The State of Machine Learning Competitions

ML Competitions Landscape

We found over 400 ML competitions that took place in 2024, across more than 20 competition platforms. The total cash prize pool across all relevant competitions we found was over $22m, up from $7.8m in 2023¹.

Platforms

Many competition platforms saw significant user growth in 2024. Most platforms grew their user base by more than 25% over the previous year, and some platforms more than doubled theirs. Kaggle remains the biggest platform both by registered users (over 22 million) and total available prize money in 2024 (over $4m).

2024 Platform Comparison

Once again, the open-source CodaLab platform hosted the most competitions in 2024 (113)², while its successor, the newer Codabench platform³, also hosted dozens of competitions and quadrupled its user base in 2024.

DrivenData celebrated its 10th birthday in 2024, and the team posted some interesting reflections in a post titled 10 takeaways from 10 years of data science for social good.

Overview of Competition Platforms

Platform	Launched	Users⁴	Competitions	Total prize money⁵
AIcrowd	2017	245k+	11	$113,000
Antigranular	2022	2.7k+	2	$10,000
Bitgrit	2017	35k+⁶	5	$184,000
Codabench	2023	17k+	37	$179,000
CodaLab	2013	74k+	113	$150,000
CrunchDAO	2021	7k+	4⁷	$180,000
DrivenData	2014	125k+	9	$650,000
EvalAI	2017	40k+	36	$92,000
Grand Challenge	2010	109k+	37	$56,000
Hugging Face	2023		11	$138,000
Kaggle	2010	22m+	44	$4,254,000
Signate	2014	80k+⁸	14	$99,000
Solafune	2020		4	$46,000
ThinkOnward	2022	6.9k+	15	$405,000
Tianchi	2014	1.4m⁹	45	$659,000
Trustii	2020	2k+	4	$57,000
Zindi	2018	75k+	21	$113,000
Other			41	$15,143,000

Note: the table above is shown with a reduced set of columns on mobile. For the full table, view this report on a larger screen.

Other platforms

For readability and relevance, the table above only includes platforms that hosted multiple competitions in 2024. Decisions around what exactly constitutes a platform are somewhat subjective, and we remain open to changing this in future. Competitions in the “Other” bucket in 2024 include:

The DARPA AI Cyber Challenge
The Vesuvius Challenge
MIT Battlecode
Numerous other competitions on special-purpose sites

This year’s platforms table includes two newer platforms which were not included in the 2023 table:

Antigranular hosts ML competitions that incorporate elements of privacy-enhancing technologies, requiring competitors to balance high predictive accuracy with minimal use of privacy budgets.
Alongside its general ML competitions, CrunchDAO runs an ongoing competition to predict returns for US equities, with the predictions used to manage a hedge fund portfolio.

Grand Challenges

There has been a resurgence of interest in grand challenges in machine learning: ambitious competitions centered around difficult and impactful research problems which we don’t yet know how to solve. A few of these competitions have been launched in recent years, with three of them funded mostly or entirely by individuals.

The AI Cyber Challenge, a two-year competition organised by DARPA in collaboration with several frontier AI labs, explores the use of AI for improving cybersecurity. The semifinals took place at DEF CON 2024, where the top 7 teams won $2m each, in addition to separate prizes for small businesses.

The Vesuvius Challenge aims to uncover text from long-buried two-thousand-year-old papyrus scrolls using high-resolution X-rays and machine learning. Building on the expertise of Brent Seales, who had already been working on this project for several years, and with $250k in initial funding from Nat Friedman and Daniel Gross, the project now has millions of dollars in funding from several individuals and organisations and has awarded multiple progress prizes. We covered 2023 results in some depth last year, and provide an update below on the fifth scroll scanned a few months ago, as well as the progress made in building automation tools.

The AI Mathematical Olympiad, set up by the trading firm XTX Markets, has a $10m prize pot to spur the creation of a publicly-shared, open source AI model capable of winning a gold medal in the International Mathematical Olympiad (IMO). Hundreds of thousands of dollars have already been paid out, and the second progress prize competition is currently ongoing. For insights into the winning strategy for the first progress prize and some changes for the second, see below.

The Abstraction and Reasoning Corpus (ARC) Prize, created and funded by François Chollet and Mike Knoop, builds on previous ARC competitions, and aims to be “a barometer of how close or how far we are to developing general AI”. This iteration has a $1m+ prize pool, and has drawn interest from ML researchers and startups. There was significant progress on the state of the art in this competition in 2024, and we give a summary further down.

In the recently-launched Konwinski Prize, the first team able to reach 90%+ on a new, dynamic, version of the SWE-Bench code-generation benchmark wins $1m. Funded by Andy Konwinski, this competition evaluates LLMs on real-world software issues collected from GitHub. The deadline for submissions is 12 March 2025.

Corporates/Nonprofits

Aside from the grand challenges mentioned above, there were dozens of competitions in 2024 run by companies, non-profits, or government organisations with the goal of solving a problem or hiring data scientists.

For example: the U.S. Bureau of Reclamation, who manage water resources, funded a series of competitions with a total of $500k in prize money, hosted on DrivenData, for making accurate water supply forecasts for 26 sites in the Western U.S.

The Learning Agency Lab, a nonprofit focused on developing the science of learning-based tools, has run 9 competitions on Kaggle in the past 3 years (including 5 in 2024), with hundreds of thousands of dollars in total prize money. Competition tasks included automating essay scoring, detecting AI-generated text, and detecting PII data. Some of these competitions were funded with support from the Bill & Melinda Gates foundation, Schmidt Futures, and the Chan Zuckerberg Initiative.

AI for Good, which describes itself as “the United Nations’ leading platform on Artificial Intelligence for sustainable development”, has run multiple competitions on Zindi in the past few years. These competitions have focused on problems like estimating air pollution from remote sensor data, and using satellite imagery for cropland mapping or predicting soil parameters.

Both Meta and Amazon ran multitask competitions on AIcrowd in 2024, in association with the ACM KDD (knowledge discovery and data mining) conference. Amazon’s competition focused on aspects of online shopping, while Meta’s competition mainly dealt with web-based knowledge retrieval and summarisation.

Academia

For some competitions affiliated with academic conferences, the goal is to enable researchers to directly compare state-of-the-art methods on standardised datasets in a shared evaluation environment — a helpful addition to papers where researchers replicate or cite each other’s results.

This year we identified hundreds of conference-affiliated competitions. This includes competitions for conferences with official competition tracks — such as NeurIPS, MICCAI, and ICRA — along with competitions affiliated with conference workshops, which tend to go through a less comprehensive review process.

While some conference-affiliated competitions attract large numbers of submissions, many target a niche group of researchers in a particular field. The barrier to entry can be higher, and winners are often rewarded with academic kudos (in the form of a conference talk or paper) instead of, or in addition to, monetary prizes.

CodaLab hosted the largest number of conference competitions, including many of the computer vision competitions for CVPR workshops. Grand Challenge hosted the second-most, including dozens of biomedical imaging competitions for MICCAI.
EvalAI was third, with competitions across CVPR, ECCV, and many other conferences.

Several other platforms also hosted conference competitions — including Kaggle, which hosted competitions for NeurIPS, CVPR, and other conferences, and AIcrowd, which hosted several tracks of competitions organised for ACM KDD.

NeurIPS Competitions

For NeurIPS 2024, competition organisers and participants from the previous year were invited to submit papers to the datasets and benchmarks track — a first for the conference.

Keep up with ML conferences

Get the latest technical updates from top machine learning conferences including NeurIPS, ICML, and ICLR.

Read Jolt's conference coverage →

Other Competitions

There were a few notable ML-adjacent competitions in 2024 which didn’t quite fit our inclusion criteria for this report:

Google’s Gemini API Developer Competition had a $1m prize pool for the best apps developed using the Gemini API.
The open-ended AI Agents Global Challenge also had a $1m prize pool, split across investment and compute credits, for the best “AI Agents”.

Prizes & Participation

The competition with the largest available prize pool was the AI Cyber Challenge, organised and funded by DARPA, with $14m paid out to winners of the semi-finals that took place in August 2024.

Both the ARC Prize and the AI Mathematical Olympiad had available prize pools of over $1m. In each case some of the prize pool was conditional on high absolute levels of performance being reached, and the actual amounts paid out were $125k and $264k, respectively. These prize amounts are expected to roll over into the next editions of these competitions.

Prize pool

Of the 18 competitions with prize pools of $100k or greater, 9 were hosted by Kaggle. Two were part of the Vesuvius Challenge (the 2024 Grand Prize and First Automated Segmentation Prize). One was the AI Cyber Challenge. The other six were on Bitgrit, Codabench, CrunchDAO, DrivenData, Humyn, and AI Singapore.

There were over 200 competitions with prize pools of at least $1k USD, of which more than half had prize pools of at least $10k.

As always, there was a wide range of participation across competitions, with some niche conference competitions drawing fewer than 10 teams (often researchers in a specific area), all the way to more mainstream competitions on the larger platforms drawing over 5,000 teams.

Leaderboard entries

In general, competitions with more prize money tended to attract more participants. However, anecdotally, participants often value other markers of success in competitions — such as academic kudos or competition platform progression points — higher than monetary prizes. This is a longer-term outlook consistent with seeing competition success as valuable professional experience. Some quant trading firms, for example, highly value success in ML competitions among job applicants, and some even look for promising recruits on competition leaderboards¹⁰.

Winning Solution Trends

Winning Toolkit

In line with previous years, Python was the almost-unanimous language of choice among competition winners. Of the 79 winning solutions we found, 76 primarily used Python. There were three others:

The winner of the Polytope Permutation Puzzle, an optimisation competition to solve Rubik’s-cube-style permutation puzzles in a minimal number of moves, implemented their solution in Rust. The second- and third-place solutions also made extensive use of Rust or C++.
The winner of March Machine Learning Mania 2024, a college basketball game prediction competition, was a high school science and stats teacher who used R to implement a monte carlo simulation based on a combination of third-party team ratings and personal intuitions about the teams. Winners of previous editions have also used R, as did the second-place solution this year.
The winner of the MLCAS Corn Yield Prediction competition, Igor Kuivjogi Fernandes, used Python to preprocess the provided satellite data, but built their model (a linear mixed model) in R using the lme4 package. They told us that they used R because “it is pretty strong for linear mixed models and I was comfortable using it”.

We analysed winning solutions’ code¹¹ where it was publicly available, and gathered information on packages used by some teams who did not release their solutions’ code. The packages listed below represent the key third-party Python packages that make up winning competitors’ core toolkit. Packages which were not included in last year’s toolkit are highlighted.

Python Packages

Core

numpy arrays
pandas dataframes
polars faster dataframes 🆕
scipy optimisation and other fundamentals
matplotlib low-level plotting
seaborn higher-level plotting

NLP

transformers tools for pre-trained models
peft parameter-efficient fine-tuning
trl reinforcement learning for language 🆕
langchain LLM tools 🆕
sentence-transformers embedding models 🆕

Vision

opencv-python core vision algorithms
torchvision core vision algorithms
Pillow core vision algorithms
albumentations image augmentations
timm pre-trained models
scikit-image core vision algorithms
segmentation-models-pytorch segmentation

Modeling

scikit-learn models, transforms, metrics
deep learning
torch
tensorflow
einops Einstein notation for tensor ops 🆕
pytorch-lightning layer on top of PyTorch
accelerate distributed PyTorch 🆕
gradient-boosted trees
lightgbm
catboost
xgboost

Other

tqdm progress bar
joblib parallelisation
optuna hyperparameter optimisation
psutil system tools
loguru logs
datasets loading data 🆕
wandb experiment tracking
shapely planar geometry 🆕
rasterio geospatial raster data 🆕
numba jit compilation

New Additions: Highlights

Many of this year’s popular Python packages were also popular last year. A few interesting packages were more prominent in winning solutions this year than previously.

einops provides an interface for tensor operations on top of NumPy, PyTorch, TensorFlow, Jax, and other libraries. This allows statements like y = x.view(x.shape[0], -1) to be replaced with statements like y = rearrange(x, 'b c h w -> b (c h w)'); with dimensions given names.

TRL provides tools for language model post-training using reinforcement learning, including techniques like Proximal Policy Optimisation and Direct Preference Optimisation.

Accelerate makes it easy to take PyTorch code written for a single device and adapt it to run on a distributed setup, across multiple devices, with minimal changes.

Shapely is a package to deal with planar geometric objects as vectors, and can perform operations like checking whether a point is contained in a polygon (as used, for example, in Matthew Aeschbacher’s winning solution for DrivenData’s Water Supply Forecast Rodeo).

Rasterio provides tools for dealing with geospatial raster data such as satellite imagery and terrain models in the TIF format.

Deep Learning

Deep Learning: PyTorch vs TensorFlow (vs JAX?)

PyTorch remained the deep learning library of choice among winners, though TensorFlow was more prevalent than in prior years. Out of the 60 winning solutions we found using deep learning, 53 used PyTorch and 7 used TensorFlow.

Almost all solutions that used TensorFlow did so through the higher-level Keras API. We found 3 uses of Pytorch Lightning, and 1 of fastai. One solution used Keras directly, without TensorFlow.

The winners of the Harmful Brain Activity Classification competition primarily used PyTorch, however their solution used a small amount of JAX code (ported from an open-source implementation of the Superlets algorithm) for transforming the time series data. This was the only usage of JAX we found throughout 2024’s winning solutions.

Computer Vision

Of the solutions using deep-learning-based computer vision architectures, 12 used convolutional neural nets (CNNs), 5 used Transformers, and 3 used a combination of both.

Computer Vision Architectures

Within computer vision architectures, U-Net, ConvNeXt, and EfficientNet were the most common model families.

Model Families

In some ways, the framing of a competition problem limits the set of suitable architectures — with certain architectures being designed for per-pixel segmentation or object detection, whereas others are tailored for regression or classification of whole images.

Some problem specifications leave room for participants to define their own problem framing. For example, in Zindi’s Arm UNICEF Disaster Vulnerability Challenge, participants needed to count the number of houses in each image with roofs made from certain types of roof material (thatch/tin/other). The winning team combined two separate approaches: one framing this as an object detection problem, where their models were trained to draw bounding boxes around a certain type of roof, which would then be counted. In the second approach, they trained regression models to directly predict the number of houses of a certain type in an image, without any intermediate object detection or segmentation.

Compute and Hardware

As in previous years, a majority of competition winners (over 80%¹²) used NVIDIA GPUs to train their models. One used a Google TPU, through Google Colab, and the remainder used CPU resources only.

We did not find any instances of competition winners using accelerators other than CPUs, TPUs, or NVIDIA GPUs. Notably, once again, we found no mention of AMD GPUs in winning solutions. This is line with trends in research papers¹³, which also rarely mention the use of AMD GPUs for training. A December 2024 post by SemiAnalysis, based on extensive testing of AMD’s leading MI300X¹⁴, suggested that software considerations might be the main driver behind the lack of uptake of AMD’s GPUs for machine learning, despite their cost advantage.

Hardware Used by Winners

In previous years, the NVIDIA A100 GPU was the most popular among winners by only a small margin. This year the A100 was more than twice as popular among winners as the next-most-popular GPU. Increased A100 availability (with the NVIDIA H100 replacing it as the current top GPU for frontier model training) may have contributed to this leap in popularity.

Accelerator Models

Other than A100s, popular configurations include 1xP100 and 2xT4, which are available in Kaggle Notebooks. The most popular consumer-grade cards were the RTX 4090 and RTX 3090.

We found two competition winners using 8xH100 nodes for training: Numina, who won the AI Mathematical Olympiad - Progress Prize 1, and The ARChitects, winners of the ARC Prize 2024. This configuration costs around $24/h using on-demand cloud compute¹⁵, and is the most expensive configuration we found (on a per-hour basis).

The winner of Kaggle’s LLM 20 Questions competition, who goes by c-number, noted that they started with a local RTX 4090 GPU, but scaled up to renting 8x RTX 4090 when they realised they were running short on time. They said that “the investment in computational resources (approximately $500 in server costs) was well justified by the final results”.

Others also mentioned paying for cloud VMs or notebooks to train their solutions. The winner of the LEAP - Atmospheric Physics using AI (ClimSim) competition, greySnow, mentioned that their experiments cost “at least 200 bucks in Colab compute units and probably less than 300 bucks in total”. One other winner told us they spent over $100 renting an A100 to train their solution.

Despite some teams spending money on cloud compute or having access to clusters or powerful personal computers, there were also winners who trained their solutions entirely for free through Kaggle or Colab notebooks.

Cloud Compute

Cloud compute services mentioned by winners included AWS, Jarvislabs.ai, Lambda, Runpod, TensorDock, and Vast.ai, with one mention each.

Of the 10 competition winners we found using cloud notebooks for training, 7 used Kaggle Notebooks (including 5 for non-Kaggle competitions) and three used Google Colab (one on the Pro tier, the other two on unknown tiers).

As in previous years, there was significant variation in the amount of training data made available for competitions. Some competitions provided only a few kilobytes of training data (AIMO Progress Prize 1 provided 10 training examples). On the other end of the scale, the DigiLut Challenge came with multiple terabytes of lung biopsy data.

Dataset Size

The largest training datasets were usually for competitions with computer vision elements, or simulations of physical systems. Competitions focused on reasoning/mathematics or NLP tended to have smaller training sets.

Training Time

Given the prevalence of ensembles and relatively large models, some solutions took multiple days to train. For example, the winners of the Kelp Forest Segmentation challenge used 12 models which were each trained for 3-6 hours, and the winner of the Youth Mental Health Narratives competition took ten days to train their final models.

It’s also possible to win a competition with minimal to no training. The winners of the SNOMED Entity Linking Challenge used a dictionary-based solution that took 6 minutes to train on CPU only. The winners of the ICML 2024 Automated Optimization Proble-Solving with Code competition called OpenAI’s GPT-4-turbo model at test-time, and their final solution did not include any training at all (they did experiment with fine-tuning GPT-3.5-turbo, but did not use this in their final solution).

Team Demographics

Most competitions allow individuals to team up and develop solutions together. Some platforms have a mechanism for teams to advertise their openness to new members, and allow team mergers until shortly before the competition deadline.

Despite this, more than half of the winners we found in 2024 were individual competitors. Teams of more than 5 (a common upper limit for team size) were rare. While additional teammates can be helpful, having a team means sharing any potential prize money, and some platforms explicitly incentivise individual entries as part of their progression system.

Winning Team Sizes

Over half of the winning teams we found were categorised as ‘first-time winners’, as they did not have any members who had already won a competition on the same platform¹⁶.

Repeat Winners

As always, there were some very prolific teams and individuals. For example, Team Epoch from TU Delft won one competition on Zindi and one on DrivenData. Kaggle Grandmaster hyd had two solo wins on Kaggle. Another Kaggle Grandmaster, Ivan Panshin, won a competition on Zindi as part of a team, and had a solo win on Solafune.

Winning Solution Specifics

This year, we’re spotlighting winning solutions to NLP competitions, and competitions around mathematics and reasoning. See our 2023 report for more detail on time-series competitions.

NLP & Sequence Data

With the recent focus on generative modelling and the availability of increasingly powerful foundation models for sequence data, many problems are being framed as sequence prediction problems. Here we discuss traditional natural language processing (NLP) or sequence processing problems including text extraction, text generation, sequence regression, sequence classification, and speech recognition. We cover competitions focused on mathematics and reasoning tasks below.

The current focus in language model research is on autoregressive decoder models, which generate tokens one step at a time. This is in contrast to encoder models, which take in strings of tokens and map them to a representation.

While encoder models have a history of success in NLP competitions, last year we noted that decoder models were also starting to be used successfully, both directly (e.g. by adding a classification head to a pre-trained model) or indirectly (e.g. to generate synthetic data).

Decoder Models

This trend continued in 2024, and several competitions seemed designed specifically with these powerful new decoder LLMs in mind.

The most commonly-used decoder models among competition winners in 2024 were variants of Llama, Mistral, Gemma, Qwen, and DeepSeek models. Several competition winners used only decoder models.

The winners of AIcrowd’s KDD Cup 2024 fine-tuned multiple Qwen2-72B models, making use of LoRA and 8 A100 GPUs at train-time and 4-bit quantisation and batch inference at test-time to enable such a large model to be used.

In the LLM Prompt Recovery competition, the winner used an ensemble of Mistral-7B and Gemma-7B models. They, alongside other top-scoring teams, used an adversarial attack on the competition metric that exploited a peculiarity in the vector embedding model, where adding a certain token — for the string lucrarea — to the end of their submissions improved their scores significantly.

In the LLM 20 Questions competition, competitors had to build both question and answer agents, each of which would be paired off with another competitor’s agent. These pairs of agents then needed to cooperate to guess the correct secret word using as few yes/no questions as possible. The overall winner, Kaggle user c-number, used agents with multiple strategies. For their question-asking agent, they pre-populated a question table using questions generated by GPT-4o mini, and probability distributions over answers calculated by sampling from Llama-3-8B-Instruct, Phi-3-small-8k-instruct, and Gemma-7b-it. Their answering agent used Llama-3-8B-Instruct and DeepSeek-Math. No fine-tuning was done on any of their models. They were one of many teams who also implemented a simpler strategy purely based on alphabetical bisection using regular expressions, which was more reliable but depended on the other (randomly assigned) agent in the pair to also have implemented this strategy, and so potentially wasted one question on an initial ‘handshake’.

In LMSYS - Chatbot Arena Human Preference Predictions, the winner (sayoulala) trained a Gemma2-9b model alongside two relatively large models (Llama3-70B and Qwen2-72B), using LoRA/QLoRA, after which they used distillation to improve the smaller Gemma model (with the logits distribution from the larger models). The smaller Gemma2-9b model was then quantised to 8-bit, and only that model was used for inference. They note that “distillation is a very promising approach, especially in the current Kaggle competitions, where inference constraints are a limiting factor”.

Encoder Models

Alongside decoder models’ success, encoder-only models still had a place in winning solutions. For the past few years, the DeBERTa series of models have been the encoder-only model of choice among competition winners, and this was still the case in 2024.

While some competition winners only used encoder models — such as the winner of the Automated Essay Scoring competition — it was more common to see encoder models be combined with decoder models. One common way to combine them was to generate synthetic data using decoder models, and then use that to fine-tune the encoder models.

In Kaggle’s PII Data Detection competition, an ensemble of various DeBERTa models was trained on synthetic data generated by decoder (mainly Mistral and Gemma) models. Similarly, the winner of AIcrowd’s Identifying Relevant Commonsense Facts competition fine-tuned a DeBERTa model on synthetic data generated using GPT-3.5-Turbo.

Another approach was to use an encoder for retrieval, paired with a decoder for generation, in a RAG (retrieval-augmented generation) pattern. RAG can search over a corpus of documents to retrieve relevant text, which is then added to the context window during generation. The winner of Zindi’s Specializing Large Language Models for Telecom Networks competition used ColBERT for retrieval, alongside Falcon-7.5B and Phi-2 for generation¹⁷.

And lastly, it’s also possible to directly ensemble encoder and decoder models in a solution, as was done by the winner of the Detecting AI Generated Text competition, who used a combination of Mistral-7B, DeBERTa, LLama-7B, and Llama-1.1B, along with synthetic data.

Meta’s NLLB-200 encoder-decoder model was used by winners of two translation competitions on Zindi: the MLOps Machine Translation Challenge, and Tuning Meta LLMs for African Language Machine Translation.¹⁸

With the availability of recently-released encoder models like ModernBERT, it will be interesting to see if the popularity of DeBERTa endures into 2025. The authors of the ModernBERT paper claim that it brings “modern model optimizations to encoder-only models”, and that “ModernBERT-base is the first encoder to beat DeBERTaV3-base since its release in 2021”, while also being faster and more memory-efficient.

Other Approaches

Not all NLP competitions were won using deep learning. In the SNOMED CT entity linking challenge, competitors needed to tag clinical notes with relevant concepts. The winning team used a dictionary-based approach, and narrowly beat the second-placed team which used an ensemble of BERT models, as well as comfortably beating other teams which used decoder-only LLMs like Mistral-7B. The winning solution took under seven minutes to train, on CPU only.

Initially we aimed to develop an LLM-based solution, but as a baseline for comparison we also tried a simple dictionary-based approach. We proceeded in parallel, but fairly early on it seemed that in the context of this challenge, with the available annotated data, the dictionary approach was more promising.

— Team KIRIs, winners of the SNOMED CT Entity Linking Challenge

Quantisation and Model Size

There are two main axes of resource constraints on competition entries. Firstly, the cost of renting, or acquiring and running, compute hardware for developing and training. Secondly, the compute constraints imposed by the competition platform’s evaluation environment at inference time, often specified as a certain number of hours on a specific hardware configuration (e.g. 1 NVIDIA P100 GPU with 16GB of VRAM for up to 12 hours).

In some cases, extra training compute can be expended to reduce the amount of inference compute required, by using methods like model distillation and pre-calculating potential answers at train time.

Other techniques can be applied to reduce both train-time and inference-time compute. Two such techniques, which we briefly mentioned last year, are model weight quantisation and low-rank adapters.

We found instances of 4-bit, 5-bit, and 8-bit quantised models used in winning solutions. Quantisation was usually used only for inference, but some teams quantised before fine-tuning. This was particularly important for the winners of the ARC Prize, who performed inference-time fine-tuning (also referred to as test-time training or TTT) within the confines of Kaggle’s evaluation environment.

Quantisation: the what and the why

Given that CPU-only inference is generally slow, it’s important for LLMs to fit within GPU memory at inference time to make the most of the resources available. There are two main dials that can be used to adjust the amount of memory a model uses: the number of parameters, and the amount of memory taken up for each parameter.

Most models we saw used in winning solutions were in the 7-9 billion parameter range, though there were also some bigger (e.g. Qwen2-72b) and smaller (e.g. DeBERTa, with up to 300 million parameters) models.

Parameters are usually stored as floating-point numbers. A few years ago, the default was to train models using 32-bit (4 byte) floats. With the more advanced datatypes supported by modern GPUs, as well as software improvements, it’s now becoming possible to train using lower precision data types — 16 bit, and even 8 bit in some cases¹⁹.

Inference can be done at even lower precision, and often floating-point weights are quantised into 8-bit or 4-bit integer formats for inference (an 8-bit format can store 256 different values). Given rigid inference compute constraints, quantisation trades off a (hopefully small) loss in model quality for an increase in the possible number of model parameters, or an increase in generation speed and number of possible generations.

The winners of the AIMO Progress Prize 1, team Numina, noted that the T4 and P100 GPUs provided within the standard Kaggle inference environment do not support “modern data types” like bfloat16, and so they reduced the precision of their model parameters to 8 bits using AutoGPTQ (noting also that “casting [from bfloat16] to float16 leads to a degradation in model performance”)²⁰. They mentioned that quantisation “led to a small drop in accuracy, but was mostly mitigated by being able to generate many candidates during inference” — since quantisation makes inference significantly faster, as well as reducing memory use. For the second AIMO progress prize, Kaggle is allowing participants to use more modern L4 GPUs, which do support bfloat16.

Another common technique to reduce memory requirements when fine-tuning is to freeze model weights and add a small amount of new trainable parameters, through low-rank adapters (LoRA). LoRA and related techniques were common among winning solutions that involved fine-tuning. The winners of the ARC Prize noted that the Mistral-NeMo-Minitron-8B-Base model performed best in their experiments. Because of its relatively high parameter count and the limited available GPU memory, they used both LoRA and 4-bit quantisation when fine-tuning.

Not all winners chose to use LoRA. The winners of the AIMO Progress Prize 1 stated that they did not use LoRA or similar techniques because they “were not confident they could match the performance of full fine-tuning without significant experimentation”. The team had access to significant compute resources and used an 8x H100 node for training, which had enough GPU memory to do full fine-tuning.

Mathematics and Reasoning

Two of 2024’s $1m+ prize pools were for competitions focused on mathematics and reasoning: the AI Mathematical Olympiad (AIMO), and the Abstraction and Reasoning Corpus (ARC) Prize.

Team Numina won $131k for topping the leaderboard in the first AIMO Progress prize, which involved giving numerical answers to natural-language maths questions. Their solution involved using a powerful multi-GPU setup to fine-tune the DeepSeekMath-Base-7B model, as well as gathering hundreds of additional similar maths problems and solutions for validation.

The ARC Prize involves solving abstract grid-based puzzles. The ARChitects’ winning solution for the 2024 edition of the ARC Prize was based around tokenising the grids into 1-dimensional sequences, and using an LLM to predict the outputs.

More detail on these two competitions and their winning solutions further down in the Notable Competitions section.

In addition to these two competitions, there were several other mathematics and reasoning competitions. Winning solutions to these also generally made use of LLMs, either by calling APIs for frontier models (when allowed), or by fine-tuning pre-trained models with available weights.

In all three tracks of the ICML 2024 AI4Math Workshop’s “Challenges on Automated Math Reasoning”, participants generally called LLM APIs rather than training or fine-tuning their own models, and two of the three tracks’ winning solutions called the GPT-4 API. For more detail on these competitions, see our ICML 2024 liveblog entry.

The Global Artificial Intelligence Championships competition had a $100,000 prize pool for the best performance on around 400 questions spanning high school, college, and olympiad-level mathematics. Questions were provided in PDF and LaTeX formats. Answers were either true-false, multiple-choice, or open-answer (real numbers/polynomial expressions/vectors/arrays). All answer formats could be graded automatically.

GAIC problem 149: What is the simplified value of the expression, 8x³ − 3xy + √p, if p = 121, x = −2, and y = 3/2?
A) 84
B) 73 + √11
C) −28
D) −44

32 teams submitted answers. The winning team did not disclose any details about their solution. The second-placed team open-sourced their solution, which was based around GPT-4-Turbo.

Time Series & Tabular Data

Unlike computer vision and NLP, where deep learning unlocked a step change in capabilities over previous approaches and is now the obvious best option in most cases, time-series and tabular data are two areas where deep learning methods are generally thought of as one of several useful modelling approaches. To date, no single leading deep-learning-based architecture has emerged to dominate time-series or tabular data problems.

On the whole, the techniques that won time-series and tabular data competitions in 2024 were not so different from those that succeeded in previous years. Gradient-boosted decision trees were still extremely prevalent, and there was some use of deep-learning-based methods.

For context on this section, see our in-depth looks at tabular data competitions in 2022 and timeseries forecasting competitions in 2023.

Gradient Boosted Trees

Gradient-boosted decision trees (GBDTs) largely continue to dominate competitions with tabular data or time-series prediction which can be framed as tabular data.

We found 16 winning solutions using LightGBM, 13 using Catboost, and 8 using XGBoost, the three major GBDT libraries. It was common, as in previous years, to see ensembles using multiple GBDT libraries, as their varying implementations lead to different strengths and weaknesses in terms of modelling performance.

DrivenData’s Water Supply Forecast Rodeo, the largest timeseries prediction competition of 2024 in terms of prize money, was won by Matthew Aeschbacher, who used an ensemble of CatBoost and LightGBM models. The solution write-up describes initial experiments with XGBoost, which they ended up replacing with CatBoost due to its superior handling of categorical features. LightGBM was also included as part of the ensemble, as it “trains extremely fast and offers predictive accuracy comparable to Catboost”.

Aside from modelling performance, competitors might choose one library over another because of familiarity, or because its implementation is better suited to a competition’s specific computational constraints. This is why Kaggle user hyd, who used XGBoost to win the Enefit competition, opted to instead use CatBoost in their winning solution for the Optiver competition. This competition had a live evaluation period of three months following the submission deadline, during which submissions were frozen but participants’ models could continue to train on newly collected market data. They noted that “During the private leaderboard phase [for the Optiver competition], we need to train online within the Kaggle environment. XGBoost tends to run out of GPU memory, while CatBoost consumes less GPU and is also very fast.”

Deep Learning

Another successful approach in tabular data competitions, as seen in previous years, was to build an ensemble including both gradient-boosted trees and neural nets.

This can be seen in the Home Credit competition, where the winner (SeungYun Kim) ensembled a LightGBM model, a CatBoost model, and a Denselight model (stack of MLPs with residual connections). They noted: “At first, I planned to lightly test with Denselight and then build the final model with a larger model (FT-Transformer, etc.) […] Surprisingly, I haven’t created or found a model that beats Denselight performance.” Regarding GBDTs, they also noted that LightGBM “was a bit lacking compared to Catboost, but was good for the ensemble.”

Similarly, in the Game-Playing Strength of MCTS Variants competition, the winner ensembled a LightGBM model, a CatBoost model, and a TabM (deep learning) model. They noted that the TabM model was their strongest single model in cross-validation.

In competitions incorporating time-series elements, winners have successfully used deep learning architectures including Transformers and RNNs as in the Optiver competition, CNNs as in the Harmful Brain Activity Classification competition, and ensembles of all of the above in the Flame AI Challenge.

The winner of the Linking Writing Processes to Writing Quality generated features from sequences of events, and then built an ensemble of GBDT and deep neural nets (TabNet and LightAutoML) on their table of features.

The summary blog post describing successful solutions for DrivenData’s Water Supply Forecast Rodeo contains an example of an approach that is a more natural fit for deep learning than for tree-based models. They note that, while first place and almost all other leading teams used gradient-boosted trees, the second-placed team used a multi-headed multilayer perceptron (neural net) that simultaneously predicted multiple targets. Not all gradient-boosted tree libraries support multi-target regression in this way, whereas with libraries like PyTorch it’s easy to implement multi-target regression using a custom loss function.

While deep learning has clearly proved its use for tabular data, one notable omission among winning approaches in 2024 is the use of any tabular or time-series pre-trained foundation models. These models, such as Moirai and Chronos for time-series data and TabPFN for tabular data, are pre-trained on large amounts of data and can be used on novel data with zero to minimal fine-tuning, and in theory could be used in the same way as pre-trained vision models like ConvNeXt and language models like Llama.

Dataframes

Pandas has long been the dominant dataframe library with few (if any) serious alternatives, but that has been changing in recent years. We found 7 winning solutions using Polars, up from 3 in 2023 and none in 2022. All of them also made at least some use of Pandas. For some, Polars was at the core of their feature engineering pipeline. For others, Polars was just used for a few peripheral I/O tasks.

Polars is implemented in Rust, and one of the benefits over Pandas is improved speed and memory efficiency. In line with this, Kaggle user hyd, who won the Optiver and Enefit competitions by themselves, stated they used Polars over Pandas for performance reasons.

Polars is significantly faster than Pandas. All my feature engineering experiments are now written using Polars. However, I still use Pandas for some exploratory data analysis (EDA), though it probably only accounts for about 20% of my time.

— hyd

Polars hit version 1.0 in July 2024, with the team behind it confirming that this version number signifies production-readiness²¹, and support for Polars in mainstream ML libraries has been steadily improving.

For examples of how Polars was used in 2024, see the winning solutions for the Optiver - Trading at the Close, Enefit - Predict Energy Behavior of Prosumers, Home Credit - Credit Risk Model Stability, and Building Instinct: Where power meets predictions competitions.

AutoML

AutoML Competitions

Automated Machine Learning (AutoML) is a collection of techniques that aim to replace the subjective human-driven parts of machine learning with automated processes. This includes designing features, choosing a suitable model or ensemble of models, cross-validation, and hyperparameter optimisation.

Kaggle’s 2024 AutoML Grand Prix involved 5 tabular data competitions where participants had 24 hours to develop their AutoML-based solutions. A Formula-1-style scoring system was applied to the participants of each monthly challenge, from 25 points for 1st place down to 1 point for 10th place, and a $75,000 prize pool was distributed across the five teams with the most points overall.

In the end, the top three teams were separated by a mere two points. The teams in first and second place were made up of the developers behind the LightAutoML and
AutoGluon libraries, respectively. Fourth and fifth place were affiliated with H2O Driverless AI, and third place was an individual, Robert Hatch, competing solo without any AutoML library affiliation.

The competition didn’t require solutions to be fully automated, just “AutoML-based”. The value of the AutoML components in this competition came from giving competitors the ability to build good models in a short space of time (within the 24h competition window). Perhaps surprisingly, it wasn’t the case that the winning teams used only their own AutoML libraries — it appears that they generally mostly used their own libraries, but also at times used other tools such as gradient-boosted tree libraries or even other AutoML libraries.

The LightAutoML team noted that, in the two AutoML Grand Prix stages which they won, they used only LightAutoML. The developers of AutoGluon maintain a list of Kaggle competition solutions which make use of their library, and pointed out that 9 of the top 10 teams in the AutoML Grand Prix used AutoGluon in a solution for at least one of the five stages.

We found two instances of other competition winners using LightAutoML in their solutions. The winner of Kaggle’s Linking Writing Processes to Writing Quality competition used three LightAutoML neural net models, and the winner of Kaggle’s Home Credit competition used LightAutoML’s “Dense Light” model, in both cases as part of an ensemble.

Grandmaster-Level Agents?

Given ever-more-competent LLMs, in recent years there has been increasing excitement in the ML community about LLM-driven autonomous “agents”. One milestone for such systems is to compete in ML competitions at a level equivalent to expert humans. This task is easier than many other real-world interaction tasks, in that the core interface is constrained to a few websites and specific types of interactions. On the other hand, given the breadth of modelling tasks included and the experience required to avoid common pitfalls like overfitting to the public leaderboard, it is much harder than related challenges like pure tabular AutoML or neural architecture search, and is still beyond the abilities of current systems.

A November 2024 paper titled “Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level” presented a system that autonomously generated submissions to over 60 Kaggle competitions across tabular, NLP, and computer vision tasks given just the competition URL. These competitions were not, however, representative of the conditions and difficulty level required to earn a Kaggle Grandmaster title²², and the authors clarify at the end of the paper that “this does not imply a formal Grandmaster title on Kaggle” ²³.

Some successful Kagglers, including quadruple Kaggle Grandmaster Bojan Tunguz, have commented on the paper and pointed out the gap between the evaluations in the paper and the requirements to become a Kaggle Grandmaster.

I can decidedly say that the claims of the “Kaggle Grandmaster Level Agent” are total unqualified BS. Main reason being that none of the tests that they had done were done on an active Kaggle competition, and the vast majority of the datasets they used were toy synthetic datasets for the playground competitions.

— Bojan Tunguz, Quadruple Kaggle Grandmaster, LinkedIn

Similar caveats apply to results on OpenAI’s MLE-bench benchmark. As addressed directly in the MLE-bench paper, “not all our chosen competitions are medal-granting, MLE-bench uses slightly modified datasets and grading, and agents have the advantage of using more recent technology than the participants in many cases.”

There is a way to sidestep all of these issues: having an AutoML system compete in active competitions, with the same constraints as human competitors. The authors of the above-mentioned November 2024 paper have stated their intent to do this in future work.

External and Synthetic Data

Some competitions restrict participants to training models only using the data provided, whereas others allow the use of external data, often limited only to publicly available data.

Finding the right data to train on can be valuable, such as in Solafune’s Finding Mining Sites competition, where only 1,000 annotated images were provided, and the winner relied on additional external datasets containing a million images.

Even when external data is allowed, its use is not always necessary to win. The winner of Zindi’s Agricultural Plastic Cover Mapping competition, Tevin Temu, told us that they “used a LightGBM classifier and focused on generating new features from the provided dataset”, rather than use any external data.

As we saw in previous years, some competition winners used generative models to create additional synthetic training data.

For example, in DrivenData’s spacecraft detection competition, the winners first trained their model on 300,000 synthetic images, for which they generated backgrounds using a diffusion model, before fine-tuning the model on the provided training data.

The winners of Kaggle’s AI Mathematical Olympiad also used significant synthetic data, using GPT-4 to generate “reasoning paths”, which they then filtered before training their model on them. Similarly, the winners of the ARC Prize 2024 used synthetic data to supplement the mere hundreds of training examples provided. Other winners of NLP competitions used similar approaches — see NLP & Sequence Data for more.

Use of API-gated models

The most capable frontier models — such as Claude, Gemini, and OpenAI’s various models — tend to be available only via an API. This allows model providers to charge for usage, and prevents users from running a copy of the model on their own hardware.

These models have been used by competition winners to generate synthetic data, and occasionally were called directly in solutions at inference time.

However, these models can’t be incorporated in most competition solutions. Many of the most interesting recent competitions are code competitions, where participants submit code that is evaluated on the competition platform’s servers, as opposed to running their code locally and submitting predictions. Usually, the evaluation environment for these code competitions does not allow solutions to call out to external APIs. These conditions allow competition organisers to ensure that participants develop their solutions without any leakage.

Some competitions take a hybrid approach. The ARC Prize, for example, has a private leaderboard which requires code submission as well as a semi-private leaderboard (“ARC-AGI-Pub”) for which participants can generate predictions locally and make use of model APIs. The semi-private leaderboard allows models like OpenAI’s o1 and o3 to be benchmarked on ARC — though the results are not directly comparable to the private leaderboard results.

It is possible that some future competitions will enable competition organisers to keep their test data private without requiring participants to share their model code or weights. A recent pilot experiment run by OpenMined, in partnership with the UK AI Safety Institute and Anthropic, demonstrated a proof of concept for this type of mutually private evaluation, using NVIDIA H100’s secure enclave technology.

Notable Competitions

AI Mathematical Olympiad

In a year when Google DeepMind announced a system that reached silver-medal-level performance in the International Mathematical Olympiad (IMO), there was great focus on automated mathematical reasoning using language models.

XTX Markets’ AI Mathematical Olympiad (AIMO) competition series, first announced in November 2023, aims to spur the open development of a model capable of winning a gold medal in the IMO, with $5m set aside for progress prizes and $5m for the grand prize. In 2024, a total of $263,952 was paid out for the first AIMO progress prize.

AIMO Progress Prize 1

The annual IMO competition on which the AIMO is based requires answers to include proofs. The scope of the first AIMO progress prize was more limited: the problems were easier than IMO-level problems, and only involved integer solutions between 0 and 999 (inclusive). The question difficulty is described as “similar to an intermediate-level high school math challenge”.

Ten sample questions were provided as training data; the public leaderboard throughout the competition was calculated based on 50 questions, and the final (private) leaderboard based on another 50 questions. All the questions are described as novel, and “created by an international team of problem solvers”.

As an example, one question from the AIMO progress prize 1 training set states: “There exists a unique increasing geometric sequence of five 2-digit positive integers. What is their sum?”

Answer and reasoning

Answer: 211.

One possible line of reasoning: We can write the geometric sequence as m_i = n * x^(i-1), with n an integer, and i ranging from 0 to 5. The fifth number of this sequence is m_5 = n * x^4. For the m_i to be integers, x needs to be a rational number, so can be written as p/q, with p and q integers. q^4 must divide n (for m_5 to be an integer), so we can write n as n = k * q^4. Picking the smallest possible numbers here gets us k=1, and q=2. p/q needs to be greater than 1 for the sequence to be increasing, so the smallest p we can pick is 3. That results in a sequence of [16, 24, 36, 54, 81] whose sum is 211.

The baseline solution, using Google’s open-source Gemma-7B model, scored 3/50 on the public and private test sets. The majority of the prize pool was set aside for any team able to achieve a score of over 47/50 on both test sets. The best score achieved during the competition was 29/50, by team Numina, with an impressive lead over the rest of the field. Second-placed CMU_MATH got a score of 22, and only ten other teams (out of over 1,000) got a score of 20 or higher.

Winning Solution

Team Numina’s solution can be summarised as:

Gathering a large dataset of hundreds of thousands of relevant mathematics problems with corresponding solutions.
Using GPT-4 to generate additional solutions to some of the gathered problems, with reasoning paths that incorporate tool use (e.g. symbolic solvers), and filtering out any solutions where the final answer was incorrect.
Fine-tuning DeepSeekMath-Base-7B (a base language model trained for solving mathematics problems) on the above datasets — first to generate chain-of-thought solutions and then to generate solutions that make use of external tools within a Python environment. Fine-tuning was done using a node of 8xH100 GPUs to fine-tune all the weights of their language model without having to use techniques like LoRA.
At inference-time, generating 48 candidates for each problem, with up to 3 iterations of calling out to a Python environment, before using majority voting to choose an answer. Quantising their model to 8-bit precision allowed for efficient inference and more generations than would have been possible at full precision.
Avoiding overfitting by evaluating on four validation sets with around 1,600 problems from AMC, AIME, and the MATH data sets.

Two key papers that influenced their approach are ToRA (Tool-integrated Reasoning Agent) and MuMath-Code (multi-perspective data augmentation combined with code-nested solutions).

They have published their fine-tuned model, NuminaMath-7B-TIR, as well as their two training datasets: NuminaMath-CoT and NuminaMath-TIR.

We [used] a pipeline leveraging GPT-4 to generate TORA-like reasoning paths, executing the code and producing results until the solution was complete. We filtered out solutions where the final answer did not match the reference and repeated this process three times to ensure accuracy and consistency.

— Team Numina blog post

After winning the first progress prize, Project Numina were awarded a €3m grant by XTX Markets to support their initiative to collect and publicly release a dataset of up to one million formal mathematical problems and proofs. They are not eligible to enter the second progress prize competition.

AIMO Progress Prize 2

The second AIMO progress prize launched in October 2024, and will close on the 25th of March, 2025. This prize has an available prize pool of over $2m, as well as having harder problems (“around the National Olympiad level”). Answers are still purely numerical — integers between 0 and 999, though for some questions this will require taking the ‘real’ answer modulo 1000. The performance threshold for unlocking the $1m+ top prize is set at 47/50 correct solutions on both the public and private test sets. At the time of publication, the highest score on the public leaderboard is 31/50.

In the first AIMO progress prize, teams were allowed to use “AI models and tools that are open source and were released prior to 23 February 2024”. All of the top 4 teams based their solutions around variants of the DeepSeek-Math-7B model. All 4 teams also acknowledged that their solution was based at least in part on the public notebook shared by the winner of the $10k early sharing prize, which incentivised open solution sharing among competitors.

In line with Numina’s suggestions in their post-competition blog post to “enable evaluation on modern GPUs” and "[relax] the pretrained model cutoff date", a few changes have been made for the second AIMO progress prize. Firstly, there is a whitelisting process for models that can be used in solutions, enabling teams to use more-recently-released models. Secondly, evaluation can now be done on virtual machines with 4x L4 GPUs. These have a total of 96GB of GPU memory, as compared to 32GB for the 2xT4 GPU or 16GB for the P100 GPU VMs that Kaggle competitions usually allow. Being a more modern GPU, the L4 also supports the bfloat16 data type and other optimisations that are useful for efficient inference.

These two modifications enable solutions which are closer to state-of-the-art research in mathematical reasoning with LLMs.

The $20k early sharing prize for the second AIMO progress prize was won by a notebook that achieved a score of 20/50 on the public leaderboard using the QwQ-32B-preview model — a “reasoning” model which makes use of inference-time compute scaling — with some prompt engineering and inference tweaks, but without any fine-tuning.

The new ongoing whitelisting process gives onlookers a potential clue as to which models might be used by top teams. On Thursday 23rd January 2025, a few hours before DeepSeek R1 was whitelisted for this competition²⁴, the highest score on the public leaderboard was 25, and there were 42 teams on the leaderboard with a score of 21 or above. By the following Monday morning, team NemoSkills’ score had jumped from 23 to 27 with four additional submissions, leapfrogging them into first place, and there were 82 teams with a score of 21 or over. As of the time of publication, NemoSkills are still at the top of the leaderboard, with a score of 31.

Interested in how machine learning can be used for mathematics?

Recent progress in mathematical reasoning using language models and reinforcement learning have lead to impressive results.

Discover more about formalisation and how systems like DeepMind’s AlphaProof and AlphaGeometry work.

Read the deep dive →

ARC Prize

One of the most-discussed competitions of 2024 was the ARC Prize, based on the ARC-AGI-1 data set. Described by the dataset creator François Chollet as “a barometer of how close or how far we are to developing general AI, and second, a source of inspiration for AI researchers”, the ARC Prize is made up of sequences of 2d-grid-based visual puzzles, similar to the non-verbal reasoning puzzles used in intelligence tests.

In last year’s report, we gave an overview of the previous iteration of the ARC challenge.

An example ARC task. Source: arcprize.org

Even though the challenge had been around for almost five years in some form with the same private test set, the added gravitas of the new $1m prize pool seemed to attract the attention of research labs who saw ARC Prize 2024 as an opportunity to evaluate and demonstrate the reasoning abilities of their systems — particularly at a time when reasoning, in various forms, was seen as a weakness of language models.

Many well-funded startups [have] shifted priorities to work on ARC-AGI — we’ve heard from seven such companies this year. Additionally, multiple large corporate labs have now spun up internal efforts to tackle ARC-AGI.

— ARC Prize 2024: Technical Report

There was significant progress on the ARC-AGI private leaderboard, with the best single-solution score going from 30% in 2023 to 55.5% in 2024. Since the 85% human-equivalent performance threshold was not reached, only $125k of the total $1.1m available prize pool was paid out in 2024.

ARC 2024 Winner

The top-performing team, the ARChitects, scored 53.5% on the private test set — beating second-place’s 40% score and a significant jump in performance from the 30% reached by the two joint winners of 2023’s ARCathon. There was one team that got a higher score of 55.5%²⁵, but they were not eligible for prizes as they chose not to open-source their solution.

Flowchart showing the various stages of the ARChitects’ solution. It starts with a base model, which is fine-tuned with augmented public ARC dataset data.The resulting “preliminary model” is further fine-tuned on augmented private ARC dataset data, into a final model.This final model is used to generate multiple candidates for inference, and to score augmented versions of thecandidates. The best-scoring candidate is then submitted as their final prediction. — Overview of the ARChitects’ solution. Source: GitHub repo.

The ARChitects’ winning solution has several interesting components:

Tokenisation: they tokenise the problem definitions into a 1-dimensional sequence, one token per cell, allowing them to be consumed by a language model. They restrict the number of possible tokens to 64, with special tokens including newline (to be able to represent a 2d grid) and others to delimit the beginning and end of input and output definitions.
Fine-tuning: they fine-tune a language model (Mistral-NeMo-Minitron-8B-Base) on the provided training set and additional data to predict output grids. At test time, they do additional fine-tuning on the private test data, and use LoRA, 4-bit quantisation, and checkpointing to enable this to happen within the limited compute budget of Kaggle’s evaluation environment.
Candidate generation: they sample 8-16 output candidates from their fine-tuned model for each task, using depth-first search²⁶. Depth-first search efficiently finds an output string that is assigned high likelihood by the model, whereas greedy decoding could optimise for the likelihood of individual tokens at the expense of the string as a whole, and stochastic sampling would be too computationally expensive.
Candidate selection: they select the 2 ‘best’ candidates for submission by choosing those for which the language model expresses the highest confidence across a set of augmentations.

Their training data includes examples from the Re-ARC, Concept-ARC, and ARC-Heavy datasets, as well as augmentations of these (rotation, reflection, shuffling the order of examples, and permuting colours). Augmentation is key throughout the fine-tuning, candidate-generation, and candidate selection parts of their solution.

Notably, their selected augmentations are designed so that they preserve the task’s structure while significantly changing the 1d tokenised representation that is fed into the model. They find that the model is better at learning patterns when presented problems in certain orientations, and the augmentations allow them to exploit this characteristic.

Augmentation is also used at test time. Each of their 8-16 candidate solutions corresponds to a prediction for one of 16 different augmentations of the task. Candidate selection is key: they show that on the public eval set, one particular configuration of their algorithm contains the correct solution within the 16 generated candidates 80% of the time, but a naive selection strategy would have the 2 selected candidates be correct only 60.5% of the time.

As for the language model used, they noted that modern architectures and larger models tended to perform better, and several optimisations were made to allow them to use the largest model feasible within the evaluation parameters.

For more detail on the ARChitects’ solution, see their write-up and code.

OpenAI o3 and DeepSeek R1-zero

There was also great progress on the ARC-AGI-Pub leaderboard, where progress on the “semi-private” evaluation set — whose 100 task definitions and solutions were not released publicly, but task definitions were exposed to participants who could use LLM APIs and internet access in their approaches — jumped from 43% in June 2024 to 53.6% in early December, before OpenAI’s o3 system achieved 75.7% in late December.

While the ARC-AGI private leaderboard limits inference compute and does not allow internet access during evaluation, the only limitation on the public leaderboard is a $10,000 maximum inference cost limit.

OpenAI’s 75.7% score was achieved with an inference cost of $8,689 across the 500 tasks (100 semi-private and 400 public), or around $20 per task. OpenAI revealed a separate configuration of o3 that could achieve 87.5% on the semi-private evaluation set with “roughly 172x” as much inference compute — therefore not eligible for the leaderboard, but nonetheless showing that there is room for performance improvement with additional compute.

The ARC Prize team also benchmarked DeepSeek’s R1 model on the semi-private evaluation set, and it achieved a score of 15.8% with an inference cost per task of $0.06. Interestingly, R1-Zero — the version of DeepSeek’s R1 system which didn’t go through the supervised fine-tuning stage requiring extensive human-labelled data — performed only slightly worse, achieving a 14% score with an average cost of $0.11. It’s worth noting that the R1 systems were not adapted specifically for the ARC task, whereas it is unclear how similar the o3 system evaluated on ARC is to the general o3 system²⁷.

Effective Approaches

The ARC Prize 2024 Technical Report gives a great overview of the variety of approaches to solving ARC, on both the public and private leaderboards.

There were three major categories of successful approaches:

Deep learning-guided program synthesis: using LLMs to generate Python programs that generate solutions. Some enhancements to this included iterative debugging (providing the LLM with the output of the programs and letting it propose edits), and specifying the programs in a domain-specific language (DSL). In contrast to a general-purpose programming language like Python, the DSL would be designed to abstract away relevant concepts for ARC like “objects”, “borders”, or “enclosing”, allowing program search over higher-level concepts.
Test-time fine-tuning on models that directly predict outputs: updating the weights of the LLM at test time, by fine-tuning either on the entire private test set or on the specific task at hand. Interestingly, this seems to allow models to perform much better than just using the “in-context learning” approach of providing the test set examples in the prompt. The ARChitects’ #1 private leaderboard solution discussed above used this approach.
Combining the two approaches above: each of the two above approaches seemed to do particularly well at different subsets of tasks, making ensembling them very effective. In the report, the program synthesis approach is generally referred to as “inductive”, whereas directly predicting outputs is referred to as “transductive”.

Next Steps

A new dataset, ARC-AGI-2, is expected to launch alongside ARC Prize 2025. The organisers have stated that they are committed to running the Grand Prize competition “until a high-efficiency, open-source solution scoring 85% is created”.

They have also learnt from the experience of running competitions based on ARC-AGI-1 over the last few years, and are making ARC-AGI-2 harder for automated systems while still being easy for humans, stating that “early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training). “

AI Cyber Challenge

DARPA — the US Defense Advanced Research Projects Agency — has a long history of running competitions with large prizes, such as its 2004 $1m autonomous driving DARPA Grand Challenge, of which some participants later went on to found self-driving car companies including Waymo and Cruise.

In the recent DARPA AI Cyber Challenge, run in partnership with Anthropic, Google, Microsoft, OpenAI, and others, competitors build tools to automatically find and fix vulnerabilities, using LLMs. Challenge projects in the semi-finals were based on open source projects including Jenkins, the Linux kernel, Nginx, SQLite3, and Apache Tika.

The submitted tools discovered 22 unique synthetic vulnerabilities, of which they patched 15. Notably, they also found one real-world bug in SQLite3.

Andrew Carney, program manager for the competition, said of the semifinals that “we’ve seen that AI systems are capable of not only identifying but also patching vulnerabilities to safeguard the code that underpins critical infrastructure”²⁸.

The final competition will take place at DEF CON 2025 in Las Vegas, where the winning team will receive a $4m prize. For more on the AI Cyber Challenge, see their website.

Vesuvius Challenge

As covered in detail in last year’s report, the Vesuvius Challenge is an ongoing effort to extract text from papyrus scrolls that were carbonised when mount Vesuvius erupted almost two-thousand years ago. The scrolls are scanned in minute detail using X-ray tomography, and the resulting masses of 3D data are analysed to first digitally unwrap each scroll into a flat sheet, then locate ink on that sheet, and lastly identify the (Greek) letters and passages written in the ink.

In 2023, the team (and competition participants) managed to read over 5% of one of the scrolls, using a mixture of machine learning/computer vision techniques and hundreds of hours of manual labour. One of the 2024 prizes was targeted at repeating this feat, but with increased automation: bringing the human labour down to below 4 hours, while maintaining at least 95% of the result. While this wasn’t fully achieved in 2024, good progress was made towards this goal and partial prizes were paid out for two submissions.

The 2024 grand prize — to read over 90% of 4 of the scrolls — went unclaimed. Interestingly, some scrolls appear to be much easier to read than others: while roughly 5% of the first scroll’s text was recovered in the first year of the competition, even after two years no text has been recovered from the other three scrolls that made up the initial dataset. In November 2024, a fifth scroll was scanned, ink was discovered almost immediately, and some letters were visible with minimal effort²⁹. Great news, as per the organisers: “Scroll 5’s greatest gift might be its potential ability to operate as a ‘Rosetta Stone’ for ink detection into other scrolls.”³⁰

Monthly progress prizes in the range of tens of thousands of dollars were paid out throughout 2024 for contributions to various aspects of uncovering text within the scrolls, and in total around $1.5m in prize money has now been paid out as part of the Vesuvius Challenge. The initial 2025 prizes include $200k for reading an entire scroll, and $60k for the first team to uncover at least 10 letters within a small area of scrolls 2, 3, or 4. The team behind the challenge are on a long-term mission to uncover huge amounts of writing from antiquity, and are hiring for several jobs including a platform engineer, and a computer vision & geometry researcher.

Looking Ahead

Inference-time scaling

Some recent frontier models — notably OpenAI’s o1 and o3 models as well as DeepSeek’s R1 model — have demonstrated the effectiveness of using larger amounts of inference-time compute to increase performance, also referred to as inference-time scaling.

For code competitions, where inference happens on competition platforms’ servers, scaling up inference-time compute becomes expensive, which may prove an interesting challenge for competition platforms.

On the one hand, there has long been an upward trend in compute usage for ML, and platforms like Kaggle have kept up with this, by first adding K80 GPUs, then P100 GPUs, then 2x T4 GPUs. For both the Konwinski Prize and the second AIMO progress prize, Kaggle is allowing competitors to use VMs with 4x L4 GPUs for evaluation. These have a total of 96GB of VRAM; 3x as much as the older VMs with 2x T4 GPUs.

On the other hand, the logarithmic inference-time compute scaling trends demonstrated by OpenAI’s o1 model seem to have shifted the ML research community’s focus to finding effective ways to increase inference-time compute, and so the inference component of solutions’ compute budget could end up scaling much faster than in previous years.

Ongoing and future competitions

There are many interesting competitions in the works for 2025 and beyond, with some already live. The second iteration of the AI Mathematical Olympiad is currently running, some new Vesuvius Challenge prizes have been announced, and the ARC Prize is also expected to have a new competition launching soon.

The Konwinski prize builds on the SWE-Bench benchmark, which measures models’ abilities to correctly submit code patches that close GitHub issues.

I’m giving $1M to the first team that exceeds 90% on a new version of the SWE-bench benchmark containing GitHub issues we collect after we freeze submissions.

— Andy Konwinski

The submission deadline is 12 March 2025. The evaluation set will be made up of new GitHub issues resolved in the three months following the deadline.

About This Report

About ML Contests

For over five years now, ML Contests has provided a competition directory and shared insights on trends in machine learning competitions. To receive occasional updates with content like this report, subscribe to our mailing list or RSS feed.

If you enjoyed reading this report, you might like the previous editions:

State of Competitive ML 2023

Last year's report, where we summarised results from 300+ competitions in 2023.

State of Competitive ML 2022

Last year's report, where we summarised results from 200+ competitions in 2022.

About Jolt

ML Contests is part of Jolt, an online magazine for ML researchers and practitioners.

Jolt publishes long-read technical articles, as well as a blog with updates from major machine learning conferences.

Acknowledgements

Thank you to Peter Carlens, Alex Cross, James Mundy, and Mia Van Petegem, for helpful feedback on drafts of this report. Thank you to Fritz Cremer, for sharing insights on Kaggle competitions, and to Eniola Olaleye for help with data gathering.

Thank you to the teams at AIcrowd, Antigranular, CrunchDAO, DrivenData, Grand Challenge, Kaggle, Solafune, ThinkOnward, Tianchi, Trustii, and Zindi, for providing data on their 2024 competitions and winning solutions.

Thank you to the competition winners who took the time to answer our questions by email or questionnaire: Igor Kuivjogi Fernandes, Jan Disselhoff, Lewis Hammond, Ivan Panshin, hyd, Gaius Doucet, Harshit Sheoran, Stefan Strydom, Guberan Manivannan, Kranthi, Saket Kunwar, Tevin Temu, Alex Gichamba, Tewodros K Idris, Brian Ebiyau, Eric Nyberg, Teruko Mitamura, Victor Tolulope Olufemi, Aymen Sadraoui, and Charles Yusuf.

Lastly, thank you to the maintainers of the open source projects we made use of in conducting our research and producing this page: Hugo, Tailwind CSS, Chart.js, Linguist, pipreqs, and nbconvert.

Methodology

Data was gathered from correspondence with competition platforms, organisers, and winners, as well as from public online discussions.

The following criteria were applied for inclusion, in-line with the submission criteria for our listings page. Each competition must:

have a total available prize pool of at least $1,000 in cash or liquid cryptocurrency, or be an official competition at a relevant conference.
have ended between 1 Jan 2024 and 31 December 2024 (inclusive).

Where possible, we used data provided by each competition platform, describing their own competitions, as a starting point. Most platforms were able to provide us with this data for 2024.

We did not receive this data from all platforms, and in these cases we gathered the data from their website. Notably, for CodaLab, Codabench, and EvalAI, three ‘self-service’ platforms where competition organisers can create their own competitions with minimal to no intervention from the platform team, we did our best to collect complete data and filter out irrelevant competitions, such as ones used for class assignments or draft competitions which never ended up running.

Our filter for ‘relevant conferences’ was broader this year, as we started with a superset of potentially relevant competitions from competition platforms before filtering irrelevant ones out, and found a long tail of conferences with interesting ML-related competitions. This likely contributed significantly to us finding more relevant competitions without cash prizes. In future years, we may be more selective with our conference filter and have a short list of included conferences. We may also choose to exclude conference workshop competitions in future, including only competitions for conferences with a separate competition track.

We were not able to collect full data for all platforms. For example, the dsworks.ru site bore a notice stating that the team had “temporarily taken our website offline for updates” when we tried to access it. It is likely that there are additional competition platforms as yet unknown to us. Please contact us if you know of any; we are open to including them in future.

When counting a “number of competitions” for purposes such as prize pool distribution, or popularity of programming languages, we generally use the following definition:
If a competition is made up of several tracks, each with separate leaderboards and separate prize pools, then each track counts as its own competition. If a competition is made up of multiple sub-tasks which are all measured together on one main leaderboard for one prize pool, they count together as one competition. There are some exceptions. ⁷

For the purposes of this report, we consider a “competition winner” to be the #1-placed team in a competition as defined above. We are aware that other valid usages of the term exist, with their own advantages — for example, anyone winning a Gold/Silver/Bronze medal, or anyone winning a prize in a competition. For ease of analysis and in order to avoid double-counting, we exclusively consider #1-placed teams in this report when aggregating statistics on “winning solutions” or “competition winners”.

Compiling the Python packages section in the winning toolkit involved some discretion. While we attempted to highlight the most popular and interesting packages for readers, we did not simply take the n most popular packages.

The number of users for each platform is sourced from the platform directly, where possible. Some platforms, like Kaggle, AIcrowd, CodaLab, and Codabench, list their user numbers publicly on their website. Most other platforms shared their user numbers with us via email. Hugging Face does not have user accounts specific to its competitions product. Solafune chose not to disclose their user numbers. User numbers for Signate, Tianchi, and Bitgrit are a year old, as we did not receive updates on their user numbers before publication of this report.

We excluded the DARPA AI Cyber Challenge small business track, as it wasn’t open to individuals. We excluded the AI Agents Global Challenge, as the prizes were a mixture of investment and compute credits, not cash. We excluded the Google Gemini API Developer Competition, as we felt it was more of a product development competition than a machine learning competition.

List of packages

[
 ('numpy', 63),
 ('pandas', 58),
 ('torch', 54),
 ('tqdm', 42),
 ('sklearn', 35),
 ('matplotlib', 26),
 ('transformers', 25),
 ('scipy', 24),
 ('lightgbm', 16),
 ('seaborn', 14),
 ('cv2', 14),
 ('catboost', 13),
 ('torchvision', 12),
 ('timm', 11),
 ('joblib', 11),
 ('pil', 11),
 ('yaml', 10),
 ('peft', 9),
 ('albumentations', 9),
 ('tensorflow', 9),
 ('requests', 8),
 ('datasets', 8),
 ('loguru', 8),
 ('xgboost', 8),
 ('optuna', 7),
 ('polars', 7),
 ('accelerate', 6),
 ('trl', 6),
 ('wandb', 6),
 ('einops', 6),
 ('langchain', 5),
 ('openai', 5),
 ('skimage', 5),
 ('sentence_transformers', 4),
 ('rasterio', 4),
 ('typer', 4),
 ('shapely', 4),
 ('keras', 4),
 ('setuptools', 4),
 ('beautifulsoup', 3),
 ('omegaconf', 3),
 ('kornia', 3),
 ('click', 3),
 ('geopandas', 3),
 ('dataretrieval', 3),
 ('zeep', 3),
 ('pydantic', 3),
 ('torchmetrics', 3),
 ('pytorch_lightning', 3),
 ('librosa', 3),
 ('simpleitk', 3),
 ('nibabel', 3)
 ]

Attribution

For attribution in academic contexts, please cite this work as

Carlens, H, “State of Machine Learning Competitions in 2024”, ML Contests Research, 2025.

BibTeX citation

@article{
carlens2025state,
author = {Carlens, Harald},
title = {State of Machine Learning Competitions in 2024},
journal = {ML Contests Research},
year = {2025},
note = {https://mlcontests.com/state-of-machine-learning-competitions-2024},
}

Over half of the total prize money in 2024 was for the DARPA AI Cyber Challenge. Excluding this competition, the total available prize pool of around $8.4m was still higher than in 2022 and 2023. ↩︎
The majority of CodaLab’s competitions in 2024 were conference workshop competitions, such as for the NTIRE workshop at CVPR, with no prize money associated. Currently these competitions are included in our report dataset (given that our criteria state “conference affiliation”), but we are in the process of reviewing our inclusion criteria and may in future reports restrict the criteria to include only conference competitions that are part of official competition tracks, such as at NeurIPS or ICRA, or with a meaningful monetary prize. ↩︎
As per the CodaLab newsletter: “Codabench platform software is now concentrating all development effort of the community. In addition to CodaLab features, it offers improved performance, live logs, more transparency, data-centric benchmarks and more! We warmly encourage you to use codabench.org for all your new competitions and benchmarks.” ↩︎
Public user numbers taken from AIcrowd, CodaLab, Codabench, EvalAI, Grand Challenge, Kaggle, and Zindi on 18 February 2025. User numbers for Bitgrit, Signate, and Tianchi are a year old, as we weren’t able to get an updated figure this year. For other platforms, user numbers were provided by the platform team over email. ↩︎
The number of competitions and total prize money amounts are for competitions that ended in 2024. Prize money figures include only cash and liquid cryptocurrency. Travel grants and other types of prizes are excluded. Amounts are approximate — currency conversion is done at data collection time, and amounts are rounded to the nearest $1,000 USD. See Methodology for more details. ↩︎
User numbers for Bitgrit are as of March 2024. We reached out to the team at Bitgrit for updated user numbers but did not get a response. ↩︎
CrunchDAO’s DataCrunch competition is an ongoing US equity market prediction competition, where prizes are paid out monthly and a total of $120k is available per year. The CrunchDAO website states that it started on 25/06/2024. In our dataset we have counted this as one competition with a $60k prize pool in 2024. ↩︎ ↩︎
“More than 80,000 high-level human resources are participating, including excellent data scientists at major companies and students majoring in the AI field. (as of February 2023)”. Source: Signate’s website, translated from Japanese using Google Translate on 5th of March 2024. ↩︎
Tianchi user numbers are from April 2024. ↩︎
Anecdotal evidence from speaking to multiple quant trading firms at ML conferences who explicitly stated that they value ML competition performance in applicants, as well as competitors who have been contacted by quant recruiters because of their position on certain competition leaderboards. ↩︎
When analysing winning solutions, we restrict to competitions where ranking is based fully or primarily on modelling performance, excluding competitions based around writing, data visualisation, or other elements that require subjective ratings from a panel of judges. ↩︎
Out of the 53 competition winners whose compute resources we were able to confirm, 44 (83%) used NVIDIA GPUs, 8 only used CPU resources, and one used a Google TPU. ↩︎
Tens of thousands of research papers published in 2024 cited the use of NVIDIA chips, whereas only hundreds cited AMD chips, as per the State of AI Report Compute Index. ↩︎
SemiAnalysis: MI300X vs H100 vs H200 Benchmark Part 1: Training – CUDA Moat Still Alive. ↩︎
The median hourly on-demand H100 price is around $3 at the time of writing. Lambda Cloud, which was used by the winners of the ARC challenge, charges $2.99/h/H100 for an 8xH100 node. ↩︎
Because we don’t have a reliable way to link user accounts across competition platform sites, these figures only count repeat wins on a given platform. Someone who wins a competition on Kaggle and later wins one on DrivenData will be counted as a first-time winner in the second competition. ↩︎
The competition rules required the use of Falcon-7.5B or Phi-2, but the choice of ColBERT was down to the winners themselves. ↩︎
The “Tuning Meta LLMs for African Language Machine Translation” required that “Winning solutions must use open-source models from the Meta repository on Hugging Face, including, but not limited to, NLLB 200, SeamlessM4T, as well as any derivatives of these models.” ↩︎
For example, during training for the Llama 3 models, parameters were mostly stored in bfloat16 and gradients in float32: “To ensure training convergence, we use FP32 gradient accumulation during backward computation over multiple micro-batches and also reduce-scatter gradients in FP32 across data parallel workers in FSDP. “ (paper). More recently, DeepSeek V3 used mixed-precision training with FP8: “we cache and dispatch activations in FP8, while storing low-precision optimizer states in BF16” (paper) For an introduction to mixed-precision training, see Sebastian Rashka’s overview from May 2023. ↩︎
The bfloat16 format has the same range as the 32-bit float format, but with lower precision (whereas the normal 16-bit float format sacrifices both range and precision as compared to float32). ↩︎
As per the Polars blog: “With this release, we signify that the Polars in-memory engine and API is production ready.” ↩︎
The paper claims “6 gold medals, 3 silver medals, and 7 bronze medals, as defined by Kaggle’s progression system”, however:
- None of the submissions were for active competitions; all submissions were for competitions that had already ended.
- All the claimed “gold” and “silver” medals, as well as 5 of the “bronze” medals, were for Community, Getting Started, or Playground competitions. These types of competitions do not award medals.
- The two other claimed “bronze” medals, one for a Featured competition and the other for a Research competition, were for competitions which ended in 2015 and 2016.
↩︎
On p. 38 of their paper they state a caveat that downplays their claims somewhat: “Finally, while we refer to our agent as having “Kaggle Grandmaster-level” performance, we clarify that this does not imply a formal Grandmaster title on Kaggle. Our achievements include several gold medals across various competition types — community, playground, and featured — demonstrating our agent’s competitive capability among data scientists, but some competitions may be less complex than others, and some are non-medal awarding competitions. To formally pursue the Grandmaster title, we aim to focus on active competitions, scaling and improving our results as we advance to newer versions of Agent K.” ↩︎
The first six distilled versions of DeepSeek R1 were whitelisted on Thu 23 Jan 2025 at 18:17 GMT. Our “pre-whitelisting” leaderboard snapshot was taken around three hours earlier, at 15:14 GMT on the same day. Our post-whitelisting leaderboard snapshot was taken at 10:44 GMT on Mon 27 Jan. ↩︎
The top-performing solution on the private leaderboard was by team MindsAI, with a score of 55.5%. Team MindsAI chose not to open-source their solution, thereby foregoing the prize money. Team MindsAI is made up of Jack Cole and Mohammed Osman (joint winners of the 2023 ARCathon), and Michael Hodel (winner of the 2022 ARCathon). All three MindsAI members are now part of Tufa Labs, who describe themselves as “a small, independent research group working on fundamental AI research” initially focused on completing the ARC Prize challenge. ↩︎
For more on depth-first search and other meta-generation strategies that can be used to enhance language model performance, see the NeurIPS 2024 tutorial Beyond Decoding: Meta-Generation Algorithms for Large Language Models Beyond Decoding: Meta-Generation Algorithms for Large Language Models ↩︎
In their o3 blog post, the ARC Prize team note that “OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.” For more details, see the ARC Prize’s blog posts on o3 and R1. ↩︎
Andrew Carney quote taken from the DARPA blog. ↩︎
Vesuvius Challenge blog, First letters found in new scroll ↩︎
https://scrollprize.substack.com/p/exciting-news-from-scroll-5 ↩︎

The State ofMachine Learning Competitions

2024 Edition

We summarise the state of the ML competitions landscape and analyse the hundreds of competitions that took place in 2024. Plus an overview of winning solutions and commentary on techniques used.

Highlights

ML Competitions Landscape

Platforms

2024 Platform Comparison

Overview of Competition Platforms

Grand Challenges

Corporates/Nonprofits

Academia

NeurIPS Competitions

Other Competitions

Prizes & Participation

Prize pool

Leaderboard entries

Winning Solution Trends

Winning Toolkit

Python Packages

New Additions: Highlights

Deep Learning

Deep Learning: PyTorch vs TensorFlow (vs JAX?)

Computer Vision

Computer Vision Architectures

Model Families

Compute and Hardware

Hardware Used by Winners

Accelerator Models

Cloud Compute

Dataset Size

Training Time

Team Demographics

Winning Team Sizes

Repeat Winners

Winning Solution Specifics

NLP & Sequence Data

Decoder Models

Encoder Models

Other Approaches

Quantisation and Model Size

Mathematics and Reasoning

Time Series & Tabular Data

Gradient Boosted Trees

Deep Learning

Dataframes

AutoML

AutoML Competitions

Grandmaster-Level Agents?

External and Synthetic Data

Use of API-gated models

Notable Competitions

AI Mathematical Olympiad

AIMO Progress Prize 1

Winning Solution

AIMO Progress Prize 2

ARC Prize

ARC 2024 Winner

OpenAI o3 and DeepSeek R1-zero

Effective Approaches

Next Steps

AI Cyber Challenge

Vesuvius Challenge

Looking Ahead

Inference-time scaling

Ongoing and future competitions

About This Report

About ML Contests

State of Competitive ML 2023

State of Competitive ML 2022

About Jolt

Acknowledgements

Methodology

Attribution

The State of
Machine Learning Competitions