We summarise the state of the competitive landscape and analyse the 300+ competitions that took place in 2023. Plus a deep dive analysis of 60+ winning solutions to figure out the best strategies to win at competitive ML.
In 2023, several new ML competition platforms launched, and there were a number of high-profile
competitions with very large prize pools.
We found over 300 ML competitions that took place in 2023, across more than 18 competition platforms and many more
independent single-competition websites. With a total cash prize pool of over $7.8m, 2023 saw a large jump in prize
money compared to 2022 and other recent years.
We cover major developments on each of the competition platforms in the Platforms section.
Further down we review some of 2023’s Notable Competitions in more depth, and preview some
upcoming developments in the Looking Ahead section. Competition winners’ approaches and technology
choices are analysed in the Winning Solutions section.
Platforms
Google’s Kaggle platform has by far the most registered users (17m+, up from 10m+ last year), and remains the largest
platform by total prize money, with DrivenData second.
CodaLab ran the most competitions that fit our inclusion
criteria1; many of these were competitions affiliated with academic conferences.
Among academic conferences, the other popular competition platforms tended to be EvalAI,
AIcrowd, and Kaggle.
Alibaba Cloud’s Tianchi ran the second-most competitions that fit our criteria.
Zindi — in addition to 21 competitions — also held 47
community hackathons, and awarded $190k in total prize money across community hackathons and competitions in 2023.
2023 Platform Comparison
A few new platforms started listing competitions in 2023:
Codabench is a new iteration of the popular CodaLab platform. It
maintains the open-source nature of its predecessor, and has a public instance hosted by Université Paris-Saclay. It
improves on usability, while also offering new features aimed at ongoing benchmarks.
Hugging Face, an ML stalwart, launched the first version of its competitions product.
Humyn.ai hosts competitions as well as facilitating deeper engagements between
businesses and its user-base of data scientists.
Note: the table above is shown with a reduced set of columns on mobile. For the full table, view this report on a larger screen.
Other platforms
For readability and relevance, the table above only includes platforms with multiple competitions. Decisions around
what exactly constitutes a platform are somewhat subjective, and we remain open to changing this in future.
Competitions in the “Other” bucket in 2023 include:
Onward is the new name for the platform formerly known as
Xeek/Studio X, owned by Shell, mainly hosting competitions relating to climate and energy technology innovation.
Solafune focuses on competitions using satellite and geospatial data.
Trustii focuses on competitions in
healthcare, and publishes winners’ code and solutions on GitHub.
It’s likely there are other relevant platforms which we haven’t covered here, and we’ll do our best to include them in
future once we’re made aware of them.
Academia
Competitions can be a useful research tool, and we found over 100 competitions affiliated with academic conferences in 2023.
NeurIPS 2023 hosted 20 competitions on topics including LLM efficiency, weather forecasting, multi-agent games,
modelling mouse brains, discovering new catalysts, and many others.
This year, NeurIPS introduced a new requirement for competition organisers and participants to publish their post-competition
reports as papers in the following year’s NeurIPS Datasets & Benchmarks track. As well as increasing transparency and reproducibility,
this provides an additional incentive to participants.
NeurIPS Competitions
Robotics: ICRA & IROS
Both ICRA (International Conference on Robotics and Automation) and
IROS
(International Conference on Intelligent Robots
and Systems) have official competition tracks. ICRA 2023 hosted 12 competitions, and IROS 2023 hosted 3 competitions.
Other conferences — including CVPR,
ICCV, and
ICML — also hosted ML competitions in 2023. In general, these conferences
hosted competitions as part of workshops, as opposed to having a separate competitions track. For example, the
Fine-Grained Visual Categorisation (FGVC) workshop at CVPR hosted 7 competitions.
The most popular platform for academic competitions was CodaLab, followed by EvalAI.
Both platforms are open source and free to use, which may explain their popularity with the academic community.
While most competitions used the hosted versions of these platforms, some — such as Sensorium 2023
chose to run their own instance.
Papers and publications
There were many interesting developments in academia relevant to competitions; we’ll cover only a few here.
A new journal — DMLR, the Journal of Data-centric Machine Learning Research
— launched in 2023, focusing on datasets and benchmarks for ML research, among
other things.
Data Science at the Singularity, a paper published by David Donoho in October
2023, makes the case that recent AI progress isn’t due to a particular method or architecture, but instead due to the
emergence of frictionless reproducibility in data-driven scientific research. In Donoho’s definition, the three key
ingredients of frictionless reproducibility are “data sharing, code sharing, and competitive challenges”.
Prizes & Participation
The $7.8m total prize pool in 2023 was an increase of more than 40% over 2022.
Much of this came from a few competitions with very large prizes — the Vesuvius Challenge’s $1m
prize pool (across 8 different prizes), the M6 Financial Forecasting competition’s $300k
(across 4 separate tracks), and almost $1.5m across DrivenData’s two PETs prizes — but we
saw increases in competitions across all prize pool buckets.
Prize pool
We also found more eligible competitions without prizes than in 2022. Our data set for this report is restricted to
competitions with either meaningful prize money or a conference affiliation1, and we think that some
of the increase in no-prize competitions we found this year is attributable to a methodological improvement.9
This year we manually reviewed CodaLab and EvalAI listings, identifying competitions for CVPR, ICCV, and other conferences, which we may have previously missed.
Monetary prizes are a useful incentive, but anecdotally it appears that often the monetary prizes are secondary to the
kudos and future career value involved in winning — either through gaining medals/climbing up platform rankings,
or presenting research at a conference.
Leaderboard entries
Participation is highly variable; we found 46 competitions with fewer than 10 teams entering, and 24 with over 1,000 teams.
These numbers don’t necessarily reflect the quality or effort involved by participants: some of the academic
competitions are targeted at a specific niche, and allow leading researchers at different labs to compare their methods
on a level playing field.10
Winning Solutions
In this section we will first review the most popular programming languages and packages among competition winners, after which we go into more
detail on approaches for specific types of competitions. This year, we focus mostly on time-series and NLP competitions.
See our 2022 report for more detail on tabular data and computer vision competitions.
We analysed winning solutions by reviewing public write-ups and source code, as well as getting information
directly from the winning teams where we were able to establish contact.
Some of the data was gathered from winners using a structured questionnaire, allowing for systematic comparison. In
addition to this, we interviewed some competition winners to understand their solutions in more depth,
either in-person at ICRA and NeurIPS, or by email.
Python remains the most popular programming language among winners, with
63 out of 65 winning solutions11 primarily or exclusively using Python. The two exceptions were the winner of
the M6 Competition’s duathlon track, who used R, and the winner of Kaggle’s
Santa 2022 optimisation competition,
who used C++12.
Primary Programming Language
Some winners used bits of C++ and R to supplement their Python.
We didn’t find any uses of Julia or Rust in participants’ own code — though we are starting to see adoption
of more third-party packages written in Rust (see dataframes).
Python Packages
We did a deep dive into Python packages used by winners last year, and many of the top packages used
by winners have remained the same. This year’s top packages are listed below, in descending order of popularity within
each category. 13
Core
numpy arrays
pandas dataframes
scipy optimisation and other fundamentals
matplotlib low-level plotting
seaborn higher-level plotting
NLP
transformers tools for pre-trained models
nltk swiss army knife for NLP
peft parameter-efficient fine-tuning
Vision
opencv-python core vision algorithms
Pillow core vision algorithms
torchvision core vision algorithms
albumentations image augmentations
timm pre-trained models
scikit-image core vision algorithms
segmentation-models-pytorch segmentation
Modeling
scikit-learn models, transforms, metrics
deep learning
torch
pytorch-lightning layer on top of torch
tensorflow
gradient-boosted trees
lightgbm
xgboost
catboost
Other
tqdm progress bar
wandb experiment tracking
psutil system tools
joblib parallelisation
loguru logs
ray distributed training
numba jit compilation
optuna hyperparameter optimisation
Deep Learning
PyTorch remains by far the most popular deep learning library, in a two-horse race — out of 53 solutions using deep
learning, 47 primarily used PyTorch, 4 used TensorFlow, and 2 used both PyTorch and TensorFlow.
Deep Learning: PyTorch vs TensorFlow
All solutions using TensorFlow made use of the high-level Keras API.
While the majority of PyTorch users still use PyTorch directly, we found 9 winning solutions using
PyTorch Lightning14
(up from 3 last year). We didn’t find any winning solutions using JAX.
Experiment tracking from training to production
With just two lines of code, track your model runs from training straight through to production.
Comet makes it easy to reproduce, visualize, debug, and compare your experiments so you can iterate faster.
Free for personal and academic use.
Given their rise in general usage, it’s no surprise that large language models (LLMs) are proving useful in competitive
ML too. However, they are far from ubiquitous among winning approaches.
From our findings, most winners either explicitly said they didn’t use LLMs in any way, or didn’t mention any use of LLMs.
The most popular use of LLMs we found is for code completion, with 30% of winners who completed our
questionnaire explicitly mentioning that they used LLMs for code completion. This is in line with JetBrains’ findings15
that around a quarter of developers frequently use AI assistants for generating code. Given the prevalence of these
tools for code completion, it’s likely that a similar proportion of winners are using them without feeling the need
to mention it in their write-ups — just as they wouldn’t necessarily mention the IDE or OS they’re using.
Some competitions explicitly invited the use of LLMs: Signate’s ChatGPT Challenge
asked participants to identify successful and unsuccessful uses of ChatGPT in business scenarios. DrivenData’s
AI Research Assistants for NASA
was also very well-suited to LLMs, and virtually all successful submissions incorporated some degree of prompt engineering.
Aside from code completion, LLMs were used in a few other ways by competition winners.
One of the winners who completed our questionnaire mentioned using LLMs for idea generation, and quickly getting up
to speed in a new field.
We found two uses of LLMs for generating synthetic data among winners, as described in more detail in the
synthetic data section.
The winner of DrivenData’s Unsupervised Wisdom
competition used LLMs for data extraction.
One of the joint winners of ARCathon 2023 made use of LLMs for abstract reasoning, as mentioned in the ARCathon section.
Lastly, but perhaps most interestingly, one winning team put a classification head on some pre-trained LLMs, alongside
significant use of RAG, allowing them to win Kaggle’s LLM Science Exam competition — more on this in the
NLP foundation models section.
Computer Vision
In our 2022 report
we reviewed how, unlike in NLP, leading computer vision models still largely hadn’t converged on a single architecture.
Things looked similar throughout 2023: both CNNs (convolutional neural networks) and Transformers were used for vision,
and most competition winners still used CNN-based architectures for computer vision competitions.
This lack of architectural convergence is backed up by research — a 2023 paper compared NFNets (a
CNN-based architecture) against Vision Transformers, and found that “NFNets match the reported performance of Vision
Transformers with comparable compute budgets.”16
Computer Vision Architectures
In the solutions we managed to find, ConvNeXt and U-Net (CNN) architectures were most popular. Swin-Transformers were
the most commonly-used Transformer-based models, and both convolutional layers and Transformer modules were used in
custom architectures as well. The choice of architecture is in part determined by the particular task — for example,
YOLO models for object detection and segmentation.
Model Families
The winner of DrivenData’s Tick Tick Bloom
competition, where participants used satellite imagery to detect specific types of bacteria in bodies of water,
showed that deep learning isn’t necessarily the right tool for all computer vision problems.
Their solution used a combination of k-nearest-neighbours and a LightGBM model, with features including climate data and the
colour of the water. They noted “I tried using a CNN model with satellite images. Unfortunately, this type of model
resulted in very high RMSEs. After some analysis, I suspect that the quality and resolution of the satellite images, as
well as the accuracy of the positions, made it very difficult to fit the CNN model well.”
There were many vision-related competitions from top conferences (ICCV and CVPR) for which we did not
manage to gather data on winning solutions, and their inclusion might have given a different view of the leading
edge. We aim to gather more data from these conferences for the 2024 report.
NLP
Foundation Models
Last year we noted that almost all winning NLP solutions used
versions of Microsoft Research’s DeBERTa model — usually deberta-v3-large.
Since the end of 2022, the firehose of public-weights LLM releases has continued to deliver newer, better, and more
efficient GPT-style generative foundation models that can be fine-tuned for specific tasks — including the Llama,
Falcon, Mistral, Phi, QWEN, OLMo and Gemma series of models.
As expected, these have been adopted by the competitive ML community and have at times outperformed solutions based on
the incumbent DeBERTa models. They are also being used alongside these models — for example, some competition
winners have used generative LLMs to generate synthetic data, which a DeBERTa model is then trained on.
In other situations, generative models have directly replaced these older models in the inference pipeline —
most notably Kaggle’s LLM Science Exam competition.
The winners of that competition used an ensemble of five 7B-parameter LLMs and one 13B-parameter LLM, which they fine-tuned, alongside
Retrieval-Augmented Generation on a chunked local version of Wikipedia. They used an
interesting approach where they added a binary classification head onto the end of several pre-trained LLMs before training them.
The inference pipeline came in at just under the 9-hour runtime requirement, albeit with a staggering 2.5TB of input
data!17
Kaggle’s
Bengali Speech Recognition
competition and Zindi’s
Swahili ASR
competitions both involved text-to-speech for lower-resource18 languages (Bengali and Swahili).
The winners of both competitions used OpenAI’s Whisper model,
either exclusively or as part of an ensemble. For the Bengali competition, the winners trained a custom tokeniser19,
specifically on Bengali text, to avoid the inefficient character-level encodings the Whisper tokeniser would otherwise
use for Bengali text.
Synthetic data
The winners of Kaggle’s CommonLit
competition fine-tuned deberta-v3-large on a custom dataset combining Kaggle’s training data with a larger synthetic
dataset they generated using a combination of ChatGPT and other LLMs. 20
The team jumped from 49th place on the public leaderboard to 1st on the final private leaderboard. It appears that their
focus on generating a diverse training dataset paid off particularly well because of the significant distribution shift
between public and private test data21
The winners of the LLM Science Exam
competition also made use of LLM-generated synthetic data, though they were able to use synthetic data that had already
been shared on Kaggle by others.
Should synthetic data be considered “publicly available”?
Kaggle’s competition rules for the CommonLit competition state that competitors can use external data only if it’s
“publicly available and equally accessible to use by all participants … at no cost”22. There was some
debate in the competition discussion forums as to the nature of synthetic data. If an LLM’s weights are publicly
available, or the LLM can be accessed online for free, should synthetic data generated by it be considered fair game for
future competitions?
Adapters and quantisation
The resource constraints imposed by competitions (explicitly so for inference; and through budget constraints for training)
mean that naively fine-tuning LLMs is not usually practical.
A plethora of parameter-efficient fine-tuning (PEFT) methods have emerged.
One of the most popular approaches is LoRA,
which was used in the winning LLM Science Exam solution, as well as the A100 track of the
NeurIPS 2023 LLM Efficiency Challenge.
Building on LoRA, QLoRA quantises the frozen pre-trained parameters using a novel
4-bit NormalFloat data-type, to further reduce memory requirements with minimal performance trade-offs. This allows
GPUs with as little as 16GB of memory to fine-tune 7B parameter models.
QLoRA was used to win the RTX 4090 track of the NeurIPS 2023 LLM Efficiency Challenge.
Additionally,
DrivenData’s Water Supply Forecast Rodeo
and Kaggle’s Trading at the Close competition
both started in 2023 but won’t wrap up until later in 2024 — we’ll look to cover them in next year’s report.
Winners used a mixture of approaches, including statistical models like ARIMA,
gradient-boosted tree methods (XGBoost), bayesian factor models, and deep learning methods (fully-connected deep neural
nets, LSTMs, and convolutional networks — the latter for exploiting 2D structure in satellite data).
Despite all incorporating time-series elements, time-series forecasting competitions are highly heterogeneous.
This section presents a brief overview of the winning solutions we found, with links to additional materials where available.
We also comment on making manual adjustments to systematic forecasts.
CPI Nowcasting
In Zindi’s RMB CPI Nowcasting
competition, participants had to forecast South Africa’s consumer price index, a proxy for inflation, on a monthly basis.
The overall winner of this competition used a combination of ARIMA and XGBoost for their model, and Optuna for hyperparameter
optimisation. We also heard from the winners of two of the monthly mini-challenges in this
competition, who also both used ARIMA-based models.
Energy Forecasting
In AIcrowd’s CityLearn Challenge,
one of the official NeurIPS 2023 competitions, the forecast track asked participants to predict building loads, grid
carbon intensity, and solar generation up to 48 hours ahead of time, for each building in a synthetic single-family
neighbourhood.
The winner of this competition used an online forecasting approach, with a model based on a simple feed-forward deep
neural net, and CMA-ES, an evolutionary algorithm, for hyperparameter
optimisation.
Weather Forecasting
Weather4cast 2023, another official NeurIPS competition, involved precipitation prediction over three tracks:
Core (8h prediction, 7 previously seen regions), Nowcasting (4h prediction, 7 previously seen regions), and Transfer
Learning (4h prediction, including an unseen year and 3 previously unseen regions).
Much of the difficulty in this task comes from the sparsity of rainfall events, significant variation in the magnitudes
of these events when they do happen, and the use of a regression loss, requiring models to be accurate about the amount
of rainfall as well as its likelihood.
A team from Alibaba Cloud won both the Core and Nowcasting tracks, and came second in the Transfer Learning track.
The Transfer Learning track was won by a team from Nanjing University, who came second in Nowcasting and third in the Core track.
The team from Alibaba Cloud innovated on WeatherFusionNet, the approach used by the winners of Weather4cast 2022.
WeatherFusionNet uses a combination of U-Nets and a (recurrent, convolutional) PhyDNet. The Alibaba Cloud team added a
ConvLSTM module and built an ensemble of these learners.
The Nanjing University team also used a U-Net as the base of their model, along with temporal frame interpolation, a
multi-level Dice loss, and thoughtful cropping and augmentation.
Papers describing the winning strategies in more detail are linked to from the competition website.
Financial Forecasting
The M6 Competition involved predicting returns of financial instruments,
and constructing a portfolio to maximise risk-adjusted returns. Participants could use any mix of automated or manual judgement-based strategies,
and submitted predictions and allocations rather than code. We describe this competition and the previous M competitions
in much more detail in the separate notable competitions section.
The winner of the M6 forecasting track has a description of his approach on his
blog, and some slides on the
MOFC website. He used a Bayesian approach, fitting a factor model
using Markov Chain Monte Carlo with the PyMC library.
The Bayesian approach turned out to be particularly useful when DRE, one of the 100 instruments in the M6 competition,
was de-listed in October following its acquisition. When the competition organisers set “future returns” of DRE to 0,
all that was
required was to “set DRE’s return to zero in each of the 4000 samples before taking the quantile probabilities”.
This is one example where a competition includes some of the complexity and messiness of the real world that is
absent in pure supervised learning problems on clean datasets.
The winner of the investment decisions track generated forecasts using his own AutoTS library,
an application of AutoML to time-series data, automatically building ensembles of many different types of models. In a
pre-print paper shared with us, he highlighted the importance of optimising separately for the ranked-probability score
(RPS) and expected investment returns used in the two tracks of the competition. In his portfolio construction, he also
considered a level of meta-uncertainty — generating allocations that were robust over a set of forecasts made
using different hyperparameters. More detail on his personal blog
and in his presentation slides.
The overall duathlon winner used a meta-learning approach for his forecasting solution. The meta-model,
implemented in R using torch, is trained to “identify the most appropriate parametric model for a given family of related
prediction tasks”. For the investment decisions track, he used a rank optimisation method, explicitly aiming to maximise
the probability of ranking highly in the competition — more on this further down in the
design and strategy section. The code for this solution, as well as links to papers describing the approach,
can be found in the winner’s GitHub repository.
Manual adjustments
A recurring theme in some of the top forecasting solutions was the temptation to supplement automated forecasts with subjective
opinion-based judgements. Although the M6 pre-print paper23 states that “judgment-informed forecasting approaches
(albeit, very few) perform on par (if not better) compared to pure data-driven approaches”, the winners of both the M6
forecasting track and M6 decisions track had cautionary anecdotes about their forays into overriding automated forecasts.
Tabular Data
Dataframes
When it comes to Python DataFrames, Pandas still dominates… but there are signs of change.
This is the first year we found any competition winners using Polars — the
dataframe library written in Rust, created in 2020.
Polars: advantages and disadvantages
The main benefit of Polars is its improved performance — it’s significantly faster than Pandas for many operations,
partly due to its ability to parallelise work across multiple CPU cores.
The main downside is that the majority of community code and examples are in Pandas, and Polars uses a different API.
This is a double-edged sword, since Pandas’ API has the burden of 15 years of legacy code to support, and Polars was
able to design a new API without those constraints.
The approach that many seem to be taking is to introduce Polars incrementally into an existing codebase.
Polars dataframes use Arrow Tables as their back-end, and since Pandas 2.0 added the option to also use Arrow Tables instead
of the standard NumPy back-end24, it’s possible to convert between Pandas and Polars dataframes without
copying any data. This means that with some profiling, users can convert the slowest-running operations to be run in
Polars instead of Pandas, without necessarily changing their entire codebase to use Polars.25
One caveat with this approach is that scikit-learn and other Python machine-learning libraries tend to have great native
support for NumPy-backed Pandas, but don’t yet support Polars or other dataframes using Arrow as a backend.
All three of these winning solutions also made use of Pandas to some extent. One of the teams mentioned that
they “made use of Polars because of the CPU constraints and to simply learn it.”
The Amazon KDD Cup 2023 team built an ensemble using components built separately by each of their five members. As well
as Pandas and Polars, they also made use of NVIDIA’s RAPIDS
GPU-accelerated tools such as cuDF and cuML, and NVIDIA’s
Merlin recommender system library. Somewhat unsurprising maybe,
since the team was made up of NVIDIA’s Merlin team and three of their resident NVIDIA “KGMON” Kaggle Grandmasters (Chris Deotte,
Jean-Francois Puget, and Kazuki Onodera). They made an impressive showing at the KDD Cup in 2023,
winning all three tracks.
Polars’ growth is part of two wider trends:
Rust is increasingly being used to build high-performance tools for Python — with the now widely-adopted
Ruff linter being a great example of this.
The Apache Arrow project has been working on its cross-language columnar
data format since 2016. With buy-in and contributions from Wes Mckinney (the creator of Pandas) and Hadley Wickham
(the developer of R’s popular tidyverse packages), Apache Arrow had ambitious goals since its inception.
Aside from the KDD Cup team mentioned above, we didn’t find any uses of other dataframes such as Dask, cuDF/cuML, VAEX,
or Modin. 60 of the winning solutions we found used NumPy,
and 56 used Pandas, making these the two most commonly
imported packages across all the winning solutions we found.
Gradient Boosted Trees
We wrote extensively about gradient-boosted decision trees (GBDTs) in
our 2022 report,
and in our separate piece on tabular data.
Not much changed in 2023. GBDTs are still commonly used in winning solutions, and often in ensembles together with neural-net models.
All three of the main GBDT libraries are commonly used, and there’s not much difference in popularity between them.
LightGBM remains the most popular, with 10 winning solutions using it.
XGBoost was second, with 8 winning solutions, and we found 7 winning solutions using CatBoost.
Robotics
This year we interviewed organisers and participants of several robotics competitions — including those at
NeurIPS and ICRA.
In general, robotics brings many additional complications that aren’t present in standard ML competitions. The
F1Tenth autonomous racing competition showed the value of experience, with the
winning teams having competed several times before and improved with each iteration. The unpredictable physical
environment of the competition hall with its added visual and audio noise, as well as wireless connection interference,
unpredictable and sometimes uneven floor surface, and variable lighting, can introduce issues not experienced in a
more controlled lab setting.
Robotics competitions are even more heterogeneous than software-only competitions. Across both ICRA and NeurIPS,
a variety of techniques were competitive — including traditional optimal control
techniques, and both model-based and model-free reinforcement learning.
Compute and Hardware
In this section we analyse the compute resources used by competition winners.
Throughout 2023, top research labs were competing for capacity on NVIDIA’s latest generation of data-centre GPUs to
train large general-purpose foundation models, as training FLOPs for cutting-edge AI systems continue to grow exponentially.26
Competitions, on the other hand, tend to focus on more narrowly-scoped applications, and participants can benefit from those
foundation models by fine-tuning them for a specific task on relatively small amounts of data. This has been the case for
a number of years already, with BERT-like models for NLP and various pre-trained vision models.
Still, modern ML competitions are comparatively compute-hungry. We found that over
70% of winners used a GPU to train their model, and many of these used powerful data-centre GPUs rather than cheaper consumer-oriented ones.
Most competitions aim to attract many participants, including those with more limited compute resources.
The limited-resource
nature of competitions can also make their findings more generally useful — such as the
LLM Efficiency challenge, evaluating LLM fine-tuning on consumer hardware.
Where the nature of a task makes it difficult to do research in a compute-constrained environment, some competition organisers provide compute grants to participants27.
Hardware Used by Winners
We continue to see a diverse mix of hardware types and ownership models. Solutions for tabular data or time-series
forecasting such as ARIMA, gradient-boosted decision trees, or simpler ML methods like linear regression, are often just trained on
CPU-only hardware (from relatively budget desktop processors to the 64-core Threadripper 3990X).
Deep learning models, such as those used for computer vision and NLP tasks, tend to benefit more from special-purpose
hardware, and unsurprisingly we found almost all winners for these types of competitions making use of GPUs or TPUs.
This was the first year we found winners using TPUs for training their solutions: one through a Kaggle Notebook, and the
other through a paid Google Colab subscription. Others also used Kaggle Notebooks or Colab for accessing GPUs.
Several winners mentioned experimenting on local consumer GPUs while developing their models, and using more powerful
cloud-based GPUs for their final training runs.
Once again, all GPU models used by winners were NVIDIA GPUs. While AMD’s offering has been improving, especially for
inference28, adoption is still lagging.
We didn’t find any examples of competitors using accelerators other than CPUs, TPUs, or NVIDIA GPUs for training their
models.
Accelerator Models
The most popular accelerator used by winners continues to be the A100,
just like in 2022. For all but one of the winning
solutions using A100s we were able to confirm that they were using the older 40GB model.
Second-most popular is the P100 model, which is available for free in Kaggle Notebooks. 29
Third-most popular, the RTX 3090 is the top-end retail/gaming GPU from the previous generation (the current-generation
equivalent being the RTX 4090, which was used in one winning solution).
The main notable absence is the H100, the successor to the A100. This has been the chip of choice for companies training
foundation models — with Meta planning to own over 350,000 H100s by the end of 2024. 30
The popularity of the H100 for foundation models likely impacted its availability, as capacity throughout much of
2023 was limited.
Train and fine-tune ML models with fully-dedicated Cloud GPUs
With an intuitive dashboard and pre-installed AI software, Latitude.sh gets you from zero to training in just a few clicks.
Effortlessly train, fine-tune and scale inference on enterprise-grade hardware designed for ML engineers: NVIDIA’s H100, A100 and L40S series available from clusters to container GPUs.
In addition to specific hardware models used, dataset size and training time are useful proxies for the amount of
compute required for competitions.
As was the case last year, training dataset
sizes span at least five orders of magnitude. The competition in 2023 with the most training data was the
Vesuvius Challenge, with almost 8TB of data that was publicly released.
The dataset size we track is just that of the data provided by competition organisers — many competitions allow
participants to also train on data from a selected set of sources, or even any publicly available data.
Training Time
Training time also varies significantly. In a surprising number of competitions, simple models trained in under an hour
(in some cases on consumer hardware) can win. This seems to be the case particularly in forecasting challenges.
In many other competitions, expensive hardware seems to provide a significant advantage.
Cost and accessibility
The use of code competitions is a step towards standardising resources available to participants. In a code competition,
participants submit inference code which runs in a limited-resource environment on the competition platform, rather than
directly submitting predictions.
Still, access to additional resources at train-time can make it easier to iterate more quickly, run more experiments,
and use more compute-intensive methods, which can present an advantage.
In some competitions, winners trained on setups costing thousands or tens of thousands of dollars,
including one training server with 8 RTX A6000s, and an NVIDIA DGX-1 with 8 V100s.
The
NVIDIA KGMON
team who won the KDD Cup 2023 used 8 V100 GPUs for at least six full days, which would cost over $1,000 to run on on-demand
cloud compute, just for the final training run.31
The winner of DrivenData’s BioMassters
competition used 2 A100s and the final training run took 8 days, which would cost over $500 using on-demand cloud compute.32
The winner of Kaggle’s Bengali.AI Speech Recognition
competition had access to 8 RTX A6000 GPUs, which would cost around $8/h in on-demand cloud compute, or around $40k to own33.
Team Demographics
This year, more than half the winning teams we found were made up of just a single individual.
Winning Team Sizes
While forming a larger team can be helpful — such as in the case of the NVIDIA KGMON team who split up
their pipeline into separate components, which they worked on independently — dedicated individuals can clearly compete successfully without teammates.
Kirill Brodt, who won DrivenData’s BioMassters competition
and has won prizes in several other previous competitions, shared with us that in five previous competitions where he
competed solo and came either first or second, he mostly spent between 10 and 40 hours working on his solution.34
Repeat Winners
As also seen in last year’s report, more than half
of the winning teams are first-time winners.
At the other end of the spectrum, serial winners (2+ previous wins) made up roughly a third of winners in 2023. These
include both corporate and independent teams and individuals.
Design and Strategy
As much as competitions aim to be realistic environments for testing models and algorithms,
the practicalities of running a competition and having pre-defined evaluation metrics can create opportunities for participants to
outperform by adopting strategies targeted at winning the competition, which don’t always translate to a real-world
environment.
We detail some examples of these below — including leaderboard probing, explicitly optimising for win probability
rather than the competition metric, and building on winning solutions from previous years.
Optimising to win
Filip Staněk, the winner of the M6 Duathlon track, wrote a note describing his strategy for the investment decisions
part of the competition, where he optimised explicitly for probability of winning the competition rather than targeting
the best expected investment returns. This is an inevitable result from a competition environment, where prizes are
given only to top-ranked participants, creating an asymmetric risk profile (large positive returns are very good, and
large negative returns are only as bad as mediocre returns in the competition setting).
Building on previous solutions
There’s an element of “work smarter, not harder” in some winning solutions, which isn’t always present in academic ML
research.
We noted last year
that the winner of Kaggle’s March Madness competition for predicting college basketball results used the exact code
shared by the winner of a similar competition in 2018. Unbelievably, this happened again this year! (albeit with a Python
version of the code, rather than the R version used last year, and some minor tweaks)
Leaderboard Probing
In Kaggle’s GoDaddy Microbusiness Density Forecasting
competition, participants had to make monthly predictions on noisy U.S. county-level data. Participants were given
historical data, and the public leaderboard contained slightly more recent data. The
ultimate winner used a linear
regression model (having ruled out various other models), and got an edge from probing the public leaderboard to get
more recent information about a few of the most volatile counties in the dataset — which helped them make better
forecasts for those counties on later data.
But leaderboard probing doesn’t always work out!
In Kaggle’s Identifying Age-Related Conditions
competition, performance on the public leaderboard did not transfer to the final private leaderboard score — with
the eventual winner jumping 2089 places from public to private leaderboard. Grandmaster Chris Deotte noted
that in his submission,“using probed positive targets from public LB actually hurt my private LB score”.
Some competitors identified this as a possibility early on in the competition35, showing the value in
paying attention to the discussion forums.
Notable Competitions
In this summary we highlight three interesting competitions:
The M6 Challenge, the sixth in a series of
rigorous forecasting challenges that has been running for decades.
The ARCathon, an ongoing competition
focused on exactly what ML approaches tend to be bad at: common sense.
Vesuvius Challenge
Nat Friedman announced the Vesuvius Challenge in March 2023, with $250k in funding, to build on prior work by
Dr. Brent Seales. Within a week of the announcement,
several other private backers stepped forward to bring the total prize pool to $1m.
The challenge: decipher text hidden in papyrus scrolls which were carbonised when mount Vesuvius erupted in 79 AD — almost
two-thousand years ago.
Over the past few centuries, many of these scrolls — the Herculaneum Papyri — were destroyed as a result
of attempts to open them. Their fragile carbonised state makes them almost impossible to open. One Italian monk spent
several decades painstakingly unrolling a few of them, which were found to contain Greek philosophical texts.
The Vesuvius Challenge launched with 5.5TB of data from non-invasive X-ray tomography scans of two of these scrolls,
as well as more
than 2.4TB of scanned fragments of other scrolls. The challenge consisted of multiple smaller prizes followed by a
$700,000 grand prize for anyone who could reveal
at least four passages within the scrolls. Kaggle hosted the $100k
Ink Detection Prize. There were
also several sub-prizes for building tools, fostering a huge global collaborative effort.
Following success in the ink detection prize, the first letters on these scrolls were identified in October by
Luke Farritor and Youssef Nader,
building on prior work by Casey Handmer.
The deadline for the grand prize was the 31st of December 2023, and in February 2024, following reviews by technical and
papyrological teams, the grand prize was awarded to Youssef Nader,
Luke Farritor, and Julian Schilliger. After each individually being successful at previous components of the Vesuvius
Challenge, they teamed up and combined their approaches. Their winning solution is now public on GitHub.
Train and fine-tune ML models with fully-dedicated Cloud GPUs
With an intuitive dashboard and pre-installed AI software, Latitude.sh gets you from zero to training in just a few clicks.
Effortlessly train, fine-tune and scale inference on enterprise-grade hardware designed for ML engineers: NVIDIA’s H100, A100 and L40S series available from clusters to container GPUs.
This is a momentous achievement: the Vesuvius Challenge has managed to unroll and read around 5% of one of the
scrolls. Some transcriptions and translations of parts of the discovered text are already available on the
Vesuvius Challenge website.
And they’re not stopping here. According to their recently-published Master Plan,
this is the end of stage one. Stage two will be to improve automation, and read 90% of the four scrolls that have been
scanned. That will form the basis of the 2024 Grand Prize.
In stage three, they expect to spend 2-3 years scanning and reading all (~800) excavated scrolls.
After that, stage four would be to restart excavation of the Villa dei Papiri, hoping to find a larger library in a
deeper part of the villa. If stages 2 and 3 go well, the research spawned by the uncovered text should provide a
powerful catalyst for further excavation.
The Vesuvius Challenge is a great example of important real-world research being accelerated by competitive ML.
Deciphering the hundreds of excavated scrolls could result in a significant increase in the amount of available
text from antiquity, and the first step towards that has been a success.
M6: Financial Forecasting
The M6 Financial Forecasting Competition is the sixth in a series of forecasting competitions run by the Makridakis Open
Forecasting Center (MOFC), named after its founder Spyros Makridakis. The first such “M Competition” was run in 1982.
The M Competitions are particularly rigorous and extensive, and each iteration tends to focus on a particular aspect of
forecasting.
History of M Competitions
The M Competitions provide a great barometer of the changing state-of-the-art in time-series forecasting methods, and
an interesting shift towards ML methods occurred between the M4 and M5 competitions.
Various: Micro, macro, industry, demographic; monthly and yearly.
Leading Methods
Exponential smoothing
Participants
Selected experts.
Key
“If the forecasting user can discriminate in his choice of methods depending upon the type of data (yearly, quarterly, monthly), the type of series (macro, micro, etc.) and the time horizon of forecasting, then he or she could do considerably better than using a single method across all situations”
Various: Micro, macro, industry, demographic; monthly and yearly.
Leading Methods
Exponential smoothing
Participants
Selected experts.
Key
“If the forecasting user can discriminate in his choice of methods depending upon the type of data (yearly, quarterly, monthly), the type of series (macro, micro, etc.) and the time horizon of forecasting, then he or she could do considerably better than using a single method across all situations”
Buillding on M1, and incorporating private data shared by 4 companies.
29 timeseries
18 methods
Focus/data
US. macro-economic data, and internal data shared by four companies (mostly sales data).
Leading Methods
Exponential smoothing
Participants
Selected experts.
Findings
“The most striking outcome of the M2-Competition is the good and robust performance of exponential smoothing methods and in particular that of Dampen and Single smoothing” “the less the randomness of the series the better the relative accuracy of the more sophisticated methods used by the forecasters.”
US. macro-economic data, and internal data shared by four companies (mostly sales data).
Leading Methods
Exponential smoothing
Participants
Selected experts.
Findings
“The most striking outcome of the M2-Competition is the good and robust performance of exponential smoothing methods and in particular that of Dampen and Single smoothing” “the less the randomness of the series the better the relative accuracy of the more sophisticated methods used by the forecasters.”
“The accuracy of the combination of various methods outperforms, on average, the specific methods being combined and does well in comparison with other methods.”
“The accuracy of the combination of various methods outperforms, on average, the specific methods being combined and does well in comparison with other methods.”
Many more timeseries and methods than previous competitions.
100,000 timeseries
61 methods
€27,000 prize pool
Focus/data
Various frequencies (hourly up to yearly), and diverse data. “The 100,000 time series used in the M4 were selected from a database… that contains 900,000 continuous time series, built from multiple, diverse and publicly accessible sources.” Participants submitted point forecasts as well as prediction intervals.
Leading Methods
Meta-learning, and ensembles of statistical and ML methods
Participants
Open to everyone.
Findings
“all of the top-performing methods, in terms of both [point forecasts] and [prediction intervals], were combinations of mostly statistical methods, with such combinations being more accurate numerically than either pure statistical or pure ML methods.”… “While hybrid approaches (combining ML and statistical methods) performed very well, the four submitted pure ML approaches “definitely performed below expectations… all [pure] ML methods submitted were less accurate than [a simple baseline] benchmark”
Various frequencies (hourly up to yearly), and diverse data. “The 100,000 time series used in the M4 were selected from a database… that contains 900,000 continuous time series, built from multiple, diverse and publicly accessible sources.” Participants submitted point forecasts as well as prediction intervals.
Leading Methods
Meta-learning, and ensembles of statistical and ML methods
Participants
Open to everyone.
Findings
“all of the top-performing methods, in terms of both [point forecasts] and [prediction intervals], were combinations of mostly statistical methods, with such combinations being more accurate numerically than either pure statistical or pure ML methods.”… “While hybrid approaches (combining ML and statistical methods) performed very well, the four submitted pure ML approaches “definitely performed below expectations… all [pure] ML methods submitted were less accurate than [a simple baseline] benchmark”
Walmart data, run on Kaggle, evaluating accuracy and uncertainty.
42,000 timeseries
5,507 methods
$100,000 prize pool
Focus/data
Retail data (Walmart)
Leading Methods
Mostly ensembles of LightGBM models, some ensembles of deep neural nets.
Participants
Open to everyone.
Findings
“M5 was the first competition where all of the top-performing methods were both ‘pure’ ML approaches and better than all statistical benchmarks and their combinations. "
Mostly ensembles of LightGBM models, some ensembles of deep neural nets.
Participants
Open to everyone.
Findings
“M5 was the first competition where all of the top-performing methods were both ‘pure’ ML approaches and better than all statistical benchmarks and their combinations. "
Various: meta-learning, bayesian methods, large ensembles.
Participants
Open to everyone.
Findings
The majority of participants underperformed against simple baselines, but a small minority managed to outperform. There was a weak link between teams’ forecasting accuracy and investment performance: “the teams that submitted the best forecasting submissions did not perform similarly well in terms of investment decisions and vice versa.”
Various: meta-learning, bayesian methods, large ensembles.
Participants
Open to everyone.
Findings
The majority of participants underperformed against simple baselines, but a small minority managed to outperform. There was a weak link between teams’ forecasting accuracy and investment performance: “the teams that submitted the best forecasting submissions did not perform similarly well in terms of investment decisions and vice versa.”
Buillding on M1, and incorporating private data shared by 4 companies.
29 timeseries
18 methods
Focus/data
US. macro-economic data, and internal data shared by four companies (mostly sales data).
Leading Methods
Exponential smoothing
Participants
Selected experts.
Findings
“The most striking outcome of the M2-Competition is the good and robust performance of exponential smoothing methods and in particular that of Dampen and Single smoothing” “the less the randomness of the series the better the relative accuracy of the more sophisticated methods used by the forecasters.”
Many more timeseries and methods than previous competitions.
100,000 timeseries
61 methods
€27,000 prize pool
Focus/data
Various frequencies (hourly up to yearly), and diverse data. “The 100,000 time series used in the M4 were selected from a database… that contains 900,000 continuous time series, built from multiple, diverse and publicly accessible sources.” Participants submitted point forecasts as well as prediction intervals.
Leading Methods
Meta-learning, and ensembles of statistical and ML methods
Participants
Open to everyone.
Findings
“all of the top-performing methods, in terms of both [point forecasts] and [prediction intervals], were combinations of mostly statistical methods, with such combinations being more accurate numerically than either pure statistical or pure ML methods.”… “While hybrid approaches (combining ML and statistical methods) performed very well, the four submitted pure ML approaches “definitely performed below expectations… all [pure] ML methods submitted were less accurate than [a simple baseline] benchmark”
Various: meta-learning, bayesian methods, large ensembles.
Participants
Open to everyone.
Findings
The majority of participants underperformed against simple baselines, but a small minority managed to outperform. There was a weak link between teams’ forecasting accuracy and investment performance: “the teams that submitted the best forecasting submissions did not perform similarly well in terms of investment decisions and vice versa.”
Various: Micro, macro, industry, demographic; monthly and yearly.
Leading Methods
Exponential smoothing
Participants
Selected experts.
Key
“If the forecasting user can discriminate in his choice of methods depending upon the type of data (yearly, quarterly, monthly), the type of series (macro, micro, etc.) and the time horizon of forecasting, then he or she could do considerably better than using a single method across all situations”
“The accuracy of the combination of various methods outperforms, on average, the specific methods being combined and does well in comparison with other methods.”
Walmart data, run on Kaggle, evaluating accuracy and uncertainty.
42,000 timeseries
5,507 methods
$100,000 prize pool
Focus/data
Retail data (Walmart)
Leading Methods
Mostly ensembles of LightGBM models, some ensembles of deep neural nets.
Participants
Open to everyone.
Findings
“M5 was the first competition where all of the top-performing methods were both ‘pure’ ML approaches and better than all statistical benchmarks and their combinations. "
The M4 competition paper36, in 2018 noted that “all [pure] ML methods submitted were less accurate than
[a simple baseline] benchmark”. Only a few years later, both tracks of the M5 competition were won by LightGBM models,
and the M5 Accuracy paper37 stated that “M5 was the first competition where all of the top-performing
methods were both ‘pure’ ML approaches and better than all statistical benchmarks and their combinations. “
Some of the success of ML methods was attributed to “cross-learning” — training one ML model on multiple time-series
as opposed to training separate instances of a traditional statistical model on each of the given time-series. This was
particularly relevant in the M5 competition, where the time-series were aligned and highly correlated with each other.
Another conclusion that came out of several of the competitions is that the “best” method depends on the frequency of
data, on the metric chosen, and on the type of data — particularly the amount of noise present.
M6 Competition
The M6 Competition, which finished in February 2023, focused on financial time-series and was the first M Competition to
use live evaluation on real, public data. It also had the largest prize pool of any M Competition so far, at $300,000.
While there have been several financial forecasting competitions in recent years (see our
summary of financial forecasting competitions in 2022),
this one is unique in several aspects, including its almost year-long evaluation period — roughly four times as
long as the usual three months in other competitions.
There were two components to the M6 Competition, measured separately:
Forecasting: predict the ranking of the 4-weekly returns of 100 different exchange-traded financial instruments (50 US stocks, 50 international ETFs).
Decision-making: allocate capital across the 100 instruments to maximise future risk-adjusted returns.
This competition required sustained commitment from participants. Of the 163 teams who entered both the forecasting and
decision-making tracks, only 26 made full original submissions at all 12 submission points. (in the case of a missed
submission, the previous submission would be carried through to the current period)
In fact, “all five winners in the forecasting track and four of the five winners in the investment track updated their
submission at every single round, while the same was true for three of the duathlon winners”23 — suggesting that
persistence is a highly desirable quality in a competition like this.
Comparing the teams’ performance against the simple (equal-weighted long allocation) investment benchmark showed just
how hard this problem is: “although the vast majority of the teams (75%) have managed to construct less risky portfolios
than the benchmark,… only 31% have realized higher returns and [information ratios].”
Some teams outperformed: “a small group of teams achieved… an impressive rate of return of about 30%”,
but “the performance of the teams is rather symmetric around the mean in the sense that more than one fourth of the teams
realized losses that exceeded 7%, reaching up to 46%.”
Before running the competition, the MOFC published ten hypotheses, and each of these is reviewed in the
pre-print paper23 analysing competition results (the full M6 issue of the IJF will likely be published later this year).
One interesting hypothesis that was confirmed predicted a weak link between forecasting and decision-making ability.
We corresponded with the #1 ranked participants in each of the forecasting, decision-making, and duathlon tracks to
understand their approaches, and explore those in the
timeseries forecasting solutions section.
The next M competition, M7, may launch later in 2024. Keep an eye on the website for updates.
ARCathon
The Abstraction and Reasoning Corpus (ARC) is a benchmark designed to measure AI skill acquisition. Rather than using
specific tasks, it expects algorithms to be able to compete using a few-shot approach — correctly interpreting the
task from a few given examples, and extrapolating to the remaining example.
The 1,000 ARC tasks were manually created by François Chollet, who also created Keras.
Since their creation in 2019, the training examples have been publicly released, but the test examples have
remained carefully guarded.
The first challenge using ARC tasks was Kaggle’s
Abstraction and Reasoning Challenge
in 2020, where the winner reached 21% success on the test set. Most humans can solve 80% of ARC tasks.
After this, Lab42, a part of Swiss AI company Mindfire, partnered with François to run an ongoing challenge using ARC tasks,
called ARCathon, committing around $1,000 in prize money for every 1%
improvement on the state-of-the-art. As of the end of 2023, the best participant-submitted ARCathon solution got a 30%
success rate, and an ensemble of solutions put together by the competition organisers got up to 31%.
Interestingly, one of the joint winners of the 2023 ARCathon used LLMs as a core part of their solution. This and other
leading solutions are described on Lab42’s winners page.
Lab42 have a useful ARC Playground tool on their site,
allowing users to attempt existing ARC tasks, as well as a task editor to create new ones for a larger, crowdsourced
set of ARC tasks.
Looking Ahead
Platform Developments
Platforms are continuing to invest in their product offerings as the ML competition space grows.
Across the board, platforms are adding features that enable more effective learning, better knowledge sharing, new types
of competitions, deeper integration with pre-trained models, and other improvements for competition organisers, sponsors,
and participants.
A few highlights:
Kaggle launched their Model Hub in 2023,
and are continuing to add new features to it — since the end of January 2024,
anyone can now share and publish their models on Kaggle Models. They’re also pushing forward in other product areas, and
told us that they’re “planning to expand [their] platform to introduce more novel competition formats”.
Zindi are running a year-long skills partnership with Microsoft, for users to build AI and Cloud skills through Zindi.
They will also be running several Generative AI challenges this year, of which the
firsttwo
are already live.
Hugging Face released a new version of their competition platform
at the end of January 2024, and have been continuing to add new features since.
Since Codabench went live in 2023, there have been several usability improvements. It is likely that more of the
community and development effort will transition over from CodaLab to Codabench in 2024.
DrivenData have told us that they “anticipate a healthy number of scientific competitions this year, as well as an
interesting selection of social impact competitions. On the technical front, we’ve been continuing work on UI/UX
improvements and doing some brainstorming and experimentation around how open challenges may fit with LLM-related tasks.”
EvalAI are continuing to add features to enhance the competition management experience — including improved GitHub sync,
easy leaderboard migration, and additional monitoring.
Upcoming Competitions
We saw general AI excitement reflected in the competitions space in 2023, but that was just the start.
Many big competitions are launching in the coming year, from a continuation of the Vesuvius Challenge to mathematics
and fixing security vulnerabilities.
We expect there to be several interesting reinforcement learning (RL) competitions in 2024.
The teams behind competitions like
Lux AI,
the Neural MMO Challenge,
and the Melting Pot Challenge
are building promising competition environments that manage to capture the complexity of real games while being more
tractable for RL because of their improved computational efficiency and ease of parallelisation.
Vesuvius Challenge 2024
As mentioned in the Vesuvius Challenge section, for stage two the
$100k 2024 Grand Prize
will go to the first team able to read 90% of the four scanned scrolls. There are a number of other progress prizes,
with a total of over $500k in prize money available in 2024.
There will also be a celebration of the success of the Vesuvius Challenge 2023 at the Getty Villa Museum in LA on
the 16th of March at 4pm — register here.
AI Mathematical Olympiad
The $10m Artificial Intelligence Mathematical Olympiad (AIMO) Prize, run by the algorithmic trading firm XTX Markets, intends to
spur the open development of AI models that can reason mathematically, leading to the creation of a publicly-shared AI
model capable of winning a gold medal in the International Mathematical Olympiad (IMO).
The grand prize of $5m will be awarded to the first publicly-shared model that can perform at a standard equivalent to
an IMO gold medal, and an additional $5m has been set aside for a series of progress prizes.
This seems to be an area where recent developments in generative models will prove to be highly valuable —
soon after the competition was announced, DeepMind published a paper in Nature announcing AlphaGeometry,
“a neuro-symbolic system… approaching the performance of an average International Mathematical Olympiad (IMO) gold medallist.”38
In the spirit of openness, the competition rules require winning systems to be reproducible — not just to have
their inference code be open, but also their training procedure and training data.39
AI Mathematical Olympiad: public sharing rules
“Our initial view is that, in the interest of advancing scientific knowledge, the AI models should be reproducible by
any third party with sufficient resources. In particular, the training data, training script and final model
(architecture and corresponding weights) should be made public. If the training data is too large to provide, then
alternatively the procedure used to construct the training data should be provided. We will determine the licences under
which AI models would need to be shared in order to be eligible for prizes. This may include a requirement to publish
open-source for non-commercial uses and on a royalty-basis for commercial uses.”
The first $1m AIMO progress prize, to be hosted on Kaggle, was announced on the 23rd of February 202440. For more information or to take part, go to the AIMO website.
AI Cyber Challenge
DARPA’s AI Cyber Challenge (AIxCC) is a bold effort to develop automated systems that can find and fix security
vulnerabilities in software. Like previous DARPA challenges, it’s extremely well-funded — with a total of $29.5m in prizes
to be awarded to teams taking part, including $7m to support small business teams in developing their solutions, and $2m
to each of the winning teams in the semifinals at DEF Con 32 in August 2024.
DARPA’s partners — Anthropic, Google, Microsoft, and OpenAI — are providing participants with credits to
their compute clouds and LLM APIs.
Participants applying to take part have until 30th of April 2024 to register by submitting a 5-page whitepaper describing their
planned solution. Once built, solutions will be scored on diversity (how well they do at finding different types
of vulnerabilities), accuracy (penalising false positives), discovery (how many vulnerabilities they find), and program
repair (whether they can submit code patches that fix vulnerabilities without compromising functionality).
After this summer’s semifinals, the finals will take part at DEF CON in 2025, with a top prize of $4m.
As we see more of these competitions, platforms and organisers are confronting the issues involved in trying to
measure generative models in a systematic way. Some platforms already support automate human evaluation at large
scale, and most have run competitions involving panels of expert judges. Will other approaches be developed over time?
The competitive ML community has developed a suite of tools that can be used to more reliably measure the
internet-scale foundation models driving generative applications, without the issues inherent in static benchmarks.41
To list a few relevant techniques:
Benchmarks with test sets that are kept secret, like in the ARCathon competition, where the same private
test set has been used in multiple competitions since 2019.42 This would require a trusted evaluator
to run a version of the model on the secret test set, and pass back only the score. This wouldn’t currently be possible
for models locked behind an API.
Dynamic benchmarks, where each iteration of the competition/benchmark is run on a new test set unknown to participants.
While this would allow API-locked models to be evaluated, there would be additional concerns of distribution shift.
In addition to the other competition platforms mentioned in this report, the Dynabench
platform aims to embrace this approach to dynamically measure models.
We look forward to seeing the developments in this space over the coming months and years. We expect that competitions
have an important role to play in evaluating the foundation models these generative solutions are being built on.
About This Report
If you would like to support this research, please check out our sponsors below, or subscribe to our mailing
list.
About Our Sponsors
Latitude.sh
is the cloud that powers innovation.
From high-demanding applications to machine learning workflows, provision scalable, high-performance, and cost-effective cloud infrastructure on an easy-to-use and modern platform that puts developers first.
With both GPU containers and fully dedicated clusters available, choose the option that best fits your compute needs to effortlessly train, fine-tune and run inference on your Machine Learning models.
Comet’s
Experiment Tracking and Model Production Monitoring tools work with your existing infrastructure to track, visualize, debug, and compare model runs from training straight through to production.
Use Comet’s Artifacts, data lineage, and Model Registry as a single source of truth for all your model training runs and trace bugs in production all the way back to the data the model was trained on.
For prompt engineering workflows, use Comet LLM, an open source tool to track, visualize, and evaluate your prompts, chains, variables, and outputs.
Create a free account
to try Comet’s Experiment Management, Model Registry, and Prompt Management solutions.
About ML Contests
For over four years now, ML Contests has provided a competition directory and shared insights
on trends in the competitive machine learning space. To receive occasional updates with key insights like this report,
subscribe to our mailing list.
If you enjoyed reading this report, you might enjoy these other articles:
Thank you to Eniola Olaleye for help with data gathering, and to Peter Carlens, James Mundy, Alex Cross, and Maja Waite
for helpful feedback on drafts of this report.
To Adrien Pavao for sharing a draft of his PhD thesis43, and Fritz Cremer, for sharing insights on Kaggle
competitions.
To Professor Spyros Makridakis, and the Lab42 team, for taking the time to answer our questions and giving extra
insight into the M Competitions and the ARCathon.
To the teams at AIcrowd, DrivenData, Grand Challenge, Hugging Face, Humyn.ai, Kaggle, Lab42, Onward, Solafune, Trustii,
and Zindi, for providing data on their 2023 competitions.
Thank you to the competition winners who took the time to answer our questions by email or questionnaire:
Aleksandr Shatilov,
Andoni Irazusta Garmendia,
Ben Swain,
Colin Catlin,
Filip Staněk,
Francesco Morri
Harrison Karani Maina,
Nasri Mohammed,
Ning Jia,
Olufemi Victor Tolulope,
Plato,
Tom Wetherell,
Yasser Houssem Eddine Kehal,
Yisak Birhanu Bule,
Kirill Brodt,
and others who preferred not to be named.
Thank you to the organisers of the
Machine Unlearning, PFL-DocVQA, Big-ANN, LLM Efficiency, Weather4cast, Sensorium, Single-cell Perturbation Prediction,
Lux AI, CityLearn, ROAD-R, Robot Air Hockey, and MyoChallenge
NeurIPS competitions,
and the
F1Tenth, AQRC, PUB.R, BARN, METRICS HEART-MET, Manufacturing Robotics, Robotic Grasping and Manipulation,
Human-Robot Collaborative Assembly, Humanoid Robot Wresting, and Picking, Stacking and Assembly
ICRA competitions, who took the time to speak to us virtually or in person.
Thank you to the organisers of the Vesuvius Challenge, for giving us permission to use their images.
Lastly, thank you to the maintainers of the open source projects we made use of in conducting our research and
producing this page: Hugo,
Tailwind CSS, Chart.js,
Linguist, pipreqs, and
nbconvert.
Methodology
Data was gathered from correspondence with competition platforms, organisers, and winners, as well as from public online
discussions and code.
The following criteria were applied for inclusion, in-line with the submission criteria for our
listings page. Each competition must:
have a total prize pool6 of at least $1,000 in cash or liquid cryptocurrency (BTC/ETH), or be an
official competition at a top machine learning or robotics conference.
have ended44 between 1 Jan 2023 and 31 December 2023 (inclusive).
In our written explanations, the word “competition” is often used to refer to an over-arching competition (e.g. “The
M6 Competition”), while in our statistics we usually count sub-tracks each as their own competition (e.g. M6: Forecasting,
M6: Decisions, M6: Duathlon, and M6: Student Prize).
When counting a “number of competitions” for purposes such as prize pool distribution, or popularity of programming
languages, we use the following definition: If a competition is made up of several tracks, each with separate leaderboards and separate prize pools, then each track
counts as its own competition. If a competition is made up of multiple sub-tasks which are all measured together on
one main leaderboard for one prize pool, they count together as one competition.
For the purposes of this report, we consider a “competition winner” to be the #1-placed team in a competition as defined above.
We are aware that other valid usages of the term exist, with their own advantages — for example, anyone winning a Gold/Silver/Bronze medal,
or anyone winning a prize in a competition. For ease of analysis and in order to avoid double-counting, we exclusively
consider #1-placed teams in this report.
Compiling the Python packages section in the winning toolkit involved some discretion. While we
attempted to highlight the most popular and interesting packages for readers, we did not simply take the n most popular
packages. For example, Optuna, used in three winning solutions, is the most commonly used hyperparameter optimisation
package, and we included it in our toolkit. Polars was also used in three winning solutions, but given that it has
similar functionality to Pandas, which was used in 56 winning solutions, we did not include Polars in the winning
toolkit.
For the sake of transparency, we include a list of all packages which were used in more than three winning solutions
below, along with counts. Note that these counts don’t necessarily map to actual usages, and in some cases we did
additional work to figure out actual usage. For example, while TensorFlow is imported in some files within ten different
winning solutions, several of these didn’t actually make use of TensorFlow — e.g. ROAD-R
or RSNA Screening. We did not have the research resources
to go through all repositories in detail in this way; we investigated in cases where both TensorFlow and PyTorch were
being imported in one solution, and conducted a few other spot checks. As such, the list below should be considered
“raw”, without this extra usage analysis applied.
7 March 2024:
Changed “the winner of the M6 Competition’s forecasting track, who used R…” to “the winner of the M6 Competition’s duathlon track, who used R…” (Winning Toolkit)
Broadly, we included only competitions which had either cash prize money of at least $1k or were affiliated with an academic conference. See our submission criteria for more detail. ↩︎↩︎
The number of competitions and total prize money amounts are for competitions that ended in 2023. Prize money figures include only cash and liquid cryptocurrency (BTC/ETH). Travel grants and other types of prizes are excluded. Amounts are approximate — currency conversion is done at data collection time, and amounts are rounded to the nearest $1,000 USD. See Methodology for more details. ↩︎↩︎
In addition to the three eligible competitions Codabench hosted in 2023, the platform was also used to run
a power network operations challenge for power companies operating in the Paris region of France, with 7 participating
teams and a €500,000 grant as prize.
For more information on this “Learning to Run a Power Network” competition, see Adrien Pavao’s PhD Thesis (PDF). ↩︎
CodaLab’s public instance moved from competitions.codalab.org to
codalab.lisn.fr when the CodaLab platform transitioned from Python2 to Python3, and users
were not automatically migrated to the new platform. The old public CodaLab instance had 128,000 users in October 2022
before it was closed to new competitions. The new public CodaLab instance had 55,767 users as of 5 March 2024.
Source: CodaLab highlights page↩︎
In general, we consider the available prize pool rather than the disbursed prize
pool. In most cases the available prize pool is the same as the disbursed prize pool,
but there are some exceptions. Most notably: the ARCathon competition has prizes tied to benchmark improvements,
and while the available prize pool for the main competition in 2023 was CHF 69,000 (one thousand Swiss Francs per
percentage point improvement between 31% and 100%), nothing was paid out in 2023 as there were no improvements by
participants on the state-of-the-art ensemble solution. In other cases there can be discretionary bonus prizes which are
not always awarded. We count the available prize pool, as it is not always possible to get reliable information on the
disbursed prize pool. ↩︎↩︎
“More than 80,000 high-level human resources are participating, including excellent data scientists at major companies and students majoring in the AI field. (as of February 2023)”. Source: Signate’s website, translated from Japanese using Google Translate on 5th of March 2024. ↩︎
Tianchi: “approximately 1.4 million registered users”. Source: email correspondence with
the Tianchi team on 1 April 2024. ↩︎
We acknowledged this limitation in footnote 4 of our 2022 report. Because platforms like CodaLab and EvalAI allow anyone to list competitions, there isn’t a central platform source we can use for the number of competitions and total prize pool. In 2022 we did not have sufficient resources to exhaustively go through all of the listings on these sites, and had to rely on other sources. ↩︎
It is also the case that in competitions with a higher barrier to entry, the participants who manage
to surmount that barrier will tend to be more dedicated than in competitions where new submissions can be generated
by making minor edits to another team’s solution. For this reason, we think aiming for a large number of submissions is
not always the right goal, and we have removed the “typical entries” column from the competition platforms table. ↩︎
The 65 solutions include 57 where source code was public, 6 where code wasn’t public but we got language/package data from winners completing our questionnaire, and 2 where language/package data was mentioned in the winners’ write-up. In cases where the same solution won multiple tracks (e.g. Weather4cast, KDD Cup), the solution is counted only once. ↩︎
This competition was more of an optimisation problem than strictly a machine learning problem. The winners state: “Here, we searched for the optimal arm movement without additional reconfiguration cost by using dynamic programming (DP) and beam search with pruning… the entire DP table could be calculated in a few seconds when implemented with C++…”. Source: Kaggle solution write-up↩︎
The toolkit section broadly reflects the most-commonly-used packages for different ML-relevant purposes. Some discretion is used: for example, more general-purpose packages like requests and setuptools are excluded from the list. More details in the methodology section. ↩︎
Of these, 8 used the pytorch-lightning package, and one used the newer lightning package. ↩︎
Of over 26,000 developers surveyed, 24% said they use AI assistants “quite often” for generating code. 37% selected “from time to time”, 24% “rarely”, and 15% “never”. Source: JetBrains’ State of Developer Ecosystem 2023 report. ↩︎
“Many researchers believe that ConvNets perform well on small or moderately sized datasets, but are not competitive with Vision Transformers when given access to datasets on the web-scale. We challenge this belief by evaluating a performant ConvNet architecture pre-trained on JFT-4B, a large labelled dataset of images often used for training foundation models. We consider pre-training compute budgets between 0.4k and 110k TPU-v4 core compute hours, and train a series of networks of increasing depth and width from the NFNet model family. We observe a log-log scaling law between held out loss and compute budget. After fine-tuning on ImageNet, NFNets match the reported performance of Vision Transformers with comparable compute budgets. Our strongest fine-tuned model achieves a Top-1 accuracy of 90.4%. " Source: Smith et al, ConvNets Match Vision Transformers at Scale. ↩︎
“Our final ensemble is a blend of five 7B models and one 13B model. Each model uses a different context approach using different wikis, embeddings and topk. The pipeline fits precisely into 9-hour runtime and utilizes 2.5TB of input data.”, “We had most success with the following LLM models: Llama-2-7b, Mistral-7B-v0.1, xgen-7b-8k-base, Llama-2-13b”. Source: Kaggle solution write-up. ↩︎
“The original Whisper tokenizer employs character-level tokens for low-resource languages, which can be time-consuming.
So we trained a BPE tokenizer with 12,000 tokens specifically for Bengali text. We then replaced some tokens in the Whisper tokenizer with these.
We carefully replaced tokens after the 10,000th position of the original Whisper’s tokens.” Source: Kaggle solution write-up. ↩︎
“Our final submission only used a 4fold microsoft/deberta v3 large. “, “Carefully designed prompt to guide LLM to spit out the topic and topic text in his stomach”, “Another prompt used to generate ten summary of different quality for each additional topic”, “we tried as many open source LLM as well as chatgpt”. Source: Kaggle solution write-up. ↩︎
Some participants noticed in advance that the public leaderboard dataset was not very diverse, and performance on that dataset was therefore less likely to generalise well to the private dataset: “If I am not mistaken, this must mean that public data only contains texts from grade 10, which makes it a very poor indicator of good model quality. " Source: Kaggle Discussion. ↩︎
“You may use data other than the Competition Data (“External Data”) to develop and test your Submissions. However, you will ensure the External Data is publicly available and equally accessible to use by all participants of the Competition for purposes of the competition at no cost to the other participants. " Source: Kaggle competition rules. ↩︎
M6 paper abstract: “The M6 forecasting competition, the sixth in the Makridakis’ competition sequence, is focused on financial forecasting. A key objective of the M6 competition was to contribute to the debate surrounding the Efficient Market Hypothesis (EMH) by examining how and why market participants make investment decisions. To address these objectives, the M6 competition investigated forecasting accuracy and investment performance on a universe of 100 publicly traded assets. The competition employed live evaluation on real data across multiple periods, a cross-sectional setting where participants predicted asset performance relative to that of other assets, and a direct evaluation of the utility of forecasts. In this way, we were able to measure the benefits of accurate forecasting and assess the importance of forecasting when making investment decisions. Our findings highlight the challenges that participants faced when attempting to accurately forecast the relative performance of assets, the great difficulty associated with trying to consistently outperform the market, the limited connection between submitted forecasts and investment decisions, the value added by information exchange and the “wisdom of crowds”, and the value of utilizing risk models when attempting to connect prediction and investing decisions. " Source: Makridakis, et al., The M6 forecasting competition: Bridging the gap between forecasting and investment decisions, arXiv. ↩︎↩︎↩︎
Using Pandas with an Arrow back-end can be as simple as df_pyarrow = pd.read_csv(data, dtype_backend="pyarrow"). For more details, see
the pandas documentation or
this post by one of the Pandas
devs. ↩︎
Another example of this from outside competitive ML: “During a recent Higher Performance Python course (privately run for a large hedge fund) we had a chat about the internal move towards Polars - some research groups were starting to try Polars with cautious success. Typically this worked in isolated places where Pandas had poor performance and where lots of RAM and many CPUs could more efficiently process the data in Polars. " Source: NotANumber Newsletter. ↩︎
“We show that before 2010 training compute grew in line with Moore’s law, doubling roughly every 20 months. Since the advent of Deep Learning in the early 2010s, the scaling of training compute has accelerated, doubling approximately every 6 months. " Source: Sevilla et al., Compute Trends Across Three Eras of Machine Learning, arXiv. ↩︎
For example, the ROAD-R and Big-ANN competitions provided $200-$1,000 of compute grants to select participants. ↩︎
Some LLM inference benchmarks show AMD’s MI300X having lower latency and higher throughput than NVIDIA’s H100, while being cheaper. For example, on Llama-2 13B, the MI300X shows a 1.2x latency improvement over the H100, according to AMD’s benchmarks. “This is a big deal as OpenAI and Microsoft will be using AMD MI300 heavily for inference.” Source: Semianalysis newsletter. ↩︎
T4s were added along P100s in 2022. “A P100 GPU will perform better on some applications and the T4x2 will perform better on others. For example, a P100 typically has better single-precision performance than a T4, but the T4 has better mixed precision performance, and you’ll have twice as much GPU RAM in the T4x2 configuration. " Source: Kaggle product announcement. ↩︎
“I recently shared that by the end of this year we’ll have about 350k H100s and including other GPUs that’ll be around 600k H100 equivalents of compute. " Source: Meta Platforms, Inc. earnings call, 1 Feb 2024 (PDF). ↩︎
With a median on-demand price of $0.89/h for an V100 (source: Cloud GPUs), 8 of them for 6 days would cost $1025.28. Compute details from the paper: “The code was run on DGX V100 workstation or on NVIDIA cluster nodes with up to 8 V100 GPU (DGX-1 machines). The CNN model in particular was quite compute intense, with more than 24 hours per language on a 8xV100 node.” Source: Deotte et al, Winning Amazon KDD Cup'23, OpenReview) ↩︎
“2 x GPU Nvidia A100 40 GB VRAM… 8 days for training” (source: winner’s solution directory). Using a median on-demand price of $1.5/h for an A100 (40GB). Source: Cloud GPUs. ↩︎
“Trained on 8x 48GB RTX A6000” (source: Kaggle solution write-up). Using a median on-demand price of $0.99/h and a purchase price of $4,698 for an A6000. Rounding up from $37,584 to $40k, considering the cost of CPU, RAM, and other hardware required. Sources: Cloud GPUs, Amazon. ↩︎
More precisely, in these 5 competitions he spent 9h45m,
14h39m, 30h2m,
40h8m, and 41h45m.
These were all deep learning tasks; mostly computer vision. ↩︎
For example, one post from @raddar mentioned: “cross-validation is useless, leaderboard is useless, models are validated on 1-2 data points”. Source: Kaggle Discussion. ↩︎
M4 paper abstract: “The M4 Competition follows on from the three previous M competitions, the purpose of which was to learn from empirical evidence both how to improve the forecasting accuracy and how such learning could be used to advance the theory and practice of forecasting. The aim of M4 was to replicate and extend the three previous competitions by: (a) significantly increasing the number of series, (b) expanding the number of forecasting methods, and (c) including prediction intervals in the evaluation process as well as point forecasts. This paper covers all aspects of M4 in detail, including its organization and running, the presentation of its results, the top-performing methods overall and by categories, its major findings and their implications, and the computational requirements of the various methods. Finally, it summarizes its main conclusions and states the expectation that its series will become a testing ground for the evaluation of new methods and the improvement of the practice of forecasting, while also suggesting some ways forward for the field.” Source: Makridakis et al., The M4 Competition: 100,000 time series and 61 forecasting methods. ↩︎
M5 Accuracy paper abstract: “In this study, we present the results of the M5 “Accuracy” competition, which was the first of two parallel challenges in the latest M competition with the aim of advancing the theory and practice of forecasting. The main objective in the M5 “Accuracy” competition was to accurately predict 42,840 time series representing the hierarchical unit sales for the largest retail company in the world by revenue, Walmart. The competition required the submission of 30,490 point forecasts for the lowest cross-sectional aggregation level of the data, which could then be summed up accordingly to estimate forecasts for the remaining upward levels. We provide details of the implementation of the M5 “Accuracy” challenge, as well as the results and best performing methods, and summarize the major findings and conclusions. Finally, we discuss the implications of these findings and suggest directions for future research.” Source: Makridakis et al., M5 accuracy competition: Results, findings, and conclusions. ↩︎
AlphaGeometry abstract: “Proving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning, owing to their reputed difficulty among the world’s best talents in pre-university mathematics. Current machine-learning approaches, however, are not applicable to most mathematical domains owing to the high cost of translating human proofs into machine-verifiable format. The problem is even worse for geometry because of its unique translation challenges, resulting in severe scarcity of training data. We propose AlphaGeometry, a theorem prover for Euclidean plane geometry that sidesteps the need for human demonstrations by synthesizing millions of theorems and proofs across different levels of complexity. AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challenging problems. On a test set of 30 latest olympiad-level problems, AlphaGeometry solves 25, outperforming the previous best method that only solves ten problems and approaching the performance of an average International Mathematical Olympiad (IMO) gold medallist. Notably, AlphaGeometry produces human-readable proofs, solves all geometry problems in the IMO 2000 and 2015 under human expert evaluation and discovers a generalized version of a translated IMO theorem in 2004.” Source: Trinh et al., Solving olympiad geometry without human demonstrations. ↩︎
The “pre-announcement” of the first progress prize describes a multiple choice Olympiad on 100
novel problems, to be hosted on Kaggle, with a total prize fund of $1,048,576 over up to three years.
Source: r/AIMOprize. ↩︎
There will always be some leakage, since participants get information on the test set when they find out their score, but limiting submission frequency can slow down this process. ↩︎
Titled “Methodology for Design and Analysis of Machine Learning Competitions”, it is now publicly
available on Adrien’s website (PDF).
It gives context on the history of competitions and benchmarks, as well an overview of CodaLab and Codabench, and
considerations on competition design for many different types of competitions. ↩︎
In most cases, we take the submission deadline as the end date. The only exception to this is when a
competition is judged on data which becomes available only after final submissions: for example, Kaggle’s
Trading at the Close competition
had a final submission deadline of 20th December 2023, after which 3 months of trading data is collected, and the
competition officially ends on 20th March, 2024. This competition will be included in the 2024 edition of our report, to be published in 2025.
Conversely, while the Vesuvius Challenge winners weren’t announced until February 2024, the
submission deadline was 31 December 2023, and so it is included in this edition of our report. ↩︎