The State of Competitive Machine Learning

2023 Edition

We summarise the state of the competitive landscape and analyse the 300+ competitions that took place in 2023. Plus a deep dive analysis of 60+ winning solutions to figure out the best strategies to win at competitive ML.

Sponsored by

Highlights

Competitive ML Landscape

In 2023, several new ML competition platforms launched, and there were a number of high-profile competitions with very large prize pools.

We found over 300 ML competitions that took place in 2023, across more than 18 competition platforms and many more independent single-competition websites. With a total cash prize pool of over $7.8m, 2023 saw a large jump in prize money compared to 2022 and other recent years.

We cover major developments on each of the competition platforms in the Platforms section. Further down we review some of 2023’s Notable Competitions in more depth, and preview some upcoming developments in the Looking Ahead section. Competition winners’ approaches and technology choices are analysed in the Winning Solutions section.

Platforms

Google’s Kaggle platform has by far the most registered users (17m+, up from 10m+ last year), and remains the largest platform by total prize money, with DrivenData second.

CodaLab ran the most competitions that fit our inclusion criteria1; many of these were competitions affiliated with academic conferences. Among academic conferences, the other popular competition platforms tended to be EvalAI, AIcrowd, and Kaggle.

Alibaba Cloud’s Tianchi ran the second-most competitions that fit our criteria.

Zindi — in addition to 21 competitions — also held 47 community hackathons, and awarded $190k in total prize money across community hackathons and competitions in 2023.

2023 Platform Comparison

A few new platforms started listing competitions in 2023:

Overview of competition platforms (2023)

Platform Founded Users Competitions2 Total prize money2
AIcrowd 2017 123k+ 17 $198,000
Bitgrit 2017 35k+ 2 $6,000
Codabench3 2023 3.5k+ 3 $10,000
CodaLab 2013 55k+4 71 $218,000
DrivenData 2014 100k+ 11 $1,762,000
DS Works 5 $127,000
EvalAI 2017 40k+ 33 $96,000
Grand Challenge 2010 94k+5 22 $40,000
Hugging Face 2023 5 $4,000
Humyn.ai 2023 14k+ 5 $41,000
Kaggle 2010 17m+ 41 $2,358,000
Lab42 2022 ~500 3 $87,0006
Onward 2022 3.2k+ 3 $110,000
Signate 2014 80k+7 12 $163,000
Solafune 2020 3 $30,000
Tianchi 2014 ~1.4m8 61 $852,000
Trustii 2020 1.5k+ 2 $60,000
Zindi 2018 72k+ 21 $134,000
Other 47 $1,518,000

Note: the table above is shown with a reduced set of columns on mobile. For the full table, view this report on a larger screen.

Other platforms

For readability and relevance, the table above only includes platforms with multiple competitions. Decisions around what exactly constitutes a platform are somewhat subjective, and we remain open to changing this in future. Competitions in the “Other” bucket in 2023 include:

Some updates on platforms not included in our 2022 report:

It’s likely there are other relevant platforms which we haven’t covered here, and we’ll do our best to include them in future once we’re made aware of them.

Academia

Competitions can be a useful research tool, and we found over 100 competitions affiliated with academic conferences in 2023.

NeurIPS 2023 hosted 20 competitions on topics including LLM efficiency, weather forecasting, multi-agent games, modelling mouse brains, discovering new catalysts, and many others.

images/NeurIPS Logo.png
Competitions at NeurIPS 2023

We attended NeurIPS 2023 and spoke to more than half of the competition organisers, as well as several of the participants.

This year, NeurIPS introduced a new requirement for competition organisers and participants to publish their post-competition reports as papers in the following year’s NeurIPS Datasets & Benchmarks track. As well as increasing transparency and reproducibility, this provides an additional incentive to participants.

NeurIPS Competitions

Robotics: ICRA & IROS

Both ICRA (International Conference on Robotics and Automation) and IROS (International Conference on Intelligent Robots and Systems) have official competition tracks. ICRA 2023 hosted 12 competitions, and IROS 2023 hosted 3 competitions.

More on this in the robotics section.

Other Conferences

Other conferences — including CVPR, ICCV, and ICML — also hosted ML competitions in 2023. In general, these conferences hosted competitions as part of workshops, as opposed to having a separate competitions track. For example, the Fine-Grained Visual Categorisation (FGVC) workshop at CVPR hosted 7 competitions.

The most popular platform for academic competitions was CodaLab, followed by EvalAI. Both platforms are open source and free to use, which may explain their popularity with the academic community. While most competitions used the hosted versions of these platforms, some — such as Sensorium 2023 chose to run their own instance.

Papers and publications

There were many interesting developments in academia relevant to competitions; we’ll cover only a few here.

A new journal — DMLR, the Journal of Data-centric Machine Learning Research — launched in 2023, focusing on datasets and benchmarks for ML research, among other things.

Pre-prints are now available on arXiv for most chapters in the new book AI competitions and benchmarks: the science behind the contests, with contributions from authors from various organisations including Kaggle, CodaLab, and ML Contests.

Data Science at the Singularity, a paper published by David Donoho in October 2023, makes the case that recent AI progress isn’t due to a particular method or architecture, but instead due to the emergence of frictionless reproducibility in data-driven scientific research. In Donoho’s definition, the three key ingredients of frictionless reproducibility are “data sharing, code sharing, and competitive challenges”.

Prizes & Participation

The $7.8m total prize pool in 2023 was an increase of more than 40% over 2022. Much of this came from a few competitions with very large prizes — the Vesuvius Challenge’s $1m prize pool (across 8 different prizes), the M6 Financial Forecasting competition’s $300k (across 4 separate tracks), and almost $1.5m across DrivenData’s two PETs prizes — but we saw increases in competitions across all prize pool buckets.

Prize pool

We also found more eligible competitions without prizes than in 2022. Our data set for this report is restricted to competitions with either meaningful prize money or a conference affiliation1, and we think that some of the increase in no-prize competitions we found this year is attributable to a methodological improvement.9 This year we manually reviewed CodaLab and EvalAI listings, identifying competitions for CVPR, ICCV, and other conferences, which we may have previously missed.

Monetary prizes are a useful incentive, but anecdotally it appears that often the monetary prizes are secondary to the kudos and future career value involved in winning — either through gaining medals/climbing up platform rankings, or presenting research at a conference.

Leaderboard entries

Participation is highly variable; we found 46 competitions with fewer than 10 teams entering, and 24 with over 1,000 teams. These numbers don’t necessarily reflect the quality or effort involved by participants: some of the academic competitions are targeted at a specific niche, and allow leading researchers at different labs to compare their methods on a level playing field.10

Winning Solutions

In this section we will first review the most popular programming languages and packages among competition winners, after which we go into more detail on approaches for specific types of competitions. This year, we focus mostly on time-series and NLP competitions. See our 2022 report for more detail on tabular data and computer vision competitions.

We analysed winning solutions by reviewing public write-ups and source code, as well as getting information directly from the winning teams where we were able to establish contact.

Some of the data was gathered from winners using a structured questionnaire, allowing for systematic comparison. In addition to this, we interviewed some competition winners to understand their solutions in more depth, either in-person at ICRA and NeurIPS, or by email.

Throughout this section we link to several write-ups of winning solutions. For more write-ups, check out DrivenData’s blog, AIcrowd’s blog, Zindi’s meet the winners posts, and the winners of Kaggle’s new best solution write-ups award.

Winning Toolkit

Python remains the most popular programming language among winners, with 63 out of 65 winning solutions11 primarily or exclusively using Python. The two exceptions were the winner of the M6 Competition’s duathlon track, who used R, and the winner of Kaggle’s Santa 2022 optimisation competition, who used C++12.

Primary Programming Language

Some winners used bits of C++ and R to supplement their Python. We didn’t find any uses of Julia or Rust in participants’ own code — though we are starting to see adoption of more third-party packages written in Rust (see dataframes).

Python Packages

We did a deep dive into Python packages used by winners last year, and many of the top packages used by winners have remained the same. This year’s top packages are listed below, in descending order of popularity within each category. 13

Core
  • numpy arrays
  • pandas dataframes
  • scipy optimisation and other fundamentals
  • matplotlib low-level plotting
  • seaborn higher-level plotting
NLP
  • transformers tools for pre-trained models
  • nltk swiss army knife for NLP
  • peft parameter-efficient fine-tuning
Vision
  • opencv-python core vision algorithms
  • Pillow core vision algorithms
  • torchvision core vision algorithms
  • albumentations image augmentations
  • timm pre-trained models
  • scikit-image core vision algorithms
  • segmentation-models-pytorch segmentation
Modeling
  • scikit-learn models, transforms, metrics
  • deep learning
  • torch
  • pytorch-lightning layer on top of torch
  • tensorflow
  • gradient-boosted trees
  • lightgbm
  • xgboost
  • catboost
Other
  • tqdm progress bar
  • wandb experiment tracking
  • psutil system tools
  • joblib parallelisation
  • loguru logs
  • ray distributed training
  • numba jit compilation
  • optuna hyperparameter optimisation

Deep Learning

PyTorch remains by far the most popular deep learning library, in a two-horse race — out of 53 solutions using deep learning, 47 primarily used PyTorch, 4 used TensorFlow, and 2 used both PyTorch and TensorFlow.

Deep Learning: PyTorch vs TensorFlow

All solutions using TensorFlow made use of the high-level Keras API. While the majority of PyTorch users still use PyTorch directly, we found 9 winning solutions using PyTorch Lightning14 (up from 3 last year). We didn’t find any winning solutions using JAX.

Use of LLMs

Given their rise in general usage, it’s no surprise that large language models (LLMs) are proving useful in competitive ML too. However, they are far from ubiquitous among winning approaches.

From our findings, most winners either explicitly said they didn’t use LLMs in any way, or didn’t mention any use of LLMs.

The most popular use of LLMs we found is for code completion, with 30% of winners who completed our questionnaire explicitly mentioning that they used LLMs for code completion. This is in line with JetBrains’ findings15 that around a quarter of developers frequently use AI assistants for generating code. Given the prevalence of these tools for code completion, it’s likely that a similar proportion of winners are using them without feeling the need to mention it in their write-ups — just as they wouldn’t necessarily mention the IDE or OS they’re using.

Some competitions explicitly invited the use of LLMs: Signate’s ChatGPT Challenge asked participants to identify successful and unsuccessful uses of ChatGPT in business scenarios. DrivenData’s AI Research Assistants for NASA was also very well-suited to LLMs, and virtually all successful submissions incorporated some degree of prompt engineering.

Aside from code completion, LLMs were used in a few other ways by competition winners.

One of the winners who completed our questionnaire mentioned using LLMs for idea generation, and quickly getting up to speed in a new field.

This competition was my first time-series problem, so I used ChatGPT at the very start of my research process to get a broad overview of the field. Asking questions along the lines of “What are classical approaches to time series problems”, “What are modern/state-of-the-art/deep learning based approaches?”, etc. I found it pretty helpful to get model names and terms to research further (ARIMA, Theta, Prophet, TCN, …).
— Tom Wetherell, Zindi RMB CPI Nowcasting winner

We found two uses of LLMs for generating synthetic data among winners, as described in more detail in the synthetic data section. The winner of DrivenData’s Unsupervised Wisdom competition used LLMs for data extraction. One of the joint winners of ARCathon 2023 made use of LLMs for abstract reasoning, as mentioned in the ARCathon section.

Lastly, but perhaps most interestingly, one winning team put a classification head on some pre-trained LLMs, alongside significant use of RAG, allowing them to win Kaggle’s LLM Science Exam competition — more on this in the NLP foundation models section.

Computer Vision

In our 2022 report we reviewed how, unlike in NLP, leading computer vision models still largely hadn’t converged on a single architecture.

Things looked similar throughout 2023: both CNNs (convolutional neural networks) and Transformers were used for vision, and most competition winners still used CNN-based architectures for computer vision competitions.

This lack of architectural convergence is backed up by research — a 2023 paper compared NFNets (a CNN-based architecture) against Vision Transformers, and found that “NFNets match the reported performance of Vision Transformers with comparable compute budgets.”16

Computer Vision Architectures

In the solutions we managed to find, ConvNeXt and U-Net (CNN) architectures were most popular. Swin-Transformers were the most commonly-used Transformer-based models, and both convolutional layers and Transformer modules were used in custom architectures as well. The choice of architecture is in part determined by the particular task — for example, YOLO models for object detection and segmentation.

Model Families

The winner of DrivenData’s Tick Tick Bloom competition, where participants used satellite imagery to detect specific types of bacteria in bodies of water, showed that deep learning isn’t necessarily the right tool for all computer vision problems.

Their solution used a combination of k-nearest-neighbours and a LightGBM model, with features including climate data and the colour of the water. They noted “I tried using a CNN model with satellite images. Unfortunately, this type of model resulted in very high RMSEs. After some analysis, I suspect that the quality and resolution of the satellite images, as well as the accuracy of the positions, made it very difficult to fit the CNN model well.”

There were many vision-related competitions from top conferences (ICCV and CVPR) for which we did not manage to gather data on winning solutions, and their inclusion might have given a different view of the leading edge. We aim to gather more data from these conferences for the 2024 report.

NLP

Foundation Models

Last year we noted that almost all winning NLP solutions used versions of Microsoft Research’s DeBERTa model — usually deberta-v3-large.

Since the end of 2022, the firehose of public-weights LLM releases has continued to deliver newer, better, and more efficient GPT-style generative foundation models that can be fine-tuned for specific tasks — including the Llama, Falcon, Mistral, Phi, QWEN, OLMo and Gemma series of models.

As expected, these have been adopted by the competitive ML community and have at times outperformed solutions based on the incumbent DeBERTa models. They are also being used alongside these models — for example, some competition winners have used generative LLMs to generate synthetic data, which a DeBERTa model is then trained on.

In other situations, generative models have directly replaced these older models in the inference pipeline — most notably Kaggle’s LLM Science Exam competition.

The winners of that competition used an ensemble of five 7B-parameter LLMs and one 13B-parameter LLM, which they fine-tuned, alongside Retrieval-Augmented Generation on a chunked local version of Wikipedia. They used an interesting approach where they added a binary classification head onto the end of several pre-trained LLMs before training them. The inference pipeline came in at just under the 9-hour runtime requirement, albeit with a staggering 2.5TB of input data!17

In low-compute-resource situations, Mistral-7B and QWEN-7b emerged as two popular foundation models, as used by the winners of the NeurIPS 2023 LLM Efficiency Challenge.

Non-English Languages and Text-To-Speech

Kaggle’s Bengali Speech Recognition competition and Zindi’s Swahili ASR competitions both involved text-to-speech for lower-resource18 languages (Bengali and Swahili). The winners of both competitions used OpenAI’s Whisper model, either exclusively or as part of an ensemble. For the Bengali competition, the winners trained a custom tokeniser19, specifically on Bengali text, to avoid the inefficient character-level encodings the Whisper tokeniser would otherwise use for Bengali text.

Synthetic data

The winners of Kaggle’s CommonLit competition fine-tuned deberta-v3-large on a custom dataset combining Kaggle’s training data with a larger synthetic dataset they generated using a combination of ChatGPT and other LLMs. 20 The team jumped from 49th place on the public leaderboard to 1st on the final private leaderboard. It appears that their focus on generating a diverse training dataset paid off particularly well because of the significant distribution shift between public and private test data21

The winners of the Stable Diffusion - Image to Prompts competition used ChatGPT to generate a large number of synthetic prompts.

The winners of the LLM Science Exam competition also made use of LLM-generated synthetic data, though they were able to use synthetic data that had already been shared on Kaggle by others.

Should synthetic data be considered “publicly available”?

Kaggle’s competition rules for the CommonLit competition state that competitors can use external data only if it’s “publicly available and equally accessible to use by all participants … at no cost”22. There was some debate in the competition discussion forums as to the nature of synthetic data. If an LLM’s weights are publicly available, or the LLM can be accessed online for free, should synthetic data generated by it be considered fair game for future competitions?

Adapters and quantisation

The resource constraints imposed by competitions (explicitly so for inference; and through budget constraints for training) mean that naively fine-tuning LLMs is not usually practical.

A plethora of parameter-efficient fine-tuning (PEFT) methods have emerged. One of the most popular approaches is LoRA, which was used in the winning LLM Science Exam solution, as well as the A100 track of the NeurIPS 2023 LLM Efficiency Challenge.

Building on LoRA, QLoRA quantises the frozen pre-trained parameters using a novel 4-bit NormalFloat data-type, to further reduce memory requirements with minimal performance trade-offs. This allows GPUs with as little as 16GB of memory to fine-tune 7B parameter models. QLoRA was used to win the RTX 4090 track of the NeurIPS 2023 LLM Efficiency Challenge.

Timeseries Forecasting

Some of the notable time-series prediction competitions from 2023 include the M6 Competition, the Weather4cast competition, Zindi’s RMB CPI Nowcasting Challenge, AIcrowd’s CityLearn Forecasting Track, and Kaggle’s GoDaddy Microbusiness Density Forecasting.

Additionally, DrivenData’s Water Supply Forecast Rodeo and Kaggle’s Trading at the Close competition both started in 2023 but won’t wrap up until later in 2024 — we’ll look to cover them in next year’s report.

Winners used a mixture of approaches, including statistical models like ARIMA, gradient-boosted tree methods (XGBoost), bayesian factor models, and deep learning methods (fully-connected deep neural nets, LSTMs, and convolutional networks — the latter for exploiting 2D structure in satellite data).

Despite all incorporating time-series elements, time-series forecasting competitions are highly heterogeneous. This section presents a brief overview of the winning solutions we found, with links to additional materials where available. We also comment on making manual adjustments to systematic forecasts.

CPI Nowcasting

In Zindi’s RMB CPI Nowcasting competition, participants had to forecast South Africa’s consumer price index, a proxy for inflation, on a monthly basis.

The overall winner of this competition used a combination of ARIMA and XGBoost for their model, and Optuna for hyperparameter optimisation. We also heard from the winners of two of the monthly mini-challenges in this competition, who also both used ARIMA-based models.

Energy Forecasting

In AIcrowd’s CityLearn Challenge, one of the official NeurIPS 2023 competitions, the forecast track asked participants to predict building loads, grid carbon intensity, and solar generation up to 48 hours ahead of time, for each building in a synthetic single-family neighbourhood.

The winner of this competition used an online forecasting approach, with a model based on a simple feed-forward deep neural net, and CMA-ES, an evolutionary algorithm, for hyperparameter optimisation.

Weather Forecasting

Weather4cast 2023, another official NeurIPS competition, involved precipitation prediction over three tracks: Core (8h prediction, 7 previously seen regions), Nowcasting (4h prediction, 7 previously seen regions), and Transfer Learning (4h prediction, including an unseen year and 3 previously unseen regions).

Much of the difficulty in this task comes from the sparsity of rainfall events, significant variation in the magnitudes of these events when they do happen, and the use of a regression loss, requiring models to be accurate about the amount of rainfall as well as its likelihood.

A team from Alibaba Cloud won both the Core and Nowcasting tracks, and came second in the Transfer Learning track. The Transfer Learning track was won by a team from Nanjing University, who came second in Nowcasting and third in the Core track.

The team from Alibaba Cloud innovated on WeatherFusionNet, the approach used by the winners of Weather4cast 2022. WeatherFusionNet uses a combination of U-Nets and a (recurrent, convolutional) PhyDNet. The Alibaba Cloud team added a ConvLSTM module and built an ensemble of these learners.

The Nanjing University team also used a U-Net as the base of their model, along with temporal frame interpolation, a multi-level Dice loss, and thoughtful cropping and augmentation.

Papers describing the winning strategies in more detail are linked to from the competition website.

Financial Forecasting

The M6 Competition involved predicting returns of financial instruments, and constructing a portfolio to maximise risk-adjusted returns. Participants could use any mix of automated or manual judgement-based strategies, and submitted predictions and allocations rather than code. We describe this competition and the previous M competitions in much more detail in the separate notable competitions section.

The winner of the M6 forecasting track has a description of his approach on his blog, and some slides on the MOFC website. He used a Bayesian approach, fitting a factor model using Markov Chain Monte Carlo with the PyMC library.

The Bayesian approach turned out to be particularly useful when DRE, one of the 100 instruments in the M6 competition, was de-listed in October following its acquisition. When the competition organisers set “future returns” of DRE to 0, all that was required was to “set DRE’s return to zero in each of the 4000 samples before taking the quantile probabilities”. This is one example where a competition includes some of the complexity and messiness of the real world that is absent in pure supervised learning problems on clean datasets.

The winner of the investment decisions track generated forecasts using his own AutoTS library, an application of AutoML to time-series data, automatically building ensembles of many different types of models. In a pre-print paper shared with us, he highlighted the importance of optimising separately for the ranked-probability score (RPS) and expected investment returns used in the two tracks of the competition. In his portfolio construction, he also considered a level of meta-uncertainty — generating allocations that were robust over a set of forecasts made using different hyperparameters. More detail on his personal blog and in his presentation slides.

The overall duathlon winner used a meta-learning approach for his forecasting solution. The meta-model, implemented in R using torch, is trained to “identify the most appropriate parametric model for a given family of related prediction tasks”. For the investment decisions track, he used a rank optimisation method, explicitly aiming to maximise the probability of ranking highly in the competition — more on this further down in the design and strategy section. The code for this solution, as well as links to papers describing the approach, can be found in the winner’s GitHub repository.

Manual adjustments

A recurring theme in some of the top forecasting solutions was the temptation to supplement automated forecasts with subjective opinion-based judgements. Although the M6 pre-print paper23 states that “judgment-informed forecasting approaches (albeit, very few) perform on par (if not better) compared to pure data-driven approaches”, the winners of both the M6 forecasting track and M6 decisions track had cautionary anecdotes about their forays into overriding automated forecasts.

On one occasion, a market decision was subjectively modified based solely on author bias to favor an increased ‘buy’ weighting to AMZN which had recently seemingly underperformed in the market. This proved deleterious as in the next interval there was an even greater loss in returns for the targeted stock.
— Colin Catlin, M6 decisions track winner (unpublished pre-print paper)

Tabular Data

Dataframes

When it comes to Python DataFrames, Pandas still dominates… but there are signs of change.

This is the first year we found any competition winners using Polars — the dataframe library written in Rust, created in 2020.

Polars: advantages and disadvantages

The main benefit of Polars is its improved performance — it’s significantly faster than Pandas for many operations, partly due to its ability to parallelise work across multiple CPU cores.

The main downside is that the majority of community code and examples are in Pandas, and Polars uses a different API. This is a double-edged sword, since Pandas’ API has the burden of 15 years of legacy code to support, and Polars was able to design a new API without those constraints.

The approach that many seem to be taking is to introduce Polars incrementally into an existing codebase. Polars dataframes use Arrow Tables as their back-end, and since Pandas 2.0 added the option to also use Arrow Tables instead of the standard NumPy back-end24, it’s possible to convert between Pandas and Polars dataframes without copying any data. This means that with some profiling, users can convert the slowest-running operations to be run in Polars instead of Pandas, without necessarily changing their entire codebase to use Polars.25

One caveat with this approach is that scikit-learn and other Python machine-learning libraries tend to have great native support for NumPy-backed Pandas, but don’t yet support Polars or other dataframes using Arrow as a backend.

We found three winning solutions using Polars: Kaggle’s Student Performance Prediction, Kaggle’s Sleep State Detection, and one member of the winning team for AIcrowd’s Amazon KDD Cup 2023.

All three of these winning solutions also made use of Pandas to some extent. One of the teams mentioned that they “made use of Polars because of the CPU constraints and to simply learn it.”

The Amazon KDD Cup 2023 team built an ensemble using components built separately by each of their five members. As well as Pandas and Polars, they also made use of NVIDIA’s RAPIDS GPU-accelerated tools such as cuDF and cuML, and NVIDIA’s Merlin recommender system library. Somewhat unsurprising maybe, since the team was made up of NVIDIA’s Merlin team and three of their resident NVIDIA “KGMON” Kaggle Grandmasters (Chris Deotte, Jean-Francois Puget, and Kazuki Onodera). They made an impressive showing at the KDD Cup in 2023, winning all three tracks.

Polars’ growth is part of two wider trends:

  1. Rust is increasingly being used to build high-performance tools for Python — with the now widely-adopted Ruff linter being a great example of this.
  2. The Apache Arrow project has been working on its cross-language columnar data format since 2016. With buy-in and contributions from Wes Mckinney (the creator of Pandas) and Hadley Wickham (the developer of R’s popular tidyverse packages), Apache Arrow had ambitious goals since its inception.

Aside from the KDD Cup team mentioned above, we didn’t find any uses of other dataframes such as Dask, cuDF/cuML, VAEX, or Modin. 60 of the winning solutions we found used NumPy, and 56 used Pandas, making these the two most commonly imported packages across all the winning solutions we found.

Gradient Boosted Trees

We wrote extensively about gradient-boosted decision trees (GBDTs) in our 2022 report, and in our separate piece on tabular data.

Not much changed in 2023. GBDTs are still commonly used in winning solutions, and often in ensembles together with neural-net models.

All three of the main GBDT libraries are commonly used, and there’s not much difference in popularity between them. LightGBM remains the most popular, with 10 winning solutions using it. XGBoost was second, with 8 winning solutions, and we found 7 winning solutions using CatBoost.

Robotics

This year we interviewed organisers and participants of several robotics competitions — including those at NeurIPS and ICRA.

In general, robotics brings many additional complications that aren’t present in standard ML competitions. The F1Tenth autonomous racing competition showed the value of experience, with the winning teams having competed several times before and improved with each iteration. The unpredictable physical environment of the competition hall with its added visual and audio noise, as well as wireless connection interference, unpredictable and sometimes uneven floor surface, and variable lighting, can introduce issues not experienced in a more controlled lab setting.

images/robot_parade.png
Robotics Competitions at ICRA 2023

We discuss winning robotics solutions in depth in our separate post on robotics competitions from ICRA 2023, following a week of in-person interviews.

Robotics competitions are even more heterogeneous than software-only competitions. Across both ICRA and NeurIPS, a variety of techniques were competitive — including traditional optimal control techniques, and both model-based and model-free reinforcement learning.

Compute and Hardware

In this section we analyse the compute resources used by competition winners.

Throughout 2023, top research labs were competing for capacity on NVIDIA’s latest generation of data-centre GPUs to train large general-purpose foundation models, as training FLOPs for cutting-edge AI systems continue to grow exponentially.26

Competitions, on the other hand, tend to focus on more narrowly-scoped applications, and participants can benefit from those foundation models by fine-tuning them for a specific task on relatively small amounts of data. This has been the case for a number of years already, with BERT-like models for NLP and various pre-trained vision models.

Still, modern ML competitions are comparatively compute-hungry. We found that over 70% of winners used a GPU to train their model, and many of these used powerful data-centre GPUs rather than cheaper consumer-oriented ones.

Most competitions aim to attract many participants, including those with more limited compute resources. The limited-resource nature of competitions can also make their findings more generally useful — such as the LLM Efficiency challenge, evaluating LLM fine-tuning on consumer hardware.

Where the nature of a task makes it difficult to do research in a compute-constrained environment, some competition organisers provide compute grants to participants27.

Hardware Used by Winners

We continue to see a diverse mix of hardware types and ownership models. Solutions for tabular data or time-series forecasting such as ARIMA, gradient-boosted decision trees, or simpler ML methods like linear regression, are often just trained on CPU-only hardware (from relatively budget desktop processors to the 64-core Threadripper 3990X).

Deep learning models, such as those used for computer vision and NLP tasks, tend to benefit more from special-purpose hardware, and unsurprisingly we found almost all winners for these types of competitions making use of GPUs or TPUs.

This was the first year we found winners using TPUs for training their solutions: one through a Kaggle Notebook, and the other through a paid Google Colab subscription. Others also used Kaggle Notebooks or Colab for accessing GPUs.

Several winners mentioned experimenting on local consumer GPUs while developing their models, and using more powerful cloud-based GPUs for their final training runs.

Once again, all GPU models used by winners were NVIDIA GPUs. While AMD’s offering has been improving, especially for inference28, adoption is still lagging.

We didn’t find any examples of competitors using accelerators other than CPUs, TPUs, or NVIDIA GPUs for training their models.

Accelerator Models

The most popular accelerator used by winners continues to be the A100, just like in 2022. For all but one of the winning solutions using A100s we were able to confirm that they were using the older 40GB model.

Second-most popular is the P100 model, which is available for free in Kaggle Notebooks. 29

Third-most popular, the RTX 3090 is the top-end retail/gaming GPU from the previous generation (the current-generation equivalent being the RTX 4090, which was used in one winning solution).

The main notable absence is the H100, the successor to the A100. This has been the chip of choice for companies training foundation models — with Meta planning to own over 350,000 H100s by the end of 2024. 30 The popularity of the H100 for foundation models likely impacted its availability, as capacity throughout much of 2023 was limited.

Dataset Size

In addition to specific hardware models used, dataset size and training time are useful proxies for the amount of compute required for competitions.

As was the case last year, training dataset sizes span at least five orders of magnitude. The competition in 2023 with the most training data was the Vesuvius Challenge, with almost 8TB of data that was publicly released.

The dataset size we track is just that of the data provided by competition organisers — many competitions allow participants to also train on data from a selected set of sources, or even any publicly available data.

Training Time

Training time also varies significantly. In a surprising number of competitions, simple models trained in under an hour (in some cases on consumer hardware) can win. This seems to be the case particularly in forecasting challenges.

In many other competitions, expensive hardware seems to provide a significant advantage.

Cost and accessibility

The use of code competitions is a step towards standardising resources available to participants. In a code competition, participants submit inference code which runs in a limited-resource environment on the competition platform, rather than directly submitting predictions. Still, access to additional resources at train-time can make it easier to iterate more quickly, run more experiments, and use more compute-intensive methods, which can present an advantage.

In some competitions, winners trained on setups costing thousands or tens of thousands of dollars, including one training server with 8 RTX A6000s, and an NVIDIA DGX-1 with 8 V100s.

The NVIDIA KGMON team who won the KDD Cup 2023 used 8 V100 GPUs for at least six full days, which would cost over $1,000 to run on on-demand cloud compute, just for the final training run.31

The winner of DrivenData’s BioMassters competition used 2 A100s and the final training run took 8 days, which would cost over $500 using on-demand cloud compute.32

The winner of Kaggle’s Bengali.AI Speech Recognition competition had access to 8 RTX A6000 GPUs, which would cost around $8/h in on-demand cloud compute, or around $40k to own33.

Team Demographics

This year, more than half the winning teams we found were made up of just a single individual.

Winning Team Sizes

While forming a larger team can be helpful — such as in the case of the NVIDIA KGMON team who split up their pipeline into separate components, which they worked on independently — dedicated individuals can clearly compete successfully without teammates.

Kirill Brodt, who won DrivenData’s BioMassters competition and has won prizes in several other previous competitions, shared with us that in five previous competitions where he competed solo and came either first or second, he mostly spent between 10 and 40 hours working on his solution.34

Repeat Winners

As also seen in last year’s report, more than half of the winning teams are first-time winners.

At the other end of the spectrum, serial winners (2+ previous wins) made up roughly a third of winners in 2023. These include both corporate and independent teams and individuals.

Design and Strategy

As much as competitions aim to be realistic environments for testing models and algorithms, the practicalities of running a competition and having pre-defined evaluation metrics can create opportunities for participants to outperform by adopting strategies targeted at winning the competition, which don’t always translate to a real-world environment.

We detail some examples of these below — including leaderboard probing, explicitly optimising for win probability rather than the competition metric, and building on winning solutions from previous years.

Optimising to win

Filip Staněk, the winner of the M6 Duathlon track, wrote a note describing his strategy for the investment decisions part of the competition, where he optimised explicitly for probability of winning the competition rather than targeting the best expected investment returns. This is an inevitable result from a competition environment, where prizes are given only to top-ranked participants, creating an asymmetric risk profile (large positive returns are very good, and large negative returns are only as bad as mediocre returns in the competition setting).

This portfolio strategy can attain a comparable probability of winning as a participant capable of consistently generating approximately double the market returns. However, it exhibits poor performance in expectation. This highlights that the task of succeeding in such a competition may not always coincide with the task of maximizing expected investment returns.
— Filip Staněk, M6 Duathlon winner, paper abstract

Building on previous solutions

There’s an element of “work smarter, not harder” in some winning solutions, which isn’t always present in academic ML research.

We noted last year that the winner of Kaggle’s March Madness competition for predicting college basketball results used the exact code shared by the winner of a similar competition in 2018. Unbelievably, this happened again this year! (albeit with a Python version of the code, rather than the R version used last year, and some minor tweaks)

My submission was essentially the @raddar code from a few years back… My contribution was limited to 1) commenting out the np.exp() line when calculating Team Quality as it ended up returning quite a few inf values and 2) No overrides in match-ups of seeds 1-4 against seeds 13-16.
— RustyB, March ML Mania 2023 winner, Kaggle Discussion

Leaderboard Probing

In Kaggle’s GoDaddy Microbusiness Density Forecasting competition, participants had to make monthly predictions on noisy U.S. county-level data. Participants were given historical data, and the public leaderboard contained slightly more recent data. The ultimate winner used a linear regression model (having ruled out various other models), and got an edge from probing the public leaderboard to get more recent information about a few of the most volatile counties in the dataset — which helped them make better forecasts for those counties on later data.

But leaderboard probing doesn’t always work out! In Kaggle’s Identifying Age-Related Conditions competition, performance on the public leaderboard did not transfer to the final private leaderboard score — with the eventual winner jumping 2089 places from public to private leaderboard. Grandmaster Chris Deotte noted that in his submission,“using probed positive targets from public LB actually hurt my private LB score”.

Some competitors identified this as a possibility early on in the competition35, showing the value in paying attention to the discussion forums.

Notable Competitions

In this summary we highlight three interesting competitions:

Vesuvius Challenge

Nat Friedman announced the Vesuvius Challenge in March 2023, with $250k in funding, to build on prior work by Dr. Brent Seales. Within a week of the announcement, several other private backers stepped forward to bring the total prize pool to $1m.

The challenge: decipher text hidden in papyrus scrolls which were carbonised when mount Vesuvius erupted in 79 AD — almost two-thousand years ago.

We’re using a particle accelerator and AI to read a lost library from a dead empire. People have been trying to read the Herculaneum Papyri for 275 years. With your help, we’ll do it in 2023.
— Nat Friedman, 15 March 2023, Twitter

Over the past few centuries, many of these scrolls — the Herculaneum Papyri — were destroyed as a result of attempts to open them. Their fragile carbonised state makes them almost impossible to open. One Italian monk spent several decades painstakingly unrolling a few of them, which were found to contain Greek philosophical texts.

An illustration of the villa where the scrolls were found
One of the carbonised scrolls
A fragment of a scroll.
Dr. Seales and his team at the Oxford Diamond Light Source
Images from scrollprize.org, reproduced with permission.

The Vesuvius Challenge launched with 5.5TB of data from non-invasive X-ray tomography scans of two of these scrolls, as well as more than 2.4TB of scanned fragments of other scrolls. The challenge consisted of multiple smaller prizes followed by a $700,000 grand prize for anyone who could reveal at least four passages within the scrolls. Kaggle hosted the $100k Ink Detection Prize. There were also several sub-prizes for building tools, fostering a huge global collaborative effort.

Following success in the ink detection prize, the first letters on these scrolls were identified in October by Luke Farritor and Youssef Nader, building on prior work by Casey Handmer.

The deadline for the grand prize was the 31st of December 2023, and in February 2024, following reviews by technical and papyrological teams, the grand prize was awarded to Youssef Nader, Luke Farritor, and Julian Schilliger. After each individually being successful at previous components of the Vesuvius Challenge, they teamed up and combined their approaches. Their winning solution is now public on GitHub.

This is a momentous achievement: the Vesuvius Challenge has managed to unroll and read around 5% of one of the scrolls. Some transcriptions and translations of parts of the discovered text are already available on the Vesuvius Challenge website.

And they’re not stopping here. According to their recently-published Master Plan, this is the end of stage one. Stage two will be to improve automation, and read 90% of the four scrolls that have been scanned. That will form the basis of the 2024 Grand Prize.

In stage three, they expect to spend 2-3 years scanning and reading all (~800) excavated scrolls. After that, stage four would be to restart excavation of the Villa dei Papiri, hoping to find a larger library in a deeper part of the villa. If stages 2 and 3 go well, the research spawned by the uncovered text should provide a powerful catalyst for further excavation.

The Vesuvius Challenge is a great example of important real-world research being accelerated by competitive ML. Deciphering the hundreds of excavated scrolls could result in a significant increase in the amount of available text from antiquity, and the first step towards that has been a success.

M6: Financial Forecasting

The M6 Financial Forecasting Competition is the sixth in a series of forecasting competitions run by the Makridakis Open Forecasting Center (MOFC), named after its founder Spyros Makridakis. The first such “M Competition” was run in 1982. The M Competitions are particularly rigorous and extensive, and each iteration tends to focus on a particular aspect of forecasting.

History of M Competitions

The M Competitions provide a great barometer of the changing state-of-the-art in time-series forecasting methods, and an interesting shift towards ML methods occurred between the M4 and M5 competitions.

M1
1982
Micro, macro, industry, demographic, monthly-yearly.
1,001 timeseries
24 methods
Focus/data Various: Micro, macro, industry, demographic; monthly and yearly.
Leading Methods Exponential smoothing
Participants Selected experts.
Key “If the forecasting user can discriminate in his choice of methods depending upon the type of data (yearly, quarterly, monthly), the type of series (macro, micro, etc.) and the time horizon of forecasting, then he or she could do considerably better than using a single method across all situations”
More M1 Competition Paper
M2
1987 - 1991
Buillding on M1, and incorporating private data shared by 4 companies.
29 timeseries
18 methods
Focus/data US. macro-economic data, and internal data shared by four companies (mostly sales data).
Leading Methods Exponential smoothing
Participants Selected experts.
Findings “The most striking outcome of the M2-Competition is the good and robust performance of exponential smoothing methods and in particular that of Dampen and Single smoothing”
“the less the randomness of the series the better the relative accuracy of the more sophisticated methods used by the forecasters.”
More M2 Competition Paper
M3
2000
Extending M1 and M2; the first to be open to all.
3,003 timeseries
24 methods
Focus/data Various types (micro, industry, macro, etc) and different intervals (yearly, quarterly, etc)
Leading Methods Theta (decomposition), ForecastPro (expert system), Robust Trend (non-parametric exponential smoothing)
Participants Open to everyone.
Findings “The accuracy of the combination of various methods outperforms, on average, the specific methods being combined and does well in comparison with other methods.”
More M3 Competition Paper
M4
2018
Many more timeseries and methods than previous competitions.
100,000 timeseries
61 methods
€27,000 prize pool
Focus/data Various frequencies (hourly up to yearly), and diverse data. “The 100,000 time series used in the M4 were selected from a database… that contains 900,000 continuous time series, built from multiple, diverse and publicly accessible sources.” Participants submitted point forecasts as well as prediction intervals.
Leading Methods Meta-learning, and ensembles of statistical and ML methods
Participants Open to everyone.
Findings “all of the top-performing methods, in terms of both [point forecasts] and [prediction intervals], were combinations of mostly statistical methods, with such combinations being more accurate numerically than either pure statistical or pure ML methods.”…
“While hybrid approaches (combining ML and statistical methods) performed very well, the four submitted pure ML approaches “definitely performed below expectations… all [pure] ML methods submitted were less accurate than [a simple baseline] benchmark”
More M4 Paper | IJF special issue
M5
2020
Walmart data, run on Kaggle, evaluating accuracy and uncertainty.
42,000 timeseries
5,507 methods
$100,000 prize pool
Focus/data Retail data (Walmart)
Leading Methods Mostly ensembles of LightGBM models, some ensembles of deep neural nets.
Participants Open to everyone.
Findings “M5 was the first competition where all of the top-performing methods were both ‘pure’ ML approaches and better than all statistical benchmarks and their combinations. "
More M5 Accuracy Paper | M5 Uncertainty Paper | IJF Special Issue
M6
2022 - 2023
Financial forecasting and portfolio construction.
100 timeseries
226 methods
$300,000 prize pool
Focus/data Financial forecasting and portfolio construction
Leading Methods Various: meta-learning, bayesian methods, large ensembles.
Participants Open to everyone.
Findings The majority of participants underperformed against simple baselines, but a small minority managed to outperform. There was a weak link between teams’ forecasting accuracy and investment performance: “the teams that submitted the best forecasting submissions did not perform similarly well in terms of investment decisions and vice versa.”
Paper Pre-print paper
M2
1987 - 1991
Buillding on M1, and incorporating private data shared by 4 companies.
29 timeseries
18 methods
M4
2018
Many more timeseries and methods than previous competitions.
100,000 timeseries
61 methods
€27,000 prize pool
2022 - 2023
M6
Financial forecasting and portfolio construction.
100 timeseries
226 methods
$300,000 prize pool
M1
1982
Micro, macro, industry, demographic, monthly-yearly.
1,001 timeseries
24 methods
M3
2000
Extending M1 and M2; the first to be open to all.
3,003 timeseries
24 methods
M5
2020
Walmart data, run on Kaggle, evaluating accuracy and uncertainty.
42,000 timeseries
5,507 methods
$100,000 prize pool

The M4 competition paper36, in 2018 noted that “all [pure] ML methods submitted were less accurate than [a simple baseline] benchmark”. Only a few years later, both tracks of the M5 competition were won by LightGBM models, and the M5 Accuracy paper37 stated that “M5 was the first competition where all of the top-performing methods were both ‘pure’ ML approaches and better than all statistical benchmarks and their combinations. “

Some of the success of ML methods was attributed to “cross-learning” — training one ML model on multiple time-series as opposed to training separate instances of a traditional statistical model on each of the given time-series. This was particularly relevant in the M5 competition, where the time-series were aligned and highly correlated with each other.

Another conclusion that came out of several of the competitions is that the “best” method depends on the frequency of data, on the metric chosen, and on the type of data — particularly the amount of noise present.

M6 Competition

The M6 Competition, which finished in February 2023, focused on financial time-series and was the first M Competition to use live evaluation on real, public data. It also had the largest prize pool of any M Competition so far, at $300,000.

While there have been several financial forecasting competitions in recent years (see our summary of financial forecasting competitions in 2022), this one is unique in several aspects, including its almost year-long evaluation period — roughly four times as long as the usual three months in other competitions.

There were two components to the M6 Competition, measured separately:

  1. Forecasting: predict the ranking of the 4-weekly returns of 100 different exchange-traded financial instruments (50 US stocks, 50 international ETFs).
  2. Decision-making: allocate capital across the 100 instruments to maximise future risk-adjusted returns.

This competition required sustained commitment from participants. Of the 163 teams who entered both the forecasting and decision-making tracks, only 26 made full original submissions at all 12 submission points. (in the case of a missed submission, the previous submission would be carried through to the current period)

In fact, “all five winners in the forecasting track and four of the five winners in the investment track updated their submission at every single round, while the same was true for three of the duathlon winners”23 — suggesting that persistence is a highly desirable quality in a competition like this.

Comparing the teams’ performance against the simple (equal-weighted long allocation) investment benchmark showed just how hard this problem is: “although the vast majority of the teams (75%) have managed to construct less risky portfolios than the benchmark,… only 31% have realized higher returns and [information ratios].”

Some teams outperformed: “a small group of teams achieved… an impressive rate of return of about 30%”, but “the performance of the teams is rather symmetric around the mean in the sense that more than one fourth of the teams realized losses that exceeded 7%, reaching up to 46%.”

Before running the competition, the MOFC published ten hypotheses, and each of these is reviewed in the pre-print paper23 analysing competition results (the full M6 issue of the IJF will likely be published later this year). One interesting hypothesis that was confirmed predicted a weak link between forecasting and decision-making ability.

We corresponded with the #1 ranked participants in each of the forecasting, decision-making, and duathlon tracks to understand their approaches, and explore those in the timeseries forecasting solutions section.

The next M competition, M7, may launch later in 2024. Keep an eye on the website for updates.

ARCathon

The Abstraction and Reasoning Corpus (ARC) is a benchmark designed to measure AI skill acquisition. Rather than using specific tasks, it expects algorithms to be able to compete using a few-shot approach — correctly interpreting the task from a few given examples, and extrapolating to the remaining example.

The 1,000 ARC tasks were manually created by François Chollet, who also created Keras. Since their creation in 2019, the training examples have been publicly released, but the test examples have remained carefully guarded.

An example ARC task. Source: Lab42

The first challenge using ARC tasks was Kaggle’s Abstraction and Reasoning Challenge in 2020, where the winner reached 21% success on the test set. Most humans can solve 80% of ARC tasks.

After this, Lab42, a part of Swiss AI company Mindfire, partnered with François to run an ongoing challenge using ARC tasks, called ARCathon, committing around $1,000 in prize money for every 1% improvement on the state-of-the-art. As of the end of 2023, the best participant-submitted ARCathon solution got a 30% success rate, and an ensemble of solutions put together by the competition organisers got up to 31%.

Interestingly, one of the joint winners of the 2023 ARCathon used LLMs as a core part of their solution. This and other leading solutions are described on Lab42’s winners page.

Lab42 have a useful ARC Playground tool on their site, allowing users to attempt existing ARC tasks, as well as a task editor to create new ones for a larger, crowdsourced set of ARC tasks.

Looking Ahead

Platform Developments

Platforms are continuing to invest in their product offerings as the ML competition space grows. Across the board, platforms are adding features that enable more effective learning, better knowledge sharing, new types of competitions, deeper integration with pre-trained models, and other improvements for competition organisers, sponsors, and participants.

A few highlights:

Upcoming Competitions

We saw general AI excitement reflected in the competitions space in 2023, but that was just the start. Many big competitions are launching in the coming year, from a continuation of the Vesuvius Challenge to mathematics and fixing security vulnerabilities.

We expect there to be several interesting reinforcement learning (RL) competitions in 2024. The teams behind competitions like Lux AI, the Neural MMO Challenge, and the Melting Pot Challenge are building promising competition environments that manage to capture the complexity of real games while being more tractable for RL because of their improved computational efficiency and ease of parallelisation.

Vesuvius Challenge 2024

As mentioned in the Vesuvius Challenge section, for stage two the $100k 2024 Grand Prize will go to the first team able to read 90% of the four scanned scrolls. There are a number of other progress prizes, with a total of over $500k in prize money available in 2024.

There will also be a celebration of the success of the Vesuvius Challenge 2023 at the Getty Villa Museum in LA on the 16th of March at 4pm — register here.

AI Mathematical Olympiad

The $10m Artificial Intelligence Mathematical Olympiad (AIMO) Prize, run by the algorithmic trading firm XTX Markets, intends to spur the open development of AI models that can reason mathematically, leading to the creation of a publicly-shared AI model capable of winning a gold medal in the International Mathematical Olympiad (IMO).

The grand prize of $5m will be awarded to the first publicly-shared model that can perform at a standard equivalent to an IMO gold medal, and an additional $5m has been set aside for a series of progress prizes.

This seems to be an area where recent developments in generative models will prove to be highly valuable — soon after the competition was announced, DeepMind published a paper in Nature announcing AlphaGeometry, “a neuro-symbolic system… approaching the performance of an average International Mathematical Olympiad (IMO) gold medallist.”38

In the spirit of openness, the competition rules require winning systems to be reproducible — not just to have their inference code be open, but also their training procedure and training data.39

AI Mathematical Olympiad: public sharing rules

Our initial view is that, in the interest of advancing scientific knowledge, the AI models should be reproducible by any third party with sufficient resources. In particular, the training data, training script and final model (architecture and corresponding weights) should be made public. If the training data is too large to provide, then alternatively the procedure used to construct the training data should be provided. We will determine the licences under which AI models would need to be shared in order to be eligible for prizes. This may include a requirement to publish open-source for non-commercial uses and on a royalty-basis for commercial uses.

The first $1m AIMO progress prize, to be hosted on Kaggle, was announced on the 23rd of February 202440. For more information or to take part, go to the AIMO website.

AI Cyber Challenge

DARPA’s AI Cyber Challenge (AIxCC) is a bold effort to develop automated systems that can find and fix security vulnerabilities in software. Like previous DARPA challenges, it’s extremely well-funded — with a total of $29.5m in prizes to be awarded to teams taking part, including $7m to support small business teams in developing their solutions, and $2m to each of the winning teams in the semifinals at DEF Con 32 in August 2024.

DARPA’s partners — Anthropic, Google, Microsoft, and OpenAI — are providing participants with credits to their compute clouds and LLM APIs.

Participants applying to take part have until 30th of April 2024 to register by submitting a 5-page whitepaper describing their planned solution. Once built, solutions will be scored on diversity (how well they do at finding different types of vulnerabilities), accuracy (penalising false positives), discovery (how many vulnerabilities they find), and program repair (whether they can submit code patches that fix vulnerabilities without compromising functionality).

After this summer’s semifinals, the finals will take part at DEF CON in 2025, with a top prize of $4m.

For more info, go to the AIxCC website.

Benchmarking GenAI

There were several competitions incorporating elements of generative modelling in 2023: Kaggle’s LLM Science Exam and Image to Prompts competitions, the LLM Efficiency Challenge, Tianchi’s FT-Data Ranker, AIcrowd’s Hackaprompt challenge, Signate’s ChatGPT Challenge, and Solafune’s Finding AI-generated Images, among others.

Several new ones have already started in 2024, including Zindi’s Malawi Health Systems LLM Challenge, AIcrowd’s Generative Interior Design Challenge, and Kaggle’s LLM Prompt Recovery.

As we see more of these competitions, platforms and organisers are confronting the issues involved in trying to measure generative models in a systematic way. Some platforms already support automate human evaluation at large scale, and most have run competitions involving panels of expert judges. Will other approaches be developed over time?

The competitive ML community has developed a suite of tools that can be used to more reliably measure the internet-scale foundation models driving generative applications, without the issues inherent in static benchmarks.41 To list a few relevant techniques:

We look forward to seeing the developments in this space over the coming months and years. We expect that competitions have an important role to play in evaluating the foundation models these generative solutions are being built on.

About This Report

If you would like to support this research, please check out our sponsors below, or subscribe to our mailing list.

About Our Sponsors

About ML Contests

For over four years now, ML Contests has provided a competition directory and shared insights on trends in the competitive machine learning space. To receive occasional updates with key insights like this report, subscribe to our mailing list.

If you enjoyed reading this report, you might enjoy these other articles:

images/Report Header.png
State of Competitive ML 2022

Last year's report, where we summarised results from 200+ competitions in 2022.

images/NeurIPS Logo.png
Competitions at NeurIPS 2023

An overview of the 20 competitions at NeurIPS 2023 in New Orleans.

images/robot_parade.png
Robotics competitions at ICRA 2023

A close look at the 12 robotics competitions at ICRA 2023, in London.

Acknowledgements

Thank you to Eniola Olaleye for help with data gathering, and to Peter Carlens, James Mundy, Alex Cross, and Maja Waite for helpful feedback on drafts of this report.

To Adrien Pavao for sharing a draft of his PhD thesis43, and Fritz Cremer, for sharing insights on Kaggle competitions.

To Professor Spyros Makridakis, and the Lab42 team, for taking the time to answer our questions and giving extra insight into the M Competitions and the ARCathon.

To the teams at AIcrowd, DrivenData, Grand Challenge, Hugging Face, Humyn.ai, Kaggle, Lab42, Onward, Solafune, Trustii, and Zindi, for providing data on their 2023 competitions.

Thank you to the competition winners who took the time to answer our questions by email or questionnaire: Aleksandr Shatilov, Andoni Irazusta Garmendia, Ben Swain, Colin Catlin, Filip Staněk, Francesco Morri Harrison Karani Maina, Nasri Mohammed, Ning Jia, Olufemi Victor Tolulope, Plato, Tom Wetherell, Yasser Houssem Eddine Kehal, Yisak Birhanu Bule, Kirill Brodt, and others who preferred not to be named.

Thank you to the organisers of the Machine Unlearning, PFL-DocVQA, Big-ANN, LLM Efficiency, Weather4cast, Sensorium, Single-cell Perturbation Prediction, Lux AI, CityLearn, ROAD-R, Robot Air Hockey, and MyoChallenge NeurIPS competitions, and the F1Tenth, AQRC, PUB.R, BARN, METRICS HEART-MET, Manufacturing Robotics, Robotic Grasping and Manipulation, Human-Robot Collaborative Assembly, Humanoid Robot Wresting, and Picking, Stacking and Assembly ICRA competitions, who took the time to speak to us virtually or in person.

Thank you to the organisers of the Vesuvius Challenge, for giving us permission to use their images.

Lastly, thank you to the maintainers of the open source projects we made use of in conducting our research and producing this page: Hugo, Tailwind CSS, Chart.js, Linguist, pipreqs, and nbconvert.

Methodology

Data was gathered from correspondence with competition platforms, organisers, and winners, as well as from public online discussions and code.

The following criteria were applied for inclusion, in-line with the submission criteria for our listings page. Each competition must:

  1. have a total prize pool6 of at least $1,000 in cash or liquid cryptocurrency (BTC/ETH), or be an official competition at a top machine learning or robotics conference.
  2. have ended44 between 1 Jan 2023 and 31 December 2023 (inclusive).

In our written explanations, the word “competition” is often used to refer to an over-arching competition (e.g. “The M6 Competition”), while in our statistics we usually count sub-tracks each as their own competition (e.g. M6: Forecasting, M6: Decisions, M6: Duathlon, and M6: Student Prize).

When counting a “number of competitions” for purposes such as prize pool distribution, or popularity of programming languages, we use the following definition:
If a competition is made up of several tracks, each with separate leaderboards and separate prize pools, then each track counts as its own competition. If a competition is made up of multiple sub-tasks which are all measured together on one main leaderboard for one prize pool, they count together as one competition.

For the purposes of this report, we consider a “competition winner” to be the #1-placed team in a competition as defined above. We are aware that other valid usages of the term exist, with their own advantages — for example, anyone winning a Gold/Silver/Bronze medal, or anyone winning a prize in a competition. For ease of analysis and in order to avoid double-counting, we exclusively consider #1-placed teams in this report.

Compiling the Python packages section in the winning toolkit involved some discretion. While we attempted to highlight the most popular and interesting packages for readers, we did not simply take the n most popular packages. For example, Optuna, used in three winning solutions, is the most commonly used hyperparameter optimisation package, and we included it in our toolkit. Polars was also used in three winning solutions, but given that it has similar functionality to Pandas, which was used in 56 winning solutions, we did not include Polars in the winning toolkit.

For the sake of transparency, we include a list of all packages which were used in more than three winning solutions below, along with counts. Note that these counts don’t necessarily map to actual usages, and in some cases we did additional work to figure out actual usage. For example, while TensorFlow is imported in some files within ten different winning solutions, several of these didn’t actually make use of TensorFlow — e.g. ROAD-R or RSNA Screening. We did not have the research resources to go through all repositories in detail in this way; we investigated in cases where both TensorFlow and PyTorch were being imported in one solution, and conducted a few other spot checks. As such, the list below should be considered “raw”, without this extra usage analysis applied.

List of packages
[
 ('numpy', 60),
 ('pandas', 56),
 ('torch', 49),
 ('tqdm', 41),
 ('sklearn', 38),
 ('matplotlib', 32),
 ('transformers', 22),
 ('scipy', 21),
 ('cv2', 15),
 ('PIL', 14),
 ('torchvision', 14),
 ('yaml', 13),
 ('albumentations', 11),
 ('wandb', 11),
 ('seaborn', 11),
 ('timm', 10),
 ('tensorflow', 10),
 ('lightgbm', 10),
 ('requests', 8),
 ('setuptools', 8),
 ('pytorch_lightning', 8),
 ('xgboost', 8),
 ('psutil', 7),
 ('catboost', 7),
 ('skimage', 7),
 ('datasets', 6),
 ('IPython', 6),
 ('joblib', 6),
 ('segmentation_models_pytorch', 6),
 ('loguru', 5),
 ('h5py', 5),
 ('mmcv', 5),
 ('nltk', 4),
 ('onnxsim', 4),
 ('onnx', 4),
 ('pycocotools', 4),
 ('numba', 4),
 ('huggingface_hub', 4),
 ('ray', 4),
 ('torchaudio', 3),
 ('onnxruntime', 3),
 ('thop', 3),
 ('tensorrt', 3),
 ('openvino', 3),
 ('polars', 3),
 ('google', 3),
 ('termcolor', 3),
 ('networkx', 3),
 ('accelerate', 3),
 ('einops', 3),
 ('optuna', 3),
 ('evaluate', 3),
 ('peft', 3),
 ('jieba', 3)]
 ]

Attribution

For attribution in academic contexts, please cite this work as

Carlens, H, “State of Competitive Machine Learning in 2023”, ML Contests Research, 2024.

BibTeX citation

@article{
carlens2024state,
author = {Carlens, Harald},
title = {State of Competitive Machine Learning in 2023},
journal = {ML Contests Research},
year = {2024},
note = {https://mlcontests.com/state-of-competitive-machine-learning-2023},
}

Updates and Errata

5 March 2024: Added year founded and number of users for Bitgrit (Overview of Competition Platforms)

7 March 2024: Changed “the winner of the M6 Competition’s forecasting track, who used R…” to “the winner of the M6 Competition’s duathlon track, who used R…” (Winning Toolkit)

2 April 2024: Added number of registered users for Tianchi (Overview of Competition Platforms).


  1. Broadly, we included only competitions which had either cash prize money of at least $1k or were affiliated with an academic conference. See our submission criteria for more detail. ↩︎ ↩︎

  2. The number of competitions and total prize money amounts are for competitions that ended in 2023. Prize money figures include only cash and liquid cryptocurrency (BTC/ETH). Travel grants and other types of prizes are excluded. Amounts are approximate — currency conversion is done at data collection time, and amounts are rounded to the nearest $1,000 USD. See Methodology for more details. ↩︎ ↩︎

  3. In addition to the three eligible competitions Codabench hosted in 2023, the platform was also used to run a power network operations challenge for power companies operating in the Paris region of France, with 7 participating teams and a €500,000 grant as prize. For more information on this “Learning to Run a Power Network” competition, see Adrien Pavao’s PhD Thesis (PDF). ↩︎

  4. CodaLab’s public instance moved from competitions.codalab.org to codalab.lisn.fr when the CodaLab platform transitioned from Python2 to Python3, and users were not automatically migrated to the new platform. The old public CodaLab instance had 128,000 users in October 2022 before it was closed to new competitions. The new public CodaLab instance had 55,767 users as of 5 March 2024. Source: CodaLab highlights page ↩︎

  5. “Total users: 94,732” as of 5 March 2024. Source: Grand Challenge statistics page ↩︎

  6. In general, we consider the available prize pool rather than the disbursed prize pool. In most cases the available prize pool is the same as the disbursed prize pool, but there are some exceptions. Most notably: the ARCathon competition has prizes tied to benchmark improvements, and while the available prize pool for the main competition in 2023 was CHF 69,000 (one thousand Swiss Francs per percentage point improvement between 31% and 100%), nothing was paid out in 2023 as there were no improvements by participants on the state-of-the-art ensemble solution. In other cases there can be discretionary bonus prizes which are not always awarded. We count the available prize pool, as it is not always possible to get reliable information on the disbursed prize pool. ↩︎ ↩︎

  7. “More than 80,000 high-level human resources are participating, including excellent data scientists at major companies and students majoring in the AI field. (as of February 2023)”. Source: Signate’s website, translated from Japanese using Google Translate on 5th of March 2024. ↩︎

  8. Tianchi: “approximately 1.4 million registered users”. Source: email correspondence with the Tianchi team on 1 April 2024. ↩︎

  9. We acknowledged this limitation in footnote 4 of our 2022 report. Because platforms like CodaLab and EvalAI allow anyone to list competitions, there isn’t a central platform source we can use for the number of competitions and total prize pool. In 2022 we did not have sufficient resources to exhaustively go through all of the listings on these sites, and had to rely on other sources. ↩︎

  10. It is also the case that in competitions with a higher barrier to entry, the participants who manage to surmount that barrier will tend to be more dedicated than in competitions where new submissions can be generated by making minor edits to another team’s solution. For this reason, we think aiming for a large number of submissions is not always the right goal, and we have removed the “typical entries” column from the competition platforms table. ↩︎

  11. The 65 solutions include 57 where source code was public, 6 where code wasn’t public but we got language/package data from winners completing our questionnaire, and 2 where language/package data was mentioned in the winners’ write-up. In cases where the same solution won multiple tracks (e.g. Weather4cast, KDD Cup), the solution is counted only once. ↩︎

  12. This competition was more of an optimisation problem than strictly a machine learning problem. The winners state: “Here, we searched for the optimal arm movement without additional reconfiguration cost by using dynamic programming (DP) and beam search with pruning… the entire DP table could be calculated in a few seconds when implemented with C++…”. Source: Kaggle solution write-up ↩︎

  13. The toolkit section broadly reflects the most-commonly-used packages for different ML-relevant purposes. Some discretion is used: for example, more general-purpose packages like requests and setuptools are excluded from the list. More details in the methodology section. ↩︎

  14. Of these, 8 used the pytorch-lightning package, and one used the newer lightning package. ↩︎

  15. Of over 26,000 developers surveyed, 24% said they use AI assistants “quite often” for generating code. 37% selected “from time to time”, 24% “rarely”, and 15% “never”. Source: JetBrains’ State of Developer Ecosystem 2023 report↩︎

  16. “Many researchers believe that ConvNets perform well on small or moderately sized datasets, but are not competitive with Vision Transformers when given access to datasets on the web-scale. We challenge this belief by evaluating a performant ConvNet architecture pre-trained on JFT-4B, a large labelled dataset of images often used for training foundation models. We consider pre-training compute budgets between 0.4k and 110k TPU-v4 core compute hours, and train a series of networks of increasing depth and width from the NFNet model family. We observe a log-log scaling law between held out loss and compute budget. After fine-tuning on ImageNet, NFNets match the reported performance of Vision Transformers with comparable compute budgets. Our strongest fine-tuned model achieves a Top-1 accuracy of 90.4%. " Source: Smith et al, ConvNets Match Vision Transformers at Scale↩︎

  17. “Our final ensemble is a blend of five 7B models and one 13B model. Each model uses a different context approach using different wikis, embeddings and topk. The pipeline fits precisely into 9-hour runtime and utilizes 2.5TB of input data.”, “We had most success with the following LLM models: Llama-2-7b, Mistral-7B-v0.1, xgen-7b-8k-base, Llama-2-13b”. Source: Kaggle solution write-up↩︎

  18. Languages where less training data is available are sometimes referred to as “low[er] resource languages”. See e.g. Low-resource Languages: A Review of Past Work and Future Challenges↩︎

  19. “The original Whisper tokenizer employs character-level tokens for low-resource languages, which can be time-consuming. So we trained a BPE tokenizer with 12,000 tokens specifically for Bengali text. We then replaced some tokens in the Whisper tokenizer with these. We carefully replaced tokens after the 10,000th position of the original Whisper’s tokens.” Source: Kaggle solution write-up↩︎

  20. “Our final submission only used a 4fold microsoft/deberta v3 large. “, “Carefully designed prompt to guide LLM to spit out the topic and topic text in his stomach”, “Another prompt used to generate ten summary of different quality for each additional topic”, “we tried as many open source LLM as well as chatgpt”. Source: Kaggle solution write-up↩︎

  21. Some participants noticed in advance that the public leaderboard dataset was not very diverse, and performance on that dataset was therefore less likely to generalise well to the private dataset: “If I am not mistaken, this must mean that public data only contains texts from grade 10, which makes it a very poor indicator of good model quality. " Source: Kaggle Discussion↩︎

  22. “You may use data other than the Competition Data (“External Data”) to develop and test your Submissions. However, you will ensure the External Data is publicly available and equally accessible to use by all participants of the Competition for purposes of the competition at no cost to the other participants. " Source: Kaggle competition rules↩︎

  23. M6 paper abstract: “The M6 forecasting competition, the sixth in the Makridakis’ competition sequence, is focused on financial forecasting. A key objective of the M6 competition was to contribute to the debate surrounding the Efficient Market Hypothesis (EMH) by examining how and why market participants make investment decisions. To address these objectives, the M6 competition investigated forecasting accuracy and investment performance on a universe of 100 publicly traded assets. The competition employed live evaluation on real data across multiple periods, a cross-sectional setting where participants predicted asset performance relative to that of other assets, and a direct evaluation of the utility of forecasts. In this way, we were able to measure the benefits of accurate forecasting and assess the importance of forecasting when making investment decisions. Our findings highlight the challenges that participants faced when attempting to accurately forecast the relative performance of assets, the great difficulty associated with trying to consistently outperform the market, the limited connection between submitted forecasts and investment decisions, the value added by information exchange and the “wisdom of crowds”, and the value of utilizing risk models when attempting to connect prediction and investing decisions. " Source: Makridakis, et al., The M6 forecasting competition: Bridging the gap between forecasting and investment decisions, arXiv. ↩︎ ↩︎ ↩︎

  24. Using Pandas with an Arrow back-end can be as simple as df_pyarrow = pd.read_csv(data, dtype_backend="pyarrow"). For more details, see the pandas documentation or this post by one of the Pandas devs. ↩︎

  25. Another example of this from outside competitive ML: “During a recent Higher Performance Python course (privately run for a large hedge fund) we had a chat about the internal move towards Polars - some research groups were starting to try Polars with cautious success. Typically this worked in isolated places where Pandas had poor performance and where lots of RAM and many CPUs could more efficiently process the data in Polars. " Source: NotANumber Newsletter↩︎

  26. “We show that before 2010 training compute grew in line with Moore’s law, doubling roughly every 20 months. Since the advent of Deep Learning in the early 2010s, the scaling of training compute has accelerated, doubling approximately every 6 months. " Source: Sevilla et al., Compute Trends Across Three Eras of Machine Learning, arXiv. ↩︎

  27. For example, the ROAD-R and Big-ANN competitions provided $200-$1,000 of compute grants to select participants. ↩︎

  28. Some LLM inference benchmarks show AMD’s MI300X having lower latency and higher throughput than NVIDIA’s H100, while being cheaper. For example, on Llama-2 13B, the MI300X shows a 1.2x latency improvement over the H100, according to AMD’s benchmarks. “This is a big deal as OpenAI and Microsoft will be using AMD MI300 heavily for inference.” Source: Semianalysis newsletter↩︎

  29. T4s were added along P100s in 2022. “A P100 GPU will perform better on some applications and the T4x2 will perform better on others. For example, a P100 typically has better single-precision performance than a T4, but the T4 has better mixed precision performance, and you’ll have twice as much GPU RAM in the T4x2 configuration. " Source: Kaggle product announcement↩︎

  30. “I recently shared that by the end of this year we’ll have about 350k H100s and including other GPUs that’ll be around 600k H100 equivalents of compute. " Source: Meta Platforms, Inc. earnings call, 1 Feb 2024 (PDF). ↩︎

  31. With a median on-demand price of $0.89/h for an V100 (source: Cloud GPUs), 8 of them for 6 days would cost $1025.28. Compute details from the paper: “The code was run on DGX V100 workstation or on NVIDIA cluster nodes with up to 8 V100 GPU (DGX-1 machines). The CNN model in particular was quite compute intense, with more than 24 hours per language on a 8xV100 node.” Source: Deotte et al, Winning Amazon KDD Cup'23, OpenReview) ↩︎

  32. “2 x GPU Nvidia A100 40 GB VRAM… 8 days for training” (source: winner’s solution directory). Using a median on-demand price of $1.5/h for an A100 (40GB). Source: Cloud GPUs↩︎

  33. “Trained on 8x 48GB RTX A6000” (source: Kaggle solution write-up). Using a median on-demand price of $0.99/h and a purchase price of $4,698 for an A6000. Rounding up from $37,584 to $40k, considering the cost of CPU, RAM, and other hardware required. Sources: Cloud GPUs, Amazon↩︎

  34. More precisely, in these 5 competitions he spent 9h45m, 14h39m, 30h2m, 40h8m, and 41h45m. These were all deep learning tasks; mostly computer vision. ↩︎

  35. For example, one post from @raddar mentioned: “cross-validation is useless, leaderboard is useless, models are validated on 1-2 data points”. Source: Kaggle Discussion↩︎

  36. M4 paper abstract: “The M4 Competition follows on from the three previous M competitions, the purpose of which was to learn from empirical evidence both how to improve the forecasting accuracy and how such learning could be used to advance the theory and practice of forecasting. The aim of M4 was to replicate and extend the three previous competitions by: (a) significantly increasing the number of series, (b) expanding the number of forecasting methods, and (c) including prediction intervals in the evaluation process as well as point forecasts. This paper covers all aspects of M4 in detail, including its organization and running, the presentation of its results, the top-performing methods overall and by categories, its major findings and their implications, and the computational requirements of the various methods. Finally, it summarizes its main conclusions and states the expectation that its series will become a testing ground for the evaluation of new methods and the improvement of the practice of forecasting, while also suggesting some ways forward for the field.” Source: Makridakis et al., The M4 Competition: 100,000 time series and 61 forecasting methods↩︎

  37. M5 Accuracy paper abstract: “In this study, we present the results of the M5 “Accuracy” competition, which was the first of two parallel challenges in the latest M competition with the aim of advancing the theory and practice of forecasting. The main objective in the M5 “Accuracy” competition was to accurately predict 42,840 time series representing the hierarchical unit sales for the largest retail company in the world by revenue, Walmart. The competition required the submission of 30,490 point forecasts for the lowest cross-sectional aggregation level of the data, which could then be summed up accordingly to estimate forecasts for the remaining upward levels. We provide details of the implementation of the M5 “Accuracy” challenge, as well as the results and best performing methods, and summarize the major findings and conclusions. Finally, we discuss the implications of these findings and suggest directions for future research.” Source: Makridakis et al., M5 accuracy competition: Results, findings, and conclusions↩︎

  38. AlphaGeometry abstract: “Proving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning, owing to their reputed difficulty among the world’s best talents in pre-university mathematics. Current machine-learning approaches, however, are not applicable to most mathematical domains owing to the high cost of translating human proofs into machine-verifiable format. The problem is even worse for geometry because of its unique translation challenges, resulting in severe scarcity of training data. We propose AlphaGeometry, a theorem prover for Euclidean plane geometry that sidesteps the need for human demonstrations by synthesizing millions of theorems and proofs across different levels of complexity. AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challenging problems. On a test set of 30 latest olympiad-level problems, AlphaGeometry solves 25, outperforming the previous best method that only solves ten problems and approaching the performance of an average International Mathematical Olympiad (IMO) gold medallist. Notably, AlphaGeometry produces human-readable proofs, solves all geometry problems in the IMO 2000 and 2015 under human expert evaluation and discovers a generalized version of a translated IMO theorem in 2004.” Source: Trinh et al., Solving olympiad geometry without human demonstrations↩︎

  39. Source: AIMO FAQs, 8.2 ↩︎

  40. The “pre-announcement” of the first progress prize describes a multiple choice Olympiad on 100 novel problems, to be hosted on Kaggle, with a total prize fund of $1,048,576 over up to three years. Source: r/AIMOprize↩︎

  41. Some benchmarks, such as BIG-bench, include a canary string in all their task definition files, intended both to help researchers filter out information about the benchmark from scraped text, and to allow for post-hoc diagnosis of whether BIG-bench data was used in training. Source: Srivastava et al., Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models ↩︎

  42. There will always be some leakage, since participants get information on the test set when they find out their score, but limiting submission frequency can slow down this process. ↩︎

  43. Titled “Methodology for Design and Analysis of Machine Learning Competitions”, it is now publicly available on Adrien’s website (PDF). It gives context on the history of competitions and benchmarks, as well an overview of CodaLab and Codabench, and considerations on competition design for many different types of competitions. ↩︎

  44. In most cases, we take the submission deadline as the end date. The only exception to this is when a competition is judged on data which becomes available only after final submissions: for example, Kaggle’s Trading at the Close competition had a final submission deadline of 20th December 2023, after which 3 months of trading data is collected, and the competition officially ends on 20th March, 2024. This competition will be included in the 2024 edition of our report, to be published in 2025. Conversely, while the Vesuvius Challenge winners weren’t announced until February 2024, the submission deadline was 31 December 2023, and so it is included in this edition of our report. ↩︎