7 Mar, 2023
In our report on the state of competitive machine learning in 2022, we review the 200+ machine learning competitions that took place in 2022 and give an overview of winning solutions.
One debate which we only touch on briefly in the report is: what is the best type of model for tabular data problems?
As a quick reminder, by tabular data problems we mean any problem where the training data is structured as rows and columns of scalar numerical values — in contrast to problems where the data is structured as a series of images, or text, or interactions between an agent and its environment. Many “business ML” problems fall into the category of tabular data problems.
The main candidate models in recent years for tabular data are neural networks and gradient-boosted decision trees (GBDTs) , and the debate over which is better has been going on for a while.
Neural network models are a series of linear transformation layers interspersed with simple non-linear functions, and trained through some form of gradient descent. They are currently state-of-the-art for most ML problems, including computer vision and natural language processing. Popular neural net libraries include PyTorch and TensorFlow.
Gradient-boosted decision trees are ensemble models — combinations of many smaller models — and are iteratively built up, one tree at a time. Popular GBDT libraries include LightGBM, CatBoost, and XGBoost.
So, how did neural networks compare to gradient boosted decision trees in 2022?
The most significant tabular data competition in 2022 was Kaggle’s American Express - Default Prediction competition. In fact, this was by far the most popular competition of the whole year - with almost 5,000 teams entering, and a $100k prize pool up for grabs. In this competition, the goal is to use (anonymised and normalised) customer statements to predict the probability that a customer will default on their credit card debt. While there is a small time-series element (the input data includes a sequence of statements per customer), many participants seem to approach it in a pure tabular way.
The first place solution used an ensemble of GBDTs and Recurrent Neural Networks (specifically, GRUs), coupled with heavy manual feature engineering. The second and third place solutions also featured significant feature engineering, and an ensemble of GBDTs.
It’s a tentative 1-0 to GBDTs so far - while neural nets were used in the winning solution, the use of an RNN was clearly driven by the time-series aspects of the data, modelling the sequence of monthly statements.
Another interesting tabular competition is Zindi’s Alvin Smart Money Classification Challenge, in which participants had to classify individual bank transactions into one of 13 categories (“Bills & Fees”, “Education”, …). While the prize money for this competition - $3,000 - was much smaller than that for the American Express one, over 200 individuals entered. The winning solution used an ensemble of different GBDT models.
2-0.
And then there’s DrivenData’s Airport Configuration Prediction competition, together with NASA. This competition was open only to students, and had a $40,000 prize pool for the teams who could best predict an airport’s runway configuration from 30 minutes to 6 hours into the future - a textbook multiclass classification problem. (while, again, there’s a time-series element to the data, the teams in general approached it at a purely tabular data problem, and some mentioned that trying to incorporate time series elements lead to worse performance)
The top 4 teams’ solutions are all mentioned on DrivenData’s write-up page for this competition, and are certainly worth reading through. Once again, the results come out mostly in favour of gradient boosted decision trees. The #1 team exclusively used Catboost for modelling. Team 2 used XGBoost for sub-models which fed into a dense neural network. Third place used two layers of XGBoost models.
3-0.
DrivenData also ran an air quality prediction competition, together with NASA, with both
Particulate Matter (PM25) and
Trace Gases (NO2) tracks, and a total of $50k in
prizes available across both tracks. These were regression tasks - using various types of satellite data to predict a
float value representing the PM or NO2 measurement for a given grid 5x5cm grid cell at a given point in time.
Once again, the winners of both tracks primarily used GBDTs - the first place winner of the PM25 track used ensembles
of XGBoost, Catboost, and LightGBM models. The first place winner of the NO2 track
was an experienced competitive ML winner - having won the M5 forecasting competition, and several others. He commented
that “LightGBM is the gold standard for these situations”, and primarily used LightGBM for his solution, but found that
“A final ensemble with a neural network added an additional 0.5% to the model’s performance.”
4-0?
Another interesting DrivenData/NASA competition with two tracks (GCMS and EGA) is the Mars Spectrometry challenge, where competitors have to analyse mass spectrometry data from Martian samples to identify the presence of different chemical substances. The GCMS data has three feature dimensions: time, mass, and intensity, with binary labels across 9 different classes of potential compounds (Aromatic, Hydrocarbon, … Mineral) - of which multiple can be present in any given sample. The EGA data has four feature dimensions: time, temperature, mass-to-charge-ratio, and abundance. This time there are ten classes (Basalt, Carbonate, …, Sulfide), and they are not mutually exclusive.
While this competition could be seen as a time-series tabular competition, the winning approaches are different from the ones mentioned above. In both tracks, the #1 solutions used specific neural net architectures to exploit structure in the data - for example, “I represented the mass spectrogram as a 2D image (temperature vs m/z values) used as an input to CNN, RNN or transformer-based models”, or “The first architecture is based on 1D CNN-Transformer where each sample is transformed into a sequence of 1D arrays before feeding into a 1D Event Detection network. The second architecture is based on 2D CNN where 3 different preprocessing methods are applied to the sample to generate a 3 channel image-like representation, then this image-like representation is fed into a 2D Event Detection network.”
4-1
A special case of tabular/time-series competitions is financial forecasting competitions. These competitions are notoriously difficult to run. In 2021, the Volatility Prediction competition on Kaggle had a controversial outcome. Rather than using Kaggle’s time-series API, which ensures solutions can’t “peek” at future data, the competition relied on data being shuffled and normalised, with timestamps obfuscated. When some creative competitors managed to reverse-engineer the time ids, the organisers added random noise to the data to try to penalise these unwanted solutions. The Kaggle community wasn’t happy, with the organisers being accused of “changing the test data after the end to manipulate the [leaderboard]” by one Kaggle Grandmaster. The decision was quickly reversed.
The downside of relying on the time-series API and collecting test data after the end of submissions, as was done in all three of Kaggle’s financial forecasting competitions in 2022, is the trade-off between collecting enough test data to overcome noise, and finishing the competition within a reasonable time period. All three competitions used a test period of three months.
Competition | Train data | Universe | Data resolution | Evaluation metric |
---|---|---|---|---|
G-Research Crypto | ~24m rows, ~3GB | 14 cryptocurrencies | 15 minutes | Pearson correlation coefficient |
Ubiquant Market Prediction* | ~3.1m rows, ~18GB | ~ 3,600 “investments” | Unclear | Pearson correlation coefficient |
JPX Stock Price Prediction | ~2.4m rows, ~1.3GB | 2,000 stocks | Daily | Sharpe Ratio |
* The official Ubiquant dataset was removed from the Kaggle website after the competition, so our data metrics for this competition are taken from this notebook. It is unclear from the data that remains and the descriptions exactly what asset classes are covered or what the frequency of the data is.
All three competitions had a test period of three months.
The JPX competition required participants to rank stocks, and then made buy/sell decisions based on the relative ranking of the stocks (roughly, the top 200 ranked stocks were bought and the bottom 200 sold each day). The Sharpe ratio was then calculated on the result of running this hypothetical trading strategy with each participant’s daily rankings as input.
It appears that the winner of the JPX competition used a simple linear regression model, and may even have had some bugs in the feature generation stage, adding evidence to some participants’ beliefs that the competition involved more than the usual amount of luck, further supported by the fact that the top of the leaderboard was made up almost exclusively of very inexperienced participants. Since the test period involved only 90 independent events (daily rebalances for 3 months), with price moves on any given day being highly correlated across stocks, and a high level of noise, there’s certainly an argument to be made about the test set being too small for any decent predictive method to reliably outperform a merely lucky one.
The Ubiquant and G-Research competitions were more straightforward, and asked participants to directly predict asset returns. G-Research published a blog post following the conclusion of their competition, confirming that “all of our top three used a LightGBM model, though some experimented with other approaches, including neural networks”. We confirmed in writing directly with Meme Lord Capital, the top team in that competition, that they primarily used LightGBM, but also made use of PyTorch, XGBoost, and scikit-learn, and were able to hand-code a number of features using domain knowledge (both team members have professional experience in quant finance). Disclaimer: G-Research is one of the sponsors of this report.
The winners of the Ubiquant competition published a summary of their approach as a Kaggle Discussion. They used heavy feature engineering with an ensemble of a LightGBM model and TabNet, a deep neural net architecture based on an attention mechanism similar to that used by Transformer models.
Competition | Winning approach | Winning modelling packages |
---|---|---|
Air Quality Prediction (PM25) | GBDT + Linear Regression | XGBoost, Catboost, LightGBM, Scikit-Learn |
Air Quality Prediction (NO2) | GBDT + NN | LightGBM, PyTorch-Lightning |
Alvin Transaction Classification | GBDT | Catboost, LightGBM, XGBoost, Scikit-Learn |
Amex Default Prediction | GBDT + GRU | LightGBM, PyTorch |
JPX Stock Price Prediction | Linear Regression | Scikit-Learn |
Mars Spectrometry (EGA) | CNN/RNN/Transformer | PyTorch |
Mars Spectrometry (GCMS) | CNN, Transformer | PyTorch |
NASA Airport Configuration | GBDT | Catboost |
G-Research Crypto | GBDT + NN | LightGBM |
Ubiquant Market Prediction | GBDT + TabNet | LightGBM + PyTorch/TensorFlow |
While we’ve half-seriously scored GBDTs as “winning” on “pure” tabular data, in reality it’s very rare to be dealing with pure tabular data that doesn’t have any other kind of structure - like the spatial dimensions in the air quality prediction competitions, or the time series elements in the other competitions. If specific neural net architectures can impose an inductive bias that helps exploit this structure, then they’re clearly adding value! One thing we can say is that the dream of picking the correct neural net architecture for a problem and having it “discover” all the relevant features is probably not quite here yet - since creative feature generation seemed to be a key element of all the winning solutions for these tabular competitions.
Another noticeable point is how many winners used an ensemble of both GBDTs and neural nets - suggesting that trying to figure out which out of the two is “best” is the wrong question if the goal is maximising predictive power. While this kind of approach seems to do well in competitive machine learning, the additional complexity and varying inference requirements for different types of models like this might make this approach less suited to production use cases.
To sum up:
GBDTs still win at pure tabular data.
Neural nets can outperform when there is structure to be exploited.
Ensembles of GBDTs and neural nets are probably best.
LightGBM is now the most popular GBDT library among winners.