NeurIPS Highlights

New Orleans, 10-16 December 2023

NeurIPS 2023 Read our full coverage of this year's conference  

A few weeks before the end of last year, NeurIPS wrapped up its week-long 2023 programme in New Orleans. It was the biggest NeurIPS yet in terms of in-person attendees (13,307) and accepted papers (3,540), and possibly the largest academic AI conference ever1.

Given its scale, it’s an impossible conference to summarise. For some fragments from the invited talks, as well as some of the orals, the exhibit hall, poster sessions, tutorials, workshops, and competitions, see our daily blogs.

Having said that, in this post we attempt to take a step back and highlight themes from the conference that stood out to us — as well as what it might say about AI trends in 2024.

Plenty of room at the bottom

One of the main themes throughout the conference sessions was that many current cutting-edge models are too big. Not that they’re cumbersome to manage, expensive to run, difficult to train, or take up a lot of memory — but that they’re bigger than they need to be, and that equivalent performance can be achieved with smaller models.

Throughout this NeurIPS, there were researchers presenting significant leaps forward on the efficiency front — whether through mathematically-equivalent algorithmic improvements on the implementation of attention, alternatives to attention which improve asymptotic scaling, clever quantisation techniques which reduce memory usage, or more thoughtful data filtering which improves performance. For just a few highlights of these, see our summary of the Efficiency Oral Session, Chris Re’s Invited Talk, and the Beyond Scaling panel.

This thinking was also validated by models released during NeurIPS, such as Mixtral and Phi. Both of these small-ish models show benchmark performance that’s equivalent and sometimes superior to larger models.

To quote Björn Ommer (quoting Richard Feynman) during his invited talk on Scaling and Generative AI“There’s plenty of room at the bottom”.2

Flavour of the week: LLMs and Diffusion Models

This was the first NeurIPS with submission deadlines after ChatGPT and Stable Diffusion’s release3, and as expected there was a lot of attention on both LLMS and on Diffusion Models. Many of the best-attended sessions focused on topics related to these — such as the tutorial on Latent Diffusion Models, and several of the Invited Talks.

Fittingly, the Test of Time award went to a paper which set up a lot of the ingredients for the LLM revolution (Jeff Dean and Greg Corrado presented; Ilya Sutskever and the other co-authors weren’t there to collect it in person).

The exhibition hall featured lots of companies with specialised their solutions for effectively pre-training, fine-tuning, and serving LLMs — alongside the usual large tech firms, quant traders, and MLOps solutions.

Better data, please

Alongside the growth of the relatively young datasets and benchmarks track, data continues to be a focus at NeurIPS. Many of the speakers referenced the importance of a deep understanding of training and evaluation data, with the emphasis shifting from quantity to quality.

One of the runner-ups to the outstanding paper award, Scaling Data-Constrained Language Models examined the effects of multi-epoch training in LLMs, as well as presenting several other interesting empirical results around training data for LLMs.

In one of the conference competitions, the LLM Efficiency Challenge (where participants maximised fine-tuned model performance given only 24h and a single GPU), the winners attributed much of their edge over others to selecting the right subset of training data.

The tutorial on Data-Centric AI made a compelling case for data-centric learning (as opposed to model-centric learning), and presented several useful resources to help use this approach in building more reliable and responsible AI, including a tool for monitoring performance on subsets of data during model training.

Degrees of openness

In the panel on Beyond Scaling, Percy Liang pointed out that thinking of a foundation model4 as “open” or “not open” isn’t a very useful distinction, and that it’s more useful to think about properties such as a model being open-weights, open-training-data, or open-training-code.

Many recent models, like Meta’s Llama/Llama2, Microsoft’s Phi, and Mistral’s models, are open-weights — in the sense that anyone can download the model weights for their own inference or fine-tuning. But this doesn’t tell us how the model was trained, or on what data5. And without knowing those two things, it’s hard to really know how good a model is, or how to get the most out of it.

Organisations that the panel highlighted for releasing models which are open in more respects than just weights were Eleuther, HuggingFace, BigScience, AI2, and LLM360.

Benchmarking and Goodhart’s Law

As the community shifts to using more foundation models with varying degrees of openness, the benchmarking norms that were designed for open models (or closed models fully developed within the organisation using them) are no longer sufficient.

One of the key difficulties is: how can we know that a model wasn’t trained on the benchmark dataset it’s being evaluated on?

Even if a model wasn’t trained directly on a benchmark dataset, over time any publicly available benchmark dataset will leak into other data, especially when web-scraped training data is so pervasive. Without access to the training data, evaluators are unable to examine similarity between the eval/benchmark samples and the training corpus. This problem is exacerbated by the fact that models are marketed on their benchmark performance, creating incentives that aren’t conducive to thorough cleaning of training data — a clear example of Goodhart’s Law.

This is an open challenge, though the competitions track has been dealing with these considerations for some time.

For occasional email updates from ML Contests with more content like this conference coverage and insights into competitive ML, subscribe to our mailing list.

It was a great NeurIPS, and left us with the feeling that there’s much more to come soon — especially in terms of democratising access to powerful and fast models. We look forward to another year of groundbreaking research!

For more on NeurIPS 2023, read our daily blogs: expo day, tutorials, day 1, day 2, day 3, and the competition track days.


  1. Our World in Data shows recent data for some of the top conferences, aggregating both virtual and in-person attendees. NeurIPS 2020 and 2021 were fully virtual, and NeurIPS 2022 had 9,835 attendees (source: NeurIPS fact sheet). The only other conferences listed there with more than 13,000 attendees are IROS 2020 and ICML 2021, which were both fully virtual. It’s possible that there were larger AI conferences a few decades ago; data for those is not as readily available. ↩︎

  2. Richard Feynman used this phrase as the title of a lecture which some see as the origin of nanotechnology. He was referring specifically to smaller-scale mechanical manipulation down to the level of individual atoms; in the machine-learning context it refers to parameter counts or memory usage rather than physical dimensions. More on Wikipedia↩︎

  3. Stable Diffusion was released in August 2022, and ChatGPT in November 2022. The NeurIPS 2022 conference took place after this, in December 2022, but much of the agenda for that conference had been set much earlier — with abstract and paper submission deadlines in May 2022. ↩︎

  4. Foundation model: “any model that is trained on broad data (generally using self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks” (arXiv↩︎

  5. There is a bit more info on Phi-2 training data — “Dataset size: 250B tokens, combination of NLP synthetic data created by AOAI GPT-3.5 and filtered web data from Falcon RefinedWeb and SlimPajama, which was assessed by AOAI GPT-4” — than Llama2 — “Llama 2 was pretrained on 2 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over one million new human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data”. (source: HuggingFace model cards↩︎