Tutorial: Towards Efficient Generative LLM Serving

Decoding algorithms, architecture design improvements, model compression, quantisation, parallelism, memory management, request scheduling, and kernel optimisations.

ICML 2024 Read our full coverage of this year's conference  

First tutorial of the day (link), looking at efficient LLM serving from the perspective of both algorithmic innovations and system optimisations. Xupeng Miao gave a run-through of various techniques covered in his lab’s December 2023 survey paper, as well as some new ones.

Branching LLM serving taxonomy diagram, splitting off between algorithmic innovations and system optimisations.
Source: Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

It also featured some of their own research, including SpecInfer (tree-based speculative inference) and SpotServe (efficient and reliable LLM serving on pre-emptible cloud instances).