Performance of LLM backends and models in Curnagl

TODO

Introduction (Cristian)
Backends and models tested (Margot)
Hardware description (Margot)
Inference latency results (Margot and Cristian) -> create one table per model and replace nodes names by GPU card name, we can also improve column titles.

Introduction

Models and backends tested

Tested Models

Llama3

Official access to Meta Llama3 models: Meta Llama3 models on Hugging Face
Meta-Llama-3.1-8B-Instruct
Meta-Llama-3.1-70B-Instruct

Mistral

Official access to Mistral models: Mistral models on MistralAI website
Access to Mistral models on Hugging Face: Mistral models on Hugging Face
mistral-7B-Instruct-v0.3
Mixtral-8x7B-v0.1-Instruct

Tested Backends

vLLM repository

vLLM backend provides efficient memory usage and fast token sampling. This backend is ideal for testing Llama3 and Mistral models in environments that require high-speed responses and low latency.

llama.cpp repository

llama.cpp was primarily used for llama but it can be applied to other LLM models. This optimized backend provides efficient inference on GPUs.

Transformers repository

If not the most widely used LLM black box, it is one of them. Easy to use, the Hugging Face Transformers library supports a wide range of models and backends. One of its main advantages is its quick set up, which enables quick experimentation across architectures.

mistral-inference repository

This is the official inference backend for Mistral. It is (supposed to be) optimized for Mistral's architecture, thus increasing the model performance. However, our benchmarks results do not demonstrate any specificities to Mistral model as llama.cpp seems to perform better.

Hardware description

Three different types of GPUs have been used to benchmark LLM models:

A100 which are available on Curnagl, official documentation,
GH200 which will be available soon on Curnagl, official documentation,
L40 which will be available soon on Curnagl, official documentation and specifications.

Here are their specifications

Characteristics	A100	GH200	L40
Number of nodes at UNIL	8	1	8
Memory per node (GB)	40	80	48
Architecture	x86_64	aarch64	x86_64
CPU	AMD Epyc2 7402	Neoverse-V2	AMD EPYC 9334 32-Core Processor
Number of CPU per NUMA node	48	72	8
Memory bandwidth - up to (TB/s)	1.9	4	0.8
FP64 performance (teraFlops)	9.7	34	NA
TF64 performance (teraFlops)	19.5	67	NA
FP32 performance (teraFlops)	19.5	67	90.5
TF32 performance (teraFlops)	156	494	90.5
TF32 performance with sparsity (teraFlops)	312	494	362
FP16 performance (teraFlops)	312	990	181
INT8 performance (teraFlops)	624	1.9	362

Depending on the code you are running, one GPU may better suit your requirements and expectations.