Performance of LLM backends and models in Curnagl
TODO
- Introduction (Cristian)
- Backends and models tested (Margot)
- Hardware description (Margot)
- Inference latency results (Margot and Cristian) -> create one table per model and replace nodes names by GPU card name, we can also improve column titles.
Introduction
Models and backends tested
Tested Models
Llama3
- Official access to Meta Llama3 models: Meta Llama3 models on Hugging Face
- Meta-Llama-3.1-8B-Instruct
- Meta-Llama-3.1-70B-Instruct
Mistral
- Official access to Mistral models: Mistral models on MistralAI website
- Access to Mistral models on Hugging Face: Mistral models on Hugging Face
- mistral-7B-Instruct-v0.3
- Mixtral-8x7B-v0.1-Instruct
Tested Backends
vLLM backend provides efficient memory usage and fast token sampling. This backend is ideal for testing Llama3 and Mistral models in environments that require high-speed responses and low latency.
llama.cpp was primarily used for llama but it can be applied to other LLM models. This optimized backend provides efficient inference on GPUs.
If not the most widely used LLM black box, it is one of them. Easy to use, the Hugging Face Transformers library supports a wide range of models and backends. One of its main advantages is its quick set up, which enables quick experimentation across architectures.
This is the official inference backend for Mistral. It is (supposed to be) optimized for Mistral's architecture, thus increasing the model performance. However, our benchmarks results do not demonstrate any specificities to Mistral model as llama.cpp seems to perform better.
Hardware description
Three different types of GPUs have been used to benchmark LLM models:
- A100 which are available on Curnagl, official documentation,
- GH200 which will be available soon on Curnagl, official documentation,
- L40 which will be available soon on Curnagl, official documentation and specifications.
Here are their specifications
Characteristics | A100 | GH200 | L40 |
---|---|---|---|
Number of nodes at UNIL | 8 | 1 | 8 |
Memory per node (GB) | 40 | 80 | 48 |
Architecture | x86_64 | aarch64 | x86_64 |
CPU | AMD Epyc2 7402 | Neoverse-V2 | AMD EPYC 9334 32-Core Processor |
Number of CPU per NUMA node | 48 | 72 | 8 |
Memory bandwidth - up to (TB/s) | 1.9 | 4 | 0.8 |
FP64 performance (teraFlops) | 9.7 | 34 | NA |
TF64 performance (teraFlops) | 19.5 | 67 | NA |
FP32 performance (teraFlops) | 19.5 | 67 | 90.5 |
TF32 performance (teraFlops) | 156 | 494 | 90.5 |
TF32 performance with sparsity (teraFlops) | 312 | 494 | 362 |
FP16 performance (teraFlops) | 312 | 990 | 181 |
INT8 performance (teraFlops) | 624 | 1.9 | 362 |
Depending on the code you are running, one GPU may better suit your requirements and expectations.