Passer au contenu principal

Performance of LLM backends and models in Curnagl

TODO

  • Introduction (Cristian)
  • Backends and models tested (Margot)
  • Hardware description (Margot)
  • Inference latency results (Margot and Cristian) -> create one table per model and replace nodes names by GPU card name, we can also improve column titles.

Introduction


BackendsModels and modelsbackends tested

Tested Models

Llama3

Mistral


Tested Backends

vLLM backend provides efficient memory usage and fast token sampling. This backend is ideal for testing Llama3 and Mistral models in environments that require high-speed responses and low latency.

llama.cpp was primarily used for llama but it can be applied to other LLM models. This optimized backend provides efficient inference on GPUs.

If not the most widely used LLM black box, it is one of them. Easy to use, the Hugging Face Transformers library supports a wide range of models and backends. One of its main advantages is its quick set up, which enables quick experimentation across architectures.

This is the official inference backend for Mistral. It is (supposed to be) optimized for Mistral's architecture, thus increasing the model performance. However, our benchmarks results do not demonstrate any specificities to Mistral model as llama.cpp seems to perform better.


Hardware description

Three different types of GPUs have been used to benchmark LLM models:

Here are their specifications

CharacteristicsA100GH200L40
Number of nodes at UNIL818
Memory per node (GB)408048
Architecturex86_64aarch64x86_64
CPUAMD Epyc2 7402Neoverse-V2AMD EPYC 9334 32-Core Processor
Number of CPU per NUMA node48728
Memory bandwidth - up to (TB/s)1.940.8
FP64 performance (teraFlops)9.734NA
TF64 performance (teraFlops)19.567NA
FP32 performance (teraFlops)19.56790.5
TF32 performance (teraFlops)15649490.5
TF32 performance with sparsity (teraFlops)312494362
FP16 performance (teraFlops)312990181
INT8 performance (teraFlops)6241.9362

Depending on the code you are running, one GPU may better suit your requirements and expectations.


Inference latency results