Passer au contenu principal

Performance of LLM backends and models in Curnagl

TODO

  • Introduction (Cristian)
  • Backends and models tested (Margot)
  • Hardware description (Margot)
  • Inference latency results (Margot and Cristian) -> create one table per model and replace nodes names by GPU card name, we can also improve column titles.

Introduction


Models and backends tested

Tested Models

Llama3

Mistral


Tested Backends

vLLM backend provides efficient memory usage and fast token sampling. This backend is ideal for testing Llama3 and Mistral models in environments that require high-speed responses and low latency.

llama.cpp was primarily used for llama but it can be applied to other LLM models. This optimized backend provides efficient inference on GPUs.

If not the most widely used LLM black box, it is one of them. Easy to use, the Hugging Face Transformers library supports a wide range of models and backends. One of its main advantages is its quick set up, which enables quick experimentation across architectures.

This is the official inference backend for Mistral. It is (supposed to be) optimized for Mistral's architecture, thus increasing the model performance. However, our benchmarks results do not demonstrate any specificities to Mistral model as llama.cpp seems to perform better.


Hardware description

Three different types of GPUs have been used to benchmark LLM models:

Here are their specifications

Characteristics A100 GH200 L40
Number of nodes at UNIL 8 1 8
Memory per node (GB) 40 80 48
Architecture x86_64 aarch64 x86_64
CPU AMD Epyc2 7402 Neoverse-V2 AMD EPYC 9334 32-Core Processor
Number of CPU per NUMA node 48 72 8
Memory bandwidth - up to (TB/s) 1.9 4 0.8
FP64 performance (teraFlops) 9.7 34 NA
TF64 performance (teraFlops) 19.5 67 NA
FP32 performance (teraFlops) 19.5 67 90.5
TF32 performance (teraFlops) 156 494 90.5
TF32 performance with sparsity (teraFlops) 312 494 362
FP16 performance (teraFlops) 312 990 181
INT8 performance (teraFlops) 624 1.9 362

Depending on the code you are running, one GPU may better suit your requirements and expectations.


Inference latency results