Passer au contenu principal

Performance of LLM backends and models in Curnagl

TODO

  • Introduction (Cristian)
  • Backends and models tested (Margot)
  • Hardware description (Margot)
  • Inference latency results (Margot and Cristian) -> create one table per model and replace nodes names by GPU card name, we can also improve column titles.

Introduction

This page shows performance of Llama and mistral models on Curnagl hardware. We have measured the token throughput which should help you to have an idea of what is possible using Curnagl resources. Training time and inference time for different task could be estimated using these results.


Models and backends tested

Tested Models

Llama3

Mistral


Tested Backends

vLLM backend provides efficient memory usage and fast token sampling. This backend is ideal for testing Llama3 and Mistral models in environments that require high-speed responses and low latency.

llama.cpp was primarily used for llama but it can be applied to other LLM models. This optimized backend provides efficient inference on GPUs.

If not the most widely used LLM black box, it is one of them. Easy to use, the Hugging Face Transformers library supports a wide range of models and backends. One of its main advantages is its quick set up, which enables quick experimentation across architectures.

This is the official inference backend for Mistral. It is (supposed to be) optimized for Mistral's architecture, thus increasing the model performance. However, our benchmarks results do not demonstrate any specificities to Mistral model as llama.cpp seems to perform better.


Hardware description

Three different types of GPUs have been used to benchmark LLM models:

Here are their specifications

Characteristics A100 GH200 L40
Number of nodes at UNIL 8 1 8
Memory per node (GB) 40 80 48
Architecture x86_64 aarch64 x86_64
Number of CPU per NUMA node 48 72 8
Memory bandwidth - up to (TB/s) 1.9 4 0.8
FP64 performance (teraFlops) 9.7 34 NA
TF64 performance (teraFlops) 19.5 67 NA
FP32 performance (teraFlops) 19.5 67 90.5
TF32 performance (teraFlops) 156 494 90.5
TF32 performance with sparsity (teraFlops) 312 494 362
FP16 performance (teraFlops) 312 990 181
INT8 performance (teraFlops) 624 1.9 362

Depending on the code you are running, one GPU may better suit your requirements and expectations.

A100 GPUs

A100 nodes are particularly adapted if:

  • you want to run your code interactively (2 A100 GPUs of 20GB each are available on Curnagl interactive sessions),
  • your code is using mixed precision (FP16/FP32).

Disadvantages:

GH200

Its specificity is to combine both Grace CPUs and H100 Hopper GPU, sharing a unified CPU-GPU memory, thus delivering a superior memory bandwidth.

GH200 node is particularly adapted if your code needs:

  • extreme memory bandwidth,
  • quite a lot of memory (up to 80GB),
  • performant Tensor Core operations.

Disadvantages:

  • as there is only one GH200 GPU, GPU distributed computing cannot be performed.

L40 GPUs

L40 nodes are particularly adapted if your code is:

  • implemented in single precision,
  • using distributed GPU programming, as they are eight L40 nodes.

Disadvantages:

  • old GPU that cannot deal with double precision.

Note: These architectures are not powerful enough to train Large Language Models.

Note: Our benchmarks aim to determine which GPU types should be provided to researchers. If you require new GPUs for your research, feel free to reach out to us through the Help Desk. In case, you and other researchers agree on the same GPU request, we will do our best to provide new resources that meet your needs.


Inference latency results

This chat dataset from GPT3 has been used to benchmark models.

The following models characteristics have been set:

  • The maximum number of tokens to generate, ignoring the number of tokens in the prompt, is set to 400 or 1000 depending on the benchmark case, even though this parameter appeared to not influence the latency of the model in our benchmarks.
  • The temperature, which controls the output randomness, is set to 0.
  • The context size, which is the number of tokens the model can process within a single input, is set to 8192. This choice offers a good balance between hardware memory capacities and model perfomance.
  • The number of layers given to the GPU for computation, which holds exclusively for llama.cpp backend, is set to 99, see this documented example.

Mistral models

mistral-7B-Instruct-v0.3
Backend results (Token/seconds) A100 GH200 L40
vllm 74.1 - -
llama.cpp 53.8 138.4 42.8
HuggingFace 30 41.3 21.6
mistral-inference 23.4 - 25
Mixtral-8x7B-v0.1-Instruct
Backend results (Token/seconds) A100 GH200 L40
llama.cpp NA NA 23.4
HuggingFace NA NA 8.5

Note: Mixtral model uses 8x7 billions parameters. The resulting memory consumption for inference can only be supported by multiple L40 GPU using distributed computing.

Llama models

Results are given in Token/seconds

8B Instruct
Backend A100 GH200 L40
llama.cpp 62.645 100.845 43.387
Transformers 31.650 43.321 21.062
vllm 44.686 - 45.176
70B Instruct
Backend L40
Transformers 2.372
vllm 30.945