Passer au contenu principal

Performance of LLM backends and models in Curnagl

TODO

  • Introduction (Cristian)
  • Backends and models tested (Margot)
  • Hardware description (Margot)
  • Inference latency results (Margot and Cristian) -> create one table per model and replace nodes names by GPU card name, we can also improve column titles.

Introduction


Backends and models tested

Tested modelsModels

Llama3

#####Instruct

Mistral

####Instruct

Tested Backends

repository

vLLM backend provides efficient memory usage and fast token sampling. This backend is ideal for testing Llama3 and Mistral models in environments that require high-speed responses and low latency.

repository

llama.cpp was primarily used for llama but it can be applied to other LLM models. This optimized backend provides efficient inference on GPUs.

**[mistral-inference
  • repository](https://github.com/mistralai/mistral-inference)**
This is the official inference backend for Mistral. It is (supposed to be) optimized for Mistral's architecture, thus increasing the model performance. However, our benchmarks results do not demonstrate any specificities to Mistral model as llama.cpp seems to perform better.
repository

If not the most widely used LLM black box, it is one of them. Easy to use, the Hugging Face Transformers library supports a wide range of models and backends. One of its main advantages is its quick set up, which enables quick experimentation across architectures.

---

This is the official inference backend for Mistral. It is (supposed to be) optimized for Mistral's architecture, thus increasing the model performance. However, our benchmarks results do not demonstrate any specificities to Mistral model as llama.cpp seems to perform better.


Hardware description


Inference latency results