Performance of LLM backends and models in Curnagl

TODO

Introduction (Cristian)
Backends and models tested (Margot)
Hardware description (Margot)
Inference latency results (Margot and Cristian) -> create one table per model and replace nodes names by GPU card name, we can also improve column titles.

Introduction

Backends and models tested

Tested models

Llama3

Official access to Meta Llama models: [Meta Llama3 models on Hugging Face](https://huggingface.co/meta-llama)

- [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)

- [Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)

##### Mistral

Official access to Mistral models: Mistral models on MistralAI website

Access to Mistral models on Hugging Face: [Mistral models on Hugging Face](https://huggingface.co/mistralai)

- [mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)

- [Mixtral-8x7B-v0.1-Instruct](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)

#### Tested Backends

[**vLLM repository**](https://github.com/vllm-project/vllm)

vLLM backend provides efficient memory usage and fast token sampling. This backend is ideal for testing Llama3 and Mistral models in environments that require high-speed responses and low latency.

[**llama.cpp repository**](https://github.com/ggerganov/llama.cpp)

llama.cpp was primarily used for llama but it can be applied to other LLM models. This optimized backend provides efficient inference on GPUs.

**[mistral-inference repository](https://github.com/mistralai/mistral-inference)**

This is the official inference backend for Mistral. It is (supposed to be) optimized for Mistral's architecture, thus increasing the model performance. However, our benchmarks results do not demonstrate any specificities to Mistral model as llama.cpp seems to perform better.

[**Transformers repository**](https://huggingface.co/docs/transformers)

If not the most widely used LLM black box, it is one of them. Easy to use, the Hugging Face Transformers library supports a wide range of models and backends. One of its main advantages is its quick set up, which enables quick experimentation across architectures.

---