Performance of LLM backends and models in Curnagl

TODO

Introduction (Cristian)
Backends and models tested (Margot)
Hardware description (Margot)
Inference latency results (Margot and Cristian) -> create one table per model and replace nodes names by GPU card name, we can also improve column titles.

Introduction

Backends and models tested

# Summary of LLM Models and Backends

## Tested Models

### Llama3

- **Models Download**: [Meta Llama3 models on Hugging Face](https://huggingface.co/meta-llama) (access official Meta Llama models).

- [**Meta-Llama-3.1-8B-Instruct**](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)

- [**Meta-Llama-3.1-70B-Instruct**](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)

### Mistral

- **Models download**: [Mistral models on Hugging Face](https://huggingface.co/mistralai)

- [**mistral-7B-Instruct-v0.3**](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)

- [**Mixtral-8x7B-v0.1-Instruct**](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)

---

## Tested Backends

- [**vLLM repository**](https://github.com/vllm-project/vllm)

vLLM backend provides efficient memory usage and fast token sampling. This backend is ideal for testing Llama3 and Mistral models in environments that require high-speed responses and low latency.

- [**llama.cpp repository**](https://github.com/ggerganov/llama.cpp)

llama.cpp was primarily used for llama but it can be applied to other LLM models. This optimized backend provides efficient inference on GPUs.

- [**mistral-inference repository**](https://github.com/mistralai/mistral-inference)

This is the official inference backend for Mistral. It is (supposed to be) optimized for Mistral's architecture, thus increasing the model performance.

- [**Transformers repository**](https://huggingface.co/docs/transformers)

If not the most widely used LLM black box, it is one of them. Easy to use, the Hugging Face Transformers library supports a wide range of models and backends. One of its main advantages is its quick set up, which enables quick experimentation across architectures.

Performance of LLM backends and models in Curnagl

TODO

Introduction

Backends and models tested

Hardware description

Inference latency results