Passer au contenu principal

Performance of LLM backends and models in Curnagl

TODO

  • Introduction (Cristian)
  • Backends and models tested (Margot)
  • Hardware description (Margot)
  • Inference latency results (Margot and Cristian) -> create one table per model and replace nodes names by GPU card name, we can also improve column titles.

Introduction

 

Backends and models tested

# Summary of LLM Models and Backends

## Tested Models

### Llama3
- **Models Download**: [Meta Llama3 models on Hugging Face](https://huggingface.co/meta-llama) (access official Meta Llama models).
- [**Meta-Llama-3.1-8B-Instruct**](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- [**Meta-Llama-3.1-70B-Instruct**](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)

### Mistral
- **Models download**: [Mistral models on Hugging Face](https://huggingface.co/mistralai)
- [**mistral-7B-Instruct-v0.3**](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)
- [**Mixtral-8x7B-v0.1-Instruct**](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)

---

## Tested Backends

- [**vLLM repository**](https://github.com/vllm-project/vllm)
vLLM backend provides efficient memory usage and fast token sampling. This backend is ideal for testing Llama3 and Mistral models in environments that require high-speed responses and low latency.

- [**llama.cpp repository**](https://github.com/ggerganov/llama.cpp)
llama.cpp was primarily used for llama but it can be applied to other LLM models. This optimized backend provides efficient inference on GPUs.

- [**mistral-inference repository**](https://github.com/mistralai/mistral-inference)
This is the official inference backend for Mistral. It is (supposed to be) optimized for Mistral's architecture, thus increasing the model performance.

- [**Transformers repository**](https://huggingface.co/docs/transformers)
If not the most widely used LLM black box, it is one of them. Easy to use, the Hugging Face Transformers library supports a wide range of models and backends. One of its main advantages is its quick set up, which enables quick experimentation across architectures.

 

Hardware description

 

Inference latency results