Performance of LLM backends and models in Curnagl
TODO
- Introduction (Cristian)
- Backends and models tested (Margot)
- Hardware description (Margot)
- Inference latency results (Margot and Cristian) -> create one table per model and replace nodes names by GPU card name, we can also improve column titles.
Introduction
Backends and models tested
# Summary of LLM Models and Backends
## Tested Models
### Llama3
- **Models Download**: [Meta Llama3 models on Hugging Face](https://huggingface.co/meta-llama) (access official Meta Llama models).
- [**Meta-Llama-3.1-8B-Instruct**](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- [**Meta-Llama-3.1-70B-Instruct**](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)
### Mistral
- **Models download**: [Mistral models on Hugging Face](https://huggingface.co/mistralai)
- [**mistral-7B-Instruct-v0.3**](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)
- [**Mixtral-8x7B-v0.1-Instruct**](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)
---
## Tested Backends
- [**vLLM repository**](https://github.com/vllm-project/vllm)
vLLM backend provides efficient memory usage and fast token sampling. This backend is ideal for testing Llama3 and Mistral models in environments that require high-speed responses and low latency.
- [**llama.cpp repository**](https://github.com/ggerganov/llama.cpp)
llama.cpp was primarily used for llama but it can be applied to other LLM models. This optimized backend provides efficient inference on GPUs.
- [**mistral-inference repository**](https://github.com/mistralai/mistral-inference)
This is the official inference backend for Mistral. It is (supposed to be) optimized for Mistral's architecture, thus increasing the model performance.
- [**Transformers repository**](https://huggingface.co/docs/transformers)
If not the most widely used LLM black box, it is one of them. Easy to use, the Hugging Face Transformers library supports a wide range of models and backends. One of its main advantages is its quick set up, which enables quick experimentation across architectures.
Hardware description
Inference latency results