Performance of LLM backends and models in Curnagl
TODO
- Introduction (Cristian)
- Backends and models tested (Margot)
- Hardware description (Margot)
- Inference latency results (Margot and Cristian) -> create one table per model and replace nodes names by GPU card name, we can also improve column titles.
Introduction
Backends and models tested
Tested modelsModels
Llama3
- Official access to Meta
LlamaLlama3 models:[Meta Llama3 models on HuggingFace](https://huggingface.co/meta-llama)Face
Mistral
- Official access to Mistral models: Mistral models on MistralAI website
- Access to Mistral models on Hugging Face:
[Mistral models on HuggingFace](https://huggingface.co/mistralai)####Instruct - Access to Mistral models on Hugging Face:
Tested Backends
vLLM backend provides efficient memory usage and fast token sampling. This backend is ideal for testing Llama3 and Mistral models in environments that require high-speed responses and low latency.
llama.cpp was primarily used for llama but it can be applied to other LLM models. This optimized backend provides efficient inference on GPUs.
-
repository](https://github.com/mistralai/mistral-inference)**
If not the most widely used LLM black box, it is one of them. Easy to use, the Hugging Face Transformers library supports a wide range of models and backends. One of its main advantages is its quick set up, which enables quick experimentation across architectures.
This is the official inference backend for Mistral. It is (supposed to be) optimized for Mistral's architecture, thus increasing the model performance. However, our benchmarks results do not demonstrate any specificities to Mistral model as llama.cpp seems to perform better.