Performance of LLM backends and models in Curnagl

TODO

Introduction (Cristian)
Backends and models tested (Margot)
Hardware description (Margot)
Inference latency results (Margot and Cristian) -> create one table per model and replace nodes names by GPU card name, we can also improve column titles.

Introduction

Backends and models tested

Tested modelsModels

Llama3

Official access to Meta ~~Llama~~Llama3 models: [Meta Llama3 models on Hugging ~~Face](https://huggingface.co/meta-llama)~~Face

- [

Meta-Llama-3.1-8B-~~Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)~~

-Instruct [

Meta-Llama-3.1-70B-~~Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)~~

~~#####~~Instruct

Mistral

Official access to Mistral models: Mistral models on MistralAI website

Access to Mistral models on Hugging Face: [Mistral models on Hugging ~~Face](https://huggingface.co/mistralai)~~

-Face [

mistral-7B-Instruct-v0.~~3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)~~

-3 [

Mixtral-8x7B-v0.1-~~Instruct](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)~~

~~####~~Instruct

Tested Backends

[**

vLLM ~~repository**](https://github.com/vllm-project/vllm)~~

repository

vLLM backend provides efficient memory usage and fast token sampling. This backend is ideal for testing Llama3 and Mistral models in environments that require high-speed responses and low latency.

[**

llama.cpp ~~repository**](https://github.com/ggerganov/llama.cpp)~~

repository

llama.cpp was primarily used for llama but it can be applied to other LLM models. This optimized backend provides efficient inference on GPUs.

**[mistral-inference

repository](https://github.com/mistralai/mistral-inference)**

This is the official inference backend for Mistral. It is (supposed to be) optimized for Mistral's architecture, thus increasing the model performance. However, our benchmarks results do not demonstrate any specificities to Mistral model as llama.cpp seems to perform better.

[**Transformers ~~repository**](https://huggingface.co/docs/transformers)~~

repository

If not the most widely used LLM black box, it is one of them. Easy to use, the Hugging Face Transformers library supports a wide range of models and backends. One of its main advantages is its quick set up, which enables quick experimentation across architectures.

~~---~~

mistral-inference repository

Performance of LLM backends and models in Curnagl

TODO

Introduction

Backends and models tested

Tested modelsModels

Llama3

Tested Backends

Hardware description

Inference latency results