How to run LLM models

This tutorial shows, how to run LLM on UNIL clusters

Simple test

set up

For this simple test, we are going to use `transformers` library from hugging face. So you should type the following command to setup a proper python environment:

module load python
python -m venv venv
source venv/bin/activate
pip install transformers accelerate torch

if you plan to use a instruct model, you will need a chat template file which you can download from https://github.com/chujiezheng/chat_templates . For this example, we are going to use the llama template

wget https://raw.githubusercontent.com/chujiezheng/chat_templates/refs/heads/main/chat_templates/llama-3-instruct.jinja

Then, you should create the following python file:

from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model_hf='/reference/LLMs/Llama3/llama3_1/Meta-Llama-3.1-8B-Instruct-hf/'

model = AutoModelForCausalLM.from_pretrained(model_hf, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_hf)

with open('llama-3-instruct.jinja', "r") as f:
    chat_template = f.read()

tokenizer.chat_template = chat_template

with open('prompt.txt') as f:
    prompt=f.read()

prompts = [
    [{'role': 'user', 'content': prompt}]
]

model_inputs = tokenizer.apply_chat_template(
    prompts,
    return_tensors="pt",
    tokenize=True,
    add_generation_prompt=True #This is for adding prompt, useful in chat mode
).to("cuda")

generated_ids = model.generate(
    model_inputs,
    max_new_tokens=400,
)


for i, answer in enumerate(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)):
    print(answer)

This python code reads a prompt from a text file called prompt.tx and uses the model 8B de LLama to perform the inference.

To turn it in the cluster, we can use the following job script:

#!/bin/bash

#SBATCH -p gpu
#SBATCH --mem 20G
#SBATCH --gres gpu:1
#SBATCH -c 2

source venv/bin/activate
python run_inference.py

You should run the previous command sbatch:

sbatch job.sh

The result of the inference will be written in the SLURM file slurm-xxxx.out