Passer au contenu principal

Deep Learning with GPUs

The training phase of your deep learning model may be very time consuming. To accelerate this process you may want to use GPUs and you will need to install the deep learning packages, such as Keras or PyTorch, properly. Here is a short documentation on how to install some well known deep learning packages in Python and R. If you encounter any problem during the installation or if you need to install other deep learning packages (in Python, R or other programming languages), please send an email to helpdesk@unil.ch with subject DCSR: Deep Learning package installation, and we will try to help you.

Keras

We will install the TensorFlow 2's implementation of the Keras API (tf.keras); see https://keras.io/about/

To install the packages in your home directory:

cd $HOME

Log into a GPU node:

Sinteractive -m 4G -G 1

Check that the GPU is visible:

nvidia-smi

Load parallel modules and python:

module purge
module load gcc/10.4.0 cuda/11.6.2 cudnn/8.4.0.27-11.6 python/3.9.13

Create a virtual environment. Here we will call it "venv_tensorflow_gpu", but you may choose another name:

python -m venv venv_tensorflow_gpu

Activate the virtual environment:

source venv_tensorflow_gpu/bin/activate

Install TensorFlow (which includes Keras):

pip install tensorflow

Check that TensorFlow was properly installed:

python -c 'import tensorflow; print(tensorflow.__version__)'

There might be a warning message and the output should be something like "2.5.0".

You may install extra packages that you deep learning code will use. For example:

pip install numpy
pip install scikit-learn
pip install pandas
pip install matplotlib

Deactivate your virtual environment and logout from the GPU node:

deactivate
exit

Comment

If you want to make your installation more reproducible, you may proceed as follows:

1. Create a file called "requirements.txt" and write the package names inside. You may also specify the package versions. For example:

tensorflow==2.5.0
numpy==1.19.5
scikit-learn==0.24.2
pandas==1.2.5
matplotlib==3.4.2

2. Proceed as above, but instead of installing the packages individually, type 

pip install -r requirements.txt

Run your deep learning code

To test your deep learning code (maximum 1h), say "my_deep_learning_code.py", you may use the interactive mode:

cd /scratch/username/

Sinteractive -p interactive -m 4G -G 1

module load gcc/10.4.0 cuda/11.6.2 cudnn/8.4.0.27-11.6 python/3.9.13

source $HOME/venv_tensorflow_gpu/bin/activate

Run your code:

python my_deep_learning_code.py

or copy/paste your code inside a python environment:

python

copy/paste your code. For example:

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical

etc

Comment

To confirm that TensorFlow is using the GPU:

import tensorflow as tf
tf.config.list_physical_devices("GPU")

or to obtain the number of GPUs available:

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices("GPU")))

Once you have finished testing your code, you must close your interactive session (by typing exit), and then run it on the cluster by using an sbatch script, say "my_sbatch_script.sh":

#!/bin/bash -l
#SBATCH --account your_account_id
#SBATCH --mail-type ALL
#SBATCH --mail-user firstname.surname@unil.ch

#SBATCH --chdir /scratch/username/
#SBATCH --job-name my_deep_learning_job
#SBATCH --output my_deep_learning_job.out

#SBATCH --partition gpu
#SBATCH --gres gpu:1
#SBATCH --gres-flags enforce-binding
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH --mem 10G
#SBATCH --time 01:00:00

module load gcc/10.4.0 cuda/11.6.2 cudnn/8.4.0.27-11.6 python/3.9.13

source $HOME/venv_tensorflow_gpu/bin/activate

python /PATH_TO_YOUR_CODE/my_deep_learning_code.py

To launch your job:

cd $HOME/PATH_TO_YOUR_SBATCH_SCRIPT/

sbatch my_sbatch_script.sh

Multi-GPU parallelism

If you want to use a single GPU, you do not need to tell Keras to use the GPU. Indeed, if a GPU is available, Keras will use it automatically.

On the other hand, if you want to use 2 (or more) GPUs (on the same node), you need to use a special TensorFlow function, called "tf.distribute.MirroredStrategy", in your python code "my_deep_learning_code.py": see the Keras documentation https://keras.io/guides/distributed_training/  If no devices are specified in the constructor argument of the strategy then it will use all the available GPUs. If no GPUs are found, it will use the available CPUs. 

This function implements single-machine multi-GPU data parallelism. It works in the following way: divide the batch data into multiple sub-batches, apply a model copy on each sub-batch, where every model copy is executed on a dedicated GPU, and finally concatenate the results (on CPU) into one big batch. For example, if your batch_size is 64 and you use 2 GPUs, then we will divide the input data into 2 sub-batches of 32 samples, process each sub-batch on one GPU, then return the full batch of 64 processed samples. This induces quasi-linear speedup.

And the sbatch script must contain the line:

#SBATCH --gres gpu:2

TensorBoard

To use TensorBoard on Curnagl, you need to modify your code as explained in https://keras.io/api/callbacks/tensorboard/ .

After your TensorBoard "logs" directory has been created, you need to proceed as follows:

[/scratch/pjacquet] Sinteractive -m 4G -G 1
Sinteractive is running with the following options:

--gres=gpu:1 -c 1 --mem 4G -J interactive -p interactive -t 1:00:00 --x11

salloc: Granted job allocation 2466209
salloc: Waiting for resource configuration
salloc: Nodes dnagpu001 are ready for job

You need to remember the GPU node's name dnagpuXXX. Here it is dnagpu001.

Then

[/scratch/pjacquet] module load gcc/10.4.0 cuda/11.6.2 cudnn/8.4.0.27-11.6 python/3.9.13

[/scratch/pjacquet] source $HOME/venv_tensorflow_gpu/bin/activate

(venv_tensorflow_gpu) [/scratch/pjacquet] ls
logs

(venv_tensorflow_gpu) [/scratch/pjacquet] tensorboard --logdir=./logs --port=6006

You will see the following message:

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.6.0 at http://localhost:6006/ (Press CTRL+C to quit)

On your laptop, you need to type:

ssh -J curnagl.dcsr.unil.ch -L 6006:localhost:6006 dnagpuXXX

where dnagpuXXX is the GPU node's name you used to launch TensorBoard (above it was dnagpu001).

Finally, on your laptop, you may use any web browser (e.g. Chrome) to open the page http://localhost:6006 (copy/paste this link into your web browser). You should then see TensorBoard with the information located in the "logs" folder.

TensorFlow

The installation of TensorFlow 2 is the same as for Keras, so please look at the above Keras installation.

Warning

In TensorFlow 1.15 and previous versions, the packages for CPU and GPU are offered separately:

pip install tensorflow==1.15 # CPU
pip install tensorflow-gpu==1.15 # GPU

PyTorch

To install the packages in your home directory:

cd $HOME

Log into a GPU node:

Sinteractive -m 4G -G 1

Check that the GPU is visible:

nvidia-smi

Load parallel modules and python:

module purge
module load gcc/10.4.0 cuda/11.6.2 cudnn/8.4.0.27-11.6 python/3.9.13

Create a virtual environment. Here we will call it "venv_pytorch_gpu", but you may choose another name:

python -m venv venv_pytorch_gpu

Activate the virtual environment:

source venv_pytorch_gpu/bin/activate

Install PyTorch:

pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

Check that PyTorch was properly installed:

python -c 'import torch; print(torch.__version__)'

There might be a warning message and the output should be something like "1.8.1".

You may install extra packages that you deep learning code will use. For example:

pip install scikit-learn
pip install pandas
pip install matplotlib

Deactivate your virtual environment and logout from the GPU node:

deactivate
exit

Comment

If you want to make your installation more reproducible, you may proceed as follows:

1. Create a file called "requirements.txt" and write the package names inside. You may also specify the package versions. For example:

torch==1.8.1
torchvision==0.9.1
scikit-learn==0.24.2
pandas==1.2.4
matplotlib==3.4.2

2. Proceed as above, but instead of installing the packages individually, type 

pip install -r requirements.txt

Run your deep learning code

To test your deep learning code (maximum 1h), say "my_deep_learning_code.py", you may use the interactive mode:

cd /scratch/username/

Sinteractive -m 4G -G 1

module load gcc/10.4.0 cuda/11.6.2 cudnn/8.4.0.27-11.6 python/3.9.13

source $HOME/venv_pytorch_gpu/bin/activate

Run your code:

python my_deep_learning_code.py

or copy/paste your code inside a python environment:

python

copy/paste your code

Once you have finished testing your code, you must close your interactive session (by typing exit), and then run it on the cluster by using an sbatch script, say "my_sbatch_script.sh":

#!/bin/bash -l
#SBATCH --account your_account_id
#SBATCH --mail-type ALL
#SBATCH --mail-user firstname.surname@unil.ch

#SBATCH --chdir /scratch/username/
#SBATCH --job-name my_deep_learning_job
#SBATCH --output my_deep_learning_job.out

#SBATCH --partition gpu
#SBATCH --gres gpu:1
#SBATCH --gres-flags enforce-binding
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH --mem 10G
#SBATCH --time 01:00:00

module load gcc/10.4.0 cuda/11.6.2 cudnn/8.4.0.27-11.6 python/3.9.13

source $HOME/venv_pytorch_gpu/bin/activate

python /PATH_TO_YOUR_CODE/my_deep_learning_code.py

To launch your job:

cd $HOME/PATH_TO_YOUR_SBATCH_SCRIPT/

sbatch my_sbatch_script.sh

TensorBoard

You may use TensorBoard with PyTorch by looking at the documentation  

https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html

and by adapting slightly the instructions above (see TensorBoard in Keras). 

R Keras

R Keras is an interface to Python Keras. In simple terms, this means that the Keras R package allows you to enjoy the benefit of R programming while having access to the capabilities of the Python Keras package.

To install the packages in your home directory:

cd $HOME

Log into a GPU node:

Sinteractive -m 4G -G 1

Check that the GPU is visible:

nvidia-smi

Load parallel modules and python:

module purge
module load gcc/10.4.0 cuda/11.6.2 cudnn/8.4.0.27-11.6 python/3.9.13 r/4.2.1

Launch an R environment:

Install the R Keras package by using a virtual environment (called "venv_r-tensorflow_gpu"):

install.packages("keras")

Would you like to use a personal library instead? (yes/No/cancel) yes

Would you like to create a personal library to install packages into? (yes/No/cancel) yes

And select Switzerland for the CRAN mirror.

library(keras)
library("tensorflow")

install_tensorflow(version = "2.5.0-gpu", method = "virtualenv", envname = "venv_r-tensorflow_gpu")

q()

This will install Keras and TensorFlow.

Comment

If you receive an error message concerning "conda", you may need to look at your .bashrc file for a conda init configuration and comment this part.

Run your deep learning code

To test your deep learning code (maximum 1h), say "my_deep_learning_code.R", you may use the interactive mode:

Sinteractive -m 4G -G 1

module load gcc/10.4.0 cuda/11.6.2 cudnn/8.4.0.27-11.6 python/3.9.13 r/4.2.1

R

library(keras)
library("tensorflow")

copy/paste your code

Comment

To confirm that TensorFlow is using the GPU:

tf$config$list_physical_devices("GPU")

or to obtain the number of GPUs available:

print(length(tf$config$list_physical_devices("GPU")))

Once you have finished testing your code, you must close your interactive session (by typing exit), and then run it on the cluster by using an sbatch script, say "my_sbatch_script.sh":

#!/bin/bash -l
#SBATCH --account your_account_id
#SBATCH --mail-type ALL
#SBATCH --mail-user firstname.surname@unil.ch

#SBATCH --chdir /scratch/username/
#SBATCH --job-name my_deep_learning_job
#SBATCH --output my_deep_learning_job.out

#SBATCH --partition gpu
#SBATCH --gres gpu:1
#SBATCH --gres-flags enforce-binding
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH --mem 10G
#SBATCH --time 01:00:00

module load gcc/10.4.0 cuda/11.6.2 cudnn/8.4.0.27-11.6 python/3.9.13 r/4.2.1

Rscript /PATH_TO_YOUR_CODE/my_deep_learning_code.R

To launch your job:

cd $HOME/PATH_TO_YOUR_SBATCH_SCRIPT/

sbatch my_sbatch_script.sh

Multi-GPU parallelism

See the explanation under the Python Keras installation.