High performance computing - HPC

Ce service permet d’accéder aux infrastructures de calcul haute performance (clusters) de l’UNIL pour le traitement de données de recherche non sensibles.

Getting Started

Getting Started

DCSR? Kesako?

The full name is the Division de Calcul et Soutien à la Recherche / Computing and Research Support unit

The mission of the DCSR is to supply the University of Lausanne with compute and storage capabilities for all areas of research.

As well as managing compute and storage systems we also provide user support:

The official DCSR homepage is at: https://www.unil.ch/ci/dcsr-en

Getting Started

How to access the clusters

The DCSR maintains a general purpose cluster (Curnagl) which is described here.  Researchers needing to process sensitive data must use the air gapped cluster Jura

There are 3 requirements to be able to connect to Curnagl:

  1. To be part of a PI project
  2. To be on the UNIL network (either physically or using the UNIL VPN if you work remotely)
  3. To have a SSH client

Step 1: Be part of a PI project

To access the clusters, your PI will first need to request resources via: https://conference.unil.ch/research-resource-requests/. Then the PI must add you as a member of one of his project. Within 24 hours your access should be granted.

Step 2: Activate the UNIL VPN

Unless you are physically within the UNIL network you need to activate the UNIL VPN (Crypto). Documentation to install and run it can be found here.

Step 3: Open a SSH client

On Linux and Mac environments, a SSH client should be available by default. You simply need to open a terminal.

Windows users can either use PowerShell if they are on Windows 10, or install a third party client such as PuTTy or MobaXterm.

Step 4: Log into the cluster

ssh -X <username>@curnagl.dcsr.unil.ch

where <username> is your UNIL username name. You will have to enter your UNIL password.

Note: we strongly recommend you to establish SSH keys to connect to the clusters and to protect your SSH keys with a passphrase. See Part3 in http://wally-head.unil.ch/courses/pdf/linux_intro.pdf for details.

More details are available regarding the different clients in this documentation.

Getting Started

I'm a PI and would like to use the clusters - what do I do?

It's easy! Please fill in the request form at https://conference.unil.ch/research-resource-requests/ and we'll get back in touch with you as soon as possible.

Help!

Help!

How do I ask for help?

Before asking for help please take the time to check that your question hasn't already been answered in our FAQ.

To contact us please send an e-mail to the UNIL Helpdesk at helpdesk@unil.ch starting the subject with DCSR 

From:    user.lambda@unil.ch

To:      helpdesk@unil.ch

Subject: DCSR Cannot run CowMod on Curnagl

Dear DCSR,
I am unable to run the CowMod code on Curnagl - please see job number 1234567 for example.

The error message is "No grass left in field - please move to alpage"

You can find my input in /users/ulambda/CowMod/tests/hay/

To reproduce the issue on the command line the following recipe works (or rather doesn't work)

module load CowMod
cd /users/ulambda/CowMod/tests/hay/
CowMod --input=Feedtest 

Thanks

Dr Lambda

It helps us if you can provide all relevant information including how we can reproduce the problem and a Job ID if you submitted your task via the batch system.

Once we have analysed your problem we will get in touch with you.

Help!

Recovering deleted files?

This depends on where the file was and when it was created and deleted.

/scratch

There is no backup and no snapshots so the file is gone forever. 

/users

If it was in your home directory /users/<username>  then you can recover files from up to 7 days ago using the built-in snapshots by navigating to the snapshot directory as follows:

[ulambda@login ~]$ pwd
/users/ulambda

[ulambda@login ~]$ date
Tue Jun  1 13:59:28 CEST 2021

[ulambda@login ~]$ $ cd /users/.snapshots/

[ulambda@login .snapshots]$ ls
2021-05-26  2021-05-27  2021-05-28  2021-05-29  2021-05-30  2021-05-31  2021-06-01

[ulambda@login .snapshots]$ cd 2021-05-31/ulambda

[ulambda@login ]$ pwd
/users/.snapshots/2021-05-31/ulambda

[ulambda@login ]$ ls
..
my_deleted_file_from_yesterday
..
..


 

The snapshots are taken at around 3am in the morning so if you created a file in the morning and deleted it the same afternoon then we can't help.

Beyond 7 days the file is lost forever.

Infrastructure and Resources

Infrastructure and Resources

Curnagl

Kesako?

 

Curnagl (Romanche), or Chocard à bec jaune in French, is a sociable bird known for its acrobatic exploits and is found throughout the alpine region. More information is available at https://www.vogelwarte.ch/fr/oiseaux/les-oiseaux-de-suisse/chocard-a-bec-jaune

It's also the name of our new compute cluster which will replace the Wally and Axiom clusters.

If you experience unexpected behaviour or need assistance please contact us via helpdesk@unil.ch starting the mail subject with DCSR Curnagl

 

How to connect

The login node is curnagl.dcsr.unil.ch

For full details on how to connect using SSH please read the documentation

 

Before connecting we recommend that you add the host's key to your list of known hosts:

echo "curnagl.dcsr.unil.ch ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBCunvgFAN/X/8b1FEIxy8p3u9jgfF0NgCl7CX4ZmqlhaYis2p7AQ34foIXemaw2wT+Pq1V9dCUh18mWXnDsjGrg=" >> ~/.ssh/known_hosts

You can also type "yes" during the first connection to accept the host key but this is less secure.

Please be aware that you must be connected to the VPN if you are not on the campus network.

Then simply ssh username@curnagl.dcsr.unil.ch where username is your UNIL account

The login node must not be used for any form of compute or memory intensive task apart from software compilation and data transfer. Any such tasks will be killed without warning.

Hardware

Compute

The cluster is composed of 72 compute nodes of which eight have GPUs. All have the same 24 core processor.

Number of nodes Memory CPU GPU
52 512 GB 2 x AMD Epyc2 7402 -
12 1024 GB 2 x AMD Epyc2 7402 -
8 512 GB 2 x AMD Epyc2 7402 2 x NVIDIA A100

 

Network

The nodes are connected with both HDR Infiniband and 100 Gb Ethernet. The Infiniband is the primary interconnect for storage and inter-node communication.

Partitions 

There are 3 main partitions on the cluster:

interactive

The interactive partition allows rapid access to resources but comes with a number of restrictions, the main ones being:

For example:

CPU cores requested Memory requested GPUs requested Run Time Allowed
4 32 1 8 hours
8 64 1 4 hours
16 128 1 2 hours
32 256 1 1 hour

We recommend that users access this using the Sinteractive command. This partition should also be used for compiling codes.

This partition can also be accessed using the following sbatch directive:

#SBATCH -p interactive 

Note on GPUs in the interactive partition

There is one node with GPUs in the interactive partition and in order to allow multiple users to work at the same time these A100 cards have been partitioned into 2 instances each with 20GB of memory for a total of 4 GPUs. 

The maximum time limit for requesting a GPU is 8 hours with the CPU and memory limits applying. 

For longer jobs and to have whole A100 GPUs please submit batch jobs to the gpu partition.

Please do not block resources if you are not using them as this prevents other people from working.

If you request too many resources then you will see the following error:

salloc: error: QOSMaxCpuMinutesPerJobLimit
salloc: error: Job submit/allocate failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

Please reduce either the time or the cpu / memory / gpu requested.

cpu

This is the main partition and includes the majority of the compute nodes. Interactive jobs are not permitted. The partition is configured to prevent long running jobs from using all available resources and to allow multi-node jobs to start within a reasonable delay.

The limits are:

Normal jobs - 3 days

Short jobs - 12 hours

Normal jobs are restricted to ~2/3 of the resources which prevents the cluster being blocked by long running jobs.

In exceptional cases wall time extensions may be granted but for this you need to contact us with a justification before submitting your jobs!

The cpu partition is the default partition so there is no need to specify it but if you wish to do so then use the following sbatch directive

#SBATCH -p cpu

gpu

This contains the GPU equipped nodes. 

To request resources in the gpu partition please use the following sbatch directive:

#SBATCH -p gpu

The limits are:

Normal jobs - 1 day

Short jobs - 6 hours

Normal jobs are restricted to ~2/3 of the resources which prevents the cluster being blocked by long running jobs.

--gres=gpu:N

where N is 1 or 2. 

Software

The DCSR software stack is loaded by default so when you connect you will see the following:

$ module avail

 /dcsrsoft/spack/hetre/v1.1/spack/share/spack/lmod/Zen2-IB-test/linux-rhel8-x86_64/Core 
   cmake/3.20.0    gcc/9.3.0     mpfr/3.1.6
   cuda/11.2.2     git/2.31.0    xz/5.2.5

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching
any of the "keys".

To see more packages load a compiler (gcc or intel) 

$ module load gcc

$ module avail

 /dcsrsoft/spack/hetre/v1.1/spack/share/spack/lmod/Zen2-IB-test/linux-rhel8-x86_64/gcc/9.3.0 
   admixtools/7.0.1         gsl/2.6                       python/2.7.18
   bamaddrg/0.1             htslib/1.10.2                 python/3.8.8       (D)
   bamtools/2.5.1           intel-tbb/2020.3              qtltools/1.3.1
   bcftools/1.10.2          julia/1.6.0                   r/4.0.4
   bedtools2/2.29.2         maven/3.6.3                   rsem/1.3.1
   blast-plus/2.11.0        miniconda3/4.9.2              star/2.7.6a
   bowtie2/2.4.2            mvapich2/2.3.5                stream/5.10-openmp
   cmake/3.20.0      (D)    nlopt/2.6.1                   stream/5.10        (D)
   eigen/3.3.9              octave/6.2.0                  tskit/0.3.1
   fftw/3.3.9               openblas/0.3.14-openmp        xz/5.2.5           (D)
   gdb/10.1                 openblas/0.3.14        (D)    zlib/1.2.11
   gmsh/4.7.1-openmp        openjdk/11.0.8_10
   gnuplot/5.2.8            perl/5.32.1

- /dcsrsoft/spack/hetre/v1.1/spack/share/spack/lmod/Zen2-IB-test/linux-rhel8-x86_64/Core --
   cmake/3.20.0    cuda/11.2.2    gcc/9.3.0 (L)    git/2.31.0    mpfr/3.1.6    xz/5.2.5

  Where:
   D:  Default Module
   L:  Module is loaded

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the
"keys".

The provided versions for key tools are:

 

Storage

The storage is provided by a Lenovo DSS system and the Spectrum Scale (GPFS) parallel filesystem.

/users

Your home space is at /users/username and there is a per user quota of 50 GB and 100,000 files.

We would like to remind you that all scripts and code should be stored in a Git repository.

/scratch

The scratch filesystem is the primary working space for running calculations.

The scratch space runs on SSD storage and has an automatic cleaning policy so in case of a shortage of free space files older than 2 weeks (starting with the oldest first) will be deleted.

Initially this cleanup will be triggered if the space is more than 90% used and this limit will be reviewed as we gain experience with the usage patterns.

The space is per user and there are no quotas (*). Your scratch space can be found at /scratch/username 

e.g. /scratch/ulambda

Use of this space is not charged for as it is now classed as temporary storage.

* There is a quota of 50% of the total space per user to prevent runaway jobs wreaking havoc

/work

The work space is for storing data that is being actively worked on as part of a research project. Projects have quotas assigned and while we will not delete data in this space there is no backup so all critical data must also be kept on the DCSR NAS.

The structure is: 

/ work / FAC / FACULTY / INSTITUTE / PI / PROJECT

This space can, and should, be used for the installation of any research group specific software tools including python virtual environments.

 

 

Infrastructure and Resources

Storage on Curnagl

Where is data stored

This storage is accessible from within the UNIL network using the SMB/CIFS protocol. It is also accessible on the cluster login node at /nas (see this guide)

The UNIL HPC clusters also have dedicated storage that is shared amongst the compute nodes but this is not, in general, accessible outside of the clusters  except via file transfer protocols (scp).

This space is intended for active use by projects and is not a long term store.

Cluster filesystems

The cluster storage is based on the IBM Spectrum Scale (GFPS) parallel filesystem. There are two disk based filesystems (users and work) and one SSD based one (scratch). Whilst there is no backup the storage is reliable and resilient to disk failure. 

The role of each filesystem as well as details of the data retention policy is given below.

How much space am I using?

For the users and work filesystems the quotacheck command allows you to see the used and allocated space:

 

[ulambda@login ~]$ quotacheck 
### Work Quotas ###
 
Project: pi_ulambda_100111-pr-g 
 
                         Block Limits                                    |     File Limits
Filesystem type         blocks      quota      limit   in_doubt    grace |    files   quota    limit in_doubt    grace  Remarks
work       FILESET      304.6G     1.999T         2T          0     none |  1107904 9990000 10000000        0     none DCSR-DSS.dcsr.unil.ch
 
 
Project: gruyere_100666-pr-g 
 
                         Block Limits                                    |     File Limits
Filesystem type         blocks      quota      limit   in_doubt    grace |    files   quota    limit in_doubt    grace  Remarks
work       FILESET           0        99G       100G          0     none |        1  990000  1000000        0     none DCSR-DSS.dcsr.unil.ch

 
### User Quota ###
 
                         Block Limits                                    |     File Limits
Filesystem type         blocks      quota      limit   in_doubt    grace |    files   quota    limit in_doubt    grace  Remarks
users      USR          8.706G        50G        51G       160M     none |    66477  102400   103424      160     none DCSR-DSS.dcsr.unil.ch

Users

/users/<username>

This is your home directory and can be used for storing small amounts of data. The per user quota is 50 GB and 100,000 files.

There are daily snapshots kept for seven days in case of accidental file deletion. See here for more details. 

Work

/work/<path to my project>

This space is allocated per project and the quota can be increased on request by the PI as long as free space remains. 

This space is not backed up but there is no over-allocation of resources so we will never ask you to remove files.

Scratch

/scratch/<username>

The scratch space is for intermediate files and the results of computations. There is no quota and the space is not charged for. You should think of it as temporary storage for a few weeks while running calculations.

In case of limited space files will be automatically deleted to free up space. The current policy is that if the usage reaches 90% files, starting with the oldest first, will be removed until the occupancy is reduced to 70%. No files newer than two weeks old will be removed.

$TMPDIR

For certain types of calculation it can be useful to use the NVMe drive on the compute node. This has a capacity of ~400 GB and can be accessed inside a batch job by using the $TMPDIR variable.

At the end of the job this space is automatically purged.

 

 

Infrastructure and Resources

Jura

Jura is a cluster for the analysis of sensitive data and is primarily used by the CHUV.

Computing ressources

Storage ressources

ATTENTION /data directory is NOT BACKED UP

Getting ressources on Jura

Accessing the infrastructure from UNIL

ATTENTION PROPER LOG OUT

Transferring data in

sib-1-24:~ someuser$ sftp someuser@jura.dcsr.unil.ch
Password:
Verification code:
Connected to someuser@jura.dcsr.unil.ch.
sftp> dir
data 
sftp> cd data
sftp> dir
sftp> put AVeryImportantFile.tgz
Uploading AVeryImportantFile.tgz to /data/AVeryImportantFile.tgz
AVeryImportantFile.tgz

Transferring code in/out

 

There is a DCSR managed Git service accessible from Jura. More information can be found at

https://wiki.unil.ch/ci/books/service-de-calcul-haute-performance-%28hpc%29/page/why-is-there-a-dcsr-gitlab-service-and-what-is-it

 

Accessing the infrastructure from CHUV

ssh<unil-username>@stockage-horus.chuv.ch

 

Infrastructure and Resources

Data Centre Migration 2022

CCT Move 

The following page describes the work going on over the summer to install new resources and migrate existing ones to the new CCT data centre.

Further details including the exact dates will be shared with the affected user groups once this information is known. 


Stage 1 (early summer)

Installation and configuration of Urblauna and the Curnagl extension.

Now that the CCT is finally ready the hardware is in place and is being configured. Behind the scenes there are also major network changes going on to allow for a dedicated research network that will not disturb other UNIL activities and which allows for better connections between experimental data acquisition systems (e.g. sequencers and microscopes) and the DCSR facilities.


Stage 2 (now)

Data migration for Curnagl

New software stack


Once the new storage is in place the data will be copied from Géopolis to the CCT. How long this takes is difficult to estimate precisely and will influence the timing of the following steps.

The new software stack will be put in service on the machines at the CCT with the current one remaining available but without any updates until the end of 2023. At this point we will move to an annual release cycle.

Note on data migration.

The data in /work and /users will be migrated by the DCSR so you have nothing to do. /scratch will not be migrated as it is considered temporary data.


Stage 3 (18th to 22nd August)

Data synchronisation for Curnagl 

Shutdown of Curnagl and switch to the extension

When all data on /work and /users have been copied between the CCT and Géopolis there will be a period of a few days where all Curnagl compute resources will be stopped to allow for final data synchronisation. Once this is complete the new Curnagl nodes will be made available and for a transitional period will provide all the compute resources.

Note: During this period only 24 compute nodes (no GPUs) will be available which is 1/3 of the current capacity. As such please plan ahead and avoid submitting any non essential workloads during this period.

Note: /scratch will not be copied as this is considered temporary data. 


Stage 4 (22nd August until mid September)

Move of Curnagl nodes and storage to the CCT.

Addition of the nodes to the "new" Curnagl

Urblauna in service

The existing machines will be moved to the CCT and integrated in the new Curnagl cluster. This will take the total capacity to 96 machines (8 with GPUs).

To allow for network reconfiguration there will be a second short downtime.

The existing storage will also be moved and integrated with the new system to double the capacity (2PB total for /work)

The new sensitive data cluster will be put in service and Jura users should begin the migration of their data and workloads.


Stage 5 (autumn)

Jura decommissioning

Once the migration to Curnagl is complete the Jura system will be stopped.


Infrastructure and Resources

Urblauna

Kesako?


Urblauna (Romanche), or Lagopède Alpin in French, is a bird known for its changing plumage which functions as a very effective camouflage. More information is available at https://www.vogelwarte.ch/fr/oiseaux/les-oiseaux-de-suisse/lagopede-alpin

It's also the name of our new sensitive data compute cluster which will replace Jura cluster.

Using the Clusters

Using the Clusters

How to run a job on Curnagl

Overview

Suppose that you have finished writing your code, say a python code called <my_code.py>, and you want to run it on the cluster Curnagl. You will need to submit a job (a bash script) with information such as the number of CPU you want to use and the amount of RAM memory you will need. This information will be processed by the job scheduler (a software installed on the cluster) and your code will be executed. The job scheduler used in Wally is called SLURM (Simple Linux Utility for Resource Management). It is a free open-source software used by many of the world’s computer clusters.

The partitions

The clusters contain several partitions (sets of compute nodes dedicated to different means). To list them, type

sinfo

As you can see, there are three partitions:

Each partition is associated with a submission queue. A queue is essentially a waiting line for your compute job to be matched with an available compute resource. Those resources become available once a compute job from a previous user is completed.

Note that the nodes may be in different states: idle=not used, alloc=used, down=switch off, etc. Depending on what you want to do, you should choose the appropriate partition/submission queue.

The sbatch script

To execute your python code on the cluster, you need to make a bash script, say <my_script.sh>, specifying the information needed to run your python code (you may want to use nano, vim or emacs as an editor on the cluster). Here is an example:

#!/bin/bash -l

#SBATCH --account project_id
#SBATCH --mail-type ALL
#SBATCH --mail-user firstname.surname@unil.ch

#SBATCH --chdir /scratch/<your_username>/
#SBATCH --job-name my_code
#SBATCH --output my_code.out

#SBATCH --partition cpu

#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 8
#SBATCH --mem 10G
#SBATCH --time 00:30:00
#SBATCH --export NONE

module load gcc/9.3.0 python/3.8.8

python3 /PATH_TO_YOUR_CODE/my_code.py

Here we have used the command "module load gcc/9.3.0 python/3.8.8" before "python3 /PATH_TO_YOUR_CODE/my_code.py" to load some libraries and to make several programs available.

To display the list of available modules or to search for a package:

module avail
module spider package_name

For example, to load bowtie2:

module load gcc/9.3.0 bowtie2/2.4.2

To display information of the sbatch command, including the SLURM options:

sbatch --help
sbatch --usage

Finally, you submit the bash script as follows:

sbatch my_script.sh

Important: We recommend to store the above bash script and your python code in your home folder, and to store your main input data in your work space. The data may be read from your python code. Finally you must write your results in your scratch space.

To show the state (R=running or PD=pending) of your jobs, type:

Squeue

If you realize that you made a mistake in your code or in the SLURM options, you may cancel it:

scancel JOBID

An interactive session

Often it is convenient to work interactively on the cluster before submitting a job. I remind you that when you connect to the cluster you are actually located at the front-end machine and your must NOT run any code there. Instead you should connect to a node by using the Sinteractive command as shown below.

[ulambda@login ~]$ Sinteractive -c 1 -m 8G -t 01:00:00
 
interactive is running with the following options:

-c 1 --mem 8G -J interactive -p interactive -t 01:00:00 --x11

salloc: Granted job allocation 172565
salloc: Waiting for resource configuration
salloc: Nodes dna020 are ready for job
[ulambda@dna020 ~]$  hostname
dna020.curnagl

You can then run your code.

Hint: If you are having problems with a job script then copy and paste the lines one at a time from the script into an interactive session - errors are much more obvious this way.

You can see the available options by passing the -h option.

[ulambda@login1 ~]$ Sinteractive -h
Usage: Sinteractive [-t] [-m] [-A] [-c] [-J]

Optional arguments:
    -t: time required in hours:minutes:seconds (default: 1:00:00)
    -m: amount of memory required (default: 8G)
    -A: Account under which this job should be run
    -R: Reservation to be used
    -c: number of CPU cores to request (default: 1)
    -J: job name (default: interactive)
    -G: Number of GPUs (default: 0)

 

To logout from the node, simply type:

exit

Embarrassingly parallel jobs

Suppose you have 14 configuration files in <path_to_configurations> and you want to process them in parallel by using your python code <my_code.py>. This is an example of embarrassingly parallel programming where you run 14 independent jobs in parallel, each with a different set of parameters specified in your configuration files. One way to do it is to use an array type:

#!/bin/bash -l

#SBATCH --account project_id
#SBATCH --mail-type ALL
#SBATCH --mail-user firstname.surname@unil.ch

#SBATCH --chdir /scratch/<your_username>/
#SBATCH --job-name my_code
#SBATCH --output=my_code_%A_%a.out

#SBATCH --partition cpu
#SBATCH --ntasks 1

#SBATCH --cpus-per-task 8
#SBATCH --mem 10G
#SBATCH --time 00:30:00
#SBATCH --export NONE

#SBATCH --array=0-13

module load gcc/9.3.0 python/3.8.8

FILES=(/path_to_configurations/*)

python3 /PATH_TO_YOUR_CODE/my_code.py ${FILES[$SLURM_ARRAY_TASK_ID]}

The above allocations (for example time=30 minutes) is applied to each individual job in your array.

Similarly, if the configuration files are simple numbers:

#!/bin/bash -l

#SBATCH --account project_id
#SBATCH --mail-type ALL
#SBATCH --mail-user firstname.surname@unil.ch

#SBATCH --chdir /scratch/<your_username>/
#SBATCH --job-name my_code
#SBATCH --output=my_code_%A_%a.out

#SBATCH --partition cpu
#SBATCH --ntasks 1

#SBATCH --cpus-per-task 8
#SBATCH --mem 10G
#SBATCH --time 00:30:00
#SBATCH --export NONE

#SBATCH --array=0-13

module load gcc/9.3.0 python/3.8.8

ARGS=(0.1 2.2 3.5 14 51 64 79.5 80 99 104 118 125 130 100)

python3 /PATH_TO_YOUR_CODE/my_code.py ${ARGS[$SLURM_ARRAY_TASK_ID]}

Another way to run embarrassingly parallel jobs is by using one-line SLURM commands. For example, this may be useful if you want to run your python code on all the files with bam extension in a folder: 

for file in `ls *.bam`
do
sbatch --account project_id --mail-type ALL --mail-user firstname.surname@unil.ch
--chdir /scratch/<your_username>/ --job-name my_code --output my_code-%j.out --partition cpu
--nodes 1 --ntasks 1 --cpus-per-task 8 --mem 10G --time 00:30:00
--wrap "module load gcc/9.3.0 python/3.8.8; python3 /PATH_TO_YOUR_CODE/my_code.py $file" &
done

Good practice

 

 

Using the Clusters

What projects am I part of and what is my default account?

In order to find out what projects you are part of on the clusters then you can use the Sproject tool:

$ Sproject 

The user ulambda ( Ursula Lambda ) is in the following project accounts
  
   ulambda_default
   ulambda_etivaz
   ulambda_gruyere
 
Their default account is: ulambda_default

If Sproject is called without any arguments then it tells you what projects/accounts you are in. 

To find out what projects other users are in you can call Sproject with the -u option

$ Sproject -u nosuchuser

The user nosuchuser ( I really do not exist ) is in the following project accounts
..
..

 

Using the Clusters

Providing access to external collaborators

In order to allow non UNIL collaborators to use the HPC clusters there are three steps which are detailed below.

Please note that the DCSR cannot accredit external collaborators (Step 1) as this is a centralised process.

  1. The PI and the external collaborator must ask for a UNIL account using this form
  2. the PI to whom the external collaborator is connected must use this application to add the collaborator into the appropriate project. Log into the application if necessary on the top right, and click on the "Manage members list / Gérer la liste de membres" icon for your project. The usernames always have 8 characters (e.g. Greta Thunberg username would be: gthunber)
  3. the external collaborator needs to use the UNIL VPN:

    https://www.unil.ch/ci/fr/home/menuinst/catalogue-de-services/reseau-et-telephonie/acces-hors-campus-vpn/documentation.html

The external collaborator on the VPN can then login to the HPC cluster as if he was inside the UNIL.

Using the Clusters

Requesting and using GPUs

GPU Nodes

As part of the gpu partition there are a number of GPU equipped nodes available.

Currently there are 7 nodes each with 2 NVIDIA A100 GPUs. One additional node is in the interactive partition 

Requesting GPUs

In order to access the GPUs they need to be requested via SLURM as one does for other resources such as CPUs and memory. 

The flag required is --gres=gpu:1 for 1 GPU per node and --gres=gpu:2 for 2 GPUs per node. 

 An example job script is as follows:

#!/bin/bash

#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 12
#SBATCH --mem 64G
#SBATCH --time 12:00:00

# NOTE - GPUS are in the gpu partition

#SBATCH --partition gpu
#SBATCH --gres gpu:1
#SBATCH --gres-flags enforce-binding

# Set up my modules

module purge
module load my list of modules
module load cuda

# Check that the GPU is visible

nvidia-smi

# Run my GPU enable python code

python mygpucode.py 

If the #SBATCH --gres gpu:1 is omitted then no GPUs will be visible even if they are present on the compute node. 

If you request one GPU it will always be seen as device 0.

The #SBATCH --gres-flags enforce-binding option ensures that the CPUs allocated will be on the same PCI bus as the GPU(s) which greatly improves the memory bandwidth. This may mean that you have to wait longer for resources to be allocated but it is strongly recommended.

If you select 2 GPUs then we strongly advise also requesting #SBATCH --exculsive to have all the resources of the node available to your job.

 

Using CUDA

In order to use the CUDA toolkit there is a module available

module load cuda

This loads the nvcc compiler and CUDA libraries. There is also a cudnn nodule for the DNN tools/libraries 

 

Containers and GPUs

Singularity containers can make use of GPUs but in order to make them visible to the container environment an extra flag "--nv" must be passed to Singularity

module load singularity

singularity run --nv mycontainer.sif

The full documentation is at https://sylabs.io/guides/3.5/user-guide/gpu.html

 

Using the Clusters

How do I run a job for more that 3 days?

The simple answer is that you can't without special authorisation. Please do not submit such jobs and ask for a time extension!

If you think that you need to run for longer than 3 days then please do the following:

Contact us via helpdesk@unil.ch and explain what the problem is.

We will then get in touch with you to analyse your code and suggest performance or workflow improvements to either allow it to complete within the required time or to allow it to be run in steps using checkpoint/restart techniques.

Recent cases involve codes that were predicted to take months to run now finishing in a few days after a bit of optimisation.

If the software cannot be optimised, there is the possibility of using a checkpoint mechanism. More information 

Using the Clusters

Access NAS DCSR from the cluster

The NAS is available from the login node only under /nas. The folder hierarchy is:

/nas/FAC/<your_faculty>/<your_department>/<your_PI>/<your_project>

Cluster -> NAS

To copy a file to the new NAS:

cp /path/to/file /nas/FAC/<your_faculty>/<your_department>/<your_PI>/<your_project>

To copy a folder to the new NAS:

cp -r /path/to/folder /nas/FAC/<your_faculty>/<your_department>/<your_PI>/<your_project>

For more complex operations, consider using rsync. For the documentation see the man page:

man rsync

or check out this link.

NAS -> cluster

As above, just swapping the source and destination:

cp /nas/FAC/<your_faculty>/<your_department>/<your_PI>/<your_project>/file /path/to/dest
cp -r /nas/FAC/<your_faculty>/<your_department>/<your_PI>/<your_project>/folder /path/to/dest
Using the Clusters

SSH connection to DCSR cluster

This page presents how to connect to DCSR cluster depending on your operating system.

Linux

SSH is always installed by most commons Linux distributions, so no extra package should be installed.

Connection with a password

To connect using a password, just run the following command:

ssh username@curnagl.dcsr.unil.ch

Of course, replace username in the command line with your UNIL login, and use your UNIL password.

Connection with a key

To connect with a key, you first have to generate the key on your laptop. This can be done as follows:

ssh-keygen -t ed25519
Generating public/private ed25519 key pair.
Enter file in which to save the key (/home/ejeanvoi/.ssh/id_ed25519): /home/ejeanvoi/.ssh/id_dcsr_cluster
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/ejeanvoi/.ssh/id_dcsr_cluster
Your public key has been saved in /home/ejeanvoi/.ssh/id_dcsr_cluster.pub
The key fingerprint is:
SHA256:8349RPk/2AuwzazGul4ki8xQbwjGj+d7AiU3O7JY064 ejeanvoi@archvm
The key's randomart image is:
+--[ED25519 256]--+
|                 |
|    .            |
|     + .       . |
|    ..=+o     o  |
|     o=+S+ o . . |
|     =*+oo+ * . .|
|    o *=..oo Bo .|
|   . . o.o.oo.+o.|
|     E..++=o   oo|
+----[SHA256]-----+

By default, it suggests you to create the private key to ~/.ssh/id_ed25519 and the public key to to ~/.ssh/id_ed25519.pub. You can hit "Enter" when the question is asked if you don't use any other key. Otherwise, you can choose another path, for instance: ~/.ssh/id_dcsr_cluster like in the example above.

Then, you have to enter a passphrase (twice). This is optional but you are strongly encouraged to choose a strong passphrase.

Once the key is created, you have to copy the public to the cluster. This can be done as follows:

[ejeanvoi@archvm ~]$ ssh-copy-id -i /home/ejeanvoi/.ssh/id_dcsr_cluster ejeanvoi@curnagl.dcsr.unil.ch
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/ejeanvoi/.ssh/id_dcsr_cluster.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
ejeanvoi@curnagl.dcsr.unil.ch's password:

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'ejeanvoi@curnagl.dcsr.unil.ch'"
and check to make sure that only the key(s) you wanted were added.

Thanks to -i option, you can specify the path to the private key, here we use /home/ejeanvoi/.ssh/id_dcsr_cluster to comply with the beginning of the example. You are asked to enter you UNIL password to access the cluster, and behind the scene, the public key will be automatically copied to the cluster.

Finally, you can connect to the cluster using you key, and that time, you will be asked to enter the passphrase of the key (and not the UNIL password):

[ejeanvoi@archvm ~]$ ssh -i /home/ejeanvoi/.ssh/id_dcsr_cluster ejeanvoi@curnagl.dcsr.unil.ch
Enter passphrase for key '.ssh/id_dcsr_cluster':
Last login: Fri Nov 26 10:25:05 2021 from 130.223.6.87
[ejeanvoi@login ~]$

Remote graphical interface

To visualize a graphical application running from the cluster, you have to connect using -X option:

ssh -X username@curnagl.dcsr.unil.ch

macOS

Like Linux, SSH has a native support in macOS, so nothing special has to be installed, excepted for the graphical part.

Connection with a password

This is similar to the Linux version described above.

Connection with a key

This is similar to the Linux version described above.

Remote graphical interface

To enable graphical visualization over SSH, you have to install an X server. Most common one is XQuartz, it can be installed like any other .dmg application.

Then, you have to add the following line at the beginning of the ~/.ssh/config file (if the file doesn't exist, you can create it):

XAuthLocation /opt/X11/bin/xauth

Finally, just add -X flag to the ssh command and run your graphical applications:

image-1637921404046.png

Windows

To access the DCSR clusters from a Windows host, you have to use an SSH client.

Several options are available:

We present here only MobaXterm (since it's a great tool that also allows to transfer files with a GUI) and the PowerShell options. For both options, we'll see how to connect through SSH with a password and with a key.

MobaXterm

Connection with a password

After opening MobaXterm, you have to create a new session: