AlphaFold

The project home page where you can find the latest information is at https://github.com/deepmind/alphafold

For details on how to run the model please see the Supplementary Information article

For some ideas on how to separate the CPU and GPU parts: https://github.com/Zuricho/ParallelFold.

Alternatively - check out what has already been calculated

Reference databases

The reference databases needed for AlphaFold have been made available in /reference/alphafold so there is no need to download them - the directory name is the date on which the databases were downloaded.

$ ls /reference/alphafold/
20210719  
20211104

New versions will be downloaded if required.

The versions correspond to:

20210719 - Initial Alphafold 2.0 release
20211104 - 2.1 release with multimer data

Using containers

The Alphafold project recommend using Docker to run the code which works on cloud or personal resources but not when using shared HPC systems as administrative access (required for Docker) is obviously not permitted.

Singularity containers

We provide Singularity images which can be used on the DCSR clusters and these can be found in /dcsrsoft/singularity/containers/

The currently available images are:

alphafold-v2.1.1.sif
alphafold-v2.1.2.sif

When running the images directly it is necessary to provide all the paths to databases which is error prone and tedious.

$ singularity run /dcsrsoft/singularity/containers/alphafold-v2.1.1.sif --helpshort
Full AlphaFold protein structure prediction script.
flags:

/app/alphafold/run_alphafold.py:
  --[no]benchmark: Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins.
    (default: 'false')
  --bfd_database_path: Path to the BFD database for use by HHblits.
  --data_dir: Path to directory of supporting data.
  --db_preset: <full_dbs|reduced_dbs>: Choose preset MSA database configuration - smaller genetic database config (reduced_dbs) or full genetic database config  (full_dbs)
    (default: 'full_dbs')
  --fasta_paths: Paths to FASTA files, each containing a prediction target that will be folded one after another. If a FASTA file contains multiple sequences, then it will be folded as a multimer. Paths should be separated by commas. All FASTA paths must have a unique basename as the basename is used
    to name the output directories for each prediction.
    (a comma separated list)
  --hhblits_binary_path: Path to the HHblits executable.
    (default: '/opt/conda/bin/hhblits')
  --hhsearch_binary_path: Path to the HHsearch executable.
    (default: '/opt/conda/bin/hhsearch')
  --hmmbuild_binary_path: Path to the hmmbuild executable.
    (default: '/usr/bin/hmmbuild')
  --hmmsearch_binary_path: Path to the hmmsearch executable.
    (default: '/usr/bin/hmmsearch')
  --is_prokaryote_list: Optional for multimer system, not used by the single chain system. This list should contain a boolean for each fasta specifying true where the target complex is from a prokaryote, and false where it is not, or where the origin is unknown. These values determine the pairing
    method for the MSA.
    (a comma separated list)
  --jackhmmer_binary_path: Path to the JackHMMER executable.
    (default: '/usr/bin/jackhmmer')
  --kalign_binary_path: Path to the Kalign executable.
    (default: '/usr/bin/kalign')
  --max_template_date: Maximum template release date to consider. Important if folding historical test sets.
  --mgnify_database_path: Path to the MGnify database for use by JackHMMER.
  --model_preset: <monomer|monomer_casp14|monomer_ptm|multimer>: Choose preset model configuration - the monomer model, the monomer model with extra ensembling, monomer model with pTM head, or multimer model
    (default: 'monomer')
  --obsolete_pdbs_path: Path to file containing a mapping from obsolete PDB IDs to the PDB IDs of their replacements.
  --output_dir: Path to a directory that will store the results.
  --pdb70_database_path: Path to the PDB70 database for use by HHsearch.
  --pdb_seqres_database_path: Path to the PDB seqres database for use by hmmsearch.
  --random_seed: The random seed for the data pipeline. By default, this is randomly generated. Note that even if this is set, Alphafold may still not be deterministic, because processes like GPU inference are nondeterministic.
    (an integer)
  --small_bfd_database_path: Path to the small version of BFD used with the "reduced_dbs" preset.
  --template_mmcif_dir: Path to a directory with template mmCIF structures, each named <pdb_id>.cif
  --uniclust30_database_path: Path to the Uniclust30 database for use by HHblits.
  --uniprot_database_path: Path to the Uniprot database for use by JackHMMer.
  --uniref90_database_path: Path to the Uniref90 database for use by JackHMMER.
  --[no]use_precomputed_msas: Whether to read MSAs that have been written to disk. WARNING: This will not check if the sequence, database or configuration have changed.
    (default: 'false')

Try --helpfull to get a list of all flags.

To run the container - here we are using a GPU so the --nv flag must be used to make the GPU visible inside the container

module load singularity

singularity run --nv /dcsrsoft/singularity/containers/alphafold-v2.1.1.sif <OPTIONS>

Helper Scripts

For standard usage we provide helper scripts which wrap the container and allow for fewer options to be passed.

$ run_alphafold_2.1.1_gpu.sh

Usage: run_alphafold_2.1.1_gpu.sh <OPTIONS>
Required Parameters:
-d <data_dir>         Path to directory of supporting data
-o <output_dir>       Path to a directory that will store the results.
-f <fasta_path>       Path to a FASTA file containing sequence. If a FASTA file contains multiple sequences, then it will be folded as a multimer
-t <max_template_date> Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets
Optional Parameters:
-g <use_gpu>          Enable NVIDIA runtime to run with GPUs (default: true)
-n <openmm_threads>   OpenMM threads (default: all available cores)
-a <gpu_devices>      Comma separated list of devices to pass to 'CUDA_VISIBLE_DEVICES' (default: 0)
-m <model_preset>     Choose preset model configuration - the monomer model, the monomer model with extra ensembling, monomer model with pTM head, or multimer model (default: 'monomer')
-c <db_preset>        Choose preset MSA database configuration - smaller genetic database config (reduced_dbs) or full genetic database config (full_dbs) (default: 'full_dbs')
-p <use_precomputed_msas> Whether to read MSAs that have been written to disk. WARNING: This will not check if the sequence, database or configuration have changed (default: 'false')
-l <is_prokaryote>    Optional for multimer system, not used by the single chain system. A boolean specifying true where the target complex is from a prokaryote, and false where it is not, or where the origin is unknown. This value determine the pairing method for the MSA (default: 'None')
-b <benchmark>        Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins (default: 'false')

These helper scripts can be found in /dcsrsoft/singularity/containers/

The currently available scripts are:

run_alphafold_v2.1.1_gpu.sh

Note that these are based on the alphafold_non_docker project so the latest Alphafold image might not have an associated helper script.

An example batch job using the helper script is

#!/bin/bash

#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 24
#SBATCH -p gpu
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding
#SBATCH --mem 200G
#SBATCH -t 6:00:00

module purge
module load singularity

export SINGULARITY_BINDPATH="/scratch,/dcsrsoft,/users,/work,/reference"

./run_alphafold_v2.1.1_gpu.sh -d /reference/alphafold/20211104 -o /scratch/ulambda/ap212 -m monomer -f T1024.fasta -t 2021-05-01

Alphafold without containers

As an alternative to using containers this guide shows how to run the AlphaFold framework on the DCSR clusters using a python virtual environment and the DCSR software stack

Fans of Conda may also wish to check out https://github.com/kalininalab/alphafold_non_docker from which a lot of the information below is taken - just make sure to module load gcc miniconda3 rather than following the exact procedure!

Setting up a virtual environment

In order to satisfy a number of the dependencies and to install AlphaFold itself we use a python virtual environment

As usual we recommenced that you create these environments in your project space in the /work filesystem. You can, of course, create one per lab and share it.

$ module load gcc python

$ cd /work/path/to/my/project

$ python -m venv alpha-venv

$ source alpha-venv/bin/activate
(alpha-venv) [ulambda@curnagl ]$ 

$ pip install alphafold
..

$ pip install --upgrade "jax[cuda111]" -f https://storage.googleapis.com/jax-releases/jax_releases.html
..

You can check what has been installed by running pip list inside the virtual environment

AlphaFold and friends

Whilst we have already installed the AlphaFold python package it's useful to have the source code which can be obtained with git

git clone https://github.com/deepmind/alphafold.git

This will create a folder called alphafold

Go into the directory (cd alphafold) and download a helper script

wget https://raw.githubusercontent.com/kalininalab/alphafold_non_docker/main/run_alphafold.sh

As you will be running jobs via Slurm please comment out (with #) the following lines in run_alphafold.sh so that if multiple GPUs are used they will be visible

# Export ENVIRONMENT variables (change me if required)
if [[ "$use_gpu" == true ]] ; then
    export CUDA_VISIBLE_DEVICES=0
fi

It's also useful to make the helper script executable

$ chmod a+x run_alphafold.sh

Now we download some chemical data need by the code

 wget -q -P alphafold/common/ https://git.scicore.unibas.ch/schwede/openstructure/-/raw/7102c63615b64735c4941278d92b554ec94415f8/modules/mol/alg/src/stereo_chemical_props.txt

After all this the setup is complete and we are ready to go.

Running an example

Here we show running on an interactive node using Sinteractive but the same logic applies to batch jobs which will be needed for longer running tasks.

$ Sinteractive -G 1 -t 2:00:00 -c 16 -m 64G
 
Sinteractive is running with the following options:
 
--gres=gpu:1 -c 16 --mem 64G -J interactive -p interactive -t 2:00:00
 
salloc: Granted job allocation 123456
salloc: Waiting for resource configuration
salloc: Nodes dnagpu001 are ready for job
[ulambda@dnagpu001 ]$

In order to make all the necessary tools available we first need to load some modules

$ module load gcc python hh-suite openmm hmmer pdbfixer kalign cuda cudnn

$ module list

Currently Loaded Modules:
  1) gcc/9.3.0     2) python/3.8.8   3) hh-suite/3.3.0   4) fftw/3.3.9   5) openmm/7.5.0   
  6) hmmer/3.3.2   7) pdbfixer/1.7   8) kalign/3.3.1     9) cuda/11.2.2  10) cudnn/8.1.1.33-11.2

We then activate the virtual environment

$ source /work/path/to/my/project/alpha-venv/bin/activate
(alpha-venv) [ulambda@dnagpu001 ~]$

Then change into the alphafold repository and launch a task

$ cd /work/path/to/my/project/alphafold

$ ./run_alphafold.sh -d /reference/alphafold/20210719 -o /scratch/ulambda/alphatest -m model_1 -f T1024.fasta -t 2021-05-01 -g true

2021-07-20 17:32:00.051940: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
..
..
2021-07-20 17:32:04.076381: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:171] XLA service 0x318b090 initialized for platform Interpreter (this does not guarantee that XLA will be used). Devices:
2021-07-20 17:32:04.076419: I external/org_tensorflow/tensorflow/compiler/xla/service/service.cc:179]   StreamExecutor device (0): Interpreter, <undefined>
..
I0720 17:45:51.702012 140086992525120 utils.py:36] Started HHsearch query
I0720 17:47:37.402516 140086992525120 utils.py:40] Finished HHsearch query in 105.700 seconds
I0720 17:47:38.506151 140086992525120 hhblits.py:128] Launching subprocess "/dcsrsoft/spack/hetre/v1.2/spack/opt/spack/linux-rhel8-zen2/gcc-9.3.0/hh-suite-3.3.0-k3vfe6b2jsdl6cebrcmb3qoxav2gyukz/bin/hhblits -i T1024.fasta -cpu 4 -oa3m /tmp/tmpkv138q2u/output.a3m -o /dev/null -n 3 -e 0.001 -maxseq 1000000 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -d /dcsrsoft/reference/alphafold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -d /dcsrsoft/reference/alphafold/uniclust30/uniclust30_2018_08/uniclust30_2018_08"
I0720 17:47:38.572123 140086992525120 utils.py:36] Started HHblits query
..
..

Example batch script

A job script to be submitted via sbatch which does the same thing as above

#!/bin/bash

#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 24
#SBATCH -p gpu
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding
#SBATCH --mem 200G
#SBATCH -t 6:00:00

module purge

module load gcc python hh-suite openmm hmmer pdbfixer kalign cuda cudnn

source /work/path/to/my/project/alpha-venv/bin/activate

cd /work/path/to/my/project/alphafold/

./run_alphafold.sh -d /reference/alphafold/20210719 -o /scratch/ulambda/alphatest -m model_1 -f T1024.fasta -t 2021-05-01 -g true

The above analysis for T1024 takes approximately 2 hours with the resources requested.

The timings.json file shows

{
    "features": 7004.208073139191,
    "process_features_model_1": 8.682352781295776,
    "predict_and_compile_model_1": 148.41881656646729,
    "relax_model_1": 64.47628593444824
}