Passer au contenu principal

AlphaFold

The project home page where you can find the latest information is at https://github.com/deepmind/alphafold 

For details on how to run the model please see the Supplementary Information article

For some ideas on how to separate the CPU and GPU parts: https://github.com/Zuricho/ParallelFold.

Alternatively - check out what has already been calculated

Note on GPU usage

Whilst Alphafold makes use of GPUs for the inference part of the modelling this only takes up a very short part of the running time as seen by the timings.json file that is produced for every run:

For the T1024 test case:

{
    "features": 6510.152379751205,
    "process_features_model_1_pred_0": 3.555035352706909,
    "predict_and_compile_model_1_pred_0": 124.84101128578186,
    "relax_model_1_pred_0": 25.707252502441406,
    "process_features_model_2_pred_0": 2.0465400218963623,
    "predict_and_compile_model_2_pred_0": 104.1096305847168,
    "relax_model_2_pred_0": 14.539108514785767,
    "process_features_model_3_pred_0": 1.7761900424957275,
    "predict_and_compile_model_3_pred_0": 82.07982850074768,
    "relax_model_3_pred_0": 13.683411598205566,
    "process_features_model_4_pred_0": 1.8073537349700928,
    "predict_and_compile_model_4_pred_0": 82.5819890499115,
    "relax_model_4_pred_0": 15.835367441177368,
    "process_features_model_5_pred_0": 1.9143474102020264,
    "predict_and_compile_model_5_pred_0": 77.47663712501526,
    "relax_model_5_pred_0": 14.72615647315979
}

That means that out of the ~2 hour run time 1h48 is spend running "classical" code (mostly hhblits) and only ~10 minutes is spent on the GPU.

As such do not request 2 GPUs as the potential speedup is negligible and this will block resources for other users

If we look at the overall efficiency of the job using seff we see:

Nodes: 1
Cores per node: 24
CPU Utilized: 03:28:24
CPU Efficiency: 7.33% of 1-23:21:36 core-walltime
Job Wall-clock time: 01:58:24
Memory Utilized: 81.94 GB
Memory Efficiency: 40.97% of 200.00 GB

Reference databases

The reference databases needed for AlphaFold have been made available in /reference/alphafold so there is no need to download them - the directory name is the date on which the databases were downloaded.

$ ls /reference/alphafold/
20210719  
20211104
20220414

New versions will be downloaded if required.

The versions correspond to:

  • 20210719 - Initial Alphafold 2.0 release
  • 20211104 - 2.1 release with multimer data
  • 20220414 - Updated weights 

 

Using containers

The Alphafold project recommend using Docker to run the code which works on cloud or personal resources but not when using shared HPC systems as administrative access (required for Docker) is obviously not permitted.

Singularity containers

We provide Singularity images which can be used on the DCSR clusters and these can be found in /dcsrsoft/singularity/containers/

The currently available images are:

  • alphafold-v2.1.1.sif
  • alphafold-v2.1.2.0.sif

Note: There seems to be an issue/problem with JAX in the v2.1.2 container - please use v2.1.1 for now!

When running the images directly it is necessary to provide all the paths to databases which is error prone and tedious.

$ singularity run /dcsrsoft/singularity/containers/alphafold-v2.1.1.sif --helpshort
Full AlphaFold protein structure prediction script.
flags:

/app/alphafold/run_alphafold.py:
  --[no]benchmark: Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins.
    (default: 'false')
  --bfd_database_path: Path to the BFD database for use by HHblits.
  --data_dir: Path to directory of supporting data.
  --db_preset: <full_dbs|reduced_dbs>: Choose preset MSA database configuration - smaller genetic database config (reduced_dbs) or full genetic database config  (full_dbs)
    (default: 'full_dbs')
  --fasta_paths: Paths to FASTA files, each containing a prediction target that will be folded one after another. If a FASTA file contains multiple sequences, then it will be folded as a multimer. Paths should be separated by commas. All FASTA paths must have a unique basename as the basename is used
    to name the output directories for each prediction.
    (a comma separated list)
  --hhblits_binary_path: Path to the HHblits executable.
    (default: '/opt/conda/bin/hhblits')
  --hhsearch_binary_path: Path to the HHsearch executable.
    (default: '/opt/conda/bin/hhsearch')
  --hmmbuild_binary_path: Path to the hmmbuild executable.
    (default: '/usr/bin/hmmbuild')
  --hmmsearch_binary_path: Path to the hmmsearch executable.
    (default: '/usr/bin/hmmsearch')
  --is_prokaryote_list: Optional for multimer system, not used by the single chain system. This list should contain a boolean for each fasta specifying true where the target complex is from a prokaryote, and false where it is not, or where the origin is unknown. These values determine the pairing
    method for the MSA.
    (a comma separated list)
  --jackhmmer_binary_path: Path to the JackHMMER executable.
    (default: '/usr/bin/jackhmmer')
  --kalign_binary_path: Path to the Kalign executable.
    (default: '/usr/bin/kalign')
  --max_template_date: Maximum template release date to consider. Important if folding historical test sets.
  --mgnify_database_path: Path to the MGnify database for use by JackHMMER.
  --model_preset: <monomer|monomer_casp14|monomer_ptm|multimer>: Choose preset model configuration - the monomer model, the monomer model with extra ensembling, monomer model with pTM head, or multimer model
    (default: 'monomer')
  --obsolete_pdbs_path: Path to file containing a mapping from obsolete PDB IDs to the PDB IDs of their replacements.
  --output_dir: Path to a directory that will store the results.
  --pdb70_database_path: Path to the PDB70 database for use by HHsearch.
  --pdb_seqres_database_path: Path to the PDB seqres database for use by hmmsearch.
  --random_seed: The random seed for the data pipeline. By default, this is randomly generated. Note that even if this is set, Alphafold may still not be deterministic, because processes like GPU inference are nondeterministic.
    (an integer)
  --small_bfd_database_path: Path to the small version of BFD used with the "reduced_dbs" preset.
  --template_mmcif_dir: Path to a directory with template mmCIF structures, each named <pdb_id>.cif
  --uniclust30_database_path: Path to the Uniclust30 database for use by HHblits.
  --uniprot_database_path: Path to the Uniprot database for use by JackHMMer.
  --uniref90_database_path: Path to the Uniref90 database for use by JackHMMER.
  --[no]use_precomputed_msas: Whether to read MSAs that have been written to disk. WARNING: This will not check if the sequence, database or configuration have changed.
    (default: 'false')

Try --helpfull to get a list of all flags.

To run the container - here we are using a GPU so the --nv flag must be used to make the GPU visible inside the container

module load singularity

singularity run --nv /dcsrsoft/singularity/containers/alphafold-v2.1.1.sif <OPTIONS>

 

Helper Scripts

ForIn standard usage we provide helper scripts which wrap the container and allow for fewer optionsorder to bemake passed.life simpler there is a wrapper script: run_alphafold_2.2.0.py

$ run_alphafold_2.1.1_gpu.sh

Usage:python3 run_alphafold_2.1.1_gpu.sh <OPTIONS>
Required Parameters:2.0.py -dh

<data_dir>usage: Pathrun_alphafold_2.2.0.py to directory of supporting data[-h] -o-fasta-paths <output_dir>FASTA_PATHS Path[FASTA_PATHS to...] a[--max-template-date directoryMAX_TEMPLATE_DATE] that[--db-preset will{reduced_dbs,full_dbs}] store[--model-preset the{monomer,monomer_casp14,monomer_ptm,multimer}] results.[--num-multimer-predictions-per-model NUM_MULTIMER_PREDICTIONS_PER_MODEL] [--benchmark]
                              [--use-precomputed-msas] [--data-dir DATA_DIR] [--docker-image DOCKER_IMAGE] [--output-dir OUTPUT_DIR] [--use-gpu] [--run-relax] [--enable-gpu-relax] [--gpu-devices GPU_DEVICES] [--cpus CPUS]

Singularity launch script for Alphafold v2.2.0

optional arguments:
  -h, --help            show this help message and exit
  --fasta-paths FASTA_PATHS [FASTA_PATHS ...], -f <fasta_path>FASTA_PATHS Path[FASTA_PATHS ...]
                        Paths to aFASTA files, each containing one sequence. All FASTA filepaths containingmust sequence. Ifhave a FASTAunique file contains multiple sequences, then it will be foldedbasename as athe multimerbasename is used to name the output directories for each prediction.
  --max-template-date MAX_TEMPLATE_DATE, -t <max_template_date>MAX_TEMPLATE_DATE
                        Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets
Optional Parameters:sets.
  -g-db-preset <use_gpu>{reduced_dbs,full_dbs}
                        EnableChoose NVIDIApreset runtimemodel toconfiguration run- no ensembling with GPUsuniref90 + bfd + uniclust30 (default:full_dbs), true)or 8 model ensemblings with uniref90 + bfd + uniclust30 (casp14).
  -n-model-preset <openmm_threads>   OpenMM threads (default: all available cores)
-a <gpu_devices>      Comma separated list of devices to pass to 'CUDA_VISIBLE_DEVICES' (default: 0)
-m <model_preset>{monomer,monomer_casp14,monomer_ptm,multimer}
                        Choose preset model configuration - the monomer model, the monomer model with extra ensembling, monomer model with pTM head, or multimer model
  (default:--num-multimer-predictions-per-model 'monomer')NUM_MULTIMER_PREDICTIONS_PER_MODEL
                        -cHow <db_preset>many Choose preset MSA database configuration - smaller genetic database configpredictions (reduced_dbs)each orwith fulla geneticdifferent databaserandom config (full_dbs) (default: 'full_dbs')
-p <use_precomputed_msas> Whether to read MSAs that have been written to disk. WARNING: Thisseed) will notbe checkgenerated per model. E.g. if the sequence, database or configuration have changed (default: 'false')
-l <is_prokaryote>    Optional for multimer system, not used by the single chain system. A boolean specifying true where the target complexthis is from a prokaryote,2 and falsethere whereare it5 ismodels not,then orthere wherewill thebe origin10 ispredictions unknown.per Thisinput. valueNote: determinethis theFLAG pairingonly methodapplies forif themodel_preset=multimer
  MSA (default: 'None')--benchmark, -b <benchmark>       Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteinsproteins.
  (default:--use-precomputed-msas
                        'false')
Whether

Theseto helperread scriptsMSAs canthat behave foundbeen written to disk instead of running the MSA tools. The MSA files are looked up in /dcsrsoft/singularity/containers/

the

Theoutput currentlydirectory, availableso scriptsit are:

must
    stay
  • run_alphafold_v2.1.1_gpu.sh
  • the
same

Notebetween multiple runs that these are basedto reuse the MSAs. WARNING: This will not check if the sequence, database or configuration have changed. --data-dir DATA_DIR, -d DATA_DIR Path to directory with supporting data: AlphaFold parameters and genetic and template databases. Set to the target of download_all_databases.sh. --docker-image DOCKER_IMAGE Alphafold docker image. --output-dir OUTPUT_DIR, -o OUTPUT_DIR Output directory for results. --use-gpu Enable NVIDIA runtime to run with GPUs. --run-relax Whether to run the final relaxation step on the alphafold_non_dockerpredicted projectmodels. soTurning relax off might result in predictions with distracting stereochemical violations but might help in case you are having issues with the latestrelaxation Alphafoldstage. image--enable-gpu-relax mightRun notrelax haveon anGPU associatedif helperGPU script.

is

Weenabled. recommend--gpu-devices thatGPU_DEVICES youComma copyseparated thelist helperof scriptdevices to yourpass workto spaceNVIDIA_VISIBLE_DEVICES. and--cpus modifyCPUS, it-c ifCPUS needed.Number of CPUs to use.

 

An example batch jobscript using the helper script isis:

#!/bin/bash

#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 24
#SBATCH -p gpu
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding
#SBATCH --mem 200G
#SBATCH -t 6:00:00

module purge
module load singularity

export SINGULARITY_BINDPATH="/scratch,/dcsrsoft,/users,/work,/reference"

./run_alphafold_v2.1.1_gpu.shrun_alphafold_2.2.0.py -d-data-dir /reference/alphafold/2021110420220414 -o-cpus 24 --use-gpu --fasta-paths ./T1024.fasta --output-dir /scratch/ulambda/ap212 -m monomer -f T1024.fasta -t 2021-05-01 alphafold/runtest

 

Alphafold without containers

 

Fans of Conda may also wish to check out https://github.com/kalininalab/alphafold_non_docker.  Just make sure to module load gcc miniconda3 rather than following the exact procedure!