AlphaFold
The project home page where you can find the latest information is at https://github.com/deepmind/alphafold
For details on how to run the model please see the Supplementary Information article
For some ideas on how to separate the CPU and GPU parts: https://github.com/Zuricho/ParallelFold.
Alternatively - check out what has already been calculated
Note on GPU usage
Whilst Alphafold makes use of GPUs for the inference part of the modelling this only takes up a very short part of the running time as seen by the timings.json
file that is produced for every run:
For the T1024 test case:
{
"features": 6510.152379751205,
"process_features_model_1_pred_0": 3.555035352706909,
"predict_and_compile_model_1_pred_0": 124.84101128578186,
"relax_model_1_pred_0": 25.707252502441406,
"process_features_model_2_pred_0": 2.0465400218963623,
"predict_and_compile_model_2_pred_0": 104.1096305847168,
"relax_model_2_pred_0": 14.539108514785767,
"process_features_model_3_pred_0": 1.7761900424957275,
"predict_and_compile_model_3_pred_0": 82.07982850074768,
"relax_model_3_pred_0": 13.683411598205566,
"process_features_model_4_pred_0": 1.8073537349700928,
"predict_and_compile_model_4_pred_0": 82.5819890499115,
"relax_model_4_pred_0": 15.835367441177368,
"process_features_model_5_pred_0": 1.9143474102020264,
"predict_and_compile_model_5_pred_0": 77.47663712501526,
"relax_model_5_pred_0": 14.72615647315979
}
That means that out of the ~2 hour run time 1h48 is spend running "classical" code (mostly hhblits) and only ~10 minutes is spent on the GPU.
As such do not request 2 GPUs as the potential speedup is negligible and this will block resources for other users
If we look at the overall efficiency of the job using seff we see:
Nodes: 1
Cores per node: 24
CPU Utilized: 03:28:24
CPU Efficiency: 7.33% of 1-23:21:36 core-walltime
Job Wall-clock time: 01:58:24
Memory Utilized: 81.94 GB
Memory Efficiency: 40.97% of 200.00 GB
Reference databases
The reference databases needed for AlphaFold have been made available in /reference/alphafold
so there is no need to download them - the directory name is the date on which the databases were downloaded.
$ ls /reference/alphafold/
20210719
20211104
20220414
New versions will be downloaded if required.
The versions correspond to:
20210719
- Initial Alphafold 2.0 release20211104
- 2.1 release with multimer data20220414
- Updated weights
Using containers
The Alphafold project recommend using Docker to run the code which works on cloud or personal resources but not when using shared HPC systems as administrative access (required for Docker) is obviously not permitted.
Singularity containers
We provide Singularity images which can be used on the DCSR clusters and these can be found in /dcsrsoft/singularity/containers/
The currently available images are:
- alphafold-v2.1.1.sif
- alphafold-v2.
1.2.0.sif
Note: There seems to be an issue/problem with JAX in the v2.1.2 container - please use v2.1.1 for now!
When running the images directly it is necessary to provide all the paths to databases which is error prone and tedious.
$ singularity run /dcsrsoft/singularity/containers/alphafold-v2.1.1.sif --helpshort
Full AlphaFold protein structure prediction script.
flags:
/app/alphafold/run_alphafold.py:
--[no]benchmark: Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins.
(default: 'false')
--bfd_database_path: Path to the BFD database for use by HHblits.
--data_dir: Path to directory of supporting data.
--db_preset: <full_dbs|reduced_dbs>: Choose preset MSA database configuration - smaller genetic database config (reduced_dbs) or full genetic database config (full_dbs)
(default: 'full_dbs')
--fasta_paths: Paths to FASTA files, each containing a prediction target that will be folded one after another. If a FASTA file contains multiple sequences, then it will be folded as a multimer. Paths should be separated by commas. All FASTA paths must have a unique basename as the basename is used
to name the output directories for each prediction.
(a comma separated list)
--hhblits_binary_path: Path to the HHblits executable.
(default: '/opt/conda/bin/hhblits')
--hhsearch_binary_path: Path to the HHsearch executable.
(default: '/opt/conda/bin/hhsearch')
--hmmbuild_binary_path: Path to the hmmbuild executable.
(default: '/usr/bin/hmmbuild')
--hmmsearch_binary_path: Path to the hmmsearch executable.
(default: '/usr/bin/hmmsearch')
--is_prokaryote_list: Optional for multimer system, not used by the single chain system. This list should contain a boolean for each fasta specifying true where the target complex is from a prokaryote, and false where it is not, or where the origin is unknown. These values determine the pairing
method for the MSA.
(a comma separated list)
--jackhmmer_binary_path: Path to the JackHMMER executable.
(default: '/usr/bin/jackhmmer')
--kalign_binary_path: Path to the Kalign executable.
(default: '/usr/bin/kalign')
--max_template_date: Maximum template release date to consider. Important if folding historical test sets.
--mgnify_database_path: Path to the MGnify database for use by JackHMMER.
--model_preset: <monomer|monomer_casp14|monomer_ptm|multimer>: Choose preset model configuration - the monomer model, the monomer model with extra ensembling, monomer model with pTM head, or multimer model
(default: 'monomer')
--obsolete_pdbs_path: Path to file containing a mapping from obsolete PDB IDs to the PDB IDs of their replacements.
--output_dir: Path to a directory that will store the results.
--pdb70_database_path: Path to the PDB70 database for use by HHsearch.
--pdb_seqres_database_path: Path to the PDB seqres database for use by hmmsearch.
--random_seed: The random seed for the data pipeline. By default, this is randomly generated. Note that even if this is set, Alphafold may still not be deterministic, because processes like GPU inference are nondeterministic.
(an integer)
--small_bfd_database_path: Path to the small version of BFD used with the "reduced_dbs" preset.
--template_mmcif_dir: Path to a directory with template mmCIF structures, each named <pdb_id>.cif
--uniclust30_database_path: Path to the Uniclust30 database for use by HHblits.
--uniprot_database_path: Path to the Uniprot database for use by JackHMMer.
--uniref90_database_path: Path to the Uniref90 database for use by JackHMMER.
--[no]use_precomputed_msas: Whether to read MSAs that have been written to disk. WARNING: This will not check if the sequence, database or configuration have changed.
(default: 'false')
Try --helpfull to get a list of all flags.
To run the container - here we are using a GPU so the --nv
flag must be used to make the GPU visible inside the container
module load singularity
singularity run --nv /dcsrsoft/singularity/containers/alphafold-v2.1.1.sif <OPTIONS>
Helper Scripts
ForIn standard usage we provide helper scripts which wrap the container and allow for fewer optionsorder to bemake passed.life simpler there is a wrapper script: run_alphafold_2.2.0.py
$ run_alphafold_2.1.1_gpu.sh
Usage:python3 run_alphafold_2.1.1_gpu.sh <OPTIONS>
Required Parameters:2.0.py -dh
<data_dir>usage: Pathrun_alphafold_2.2.0.py to directory of supporting data[-h] -o-fasta-paths <output_dir>FASTA_PATHS Path[FASTA_PATHS to...] a[--max-template-date directoryMAX_TEMPLATE_DATE] that[--db-preset will{reduced_dbs,full_dbs}] store[--model-preset the{monomer,monomer_casp14,monomer_ptm,multimer}] results.[--num-multimer-predictions-per-model NUM_MULTIMER_PREDICTIONS_PER_MODEL] [--benchmark]
[--use-precomputed-msas] [--data-dir DATA_DIR] [--docker-image DOCKER_IMAGE] [--output-dir OUTPUT_DIR] [--use-gpu] [--run-relax] [--enable-gpu-relax] [--gpu-devices GPU_DEVICES] [--cpus CPUS]
Singularity launch script for Alphafold v2.2.0
optional arguments:
-h, --help show this help message and exit
--fasta-paths FASTA_PATHS [FASTA_PATHS ...], -f <fasta_path>FASTA_PATHS Path[FASTA_PATHS ...]
Paths to aFASTA files, each containing one sequence. All FASTA filepaths containingmust sequence. Ifhave a FASTAunique file contains multiple sequences, then it will be foldedbasename as athe multimerbasename is used to name the output directories for each prediction.
--max-template-date MAX_TEMPLATE_DATE, -t <max_template_date>MAX_TEMPLATE_DATE
Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets
Optional Parameters:sets.
-g-db-preset <use_gpu>{reduced_dbs,full_dbs}
EnableChoose NVIDIApreset runtimemodel toconfiguration run- no ensembling with GPUsuniref90 + bfd + uniclust30 (default:full_dbs), true)or 8 model ensemblings with uniref90 + bfd + uniclust30 (casp14).
-n-model-preset <openmm_threads> OpenMM threads (default: all available cores)
-a <gpu_devices> Comma separated list of devices to pass to 'CUDA_VISIBLE_DEVICES' (default: 0)
-m <model_preset>{monomer,monomer_casp14,monomer_ptm,multimer}
Choose preset model configuration - the monomer model, the monomer model with extra ensembling, monomer model with pTM head, or multimer model
(default:--num-multimer-predictions-per-model 'monomer')NUM_MULTIMER_PREDICTIONS_PER_MODEL
-cHow <db_preset>many Choose preset MSA database configuration - smaller genetic database configpredictions (reduced_dbs)each orwith fulla geneticdifferent databaserandom config (full_dbs) (default: 'full_dbs')
-p <use_precomputed_msas> Whether to read MSAs that have been written to disk. WARNING: Thisseed) will notbe checkgenerated per model. E.g. if the sequence, database or configuration have changed (default: 'false')
-l <is_prokaryote> Optional for multimer system, not used by the single chain system. A boolean specifying true where the target complexthis is from a prokaryote,2 and falsethere whereare it5 ismodels not,then orthere wherewill thebe origin10 ispredictions unknown.per Thisinput. valueNote: determinethis theFLAG pairingonly methodapplies forif themodel_preset=multimer
MSA (default: 'None')--benchmark, -b <benchmark> Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteinsproteins.
(default:--use-precomputed-msas
'false')
Whether Theseto helperread scriptsMSAs canthat behave foundbeen written to disk instead of running the MSA tools. The MSA files are looked up in /dcsrsoft/singularity/containers/
Theoutput currentlydirectory, availableso scriptsit are:
- stay
run_alphafold_v2.1.1_gpu.shthe
Notebetween multiple runs that these are basedto reuse the MSAs. WARNING: This will not check if the sequence, database or configuration
have changed.
--data-dir DATA_DIR, -d DATA_DIR
Path to directory with supporting data: AlphaFold parameters and genetic and template databases. Set to the target of download_all_databases.sh.
--docker-image DOCKER_IMAGE
Alphafold docker image.
--output-dir OUTPUT_DIR, -o OUTPUT_DIR
Output directory for results.
--use-gpu Enable NVIDIA runtime to run with GPUs.
--run-relax Whether to run the final relaxation step on the alphafold_non_dockerpredicted projectmodels. soTurning relax off might result in predictions with distracting stereochemical violations but might help in case you are having issues with the latestrelaxation Alphafoldstage.
image--enable-gpu-relax mightRun notrelax haveon anGPU associatedif helperGPU script.
Weenabled.
recommend--gpu-devices thatGPU_DEVICES
youComma copyseparated thelist helperof scriptdevices to yourpass workto spaceNVIDIA_VISIBLE_DEVICES.
and--cpus modifyCPUS, it-c ifCPUS needed.Number of CPUs to use.
An example batch jobscript using the helper script isis:
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 24
#SBATCH -p gpu
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding
#SBATCH --mem 200G
#SBATCH -t 6:00:00
module purge
module load singularity
export SINGULARITY_BINDPATH="/scratch,/dcsrsoft,/users,/work,/reference"
./run_alphafold_v2.1.1_gpu.shrun_alphafold_2.2.0.py -d-data-dir /reference/alphafold/2021110420220414 -o-cpus 24 --use-gpu --fasta-paths ./T1024.fasta --output-dir /scratch/ulambda/ap212 -m monomer -f T1024.fasta -t 2021-05-01 alphafold/runtest
Alphafold without containers
Fans of Conda may also wish to check out https://github.com/kalininalab/alphafold_non_docker. Just make sure to module load gcc miniconda3
rather than following the exact procedure!