AlphaFold

The project home page where you can find the latest information is at https://github.com/deepmind/alphafold 

For details on how to run the model please see the Supplementary Information article

For some ideas on how to separate the CPU and GPU parts: https://github.com/Zuricho/ParallelFold.

Alternatively - check out what has already been calculated

Note on GPU usage

Whilst Alphafold makes use of GPUs for the inference part of the modelling, depending on the use case, this can be a small part of the running time as shown by the timings.json file that is produced for every run:

For the T1024 test case:

{
    "features": 6510.152379751205,
    "process_features_model_1_pred_0": 3.555035352706909,
    "predict_and_compile_model_1_pred_0": 124.84101128578186,
    "relax_model_1_pred_0": 25.707252502441406,
    "process_features_model_2_pred_0": 2.0465400218963623,
    "predict_and_compile_model_2_pred_0": 104.1096305847168,
    "relax_model_2_pred_0": 14.539108514785767,
    "process_features_model_3_pred_0": 1.7761900424957275,
    "predict_and_compile_model_3_pred_0": 82.07982850074768,
    "relax_model_3_pred_0": 13.683411598205566,
    "process_features_model_4_pred_0": 1.8073537349700928,
    "predict_and_compile_model_4_pred_0": 82.5819890499115,
    "relax_model_4_pred_0": 15.835367441177368,
    "process_features_model_5_pred_0": 1.9143474102020264,
    "predict_and_compile_model_5_pred_0": 77.47663712501526,
    "relax_model_5_pred_0": 14.72615647315979
}

That means that out of the ~2 hour run time 1h48 is spend running "classical" code (mostly hhblits) and only ~10 minutes is spent on the GPU.

As such do not request 2 GPUs as the potential speedup is negligible and this will block resources for other users

For multimer modelling the GPU part can take longer and depending on what you need it might be worth turning off relaxation. Always check the timings.json file to see where time is being spent! 

If we look at the overall efficiency of the job using seff we see:

Nodes: 1
Cores per node: 24
CPU Utilized: 03:28:24
CPU Efficiency: 7.33% of 1-23:21:36 core-walltime
Job Wall-clock time: 01:58:24
Memory Utilized: 81.94 GB
Memory Efficiency: 40.97% of 200.00 GB


Reference databases

The reference databases needed for AlphaFold have been made available in /reference/alphafold so there is no need to download them - the directory name is the date on which the databases were downloaded.

$ ls /reference/alphafold/
20210719  
20211104
20220414

New versions will be downloaded if required.

The versions correspond to:


Using containers

The Alphafold project recommend using Docker to run the code which works on cloud or personal resources but not when using shared HPC systems as administrative access (required for Docker) is obviously not permitted.

Singularity containers

We provide Singularity images which can be used on the DCSR clusters and these can be found in /dcsrsoft/singularity/containers/

The currently available images are:

When running the images directly it is necessary to provide all the paths to databases which is error prone and tedious.

$ singularity run /dcsrsoft/singularity/containers/alphafold-v2.1.1.sif --helpshort
Full AlphaFold protein structure prediction script.
flags:

/app/alphafold/run_alphafold.py:
  --[no]benchmark: Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins.
    (default: 'false')
  --bfd_database_path: Path to the BFD database for use by HHblits.
  --data_dir: Path to directory of supporting data.
  --db_preset: <full_dbs|reduced_dbs>: Choose preset MSA database configuration - smaller genetic database config (reduced_dbs) or full genetic database config  (full_dbs)
    (default: 'full_dbs')
  --fasta_paths: Paths to FASTA files, each containing a prediction target that will be folded one after another. If a FASTA file contains multiple sequences, then it will be folded as a multimer. Paths should be separated by commas. All FASTA paths must have a unique basename as the basename is used
    to name the output directories for each prediction.
    (a comma separated list)
  --hhblits_binary_path: Path to the HHblits executable.
    (default: '/opt/conda/bin/hhblits')
  --hhsearch_binary_path: Path to the HHsearch executable.
    (default: '/opt/conda/bin/hhsearch')
  --hmmbuild_binary_path: Path to the hmmbuild executable.
    (default: '/usr/bin/hmmbuild')
  --hmmsearch_binary_path: Path to the hmmsearch executable.
    (default: '/usr/bin/hmmsearch')
  --is_prokaryote_list: Optional for multimer system, not used by the single chain system. This list should contain a boolean for each fasta specifying true where the target complex is from a prokaryote, and false where it is not, or where the origin is unknown. These values determine the pairing
    method for the MSA.
    (a comma separated list)
  --jackhmmer_binary_path: Path to the JackHMMER executable.
    (default: '/usr/bin/jackhmmer')
  --kalign_binary_path: Path to the Kalign executable.
    (default: '/usr/bin/kalign')
  --max_template_date: Maximum template release date to consider. Important if folding historical test sets.
  --mgnify_database_path: Path to the MGnify database for use by JackHMMER.
  --model_preset: <monomer|monomer_casp14|monomer_ptm|multimer>: Choose preset model configuration - the monomer model, the monomer model with extra ensembling, monomer model with pTM head, or multimer model
    (default: 'monomer')
  --obsolete_pdbs_path: Path to file containing a mapping from obsolete PDB IDs to the PDB IDs of their replacements.
  --output_dir: Path to a directory that will store the results.
  --pdb70_database_path: Path to the PDB70 database for use by HHsearch.
  --pdb_seqres_database_path: Path to the PDB seqres database for use by hmmsearch.
  --random_seed: The random seed for the data pipeline. By default, this is randomly generated. Note that even if this is set, Alphafold may still not be deterministic, because processes like GPU inference are nondeterministic.
    (an integer)
  --small_bfd_database_path: Path to the small version of BFD used with the "reduced_dbs" preset.
  --template_mmcif_dir: Path to a directory with template mmCIF structures, each named <pdb_id>.cif
  --uniclust30_database_path: Path to the Uniclust30 database for use by HHblits.
  --uniprot_database_path: Path to the Uniprot database for use by JackHMMer.
  --uniref90_database_path: Path to the Uniref90 database for use by JackHMMER.
  --[no]use_precomputed_msas: Whether to read MSAs that have been written to disk. WARNING: This will not check if the sequence, database or configuration have changed.
    (default: 'false')

Try --helpfull to get a list of all flags.

To run the container - here we are using a GPU so the --nv flag must be used to make the GPU visible inside the container

module load singularity

singularity run --nv /dcsrsoft/singularity/containers/alphafold-v2.1.1.sif <OPTIONS>


Helper Scripts

In order to make life simpler there is a wrapper script: run_alphafold_2.2.0.py - this can be found at: 

/dcsrsoft/singularity/containers/run_alphafold_2.2.0.py

Please copy it to your working directory 

$ python3 run_alphafold_2.2.0.py -h

usage: run_alphafold_2.2.0.py [-h] --fasta-paths FASTA_PATHS [FASTA_PATHS ...] [--max-template-date MAX_TEMPLATE_DATE] [--db-preset {reduced_dbs,full_dbs}] [--model-preset {monomer,monomer_casp14,monomer_ptm,multimer}] [--num-multimer-predictions-per-model NUM_MULTIMER_PREDICTIONS_PER_MODEL] [--benchmark]
                              [--use-precomputed-msas] [--data-dir DATA_DIR] [--docker-image DOCKER_IMAGE] [--output-dir OUTPUT_DIR] [--use-gpu] [--run-relax] [--enable-gpu-relax] [--gpu-devices GPU_DEVICES] [--cpus CPUS]

Singularity launch script for Alphafold v2.2.0

optional arguments:
  -h, --help            show this help message and exit
  --fasta-paths FASTA_PATHS [FASTA_PATHS ...], -f FASTA_PATHS [FASTA_PATHS ...]
                        Paths to FASTA files, each containing one sequence. All FASTA paths must have a unique basename as the basename is used to name the output directories for each prediction.
  --max-template-date MAX_TEMPLATE_DATE, -t MAX_TEMPLATE_DATE
                        Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets.
  --db-preset {reduced_dbs,full_dbs}
                        Choose preset model configuration - no ensembling with uniref90 + bfd + uniclust30 (full_dbs), or 8 model ensemblings with uniref90 + bfd + uniclust30 (casp14).
  --model-preset {monomer,monomer_casp14,monomer_ptm,multimer}
                        Choose preset model configuration - the monomer model, the monomer model with extra ensembling, monomer model with pTM head, or multimer model
  --num-multimer-predictions-per-model NUM_MULTIMER_PREDICTIONS_PER_MODEL
                        How many predictions (each with a different random seed) will be generated per model. E.g. if this is 2 and there are 5 models then there will be 10 predictions per input. Note: this FLAG only applies if model_preset=multimer
  --benchmark, -b       Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins.
  --use-precomputed-msas
                        Whether to read MSAs that have been written to disk instead of running the MSA tools. The MSA files are looked up in the output directory, so it must stay the same between multiple runs that are to reuse the MSAs. WARNING: This will not check if the sequence, database or configuration
                        have changed.
  --data-dir DATA_DIR, -d DATA_DIR
                        Path to directory with supporting data: AlphaFold parameters and genetic and template databases. Set to the target of download_all_databases.sh.
  --docker-image DOCKER_IMAGE
                        Alphafold docker image.
  --output-dir OUTPUT_DIR, -o OUTPUT_DIR
                        Output directory for results.
  --use-gpu             Enable NVIDIA runtime to run with GPUs.
  --run-relax           Whether to run the final relaxation step on the predicted models. Turning relax off might result in predictions with distracting stereochemical violations but might help in case you are having issues with the relaxation stage.
  --enable-gpu-relax    Run relax on GPU if GPU is enabled.
  --gpu-devices GPU_DEVICES
                        Comma separated list of devices to pass to NVIDIA_VISIBLE_DEVICES.
  --cpus CPUS, -c CPUS  Number of CPUs to use.


An example batch script using the helper script is:

#!/bin/bash

#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 24
#SBATCH -p gpu
#SBATCH --gres=gpu:1
#SBATCH --gres-flags=enforce-binding
#SBATCH --mem 200G
#SBATCH -t 6:00:00

module purge
module load singularity

export SINGULARITY_BINDPATH="/scratch,/dcsrsoft,/users,/work,/reference"

./run_alphafold_2.2.0.py --data-dir /reference/alphafold/20220414 --cpus 24 --use-gpu --fasta-paths ./T1024.fasta --output-dir /scratch/ulambda/alphafold/runtest


Alphafold without containers


Fans of Conda may also wish to check out https://github.com/kalininalab/alphafold_non_docker.  Just make sure to module load gcc miniconda3 rather than following the exact procedure!




Révision #31
Créé 20 juillet 2021 15:31:52 par Ewan Roche
Mis à jour 26 mai 2023 06:55:44 par Ewan Roche