Home Artificial Intelligence How you can Solve the Protein Folding Problem: AlphaFold2

How you can Solve the Protein Folding Problem: AlphaFold2

1
How you can Solve the Protein Folding Problem: AlphaFold2

A deeper take a look at AlphaFold2 and its neural architecture

Illustration of protein sequence to shape. In white the unique protein, in rainbow the AlphaFold predicted protein [blue=highest confidence, red=lowest confidence] (Illustration by the writer)

On this series of articles, I’ll undergo protein folding and deep learning models equivalent to AlphaFold, OmegaFold, and ESMFold. We’ll start with AlphaFold2!

Proteins are molecules that perform a lot of the biochemical functions in living organisms. They’re involved in digestion (enzymes), structural processes (keratin — skin), photosynthesis and are also used extensively within the pharmaceutical industry [2].

The 3D structure of the protein is prime to its function. Proteins are made up of 20 subunits called amino acids (or residues), each with different properties equivalent to charge, polarity, length, and the variety of atoms. Amino acids are formed by a backbone, common to all amino acids, and a side-chain, unique to every amino acid. They’re connected by a peptide bond [2].

Illustration of a protein polypeptide with 4 residues. (Illustration by the writer)
Illustration of a protein backbone with 4 residues. (Illustration by the writer)

Protein contain residues oriented at specific torsion angles called φ and ψ, which give rise to a protein 3D shape.

Illustration of φ and ψ angles. (Illustration by the writer)

The major problem every biologist faces is obtaining this 3D shape of proteins, normally requires a crystal of the protein and X-Ray Crystallography. Proteins have various properties, for instance, membrane proteins are inclined to be hydrophobic meaning it is tough to discover the conditions at which it crystallizes [2]. Obtaining crystals is due to this fact a tedious and (arguably) highly random process takes days to years to many years and it will probably be thought to be more of an art than a science. Which means that many biologists may spend the complete duration of their Ph.D. attempting to crystallise a protein.

In the event you are lucky enough to get a crystal of your protein, you’ll be able to upload it to the Protein Data Bank, a big dataset of proteins:

This begs the query: can we simulate folding to acquire a 3D structure from a sequence? Short answer: Yes, sort of. Long answer: We will use molecular simulations to attempt to fold proteins which are sometimes heavy in computational use. Hence, projects like Folding@Home attempt to distribute the issue over many computers to acquire a dynamics simulation of a protein.

Now, a contest, Critical Assessment of Protein Structure Prediction (CASP) was made where some 3D structures of proteins could be holdout so that individuals could test their protein folding models. In 2020, DeepMind participated with AlphaFold2 beating the state-of-the-art and obtaining outstanding performances.

Median Global Distance Test for the CASP competition from 2008–2020. An answer of about 90 is taken into account roughly such as the crystal structure. AlphaFold 2 outperformed all of the previous models achieving state-of-the-art performance (Illustration by the writer, based on [4]).

On this blog post, I’ll go over AlphaFold2, explain its inner workings, and conclude the way it has revolutionized my work as a Ph.D. student on Protein Design and Machine Learning.

Before we start, I would really like to offer a shoutout to OpenFold by the AQ Laboratory, an open-source implementation of AlphaFold that features training code through which I double-checked the scale of tensors I confer with in this text. Most of this text’s information is within the Supplementary of the unique paper.

Let’s begin with an summary. That is what the general structure of the model looks like:

Overview of the AlphaFold Architecture [1]

Typically, you begin with a sequence of amino acids of your protein of interest. Note that a crystal is not obligatory to acquire the sequence of amino acid : this is often obtained from DNA sequencing (if you happen to know the gene of the protein) or Protein Sequencing. The proteins may be broken to smaller -mers and analysed in mass spectrometry for instance.

The aim is to arrange two key pieces of information the Multiple Sequence Alignment (MSA) representation and a pair representation. For simplicity, I’ll skip using templates.

The MSA representation is obtained by searching for similar sequences in genetic databases. As the image shows, the sequence can also come from different organisms, e.g., a fish. Here we try to get general details about each index position of the protein and understand, within the context of evolution, how the protein has modified in numerous organisms. Proteins like Rubisco (involved in photosynthesis) are generally highly conserved and due to this fact have little differences in plants. Others, just like the spike protein of a virus, are very variable.

Within the pair representation, we try to infer relationships between the sequence elements. For instance, position 54 of the protein may interact with position 1.

Throughout the network, these representations are updated several times. First, they’re embedded to create a representation of the info. Then they go through the EvoFormer, which extracts details about sequences and pairs, and eventually, a structure model which builds the 3D structure of the protein.

The input embedder attempts to create a unique representation of the info. For MSA data, AlphaFold uses an arbitrary cluster number moderately than the total MSA to cut back the variety of possible sequences that undergo the transformer, thus decreasing computation. The MSA data input msa_feat (N_clust, N_res, 49) consists by:

  • cluster_msa (N_clust, N_res, 23): a one-hot encoding of the MSA cluster center sequences (20 amino acids + 1 unknown + 1 gap + 1 masked_msa_token)
  • cluster_profile (N_clust, N_res, 23): amino acid type distribution for every residue within the MSA (20 amino acids + 1 unknown + 1 gap + 1 masked_msa_token)
  • cluster_deletion_mean (N_clust, N_res, 1): average deletions of each residue in every cluster (ranges 0–1)
  • cluster_deletion_value (N_clust, N_res, 1): variety of deletions within the MSA (ranges 0–1)
  • cluster_has_deletion (N_clust, N_res, 1): binary feature indicating whether there are deletions

For pair representations, it encodes each amino acid with a singular index within the sequence with RelPos, which accounts for distance within the sequence. That is represented as a distance matrix of every residue against one another, and the distances clipped to 32, meaning larger distances are capped to 0, meaning the dimension is effectively -32 to 32 + 1 = 65.

Each the MSA representation and the pair representations undergo several independent linear layers and are passed to the EvoFormer.

Architecture of the EvoFormer [1]

There are then 48 blocks of the EvoFormer, which uses self-attention to permit the MSA and Pairs representations to speak. We first take a look at the MSA to then merge it into the pairs.

2.1 MSA Stack

MSA Stack of EvoFormer. [1] (edited by writer)

This consists of row-wise gated self-attention with pair bias, column-wise gated self-attention, transition and outer product mean blocks.

2.1A Row-Smart Gated Self-Attention with Pair Bias

Row-Smart Gated Self-Attention with Pair Bias in Evoformer. [1]

The important thing point here is to permit MSA and pair representations communicate information with one another.

First, multi-head attention is used to calculate dot-product affinities (N_res, N_res, N_heads) from the MSA representation row, meaning the amino acids within the sequence will learn “conceptual importance” between pairs. In essence, how vital one amino acid is for an additional amino acid.

Then, the pair representation goes through a linear layer without bias, meaning only a weight parameter will likely be learned. The linear layer outputs N_heads dimensions producing the matrix pair bias matrix (N_res, N_res, N_heads). Remember this matrix was initially capped to 32 maximum distance meaning if an amino acid is more distant than 32 indices, it’ll have a worth of 0

At this point, we have now two matrices of shape (N_res, N_res, N_heads) that we will easily add together and softmax to have values between 0 and 1. An attention block with the added matrices as Queries and a row passed through a linear layer as values to acquire the eye weights.

Now we calculate the dot product between:

  • the eye weights and
  • the Linear + sigmoid of the MSA row as keys (I feel the sigmoid operation here returns a probability-like array starting from 0–1)

2.1B Column-Smart Gated Self-Attention

Column-Smart Gated Self-Attention in Evoformer. [1]

The important thing point here is that MSA is an aligned version of all sequences related to the input sequences. Which means that index X will correspond to the identical area of the protein for every sequence.

By doing this operation column-wise, we make sure that we have now a general understanding of which residues are more likely for every position. This also means the model could be robust should an analogous sequence with small differences produce similar 3D shapes.

2.1C MSA Transition

MSA Transition in Evoformer. [1]

This is a straightforward 2-layer MLP that first increases the channel dimensions by an element of 4 after which reduces it right down to the unique dimensions.

2.1D Outer Product Mean

Outer Product Mean in Evoformer.[1]

This operation goals at keeping a continuous flow of knowledge between the MSA and the pair representation. Each column within the MSA is an index position of a protein sequence.

  • Here, we select index i and j, which we independently send through a linear layer. This linear layer uses c=32, which is lower than c_m.
  • The outer product is then calculated, averaged, flattened, and again through one other linear layer.

We now have an updated entry for ij within the pair representation. We repeat this for all of the pairs.

2.2 Pairs Stack

Pairs Stack of EvoFormer. [1]

Our pair representation can technically be interpreted as a distance matrix. Earlier, we saw how each amino acid starts with 32 neighbors. We will due to this fact construct a triangle graph based on three indices of the pair representation.

For instance, nodes i, j, and k could have edges ij, ik, and jk. Each edge is updated with information from the opposite two edges of all of the triangles it is an element of.

Triangle Multiplicative Update and Triangle Self-Attention. [1]

2.2A Triangular Multiplicative Update

Now we have two forms of updates, one for outgoing edges and one for incoming edges.

Triangular Multiplicative Update. [1]

For outgoing edges, the total row or pair representations i and j is first independently passed through a linear layer producing a representation of the left edges and right edges.

Then, we compute the dot product between the corresponding representation for the ij pair and the left and right edges independently.

Finally, we take the dot product of the left and right edges representations and a final dot product with the ij pair representation.

For incoming edges, the algorithm may be very similar but keep in mind that if previously we were considering the sting as ik, we now go in the other way ki. Within the OpenFold code, that is implemented simply as a permute function.

2.2B Triangular Self-Attention

Triangular Self-Attention. [1]

This operation goals at updating the pair representation through the use of self-attention. The major goal is to update the sting with probably the most relevant edges, ie. which amino acids within the protein usually tend to interact with the present node.

With self-attention, we learn one of the best strategy to update the sting through:

  • (query-key) Similarity between edges that contain the node of interest. For example for node i, all edges that share that node (eg. ij, ik).
  • A 3rd edge (eg. jk) which even when it does indirectly connect with node i, is an element of the triangle.

This last operation is analogous in style to a graph message-passing algorithm, where even when nodes usually are not directly connected, information from other nodes within the graph is weighted and passed on.

2.2C Transition Block

Reminiscent of the transition block within the MSA trunk with a 2-Layer MLP where the channel is first expanded by an element of 4 after which reduced to the unique number.

The output of the EvoFormer block is an updated representation of each MSA and pairs (of the identical dimensionality).

Structure Module of AlphaFold. [1]

The structure module is the ultimate a part of the model and converts the pairs representations and the input sequence representation (corresponds to a row within the MSA representation) right into a 3D structure. It consists of 8 layers with shared weights, and the pair representation is used to bias the eye operations within the Invariant Point Attention (IPA) module.

The outputs are:

  • Backbone Frames (r, 3×3): Frames represent a Euclidean transform for atomic positions to go from an area frame of reference to a world one. Free-floating body representation (blue triangles) composed of N-Cα-C; thus, each residue (r_i) has three sets of (x, y, z) coordinates
  • χ angles of the sidechains (r , 3): represents the angle of every rotatable atom of the side chain. The angles define the rotational isomer (rotamer) of a residue; due to this fact, one can derive the precise position of the atoms. As much as χ1, χ2, χ3, χ4.

Note that χ refers back to the dihedral angle of every of the rotatable bonds of the side chains. There are shorter amino acids that would not have all 4 χ angles as shown below:

Side-chain angles for Lysine and Tyrosine. Tyrosine is shorter and doesn’t have χ3, χ4. (Illustration by the writer)

3.1 Invariant Point Attention (IPA)

Invariant Point Attention (IPA) of Structure Module. [1]

Generally, this sort of attention is designed to be invariant to Euclidean transformations equivalent to translations and rotations.

  • We first update the one representation with self-attention, as explained in previous sections.
  • We also feed information concerning the backbone frames of every residue to provide query points, key points, and value points for the local frame. These are then projected into a world frame where they interact with other residues after which projected back to the local frame.
  • The word “invariant” refers back to the proven fact that global and native reference points are enforced to be invariant through the use of squared distances and coordinate transformation within the 3D space.

3.2 Predict side chain and backbone torsion angles

The one representation goes through a few MLPs and outputs the torsion angles ω, φ, ψ, χ1, χ2, χ3, χ4.

3.3 Backbone Update

There are two updates returned by this block: one is the rotation represented by a quaternion (1, a, b, c where the primary value is fixed to 1 and a, b, and c correspond to the Euler axis predicted by the network) and a translation represented by a vector matrix.

3.4 All Atom Coordinates

At this point, we have now each the backbone frames and the torsion angles, and we would really like to acquire the precise atom coordinates of the amino acid. Amino acids have a really specific structure of atoms, and we have now the identity because the input sequence. We, due to this fact, apply the torsion angles to the atoms of the amino acid.

Note that over and over you will see many structural violations within the output of AlphaFold, equivalent to those depicted below. It is because the model itself doesn’t implement physical energy constraints. To alleviate this problem, we run an AMBER leisure force field to attenuate the energy of the protein.

Before AMBER leisure, Methionine and Triptophan are predicted too close to one another forming an inconceivable bond. After AMBER leisure, the rotamers are adjusted to attenuate energy and steric clashes. (Illustration by the writer)

The AlphaFold model comprises several self-attention layers and enormous activations resulting from the sizes of the MSAs. Classical backpropagation is optimized to cut back the variety of total computations per node. Nonetheless, within the case of AlphaFold, it might require greater than the available memory in a TPU core (16 GiB). Assuming a protein of 384 residues:

As an alternative, AlphaFold used gradient checkpointing (also rematerialization). The activations are recomputed and calculated for one layer on the time, thus bringing memory consumption to around 0.4 GiB.

This GIF shows what backpropagation normally looks like:

Backpropagation. (Illustration by the writer based on https://github.com/cybertronai/gradient-checkpointing)

By checkpointing, we reduce memory usage, though this has the unlucky side effect of accelerating training time by 33%:

Fixing a layer for checkpointing. (Illustration by the writer based on: https://github.com/cybertronai/gradient-checkpointing)
Fixing a layer for checkpointing. (Illustration by the writer based on https://github.com/cybertronai/gradient-checkpointing)

What if, moderately than a sequence of amino acids, you had the model of a cool protein you designed with a dynamics simulation? Or one that you simply modeled to bind one other protein like a COVID spike protein. Ideally, you’ll wish to predict the sequence obligatory to fold to an input 3D shape that will or may not exist in nature (i.e., it may very well be a very latest protein). Let me introduce you to the world of protein design, which can be my Ph.D. project TIMED (Three-dimensional Inference Method for Efficient Design):

The Inverse Folding Problem (Illustration by the writer)

This problem is arguably harder than the folding problem, as multiple sequences can fold to the identical shape. It is because there may be redundancy in amino acid types, and there are also areas of a protein which can be less critical for the actual fold.

The cool aspect about AlphaFold is that we will use it to double-check whether our models work well:

The Folding and Inverse Folding Problem (Illustration by the writer).

In the event you would really like to know more about this model, have a take a look at my GitHub repository, which also includes somewhat UI Demo!

UI Demo of TIMED for solving the Inverse Folding Problem. (Illustration by the writer). GitHub: https://github.com/wells-wood-research/timed-design

In this text, we saw how AlphaFold (partially) solves a transparent problem for biologists, mainly obtaining 3D structures from an amino acid sequence.

We broke down the structure of the model into Input Embedder, EvoFormer, and Structure module. Each of those uses several self-attention layers, including many tricks to optimize the performance.

AlphaFold works well, but is that this it for biology? No. AlphaFold continues to be computationally very expensive, and there isn’t a straightforward strategy to use it (No, Google Colab shouldn’t be easy — it’s clunky). Several alternatives, like OmegaFold and ESMFold, try to solve these problems.

These models still don’t explain how a protein folds over time. There are also a whole lot of challenges that involve designing proteins where inverse folding models can use AlphaFold to double-check that designed proteins fold to a particular shape.

In the following series of articles, we are going to look into OmegaFold and ESMFold!

[1] Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, et al. Highly accurate protein structure prediction with AlphaFold. Nature (2021) DOI: 10.1038/s41586–021–03819–2

[2] Alberts B. Molecular biology of the cell. (2015) Sixth edition. Recent York, NY: Garland Science, Taylor and Francis Group.

[3] Ahdritz G, Bouatta N, Kadyan S, Xia Q, Gerecke W, O’Donnell TJ, Berenberg D, Fisk I, Zanichelli N, Zhang B, et al. OpenFold: Retraining AlphaFold2 yields latest insights into its learning mechanisms and capability for generalization (2022) Bioinformatics. DOI: 10.1101/2022.11.20.517210

[4] Callaway E. “It is going to change every thing”: DeepMind’s AI makes gigantic leap in solving protein structures (2020). Nature 588(7837):203–204. DOI: 10.1038/d41586–020–03348–4

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here