Home Artificial Intelligence Large Language Models in Molecular Biology Introduction Large Language Models The Genetic Dogma Variation in our DNA Language Models in Molecular Biology Looking Forward Conclusion Acknowledgments References

Large Language Models in Molecular Biology Introduction Large Language Models The Genetic Dogma Variation in our DNA Language Models in Molecular Biology Looking Forward Conclusion Acknowledgments References

0
Large Language Models in Molecular Biology
Introduction
Large Language Models
The Genetic Dogma
Variation in our DNA
Language Models in Molecular Biology
Looking Forward
Conclusion
Acknowledgments
References

One essential clue in determining whether a given variant is benign, or at the least not too deleterious, comes from comparing human genetics to the genetics of close relatives comparable to chimpanzees and other primates (Figure 12). Our genome closely resembles the genomes of other primates: it’s 98.8% just like the genome of chimpanzees, 98.4% just like the genome of gorillas, and 97% just like the genome of orangutans, as an illustration. Proteins, that are conserved by evolution, are much more similar on average. Our biology can be very similar, and when a mutation in a human protein is lethal or causes a serious genetic disease, the identical mutation within the corresponding primate protein is more likely to even be harmful. Conversely, protein variants which might be observed in healthy primates are more likely to be benign in humans as well. Due to this fact, the more primate genomes we will access, the more information we will gather in regards to the human genome: we will compile a listing of protein variants which might be ceaselessly observed in primates and deduce that these variants are likely benign in humans. Hence, the seek for mutations that confer serious genetic disease should start from mutations not on this list.

Such a listing of variants in primate proteins can never be enough to categorise human mutations as benign or pathogenic. Simply put, there can be too many benign human mutations which have not had the chance to seem on the list of variants observed in primates. Nevertheless, this list may be utilized in a more productive way: by observing the patterns inside protein sequences and structures that are inclined to tolerate variants, and the patterns that tend to not tolerate variants. By learning to distinguish between these two classes of protein positions, we will gain the flexibility to annotate variants in proteins as likely benign and certain pathogenic.

The Illumina AI lab headed by Kyle Farh, which developed the SpliceAI method, adopted this approach to annotate variants in human proteins (Gao et al. 2023). Initially, in collaboration with others, they collected primate blood samples and sequenced the genomes of as many primates as they may access, including 809 individuals from 233 distinct primate species. This sequencing effort is a crucial conservation initiative: some primate species are endangered, and preserving the wealth of genetic information in these species is crucial for basic science in addition to for informing human genetics.

The team identified a catalog of 4.3 million common protein variants in primates, with the corresponding protein also being present in humans. Then, they constructed a transformer that learns to differentiate between benign and pathogenic variants in human proteins. This was achieved by learning the patterns of protein positions where primate variants are inclined to be present, in contrast to protein positions where primate variants are inclined to be absent. The transformer, named PrimateAI-3D, is a new edition of a previous deep learning tool, PrimateAI (Sundaram et al. 2018), developed by the identical laboratory. PrimateAI-3D utilizes each protein sequence data, in addition to protein 3D models which might be either experimentally reconstructed or computationally predicted by tools like AlphaFold and HHpred, voxelized at 2 Angstrom resolution (Figure 13).

Human protein structures are voxelized, and along with multiple sequence alignments are passed as input to a 3D convolutional neural network that predicts pathogenicity of all possible point mutations of a goal residue. The network is trained using a loss function with three components: (1) a language model predicting a missing human or primate amino acid using the encircling multiple alignment as input; (2) a 3D convolutional “fill-in-the-blank” model predicting a missing amino acid within the 3D structure; a language model rating trained on classifying between observed variants and random variants with matching statistical properties. Figure created by Tobias Hemp and included with permission.

Within the ClinVar data set of human-annotated variants and their effects, PrimateAI-3D achieved 87.3% recall and 80.2% precision, with an AUC of 0.843, which was best across state-of-the-art methods, though unlike other methods, it was not trained on ClinVar. Furthermore, examining corrections to ClinVar across its versions hints to some proportion of the variants where PrimateAI-3D and ClinVar disagree, could possibly be accurately called by PrimateAI-3D.

PrimateAI-3D may be applied to diagnosis of rare disease, where it could prioritize variants which might be likely deleterious, and filter out likely benign variants. One other application is the invention of genes related to complex diseases: in a cohort of patients of a given disease, one can search for variants which might be likely deleterious in response to PrimateAI-3D, after which search for an abundance of such variants inside a selected gene across the cohort. Genes that exhibit this pattern of being hit by many likely deleterious variants in patients of a given disease, are said to have a genetic “burden” that may be a signal of playing a job within the disease. Gao and colleagues from the PrimateAI-3D team studied several genetic diseases with this technique and discovered many genes previously not known to be related to these diseases. Using PrimateAI-3D, Fiziev et al (2023) developed improved rare variant polygenic risk rating (PRS) models to discover individuals at high disease risk. Additionally they integrated PrimateAI-3D into rare variant burden tests inside UK Biobank and identified promising novel drug goal candidates.

Modeling gene regulation

As outlined earlier, the intricate technique of gene regulation encompasses many interacting molecular components: the DNA chromatin structure, the chemical alterations inside histones that DNA wraps around, the attachment of transcription aspects to promoters and enhancers, the establishment of 3D DNA structure involving promoters, enhancers, certain transcription aspects, and the recruitment of RNA polymerase. Theoretically, the precise DNA sequence within the vicinity of a gene carries all the knowledge needed for this machinery to be triggered at the proper time, in the best amount, and in the suitable cell type. In practice, predicting gene expression from the DNA sequence alone is a formidable task. Yet, language models have recently achieved significant progress on this area.

Over the past 20 years, genomic researchers have undertaken monumental efforts to supply the suitable sorts of large-scale molecular data for understanding gene regulation. Lots of of various assays have been developed that inform various facets of the central dogma, too quite a few to detail here. Listed below are some examples of the knowledge obtained, all the time related to a human cell line or tissue type (the previous often being immortalized cell lines, and the latter often sourced from deceased donors): (1) Identifying the precise locations across your complete genome which have open chromatin and those who have tightly packed chromatin. Two relevant assays for this are DNAse-seq and ATAC-seq. (2) Pinpointing all locations within the genome where a selected transcription factor is certain. (3) Identifying all locations within the genome where a selected histone chemical modification has occurred. (4) Determining the extent of mRNA available for a given gene, i.e., the expression level of a selected gene. This kind of data has been obtained for lots of of human and mouse cell lines from quite a few individuals. In total, several thousand such experiments have already been collected under multi-year international projects like ENCODE, modENCODE, Roadmap Epigenomics, Human Cell Atlas, and others. Each experiment, in turn, has tens to lots of of hundreds of information points across your complete human or model organism genome.

A lineage of language models, culminating within the transformer-based Enformer tool (Avsek et al. 2021), have been developed to simply accept the DNA sequence near a gene as input and output the cell type-specific expression level of this gene for any gene within the genome. Enformer is trained on the next task: given a genome region of 100,000 nucleotides and a selected cell type, it’s trained to predict each of the available sorts of experimental data for this region, including the status of open or packed chromatin, the current histone modifications, the particular certain transcription aspects, and the extent of gene expression. A language model is good for this task: as a substitute of masked language modeling, Enformer is trained in a supervised way, predicting all of the tracks concurrently from DNA sequence. By incorporating attention mechanisms, it could efficiently collate information from distant regions (as much as 100,000 nucleotides away) to predict the status of a given location. In effect, Enformer learns all of the intricate correlations between these diverse molecular entities.

Image included with permission from corresponding creator, Ziga Avsec.

Enformer performs reasonably well in predicting gene expression from sequence alone. If we measure gene expression across all genes in the identical cell line using a selected experimental assay (as an illustration, the CAGE assay), two replicates of the identical experiment typically correlate at a median of 0.94. A computational method acting at this level could arguably reduce the necessity for collecting experimental data. Enformer doesn’t quite achieve this yet, correlating at a level of 0.85 with experimental data, which is about thrice the error in comparison with two experimental replicates. Nevertheless, this performance is anticipated to enhance as more data are incorporated and enhancements are made to the model. Notably, Enformer can predict the changes in gene expression attributable to mutations present in numerous individuals, in addition to by mutations artificially introduced through CRISPR experiments. Nevertheless, it still has its limitations, comparable to performing poorly in predicting the consequences of distal enhancers — enhancers which might be removed from the gene start — (Karollus et al. 2023) and to accurately determine the direction of the effect of non-public variants in gene expression (Sasse et al. 2023). Such shortcomings are likely as a consequence of insufficient training data. With data generation proceeding at an accelerated pace, it just isn’t unreasonable to anticipate that within the foreseeable future we may have LLMs able to predicting gene expression from sequence alone with experimental-level accuracy, and consequently models that accurately and comprehensively depict the complex molecular mechanisms involved within the central dogma of molecular biology.

As discussed above, DNA inside cells is arranged in complex, hierarchical 3D chromatin structure, which plays a job in gene regulation because only genes inside open chromatin are expressed. Orca (Zhou 2022) is a recent language model, based on a convolutional encoder-decoder architecture, that predicts 3D genome structure from proximity data provided by Hi-C experiments. Those are datasets across your complete genomes of a cell line or tissue sample, wherein pairs of genomic positions which might be close to one another are revealed as DNA fragments that glue a chunk of DNA from each region. The Orca model is a hierarchical multi-level convolutional encoder, and a multilevel decoder, which predict DNA structure at 9 levels of resolution, from 4kb (kilo base pairs) to 1024kb, for input DNA sequences which might be so long as the longest human chromosome.

Foundation Models

Foundation models are large deep learning architectures, comparable to the transformer-based GPT models by OpenAI, that encode an unlimited amount of data from diverse sources. Researchers and practitioners can fine-tune these pre-trained models for specific tasks, leading to high-performance systems for a big selection of downstream applications. Several foundation models have begun to emerge in molecular biology. Here, we’ll briefly introduce two such models that just appeared as preprints in biorXiv. (Since the papers haven’t been peer reviewed yet, we refrain from reporting on their performance in comparison with other state-of-the-art methods.)

is a foundation model designed for single-cell transcriptomics, chromatin accessibility, and protein abundance. This model is trained on single-cell data from 10 million human cells. Each cell incorporates expression values for a fraction of the roughly 20,000 human genes. The model learns embeddings of this huge cell × gene matrix, which offer insights into the underlying cellular states and energetic biological pathways. The authors innovatively adapted the GPT methodology to this vastly different setting (Figure 15). Specifically, the ordering of genes within the genome, unlike the ordering of words in a sentence, just isn’t as meaningful. Due to this fact, while GPT models are trained to predict the subsequent word, the concept of the “next gene” is unclear in single-cell data. The authors solve this problem by training the model to generate data based on a gene prompt (a set of known gene values) and a cell prompt. Ranging from the known genes, the model predicts the remaining genes together with their confidence values. For K iterations, it divides those into K bins, and the highest 1/K most confident genes are fixed as known genes for the subsequent iteration. Once trained, scGPT is fine-tuned for varied downstream tasks: batch correction, cell annotation (where the bottom truth is annotated collections of various cell types), perturbation prediction (predicting the cell state after a given set of genes are experimentally perturbed), multiomics (where each layer, transcriptome, chromatin, proteome, is treated as a unique language), prediction of biological pathways, and more.

A. Workflow of scGPT. The model is trained on a lot of cells from cell atlas, and is then superb tuned for downstream applications comparable to clustering, batch correction, cell annotation, perturbation prediction and gene network inference. B. Input embeddings. There are gene tokens, gene expression values, and condition tokens. C. The transformer layer. Image provided by Bo Wang.

is a foundational model that focuses on raw DNA sequences. These sequences are tokenized into words of six characters each (k-mers of length 6) and trained using the BERT methodology. The training data consists of the reference human genome, 3200 additional diverse human genomes to capture variations across human genomics, and the genomes of 850 other species. The Nucleotide Transformer is then applied to 18 downstream tasks that encompass lots of the previously discussed ones: promoter prediction, splice site donor and acceptor prediction, histone modifications, and more. Predictions are made either through probing, wherein embeddings at different layers are used as features for easy classifiers (comparable to logistic regression or perceptrons), or through light, computationally inexpensive fine-tuning.

Deciphering the biomolecular code that connects our genomes to the intricate biomolecular pathways in our body’s various cells, and subsequently to our physiology together with environmental interactions, doesn’t require AGI. While there are many AI tasks that will or is probably not on the horizon, I argue that understanding molecular biology and linking it to human health isn’t certainly one of them. LLMs are already proving adequate for this general aspiration.

Listed below are some tasks that we aren’t asking the AI to do. We aren’t asking it to generate recent content; fairly, we’re asking it to learn the complex statistical properties of existing biological systems. We aren’t requesting it to navigate intricate environments in a goal-oriented manner, maintain an internal state, form goals and subgoals, or learn through interaction with the environment. We aren’t asking it to unravel mathematical problems or to develop deep counterfactual reasoning. We do, nevertheless, expect it to learn one-step causality relationships: if a certain mutation occurs, a selected gene malfunctions. If this gene is under-expressed, other genes within the cascade increase or decrease. Through easy one-step causal relationships, which may be learned from triangulating between correlations across modalities comparable to DNA variation, protein abundance and phenotype (a method generally known as Mendelian randomization) and large-scale perturbation experiments which might be becoming increasingly common, LLMs will effectively model cellular states. This connection extends from the genome at one end to the phenotype at the opposite.

In summary, today’s LLMs are sufficiently advanced to model molecular biology. Further methodological improvements are all the time welcome. Nevertheless, the barrier is not any longer deep learning methodology; the more significant gatekeeper is data.

Fortunately, data is becoming each cheaper and richer. Advances in DNA sequencing technology have reduced the fee of sequencing a human genome from $3Bn billion for the primary genome, to roughly $1000 a couple of years back, and now to as little as $200 today. The identical cost reductions apply to all molecular assays that use DNA sequencing as their primary readout. This includes assays for quantifying gene expression, chromatin structure, histone modifications, transcription factor binding, and lots of of other ingenious assays developed over the past 10–20 years. Further innovations in single-cell technologies, in addition to in proteomics, metabolomics, lipidomics, and other -omic assays, allow for increasingly detailed and efficient measurements of the assorted molecular layers between DNA and human physiology.

The UK Biobank is a large-scale biomedical database and research resource, containing in-depth genetic and health information from around 500,000 UK volunteers. The participants were all between the ages of 40–69 years after they were recruited from 2006–2010. The info collected includes blood, urine and saliva samples, detailed information in regards to the participants’ backgrounds, lifestyle and health, and subsequent medical histories accessed through health records. For a subset of participants, imaging data (brain, heart, abdomen, bones and joints) have also been collected. The exomes of 470,000 individuals were released in June 2022, and your complete genomes of all individuals are coming up by the tip of 2023. Images provided by UK Biobank and included with permission.

So, how can all this be put together? A key style of data initiative is one which brings together a big group of volunteer participants for deep exploration of their -omic data, phenotypes, and health records. A number one example of that is the , a large-scale biobank, biomedical database and research resource containing comprehensive genetic and health information from half one million UK participants (Figure 16). Participant biosamples have been collected with broad consent, and a wealth of information is repeatedly being generated. The exomes (protein-coding parts of the genome) of virtually all participants have been released, with whole genomes to follow. As well as, various sorts of data can be found including COVID-19 antibody data, metabolomic, telomere, imaging, genotype, clinical measurements, primary care, pain questionnaires, and more. Additional data types are repeatedly added. UKB data can be found to anyone for research purposes. All Of Us is an identical initiative within the US, which to this point has sequenced the genomes of 250,000 participants. FinnGen (Finnland Genomics) goals to create an identical biobank of 500,000 Finnish participants, which is incredibly helpful because genetic studies turn into much easier in a cohort that’s genetically more homogeneous. deCODE Genetics leads an identical effort in Iceland, with greater than two-thirds of the adult population in Iceland participating in the trouble. Additional cohorts of sequenced participants exist, including tens of millions of exomes sequenced by Regeneron Pharmaceuticals (a non-public initiative), and plenty of national initiatives worldwide.

Cancer particularly is a disease of the genome, and plenty of corporations are constructing a wealth of genomic information on cancer patients and cancer samples, and extra clinical information. Covering this field is beyond the scope, nevertheless it is price mentioning Tempus, an AI-based precision medicine company with a big and growing library of clinical and molecular data on cancer, Foundation Medicine, a molecular information company that gives comprehensive genomic profiling assays to discover the molecular alterations in a patient’s cancer and match them with relevant targeted therapies, immunotherapies, and clinical trials, and GRAIL and Guardant Helth, two pioneering diagnostic corporations that give attention to early tumor detection from “liquid biopsies” or evaluation of the genomic content of patient blood samples, which regularly contain molecular shedding of cancer cells. Each of those corporations has data on large and growing cohorts of patients.

Along with these cohort initiatives, there are many other large-scale data initiatives. Notably, the Human Cell Atlas project has already produced gene expression data for 42 million human cells from 6,300 donor individuals. The ENCODE Project, an unlimited functional genomic dataset on lots of of human cell lines and various molecular quantities, has generated data on gene expression, chromatin accessibility, transcription factor binding, histone marks, DNA methylation, and more.

LLMs are perfectly suited to integrate these data. Trying to the long run, we could envision a mammoth LLM integrating across all such datasets. So, what might the architecture and training of such a model appear to be? Let’s engage in a thought experiment and take a look at to piece it together:

  • Genes within the genome, including essential variants like different isoforms of the resulting proteins, are tokenized.
  • Various kinds of cells and tissues are tokenized.
  • Human phenotypes, comparable to disease states, clinical indications, and adherence to drug regimens, are also tokenized.
  • DNA sequences are tokenized at a fixed-length nucleotide level.
  • Positional information within the genome connects genes with nucleotide content.
  • Protein sequences are tokenized using the amino acid alphabet.
  • Data from the Human Cell Atlas and other single-cell datasets train the LLM in an autoregressive manner akin to GPT, or with masked language modeling akin to BERT, highlighting cell-type specific and cell-state specific gene pathways.
  • ENCODE and similar data teach the LLM to associate different molecular information layers like raw DNA sequence and its variants, gene expression, methylation, histone modifications, chromatin accessibility, etc., in a cell-type specific manner. Each layer is a definite “language,” with various richness and vocabulary, providing unique information. The LLM learns to translate between these languages.
  • Projects just like the PrimateAI-3D’s primate genomics initiative and other species sequencing efforts instruct the LLM in regards to the potential benign or harmful effects of mutations within the human genome.
  • Your complete proteomes including protein variants are enriched with protein 3D structural information that’s either experimentally obtained or predicted by AlphaFold, RoseTTAfold and other structural prediction methods.
  • Datasets from the UK Biobank (UKB) and other cohorts allow the LLM to associate genomic variant information and other molecular data with human health information.
  • The LLM leverages the entire clinical records of participants to grasp common practice and its effects, and connect this with other “languages” across all datasets.
  • The LLM harnesses the vast existing literature on basic biology, genetics, molecular science, and clinical practice, including all known associations of genes and phenotypes.

Developing such an LLM presents a big challenge, which is of various kind than the GPT line of LLMs. It requires technical innovation to represent and integrate the assorted information layers, as well scaling up the variety of tokens processed by the model. Potential applications of such an LLM are vast. To list a couple of:

  • It could leverage all available patient information, including their genome, other measurements, entire clinical history, and family health information, aiding doctors in making precise diagnoses, even for rare conditions. It could possibly be particularly useful in diagnosing rare diseases and subtyping cancers.
  • The LLM could help discover promising gene and pathway targets for various clinical indications, individuals likely to reply to certain drugs, and people unlikely to learn, thereby increasing the success of clinical trials. It could also assist in drug molecule development and drug repurposing.
  • Each of the layers of molecular information can be connected to other layers in a way just like language translation, and the LLM can be probed for features that provide substantial predictive power. Whereas interpretation of deep learning models is a challenge, impressive advances are continuusly made by a research community that’s desirous to make AI interpretable. In the most recent such advance by OpenAI4 , GPT-4 has just been deployed to clarify the behavior of every of the neurons of GPT-2. (https://openai.com/research/language-models-can-explain-neurons-in-language-models)
  • The model may be leveraged to discover the “gaps’’ within the training data, in the shape of cell types, or molecular layers, and even individuals of specific genetic background or disease indications, that are predicted with poor confidence levels from other data.

While developing these technologies, it’s essential to think about potential risks, including those related to and clinical practice. Patient privacy stays a big concern. This is particularly true for LLMs, because depending on the capability of the model, in principle the information of participants that were used to coach the model is retrievable through a prompt that features a part of that data or other information that hones in to a selected patient. Due to this fact, it is particularly essential when training LLMs with participant data to have proper informed consent for the intended use of and access to those models.

Nevertheless, many individuals, exemplified by the participants within the UK Biobank cohort, are motivated to share their data and biosamples generously, providing immense advantages for research and society. As for clinical practice, it’s unclear if LLMs can independently be used for diagnosis and treatment recommendations. The first purpose of those models just isn’t to exchange, but to help healthcare professionals, offering powerful tools that doctors can use to confirm and audit medical information. To cite Isaac Kohane, “trust, but confirm” (Lee, Goldberg, Kohane 2023).

So, what are the hurdles to completely implement an LLM to bridge genetics, molecular biology, and human health? The predominant obstacle is data availability. The production of functional genomic data, comparable to those from ENCODE and the Human Cell Atlas, must be accelerated. Fortunately, the fee of generating such data is rapidly decreasing. Concurrently, multiomic cohort and clinical data should be produced and made publicly accessible. This process requires participants’ consent, taking into consideration legitimate privacy concerns. Nevertheless, alongside the inalienable right to privacy, there’s an equally essential right to participant data transparency: many individuals want to contribute by sharing their data. This is particularly true for patients of rare genetic diseases and cancer, who need to help other patients by contributing to the study of the disease and development of treatments. The success of the UK Biobank is a testament to participants’ generosity in data sharing, aiming to make a positive impact on human health.

Molecular biology just isn’t a set of neat concepts and clear principles, but a set of trillions of little facts assembled over eons of trial and error. Human biologists excel in storytelling, putting these facts into descriptions and stories that help with intuition and experimental planning. Nevertheless, making biology right into a computational science requires a mix of massive data acquisition and computational models of the best capability to distill the trillions of biological facts from data. With LLMs and the accelerating pace of information acquisition, we’re indeed a couple of years away from having accurate in silico predictive models of the first biomolecular information highway, to attach our DNA, cellular biology, and health. We will reasonably expect that over the subsequent 5-10 years a wealth of biomedical diagnostic, drug discovery, and health span corporations and initiatives will bring these models to application in human health and medicine, with enormous impact. We may also likely witness the event of open foundation models that integrate across data spanning from genomes all of the option to medical information. Such models will vastly speed up research and innovation, and foster precision medicine.

I thank Eric Schadt and Bo Wang for varied suggestions and edits to the document. I thank Anshul Kundaje, Bo Wang and Kyle Farh for providing thoughts, comments and figures. I thank Lukas Kuderna for creating the Primate Phylogeny figure for this manuscript. I’m an worker of Seer, Inc, nevertheless all opinions expressed listed here are my very own.

Avsek Z et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods 2021.

Baek M et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021.

Baek M et al. Efficient and accurate prediction of protein structure using RoseTTAFold2. biorXiv doi: https://doi.org/10.1101/2023.05.24.542179, 2023.

Bubeck S et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv:2303.12712, 2023.

Cui et al. scGPT: Towards Constructing a Foundation Model for Single-Cell Multi-omics Using Generative AI. biorXiv https://doi.org/10.1101/2023.04.30.538439, 2023.

Dalla-Torre H et al. The Nucleotide Transformer: Constructing and Evaluating Robust Foundation Models for Human Genomics. biorXiv https://doi.org/10.1101/2023.01.11.523679, 2023.

Devlin J et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805, 2018.

Fiziev P et al. Rare penetrant mutations confer severe risk of common diseases. Science 2023.

Gao et al. The landscape of tolerated genetic variation in humans and primates. Science 2023.

Jaganathan et al. Predicting splicing from primary sequence with deep learning. Cell 2019.

Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature , 583–589, 2021.

Karollus et al. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biology 2023.

Kong et al. Rate of de novo mutations and the importance of father’s age to disease risk. Nature 2012.

Lee P, Goldberg C, Kohane I. The AI Revolution in Medicine: GPT-4 and Beyond. Pearson, 2023.

Lin Z et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023.

Lyayuga Lisanza S et al. Joint generation of protein sequence and structure with RoseTTAFold sequence space diffusion. biorXiv https://doi.org/10.1101/2023.05.08.539766, 2023.

Sasse et al. How far are we from personalized gene expression prediction using sequence-to-expression deep neural networks? biorXiv https://doi.org/10.1101/2023.03.16.532969, 2023.

Sundaram et al. Predicting the clinical impact of human mutation with deep neural networks. Nature Genetics 2018.

Varadi M et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research, 2021.

Wang S et al. Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model. PLoS Computational Biology 2017.

Wolfram S. What’s ChatGPT doing… and why does it work? Wolfram Media, Inc. 2023.

Zhou J. Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale. Nature Genetics 2022.

LEAVE A REPLY

Please enter your comment!
Please enter your name here