3 Questions: On biology and medicine’s “data revolution”

Q: The Eric and Wendy Schmidt Center has 4 distinct areas of focus structured around 4 natural levels of biological organization: proteins, cells, tissues, and organisms. What, inside the current landscape of machine learning, makes now the best time to work on these specific problem classes?

A: Biology and medicine are currently undergoing a “data revolution.” The supply of large-scale, diverse datasets — starting from genomics and multi-omics to high-resolution imaging and electronic health records — makes this an opportune time. Inexpensive and accurate DNA sequencing is a reality, advanced molecular imaging has change into routine, and single cell genomics is allowing the profiling of tens of millions of cells. These innovations — and the large datasets they produce — have brought us to the brink of a brand new era in biology, one where we are going to find a way to maneuver beyond characterizing the units of life (reminiscent of all proteins, genes, and cell types) to understanding the `programs of life’, reminiscent of the logic of gene circuits and cell-cell communication that underlies tissue patterning and the molecular mechanisms that underlie the genotype-phenotype map.

At the identical time, prior to now decade, machine learning has seen remarkable progress with models like BERT, GPT-3, and ChatGPT demonstrating advanced capabilities in text understanding and generation, while vision transformers and multimodal models like CLIP have achieved human-level performance in image-related tasks. These breakthroughs provide powerful architectural blueprints and training strategies that will be adapted to biological data. As an example, transformers can model genomic sequences much like language, and vision models can analyze medical and microscopy images.

Importantly, biology is poised to be not only a beneficiary of machine learning, but in addition a major source of inspiration for brand new ML research. Very like agriculture and breeding spurred modern statistics, biology has the potential to encourage recent and even perhaps more profound avenues of ML research. Unlike fields reminiscent of recommender systems and web promoting, where there are not any natural laws to find and predictive accuracy is the last word measure of value, in biology, phenomena are physically interpretable, and causal mechanisms are the last word goal. Moreover, biology boasts genetic and chemical tools that enable perturbational screens on an unparalleled scale in comparison with other fields. These combined features make biology uniquely suited to each profit greatly from ML and function a profound wellspring of inspiration for it.

Q: Taking a somewhat different tack, what problems in biology are still really proof against our current tool set? Are there areas, perhaps specific challenges in disease or in wellness, which you are feeling are ripe for problem-solving?

A: Machine learning has demonstrated remarkable success in predictive tasks across domains reminiscent of image classification, natural language processing, and clinical risk modeling. Nevertheless, within the biological sciences, predictive accuracy is usually insufficient. The basic questions in these fields are inherently causal: How does a perturbation to a selected gene or pathway affect downstream cellular processes? What’s the mechanism by which an intervention results in a phenotypic change? Traditional machine learning models, that are primarily optimized for capturing statistical associations in observational data, often fail to reply such interventional queries.There’s a powerful need for biology and medicine to also encourage recent foundational developments in machine learning.

The sector is now equipped with high-throughput perturbation technologies — reminiscent of pooled CRISPR screens, single-cell transcriptomics, and spatial profiling — that generate wealthy datasets under systematic interventions. These data modalities naturally call for the event of models that transcend pattern recognition to support causal inference, energetic experimental design, and representation learning in settings with complex, structured latent variables. From a mathematical perspective, this requires tackling core questions of identifiability, sample efficiency, and the combination of combinatorial, geometric, and probabilistic tools. I consider that addressing these challenges won’t only unlock recent insights into the mechanisms of cellular systems, but in addition push the theoretical boundaries of machine learning.

With respect to foundation models, a consensus in the sector is that we’re still removed from making a holistic foundation model for biology across scales, much like what ChatGPT represents within the language domain — a kind of digital organism able to simulating all biological phenomena. While recent foundation models emerge almost weekly, these models have so far been specialized for a selected scale and query, and deal with one or a number of modalities.

Significant progress has been made in predicting protein structures from their sequences. This success has highlighted the importance of iterative machine learning challenges, reminiscent of CASP (critical assessment of structure prediction), which have been instrumental in benchmarking state-of-the-art algorithms for protein structure prediction and driving their improvement.

The Schmidt Center is organizing challenges to extend awareness within the ML field and make progress in the event of methods to resolve causal prediction problems which can be so critical for the biomedical sciences. With the increasing availability of single-gene perturbation data on the single-cell level, I consider predicting the effect of single or combinatorial perturbations, and which perturbations could drive a desired phenotype, are solvable problems. With our Cell Perturbation Prediction Challenge (CPPC), we aim to supply the means to objectively test and benchmark algorithms for predicting the effect of recent perturbations.

One other area where the sector has made remarkable strides is disease diagnostic and patient triage. Machine learning algorithms can integrate different sources of patient information (data modalities), generate missing modalities, discover patterns which may be difficult for us to detect, and help stratify patients based on their disease risk. While we must remain cautious about potential biases in model predictions, the danger of models learning shortcuts as a substitute of true correlations, and the chance of automation bias in clinical decision-making, I consider that is an area where machine learning is already having a major impact.

Q: Let’s speak about a few of the headlines coming out of the Schmidt Center recently. What current research do you’re thinking that people must be particularly enthusiastic about, and why?

A: In collaboration with Dr. Fei Chen on the Broad Institute, we’ve recently developed a way for the prediction of unseen proteins’ subcellular location, called PUPS. Many existing methods can only make predictions based on the particular protein and cell data on which they were trained. PUPS, nevertheless, combines a protein language model with a picture in-painting model to utilize each protein sequences and cellular images. We show that the protein sequence input enables generalization to unseen proteins, and the cellular image input captures single-cell variability, enabling cell-type-specific predictions. The model learns how relevant each amino acid residue is for the anticipated sub-cellular localization, and it will probably predict changes in localization resulting from mutations within the protein sequences. Since proteins’ function is strictly related to their subcellular localization, our predictions could provide insights into potential mechanisms of disease. In the longer term, we aim to increase this method to predict the localization of multiple proteins in a cell and possibly understand protein-protein interactions.

Along with Professor G.V. Shivashankar, a long-time collaborator at ETH Zürich, we’ve previously shown how easy images of cells stained with fluorescent DNA-intercalating dyes to label the chromatin can yield plenty of information concerning the state and fate of a cell in health and disease, when combined with machine learning algorithms. Recently, we’ve furthered this statement and proved the deep link between chromatin organization and gene regulation by developing Image2Reg, a way that allows the prediction of unseen genetically or chemically perturbed genes from chromatin images. Image2Reg utilizes convolutional neural networks to learn an informative representation of the chromatin images of perturbed cells. It also employs a graph convolutional network to create a gene embedding that captures the regulatory effects of genes based on protein-protein interaction data, integrated with cell-type-specific transcriptomic data. Finally, it learns a map between the resulting physical and biochemical representation of cells, allowing us to predict the perturbed gene modules based on chromatin images.

Moreover, we recently finalized the event of a way for predicting the outcomes of unseen combinatorial gene perturbations and identifying the kinds of interactions occurring between the perturbed genes. MORPH can guide the design of probably the most informative perturbations for lab-in-a-loop experiments. Moreover, the attention-based framework provably enables our method to discover causal relations among the many genes, providing insights into the underlying gene regulatory programs. Finally, due to its modular structure, we will apply MORPH to perturbation data measured in various modalities, including not only transcriptomics, but in addition imaging. We’re very excited concerning the potential of this method to enable the efficient exploration of the perturbation space to advance our understanding of cellular programs by bridging causal theory to necessary applications, with implications for each basic research and therapeutic applications.

3 Questions: On biology and medicine’s “data revolution”

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Generative coding

The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models

Speed up StarCoder with 🤗 Optimum Intel on Xeon: Q8/Q4 and Speculative Decoding

a Leaderboard for Real World Use Cases

Patch Time Series Transformer in Hugging Face

3 Questions: On biology and medicine’s “data revolution”

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.