Dr. Serafim Batzoglou, Chief Data Officer at Seer – Interview Series

Artificial Intelligence

Dr. Serafim Batzoglou, Chief Data Officer at Seer – Interview Series

admin

September 12, 2023

Dr. Serafim Batzoglou, Chief Data Officer at Seer – Interview Series

Serafim Batzoglou is Chief Data Officer at Seer. Prior to joining Seer, Serafim served as Chief Data Officer at Insitro, leading machine learning and data science of their approach to drug discovery. Prior to Insitro, he served as VP of Applied and Computational Biology at Illumina, leading research and technology development of AI and molecular assays for making genomic data more interpretable in human health.

What initially attracted you to the sector of genomics?

I became desirous about the sector of computational biology firstly of my PhD in computer science at MIT, once I took a category on the subject taught by Bonnie Berger, who became my PhD advisor, and David Gifford. The human genome project was picking up pace during my PhD. Eric Lander, who was heading the Genome Center at MIT became my PhD co-advisor and involved me within the project. Motivated by the human genome project, I worked on whole-genome assembly and comparative genomics of human and mouse DNA.

I then moved to Stanford University as faculty on the Computer Science department where I spent 15 years, and was privileged to have advised about 30 incredibly talented PhD students and plenty of postdoctoral researchers and undergraduates. My team’s focus has been the applying of algorithms, machine learning and software tools constructing for the evaluation of large-scale genomic and biomolecular data. I left Stanford in 2016 to guide a research and technology development team at Illumina. Since then, I even have enjoyed leading R&D teams in industry. I find that teamwork, the business aspect, and a more direct impact to society are characteristic of industry in comparison with academia. I worked at modern firms over my profession: DNAnexus, which I co-founded in 2009, Illumina, insitro and now Seer. Computation and machine learning are essential across the technology chain in biotech, from technology development, to data acquisition, to biological data interpretation and translation to human health.

Over the past 20 years, sequencing the human genome has grow to be vastly cheaper and faster. This led to dramatic growth within the genome sequencing market and broader adoption within the life sciences industry. We are actually on the cusp of getting population genomic, multi-omic and phenotypic data of sufficient size to meaningfully revolutionize healthcare including prevention, diagnosis, treatment and drug discovery. We will increasingly discover the molecular underpinnings of disease for people through computational evaluation of genomic data, and patients have the prospect to receive treatments which are personalized and targeted, especially within the areas of cancer and rare genetic disease. Beyond the plain use in medicine, machine learning coupled with genomic information allows us to realize insights into other areas of our lives, similar to our genealogy and nutrition. The subsequent several years will see adoption of personalized, data-driven healthcare, first for select groups of individuals, similar to rare disease patients, and increasingly for the broad public.

Prior to your current role you were Chief Data Officer at Insitro, leading machine learning and data science of their approach to drug discovery. What were a few of your key takeaways from this time period with how machine learning could be used to speed up drug discovery?

The standard drug discovery and development “trial-and-error” paradigm is plagued with inefficiencies and intensely lengthy timelines. For one drug to get to market, it may well take upwards of $1 billion and over a decade. By incorporating machine learning into these efforts, we are able to dramatically reduce costs and timeframes in several steps on the best way. One step is goal identification, where a gene or set of genes that modulate a disease phenotype or revert a disease cellular state to a more healthy state could be identified through large-scale genetic and chemical perturbations, and phenotypic readouts similar to imaging and functional genomics. One other step is compound identification and optimization, where a small molecule or other modality could be designed by machine learning-driven in silico prediction in addition to in vitro screening, and furthermore desired properties of a drug similar to solubility, permeability, specificity and non-toxicity could be optimized. The toughest in addition to most vital aspect is probably translation to humans. Here, selection of the fitting model—induced pluripotent stem cell-derived lines versus primary patient cell lines and tissue samples versus animal models—for the fitting disease poses an incredibly essential set of tradeoffs that ultimately reflect on the flexibility of the resulting data plus machine learning to translate to patients.

Seer Bio is pioneering latest ways to decode the secrets of the proteome to enhance human health, for readers who’re unfamiliar with this term what’s the proteome?

The proteome is the changing set of proteins produced or modified by an organism over time and in response to environment, nutrition and health state. Proteomics is the study of the proteome inside a given cell type or tissue sample. The genome of a human or other organisms is static: with the essential exception of somatic mutations, the genome at birth is the genome one has their entire life, copied exactly in each cell of their body. The proteome is dynamic and changes within the time spans of years, days and even minutes. As such, proteomes are vastly closer to phenotype and ultimately to health status than are genomes, and consequently more informative for monitoring health and understanding disease.

At Seer, we’ve got developed a latest method to access the proteome that gives deeper insights into proteins and proteoforms in complex samples similar to plasma, which is a highly accessible sample that unfortunately to-date has posed a terrific challenge for conventional mass spectrometry proteomics.

What’s the Seer’s Proteograph™ platform and the way does it offer a latest view of the proteome?

Seer’s Proteograph platform leverages a library of proprietary engineered nanoparticles, powered by a straightforward, rapid, and automatic workflow, enabling deep and scalable interrogation of the proteome.

The Proteograph platform shines in interrogating plasma and other complex samples that exhibit large dynamic range—many orders of magnitude difference within the abundance of assorted proteins within the sample—where conventional mass spectrometry methods are unable to detect the low abundance a part of the proteome. Seer’s nanoparticles are engineered with tunable physiochemical properties that gather proteins across the dynamic range in an unbiased manner. In typical plasma samples, our technology enables detection of 5x to 8x more proteins than when processing neat plasma without using the Proteograph. Because of this, from sample prep to instrumentation to data evaluation, our Proteograph Product Suite helps scientists find proteome disease signatures that may otherwise be undetectable. We prefer to say that at Seer, we’re opening up a latest gateway to the proteome.

Moreover, we’re allowing scientists to simply perform large-scale proteogenomic studies. Proteogenomics is the combining of genomic data with proteomic data to discover and quantify protein variants, link genomic variants with protein abundance levels, and ultimately link the genome and the proteome to phenotype and disease, and begin disentangling the causal and downstream genetic pathways related to disease.

Are you able to discuss a number of the machine learning technology that’s currently used at Seer Bio?

Seer is leveraging machine learning in any respect steps from technology development to downstream data evaluation. Those steps include: (1) design of our proprietary nanoparticles, where machine learning helps us determine which physicochemical properties and combos of nanoparticles will work with specific product lines and assays; (2) detection and quantification of peptides, proteins, variants and proteoforms from the readout data produced from the MS instruments; (3) downstream proteomic and proteogenomic analyses in large-scale population cohorts.

Last yr, we published a paper in Advanced Materials combining proteomics methods, nanoengineering and machine learning for improving our understanding of the mechanisms of protein corona formation. This paper uncovered nano-bio interactions and is informing Seer within the creation of improved future nanoparticles and products.

Beyond nanoparticle development, we’ve got been developing novel algorithms to discover variant peptides and post-translational modifications (PTMs). We recently developed a way for detection of protein quantified trait loci (pQTLs) that is powerful to protein variants, which is a known confounder for affinity-based proteomics. We’re extending this work to directly discover these peptides from the raw spectra using deep learning-based de novo sequencing methods to permit search without inflating the scale of spectral libraries.

Our team can be developing methods to enable scientists without deep expertise in machine learning to optimally tune and utilize machine learning models of their discovery work. That is completed via a Seer ML framework based on the AutoML tool, which allows efficient hyperparameter tuning via Bayesian optimization.

Finally, we’re developing methods to cut back the batch effect and increase the quantitative accuracy of the mass spec readout by modeling the measured quantitative values to maximise expected metrics similar to correlation of intensity values across peptides inside a protein group.

Hallucinations are a standard issue with LLMs, what are a number of the solutions to stop or mitigate this?

LLMs are generative methods which are given a big corpus and are trained to generate similar text. They capture the underlying statistical properties of the text they’re trained on, from easy local properties similar to how often certain combos of words (or tokens) are found together, to higher level properties that emulate understanding of context and meaning.

Nonetheless, LLMs are usually not primarily trained to be correct. Reinforcement learning with human feedback (RLHF) and other techniques help train them for desirable properties including correctness, but are usually not fully successful. Given a prompt, LLMs will generate text that the majority closely resembles the statistical properties of the training data. Often, this text can be correct. For instance, if asked “when was Alexander the Great born,” the proper answer is 356 BC (or BCE), and an LLM is probably going to present that answer because inside the training data Alexander the Great’s birth appears often as this value. Nonetheless, when asked “when was Empress Reginella born,” a fictional character not present within the training corpus, the LLM is prone to hallucinate and create a story of her birth. Similarly, when asked a matter that the LLM may not retrieve a right answer for (either because the fitting answer doesn’t exist, or for other statistical purposes), it’s prone to hallucinate and answer as if it knows. This creates hallucinations which are an obvious problem for serious applications, similar to “how can such and such cancer be treated.”

There are not any perfect solutions yet for hallucinations. They’re endemic to the design of the LLM. One partial solution is proper prompting, similar to asking the LLM to “consider carefully, step-by-step,” and so forth. This increases the LLMs likelihood to not concoct stories. A more sophisticated approach that’s being developed is the use of data graphs. Knowledge graphs provide structured data: entities in a knowledge graph are connected to other entities in a predefined, logical manner. Constructing a knowledge graph for a given domain is in fact a difficult task but doable with a mixture of automated and statistical methods and curation. With a built-in knowledge graph, LLMs can cross-check the statements they generate against the structured set of known facts, and could be constrained to not generate an announcement that contradicts or just isn’t supported by the knowledge graph.

Due to the elemental issue of hallucinations, and arguably due to their lack of sufficient reasoning and judgment abilities, LLMs are today powerful for retrieving, connecting and distilling information, but cannot replace human experts in serious applications similar to medical diagnosis or legal advice. Still, they’ll tremendously enhance the efficiency and capability of human experts in these domains.

Are you able to share your vision for a future where biology is steered by data moderately than hypotheses?

The standard hypothesis-driven approach, which involves researchers finding patterns, developing hypotheses, performing experiments or studies to check them, after which refining theories based on the information, is becoming supplanted by a latest paradigm based on data-driven modeling.

On this emerging paradigm, researchers start with hypothesis-free, large-scale data generation. Then, they train a machine learning model similar to an LLM with the target of accurate reconstruction of occluded data, strong regression or classification performance in a lot of downstream tasks. Once the machine learning model can accurately predict the information, and achieves fidelity comparable to the similarity between experimental replicates, researchers can interrogate the model to extract insight in regards to the biological system and discern the underlying biological principles.

LLMs are proving to be especially good in modeling biomolecular data, and are geared to fuel a shift from hypothesis-driven to data-driven biological discovery. This shift will grow to be increasingly pronounced over the subsequent 10 years and permit accurate modeling of biomolecular systems at a granularity that goes well beyond human capability.

What’s the potential impact for disease diagnosis and drug discovery?

I think LLM and generative AI will result in significant changes within the life sciences industry. One area that may profit greatly from LLMs is clinical diagnosis, specifically for rare, difficult-to-diagnose diseases and cancer subtypes. There are tremendous amounts of comprehensive patient information that we are able to tap into – from genomic profiles, treatment responses, medical records and family history – to drive accurate and timely diagnosis. If we are able to discover a method to compile all this data such that they’re easily accessible, and never siloed by individual health organizations, we are able to dramatically improve diagnostic precision. This just isn’t to imply that the machine learning models, including LLMs, will have the ability to autonomously operate in diagnosis. As a consequence of their technical limitations, within the foreseeable future they may not be autonomous, but as a substitute they may augment human experts. They might be powerful tools to assist the doctor provide superbly informed assessments and diagnoses in a fraction of the time needed thus far, and to properly document and communicate their diagnoses to the patient in addition to to your entire network of health providers connected through the machine learning system.

The industry is already leveraging machine learning for drug discovery and development, touting its ability to cut back costs and timelines in comparison with the normal paradigm. LLMs further add to the available toolbox, and are providing excellent frameworks for modeling large-scale biomolecular data including genomes, proteomes, functional genomic and epigenomic data, single-cell data, and more. Within the foreseeable future, foundation LLMs will undoubtedly connect across all these data modalities and across large cohorts of people whose genomic, proteomic and health information is collected. Such LLMs will aid in generation of promising drug targets, discover likely pockets of activity of proteins related to biological function and disease, or suggest pathways and more complex cellular functions that could be modulated in a particular way with small molecules or other drug modalities. We can even tap into LLMs to discover drug responders and non-responders based on genetic susceptibility, or to repurpose drugs in other disease indications. Lots of the present modern AI-based drug discovery firms are undoubtedly already beginning to think and develop on this direction, and we should always expect to see the formation of additional firms in addition to public efforts geared toward the deployment of LLMs in human health and drug discovery.

Dr. Serafim Batzoglou, Chief Data Officer at Seer – Interview Series

1 COMMENT

LEAVE A REPLY Cancel reply