Every cell in your body incorporates the identical genetic sequence, yet each cell expresses only a subset of those genes. These cell-specific gene expression patterns, which be sure that a brain cell is different from a skin cell, are partly determined by the three-dimensional structure of the genetic material, which controls the accessibility of every gene.
MIT chemists have now provide you with a brand new solution to determine those 3D genome structures, using generative artificial intelligence. Their technique can predict 1000’s of structures in only minutes, making it much speedier than existing experimental methods for analyzing the structures.
Using this method, researchers could more easily study how the 3D organization of the genome affects individual cells’ gene expression patterns and functions.
“Our goal was to attempt to predict the three-dimensional genome structure from the underlying DNA sequence,” says Bin Zhang, an associate professor of chemistry and the senior writer of the study. “Now that we will try this, which puts this method on par with the cutting-edge experimental techniques, it may possibly really open up quite a lot of interesting opportunities.”
MIT graduate students Greg Schuette and Zhuohan Lao are the lead authors of the paper, which appears today in .
From sequence to structure
Contained in the cell nucleus, DNA and proteins form a posh called chromatin, which has several levels of organization, allowing cells to cram 2 meters of DNA right into a nucleus that is barely one-hundredth of a millimeter in diameter. Long strands of DNA wind around proteins called histones, giving rise to a structure somewhat like beads on a string.
Chemical tags often called epigenetic modifications will be attached to DNA at specific locations, and these tags, which vary by cell type, affect the folding of the chromatin and the accessibility of nearby genes. These differences in chromatin conformation help determine which genes are expressed in several cell types, or at different times inside a given cell.
Over the past 20 years, scientists have developed experimental techniques for determining chromatin structures. One widely used technique, often called Hi-C, works by linking together neighboring DNA strands within the cell’s nucleus. Researchers can then determine which segments are situated near one another by shredding the DNA into many tiny pieces and sequencing it.
This method will be used on large populations of cells to calculate a median structure for a piece of chromatin, or on single cells to find out structures inside that specific cell. Nonetheless, Hi-C and similar techniques are labor-intensive, and it may possibly take about every week to generate data from one cell.
To beat those limitations, Zhang and his students developed a model that takes advantage of recent advances in generative AI to create a quick, accurate solution to predict chromatin structures in single cells. The AI model that they designed can quickly analyze DNA sequences and predict the chromatin structures that those sequences might produce in a cell.
“Deep learning is basically good at pattern recognition,” Zhang says. “It allows us to investigate very long DNA segments, 1000’s of base pairs, and work out what’s the necessary information encoded in those DNA base pairs.”
ChromoGen, the model that the researchers created, has two components. The primary component, a deep learning model taught to “read” the genome, analyzes the data encoded within the underlying DNA sequence and chromatin accessibility data, the latter of which is widely available and cell type-specific.
The second component is a generative AI model that predicts physically accurate chromatin conformations, having been trained on greater than 11 million chromatin conformations. These data were generated from experiments using Dip-C (a variant of Hi-C) on 16 cells from a line of human B lymphocytes.
When integrated, the primary component informs the generative model how the cell type-specific environment influences the formation of various chromatin structures, and this scheme effectively captures sequence-structure relationships. For every sequence, the researchers use their model to generate many possible structures. That’s because DNA is a really disordered molecule, so a single DNA sequence can provide rise to many alternative possible conformations.
“A significant complicating factor of predicting the structure of the genome is that there isn’t a single solution that we’re aiming for. There’s a distribution of structures, regardless of what portion of the genome you’re taking a look at. Predicting that very complicated, high-dimensional statistical distribution is something that’s incredibly difficult to do,” Schuette says.
Rapid evaluation
Once trained, the model can generate predictions on a much faster timescale than Hi-C or other experimental techniques.
“Whereas you may spend six months running experiments to get just a few dozen structures in a given cell type, you’ll be able to generate a thousand structures in a selected region with our model in 20 minutes on only one GPU,” Schuette says.
After training their model, the researchers used it to generate structure predictions for greater than 2,000 DNA sequences, then compared them to the experimentally determined structures for those sequences. They found that the structures generated by the model were the identical or very just like those seen within the experimental data.
“We typically take a look at a whole lot or 1000’s of conformations for every sequence, and that offers you an inexpensive representation of the variety of the structures that a selected region can have,” Zhang says. “If you happen to repeat your experiment multiple times, in several cells, you’ll very likely find yourself with a really different conformation. That’s what our model is attempting to predict.”
The researchers also found that the model could make accurate predictions for data from cell types apart from the one it was trained on. This implies that the model might be useful for analyzing how chromatin structures differ between cell types, and the way those differences affect their function. The model may be used to explore different chromatin states that may exist inside a single cell, and the way those changes affect gene expression.
One other possible application could be to explore how mutations in a selected DNA sequence change the chromatin conformation, which could make clear how such mutations may cause disease.
“There are quite a lot of interesting questions that I believe we will address with any such model,” Zhang says.
The researchers have made all of their data and the model available to others who wish to make use of it.
The research was funded by the National Institutes of Health.