Using machine learning, MIT chemical engineers have created a computational model that may predict how well any given molecule will dissolve in an organic solvent — a key step within the synthesis of nearly any pharmaceutical. This sort of prediction could make it much easier to develop recent ways to supply drugs and other useful molecules.
The brand new model, which predicts how much of a solute will dissolve in a specific solvent, should help chemists to decide on the best solvent for any given response of their synthesis, the researchers say. Common organic solvents include ethanol and acetone, and there are tons of of others that may also be utilized in chemical reactions.
“Predicting solubility really is a rate-limiting step in synthetic planning and manufacturing of chemicals, especially drugs, so there’s been a longstanding interest in with the ability to make higher predictions of solubility,” says Lucas Attia, an MIT graduate student and one in every of the lead authors of the brand new study.
The researchers have made their model freely available, and plenty of firms and labs have already began using it. The model may very well be particularly useful for identifying solvents which are less hazardous than among the mostly used industrial solvents, the researchers say.
“There are some solvents that are known to dissolve most things. They’re really useful, but they’re damaging to the environment, they usually’re damaging to people, so many firms require that you could have to reduce the quantity of those solvents that you simply use,” says Jackson Burns, an MIT graduate student who can also be a lead writer of the paper. “Our model is amazingly useful in with the ability to discover the next-best solvent, which is hopefully much less damaging to the environment.”
William Green, the Hoyt Hottel Professor of Chemical Engineering and director of the MIT Energy Initiative, is the senior writer of the study, which appears today in . Patrick Doyle, the Robert T. Haslam Professor of Chemical Engineering, can also be an writer of the paper.
Solving solubility
The brand new model grew out of a project that Attia and Burns worked on together in an MIT course on applying machine learning to chemical engineering problems. Traditionally, chemists have predicted solubility with a tool generally known as the Abraham Solvation Model, which will be used to estimate a molecule’s overall solubility by adding up the contributions of chemical structures throughout the molecule. While these predictions are useful, their accuracy is proscribed.
Previously few years, researchers have begun using machine learning to attempt to make more accurate solubility predictions. Before Burns and Attia began working on their recent model, the state-of-the-art model for predicting solubility was a model developed in Green’s lab in 2022.
That model, generally known as SolProp, works by predicting a set of related properties and mixing them, using thermodynamics, to ultimately predict the solubility. Nonetheless, the model has difficulty predicting solubility for solutes that it hasn’t seen before.
“For drug and chemical discovery pipelines where you’re developing a brand new molecule, you desire to have the option to predict ahead of time what its solubility looks like,” Attia says.
A part of the rationale that existing solubility models haven’t worked well is because there wasn’t a comprehensive dataset to coach them on. Nonetheless, in 2023 a brand new dataset called BigSolDB was released, which compiled data from nearly 800 published papers, including information on solubility for about 800 molecules dissolved about greater than 100 organic solvents which are commonly utilized in synthetic chemistry.
Attia and Burns decided to try training two various kinds of models on this data. Each of those models represent the chemical structures of molecules using numerical representations generally known as embeddings, which incorporate information reminiscent of the variety of atoms in a molecule and which atoms are certain to which other atoms. Models can then use these representations to predict quite a lot of chemical properties.
One in all the models utilized in this study, generally known as FastProp and developed by Burns and others in Green’s lab, incorporates “static embeddings.” Which means the model already knows the embedding for every molecule before it starts doing any kind of study.
The opposite model, ChemProp, learns an embedding for every molecule in the course of the training, at the identical time that it learns to associate the features of the embedding with a trait reminiscent of solubility. This model, developed across multiple MIT labs, has already been used for tasks reminiscent of antibiotic discovery, lipid nanoparticle design, and predicting chemical response rates.
The researchers trained each kinds of models on over 40,000 data points from BigSolDB, including information on the consequences of temperature, which plays a major role in solubility. Then, they tested the models on about 1,000 solutes that had been withheld from the training data. They found that the models’ predictions were two to thrice more accurate than those of SolProp, the previous best model, and the brand new models were especially accurate at predicting variations in solubility because of temperature.
“Having the ability to accurately reproduce those small variations in solubility because of temperature, even when the overarching experimental noise could be very large, was a very positive sign that the network had appropriately learned an underlying solubility prediction function,” Burns says.
Accurate predictions
The researchers had expected that the model based on ChemProp, which is in a position to learn recent representations because it goes along, would have the option to make more accurate predictions. Nonetheless, to their surprise, they found that the 2 models performed essentially the identical. That means that the foremost limitation on their performance is the standard of the info, and that the models are performing in addition to theoretically possible based on the info that they’re using, the researchers say.
“ChemProp should all the time outperform any static embedding when you could have sufficient data,” Burns says. “We were blown away to see that the static and learned embeddings were statistically indistinguishable in performance across all the several subsets, which indicates to us that that the info limitations which are present on this space dominated the model performance.”
The models could change into more accurate, the researchers say, if higher training and testing data were available — ideally, data obtained by one person or a bunch of individuals all trained to perform the experiments the identical way.
“One in all the large limitations of using these sorts of compiled datasets is that different labs use different methods and experimental conditions once they perform solubility tests. That contributes to this variability between different datasets,” Attia says.
Since the model based on FastProp makes its predictions faster and has code that is simpler for other users to adapt, the researchers decided to make that one, generally known as FastSolv, available to the general public. Multiple pharmaceutical firms have already begun using it.
“There are applications throughout the drug discovery pipeline,” Burns says. “We’re also excited to see, outside of formulation and drug discovery, where people may use this model.”
The research was funded, partially, by the U.S. Department of Energy.