Learning the language of molecules to predict their properties
Date:
July 7, 2023
Source:
Massachusetts Institute of Technology
Summary:
A new framework uses machine learning to simultaneously predict
molecular properties and generate new molecules using only a small
amount of data for training.
Facebook Twitter Pinterest LinkedIN Email
==========================================================================
FULL STORY ========================================================================== Discovering new materials and drugs typically involves a manual,
trial-and- error process that can take decades and cost millions
of dollars. To streamline this process, scientists often use machine
learning to predict molecular properties and narrow down the molecules
they need to synthesize and test in the lab.
Researchers from MIT and the MIT-Watson AI Lab have developed a new,
unified framework that can simultaneously predict molecular properties
and generate new molecules much more efficiently than these popular deep-learning approaches.
To teach a machine-learning model to predict a molecule's biological
or mechanical properties, researchers must show it millions of labeled molecular structures -- a process known as training. Due to the expense
of discovering molecules and the challenges of hand-labeling millions
of structures, large training datasets are often hard to come by, which
limits the effectiveness of machine-learning approaches.
By contrast, the system created by the MIT researchers can effectively
predict molecular properties using only a small amount of data. Their
system has an underlying understanding of the rules that dictate how
building blocks combine to produce valid molecules. These rules capture
the similarities between molecular structures, which helps the system
generate new molecules and predict their properties in a data-efficient
manner.
This method outperformed other machine-learning approaches on both
small and large datasets, and was able to accurately predict molecular properties and generate viable molecules when given a dataset with fewer
than 100 samples.
"Our goal with this project is to use some data-driven methods to
speed up the discovery of new molecules, so you can train a model to do
the prediction without all of these cost-heavy experiments," says lead
author Minghao Guo, a computer science and electrical engineering (EECS) graduate student.
Guo's co-authors include MIT-IBM Watson AI Lab research staff members
Veronika Thost, Payel Das, and Jie Chen; recent MIT graduates Samuel Song
'23 and Adithya Balachandran '23; and senior author Wojciech Matusik, a professor of electrical engineering and computer science and a member
of the MIT-IBM Watson AI Lab, who leads the Computational Design
and Fabrication Group within the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). The research will be presented at the International Conference for Machine Learning.
Learning the language of molecules To achieve the best results
with machine-learning models, scientists need training datasets with
millions of molecules that have similar properties to those they hope to discover. In reality, these domain-specific datasets are usually very
small. So, researchers use models that have been pretrained on large
datasets of general molecules, which they apply to a much smaller,
targeted dataset. However, because these models haven't acquired much
domain- specific knowledge, they tend to perform poorly.
The MIT team took a different approach. They created a machine-learning
system that automatically learns the "language" of molecules -- what
is known as a molecular grammar -- using only a small, domain-specific
dataset. It uses this grammar to construct viable molecules and predict
their properties.
In language theory, one generates words, sentences, or paragraphs based
on a set of grammar rules. You can think of a molecular grammar the
same way. It is a set of production rules that dictate how to generate molecules or polymers by combining atoms and substructures.
Just like a language grammar, which can generate a plethora of sentences
using the same rules, one molecular grammar can represent a vast number
of molecules.
Molecules with similar structures use the same grammar production rules,
and the system learns to understand these similarities.
Since structurally similar molecules often have similar properties,
the system uses its underlying knowledge of molecular similarity to
predict properties of new molecules more efficiently.
"Once we have this grammar as a representation for all the different
molecules, we can use it to boost the process of property prediction,"
Guo says.
The system learns the production rules for a molecular grammar using reinforcement learning -- a trial-and-error process where the model is
rewarded for behavior that gets it closer to achieving a goal.
But because there could be billions of ways to combine atoms and
substructures, the process to learn grammar production rules would be
too computationally expensive for anything but the tiniest dataset.
The researchers decoupled the molecular grammar into two parts. The first
part, called a metagrammar, is a general, widely applicable grammar
they design manually and give the system at the outset. Then it only
needs to learn a much smaller, molecule-specific grammar from the domain dataset. This hierarchical approach speeds up the learning process.
Big results, small datasets In experiments, the researchers' new system simultaneously generated viable molecules and polymers, and predicted
their properties more accurately than several popular machine-learning approaches, even when the domain-specific datasets had only a few hundred samples. Some other methods also required a costly pretraining step that
the new system avoids.
The technique was especially effective at predicting physical properties
of polymers, such as the glass transition temperature, which is
the temperature required for a material to transition from solid to
liquid. Obtaining this information manually is often extremely costly
because the experiments require extremely high temperatures and pressures.
To push their approach further, the researchers cut one training set
down by more than half -- to just 94 samples. Their model still achieved results that were on par with methods trained using the entire dataset.
"This grammar-based representation is very powerful. And because the
grammar itself is a very general representation, it can be deployed
to different kinds of graph-form data. We are trying to identify other applications beyond chemistry or material science," Guo says.
In the future, they also want to extend their current molecular grammar
to include the 3D geometry of molecules and polymers, which is key
to understanding the interactions between polymer chains. They are
also developing an interface that would show a user the learned grammar production rules and solicit feedback to correct rules that may be wrong, boosting the accuracy of the system.
This work is funded, in part, by the MIT-IBM Watson AI Lab and its member company, Evonik. Paper: "Hierarchical Grammar-Induced Geometry for Data- Efficient Molecular Property Prediction"
* RELATED_TOPICS
o Matter_&_Energy
# Materials_Science # Chemistry # Organic_Chemistry
# Nature_of_Water # Engineering_and_Construction #
Nanotechnology # Inorganic_Chemistry # Physics
* RELATED_TERMS
o Polymer o Periodic_table o Macromolecule o Microwave o
Fluid_mechanics o Mass o Wind_turbine o Nanotechnology
==========================================================================
Print
Email
Share ========================================================================== ****** 1 ****** ***** 2 ***** **** 3 ****
*** 4 *** ** 5 ** Breaking this hour ==========================================================================
* Cystic_Fibrosis:_Lasting_Improvement *
Artificial_Cells_Demonstrate_That_'Life_...
* Advice_to_Limit_High-Fat_Dairy_Foods_Challenged
* First_Snapshots_of_Fermion_Pairs *
Why_No_Kangaroos_in_Bali;_No_Tigers_in_Australia
* New_Route_for_Treating_Cancer:_Chromosomes *
Giant_Stone_Artefacts_Found:_Prehistoric_Tools
* Astonishing_Secrets_of_Tunicate_Origins *
Most_Distant_Active_Supermassive_Black_Hole *
Creative_People_Enjoy_Idle_Time_More_Than_Others
Trending Topics this week ========================================================================== SPACE_&_TIME Asteroids,_Comets_and_Meteors Big_Bang Jupiter
MATTER_&_ENERGY Construction Materials_Science Civil_Engineering COMPUTERS_&_MATH Educational_Technology Communications
Mathematical_Modeling
==========================================================================
Strange & Offbeat ========================================================================== SPACE_&_TIME Quasar_'Clocks'_Show_Universe_Was_Five_Times_Slower_Soon_After_the_Big_Bang First_'Ghost_Particle'_Image_of_Milky_Way Gullies_on_Mars_Could_Have_Been_Formed_by_Recent_Periods_of_Liquid_Meltwater, Study_Suggests MATTER_&_ENERGY Holograms_for_Life:_Improving_IVF_Success Researchers_Create_Highly_Conductive_Metallic_Gel_for_3D_Printing Artificial_Cells_Demonstrate_That_'Life_Finds_a_Way' COMPUTERS_&_MATH Number_Cruncher_Calculates_Whether_Whales_Are_Acting_Weirdly AI_Tests_Into_Top_1%_for_Original_Creative_Thinking Growing_Bio-Inspired_Polymer_Brains_for_Artificial_Neural_Networks
Story Source: Materials provided by
Massachusetts_Institute_of_Technology. Original written by Adam
Zewe. Note: Content may be edited for style and length.
==========================================================================
Link to news story:
https://www.sciencedaily.com/releases/2023/07/230707153847.htm
--- up 1 year, 18 weeks, 4 days, 10 hours, 50 minutes
* Origin: -=> Castle Rock BBS <=- Now Husky HPT Powered! (1:317/3)