How AI learned to create proteins that do not exist in nature: a unique project of scientists from MIT
Artificial intelligence can help develop drugs, vaccines and biosensors.
Proteins are the basis of life. They are encoded by DNA and are responsible for many of the biological functions that sustain life in the human body. But our body is a fragile system exposed to various factors: pathogens, viruses, diseases and cancer.
Imagine if we could speed up the process of creating vaccines or drugs for new pathogens. What if we had gene-editing technology that could automatically produce proteins to correct DNA errors that cause cancer? Finding proteins that can bind strongly to targets or speed up chemical reactions is vital for drug development, diagnostics, and many industrial applications, but it is often a time-consuming and costly process.
To expand our ability in the field of protein engineering, researchers at the MIT Artificial Intelligence Laboratory ( MIT CSAIL ) developed FrameDiff, a computational tool for creating new protein structures beyond what nature has produced. The machine learning approach generates “boxes” that match the intrinsic properties of protein structures, allowing him to design new proteins independent of existing designs, providing unique protein structures.
“In nature, protein design is a slow process that takes millions of years. Our technique aims to provide an answer to human-made problems that evolve much faster than the speed of nature,” says MIT CSAIL graduate student Jason Yim, one of the lead authors of the new paper on the work. “The goal is that with this new ability to generate synthetic protein structures, many improved possibilities open up, such as better binders. This means engineering proteins that can attach to other molecules more efficiently and selectively, with broad implications for targeted drug delivery and biotechnology, where it could lead to the development of better biosensors. It could also have implications for the biomedical field and beyond, offering opportunities such as developing more efficient photosynthetic proteins, making more efficient antibodies, and engineering nanoparticles for gene therapy.”
Proteins have complex structures consisting of many atoms linked by chemical bonds. The most important atoms that define the 3D shape of a protein are called the “skeleton”, like the backbone of a protein. Each triple of atoms along the skeleton has the same pattern of bonds and types of atoms. The researchers noticed that this pattern could be used to build machine learning algorithms using ideas from differential geometry and probability. This is where boxes come in: Mathematically, these triples can be modeled as rigid bodies called “boxes” (common in physics) that have position and rotation in 3D.
These frames provide each triple with enough information to be aware of its spatial environment. The challenge is for the machine learning algorithm to learn how to move each box to build the protein skeleton. By learning to construct existing proteins, the algorithm will hopefully generalize and be able to create new proteins that never existed in nature.
Teaching a model to construct proteins using “diffusion” involves introducing noise that randomly moves all the boxes and blurs out what the original protein looked like. The algorithm’s job is to move and rotate each frame until it looks like the original protein. Although simple, the development of diffusion on frames requires the techniques of stochastic calculus on Riemannian manifolds. On the theoretical side, the researchers developed “SE(3) diffusion” to study probability distributions that non-trivially connect the translation and rotation components of each frame.
The Subtle Art of Diffusion
In 2021, DeepMind introduced AlphaFold2, a deep learning algorithm for predicting the 3D structures of proteins from their sequences. When creating synthetic proteins, there are two main steps: generation and prediction. Generation means creating new protein structures and sequences, and “prediction” means finding out the 3D structure of a sequence. Not coincidentally, AlphaFold2 also used protein modeling frameworks. SE(3) diffusion and FrameDiff were inspired to take the idea of frames further by incorporating frames into diffusion models, a generative AI technique that has become extremely popular in image generation, such as Midjourney.
The common framework and principles between protein structure generation and prediction meant that the best models from both ends were compatible. In collaboration with the Institute for Protein Design at the University of Washington SE(3), diffusion is already being used to design and experimentally test new proteins. Specifically, they combined SE(3) diffusion with RosettaFold2, a protein structure prediction tool very similar to AlphaFold2, resulting in “RFdiffusion”. This new tool has brought protein designers closer to solving important problems in biotechnology, including the development of highly specific protein binders for accelerated vaccine design, the engineering of symmetrical proteins for gene delivery, and a robust motif backbone for precise enzyme design.
Future challenges for FrameDiff include improving generality for problems that combine multiple requirements for biologics such as drugs. Another extension is to generalize the models to all biological modalities, including DNA and small molecules. The team believes that by extending FrameDiff training on big data and improving the optimization process, it will be able to generate core structures that have RFdiffusion-level design capabilities while maintaining FrameDiff’s inherent simplicity.
“Abandoning the pretrained structure prediction model [в FrameDiff] opens up the possibility of rapidly generating structures that expand to large lengths,” says Harvard University computational biologist Sergey Ovchinnikov. The researchers’ innovative approach represents a promising step towards overcoming the limitations of current structure prediction models. Although this is still preliminary work, it is an encouraging step in the right direction. Thus, the vision of protein design playing a key role in solving humanity’s most pressing problems seems increasingly achievable thanks to the groundbreaking work of this team of researchers at MIT.”
Jason Yim co-wrote the paper with Columbia University postdoc Brian Trippe, Valentin De Bortoli, Center for Science Data Center researcher at the National Center for Scientific Research of France in Paris, Cambridge University postdoc Emile Mathieu, and Oxford University professor of statistics and senior DeepMind Fellow Arnaud Doucet. MIT professors Regina Barzilay and Tommi Jaakkola advised the study.
Team work was supported in part by the MIT Abdul Latif Jameel Clinic for Machine Learning in Health, EPSRC grants and partnership between Microsoft Research and the University of Cambridge, National Science Foundation Postgraduate Research Fellowship Program, NSF Expeditions grant, Machine Learning for Pharmaceutical Discovery and Synthesis Consortium, DTRA’s Medical Discovery Program against New and Emerging Threats, DARPA’s Accelerated Molecular Discovery Program, and the Sanofi Computational Antibody Design Grant. This research will be presented at the International Conference on Machine Learning in July.