Building better proteins with machine learning

photo of Philip Romero
Assistant professor Philip Romero.

Neural networks and machine learning once seemed like far-fetched futuristic concepts but are now proven successful tools that can help scientists approach big problems (and big datasets). A major advantage for machine learning is that it can analyze high throughput datasets and pull out the best predictions out of millions of sequence combinations — a needle in a haystack process that would be impossible to do experimentally in the lab.

In a recent study published in the journal PNAS, Morgridge Institute for Research investigator Anthony Gitter and his lab demonstrated that a machine learning model could predict new protein sequences that could improve protein function. The Gitter Lab worked in close collaboration with biochemistry assistant professor Phil Romero‘s lab to fully realize this intersection of machine learning and protein engineering.

Proteins are made up of a sequence of up to thousands of characters long — a combination of the 20 different amino acids that serve as building blocks. The sequence determines how the protein folds into a three-dimensional shape, and the shape determines its function.

“Almost everything that happens in a cell is because some protein has the right shape to do that job,” Gitter says.

Photo of Anthony Gitter
Morgridge investigator Anthony Gitter.

Changing even a single amino acid in the sequence could drastically alter the shape and function of a protein. In most cases, proteins are unaffected or simply fall apart due to an unstable structure. But what if there was a way to home in on changes that would make the protein better at its job?

A fluorescent protein could shine brighter and improve the way biologists visualize cellular activity under a microscope. Or a protein receptor could bind more efficiently to important biological molecules.

Sam Gelman, a graduate student in the Gitter Lab, led the design and testing of multiple neural network models to learn about the biological structure and function from existing datasets for several well-known proteins, like green fluorescent protein (GFP) and protein G B1 domain (GB1) that binds to immunoglobulin G (IgG) antibodies.

Once the machine learning software could extrapolate meaningful information from the sequences, the next step was to test its ability to make predictions about how to design new protein sequences that could affect function.

The researchers identified five variations of the GB1 protein sequence, which the Romero Lab synthesized into proteins to test their function.

“We actually came up with a new version of a protein that works much better than anything that’s been observed naturally before or anything that’s been engineered before,” Gitter says.

Their new protein — identified as Design10 because it contained 10 amino acid mutations — had a similar structure to GB1, but could bind to IgG antibodies with more than 20 times the affinity of the natural protein.

This computational work is built on the foundation of basic research and the experimental work done by groups like the Romero Lab. And the best machine learning models are only as good as the datasets on which they are trained.

“The best models can learn a more accurate predictive model with less example data than other models can,” Gitter says. “Where we would really like to go with this in the future is doing less wet lab experimental work to build up these experimental datasets — let the machine learning models step in so that you’re getting to the needle in the haystack a lot sooner.”

While this research is proof-of-concept, protein engineering has huge potential in biomedical research, Gitter says, perhaps by creating new ways to treat disease.

This story was adapted from one written by Mariel Mohns at the Morgridge Institute for Research. Read the original here.