The new technology has the potential to surpass directed evolution, the protein design method that won the Nobel Prize, and it will revitalize the 50-year-old field of protein engineering by accelerating the creation of novel proteins

Published: 2023-03-11

Scientists have developed an AI system that can produce synthetic enzymes from scratch.  According to a paper from the University of California called "AI technology generates original proteins from scratch: Natural language model jumpstarts protein design with the creation of active enzymes." cited on ScienceDaily.com, even though several of these enzymes' artificially created amino acid sequences differed noticeably from those of any known natural protein, they nonetheless performed as well in laboratory testing as those found in nature.

The experiment shows that, despite being designed to read and create English text, natural language processing can pick up on at least some of the fundamental concepts of biology. The artificial intelligence (AI) tool ProGen, created by Salesforce Research, assembles amino acid sequences into synthetic proteins via next-token prediction.

The new technology, according to scientists, has the potential to surpass directed evolution, the protein design method that won the Nobel Prize, and it will revitalize the 50-year-old field of protein engineering by accelerating the creation of novel proteins with applications ranging from therapeutics to the degradation of plastic.

James Fraser, Ph.D., professor of bioengineering and medicinal sciences at the UCSF School of Pharmacy, and an author of the study, which was published on Jan. 26 in Nature Biotechnology, noted that artificial designs outperform ideas that were inspired by the evolutionary process.

Although it differs from the typical evolutionary process, the language model is teaching elements of evolution, according to Fraser. "We may now adjust the creation of these traits for particular impacts. For instance, a very thermostable enzyme, one that prefers acidic conditions, or one that won't interact with other proteins."

The amino acid sequences of 280 million unique proteins of all sorts were simply loaded into the machine learning model to develop the model, which was then given a few weeks to process the data. After that, they adjusted the model by feeding it 56,000 sequences from five different lysozyme families along with some background knowledge about these particular proteins.

Based on how closely they mirrored the sequences of normal proteins and how naturalistic the underlying amino acid "grammar" and "semantics" of the AI proteins were, the study team chose 100 sequences from the model's fast generation of a million sequences to test.

Out of this initial batch of 100 proteins, which Tierra Biosciences evaluated in vitro, the team created five fake proteins to test in cells and compared their function to an enzyme known as hen egg white lysozyme, which is present in the whites of chicken eggs (HEWL). Human tears, saliva, and milk all contain similar lysozymes that operate as antimicrobial defenses against bacteria and fungus.

Although though just two of the artificial enzymes had sequences that were around 18% similar to one another, they were nonetheless able to degrade bacterial cell walls with activity that was equal to HEWL. Around 90% and 70% of all known proteins were similar to the two sequences.

In a subsequent round of screening, the scientists discovered that the AI-generated enzymes displayed functionality even when as little as 31.4% of their sequence resembled any known natural protein. A single mutation in a normal protein can cause it to stop functioning.

By analyzing the raw sequence data, the AI was even able to determine how the enzymes should be formed. The manufactured proteins' atomic structures, as determined by X-ray crystallography, seemed just as they should, despite the fact that their sequences were novel.

Based on a type of natural language programming that their researchers first used to produce English language writing, Salesforce Research created ProGen in 2020.

They already knew from their earlier research that the AI system was capable of teaching itself the fundamental principles of good composition, including syntax and word meaning.

"Sequence-based models are incredibly strong in understanding structure and rules when you train them with masses of data," said Nikhil Naik, Ph., Director of AI Research at Salesforce Research and the paper's senior author. Students gain knowledge of compositionality and the words that can appear together.

The design options for proteins were almost endless. As far as proteins go, lysozymes are tiny, containing up to 300 amino acids. Yet, given that there are 20 different amino acids, there are a staggering 20300 potential combinations. It is more than the sum of all the people who have ever lived, the number of sand grains on Earth, and the number of atoms in the cosmos.

It's amazing that the model can produce functional enzymes with such ease given the infinite possibilities.

"The capacity to build functional proteins from scratch out-of-the-box proves we are going into a new age of protein design," stated Ali Madani, Ph., founder of Profluent Bio and first author of the study. Ali Madani is a former research scientist at Salesforce Research. "Protein engineers now have a flexible new tool at their disposal, and we're eager to explore the therapeutic uses."