- A joint research team led by Distinguished Professor Sang Yup Lee of the Department of Chemical and Biomolecular Engineering and Bernhard Palsson of UCSD developed ‘DeepECtransformer’, an artificial intelligence that can predict Enzyme Commission (EC) number of proteins.
- The AI is tasked to discover new enzymes that have not been discovered yet, which would allow prediction for a total of 5,360 types of Enzyme Commission (EC) numbers
- It is expected to be used in the development of microbial cell factories that produce environmentally friendly chemicals as a core technology for analyzing the metabolic network of a genome.
While E. coli is one of the most studied organisms, the function of 30% of proteins that make up E. coli has not yet been clearly revealed. For this, an artificial intelligence was used to discover 464 types of enzymes from the proteins that were unknown, and the researchers went on to verify the predictions of 3 types of proteins were successfully identified through in vitro enzyme assay.
KAIST (President Kwang-Hyung Lee) announced on the 24th that a joint research team comprised of Gi Bae Kim, Ji Yeon Kim, Dr. Jong An Lee and Distinguished Professor Sang Yup Lee of the Department of Chemical and Biomolecular Engineering at KAIST, and Dr. Charles J. Norsigian and Professor Bernhard O. Palsson of the Department of Bioengineering at UCSD has developed DeepECtransformer, an artificial intelligence that can predict the enzyme functions from the protein sequence, and has established a prediction system by utilizing the AI to quickly and accurately identify the enzyme function.
Enzymes are proteins that catalyze biological reactions, and identifying the function of each enzyme is essential to understanding the various chemical reactions that exist in living organisms and the metabolic characteristics of those organisms. Enzyme Commission (EC) number is an enzyme function classification system designed by the International Union of Biochemistry and Molecular Biology, and in order to understand the metabolic characteristics of various organisms, it is necessary to develop a technology that can quickly analyze enzymes and EC numbers of the enzymes present in the genome.
Various methodologies based on deep learning have been developed to analyze the features of biological sequences, including protein function prediction, but most of them have a problem of a black box, where the inference process of AI cannot be interpreted. Various prediction systems that utilize AI for enzyme function prediction have also been reported, but they do not solve this black box problem, or cannot interpret the reasoning process in fine-grained level (e.g., the level of amino acid residues in the enzyme sequence).
The joint team developed DeepECtransformer, an AI that utilizes deep learning and a protein homology analysis module to predict the enzyme function of a given protein sequence. To better understand the features of protein sequences, the transformer architecture, which is commonly used in natural language processing, was additionally used to extract important features about enzyme functions in the context of the entire protein sequence, which enabled the team to accurately predict the EC number of the enzyme. The developed DeepECtransformer can predict a total of 5360 EC numbers.
The joint team further analyzed the transformer architecture to understand the inference process of DeepECtransformer, and found that in the inference process, the AI utilizes information on catalytic active sites and/or the cofactor binding sites which are important for enzyme function. By analyzing the black box of DeepECtransformer, it was confirmed that the AI was able to identify the features that are important for enzyme function on its own during the learning process.
"By utilizing the prediction system we developed, we were able to predict the functions of enzymes that had not yet been identified and verify them experimentally," said Gi Bae Kim, the first author of the paper. "By using DeepECtransformer to identify previously unknown enzymes in living organisms, we will be able to more accurately analyze various facets involved in the metabolic processes of organisms, such as the enzymes needed to biosynthesize various useful compounds or the enzymes needed to biodegrade plastics." he added.
"DeepECtransformer, which quickly and accurately predicts enzyme functions, is a key technology in functional genomics, enabling us to analyze the function of entire enzymes at the systems level," said Professor Sang Yup Lee. He added, “We will be able to use it to develop eco-friendly microbial factories based on comprehensive genome-scale metabolic models, potentially minimizing missing information of metabolism.”
The joint team’s work on DeepECtransformer is described in the paper titled "Functional annotation of enzyme-encoding genes using deep learning with transformer layers" written by Gi Bae Kim, Professor Sang Yup Lee of the Department of Chemical and Biomolecular Engineering of KAIST and their colleagues. The paper was published via peer-review on the 14th of November on “Nature Communications”.
This research was conducted with the support by “the Development of next-generation biorefinery platform technologies for leading bio-based chemicals industry project (2022M3J5A1056072)” and by “Development of platform technologies of microbial cell factories for the next-generation biorefineries project (2022M3J5A1056117)” from National Research Foundation supported by the Korean Ministry of Science and ICT (Project Leader: Distinguished Professor Sang Yup Lee, KAIST).
< Figure 1. The structure of DeepECtransformer's artificial neural network >