The algorithm presents the first full-fledged, short-read alignment software that leverages learned indices for solving the exact match search problem for efficient seeding
< Image:Scientists from KAIST develops new machine-learning-based approach to speed up DNA sequencing. >
The human genome consists of a complete set of DNA, which is about 6.4 billion letters long. Because of its size, reading the whole genome sequence at once is challenging. So scientists use DNA sequencers to produce hundreds of millions of DNA sequence fragments, or short reads, up to 300 letters long. Then the DNA sequencer assembles all the short reads like a giant jigsaw puzzle to reconstruct the entire genome sequence. Even with very fast computers, this job can take hours to complete.
A research team at KAIST has achieved up to 3.45x faster speeds by developing the first short-read alignment software that uses a recent advance in machine-learning called a learned index.
The research team reported their findings on March 7, 2022 in the journal Bioinformatics. The software has been released as open source and can be found on github (https://github.com/kaist-ina/BWA-MEME).
Next-generation sequencing (NGS) is a state-of-the-art DNA sequencing method. Projects are underway with the goal of producing genome sequencing at population scale. Modern NGS hardware is capable of generating billions of short reads in a single run. Then the short reads have to be aligned with the reference DNA sequence. With large-scale DNA sequencing operations running hundreds of next-generation sequences, the need for an efficient short read alignment tool has become even more critical. Accelerating the DNA sequence alignment would be a step toward achieving the goal of population-scale sequencing. However, existing algorithms are limited in their performance because of their frequent memory accesses.
BWA-MEM2 is a popular short-read alignment software package currently used to sequence the DNA. However, it has its limitations. The state-of-the-art alignment has two phases – seeding and extending. During the seeding phase, searches find exact matches of short reads in the reference DNA sequence. During the extending phase, the short reads from the seeding phase are extended. In the current process, bottlenecks occur in the seeding phase. Finding the exact matches slows the process.
The researchers set out to solve the problem of accelerating the DNA sequence alignment. To speed the process, they applied machine learning techniques to create an algorithmic improvement. Their algorithm, BWA-MEME (BWA-MEM emulated) leverages learned indices to solve the exact match search problem. The original software compared one character at a time for an exact match search. The team’s new algorithm achieves up to 3.45x faster speeds in seeding throughput over BWA-MEM2 by reducing the number of instructions by 4.60x and memory accesses by 8.77x. “Through this study, it has been shown that full genome big data analysis can be performed faster and less costly than conventional methods by applying machine learning technology,” said Professor Dongsu Han from the School of Electrical Engineering at KAIST.
The researchers’ ultimate goal was to develop efficient software that scientists from academia and industry could use on a daily basis for analyzing big data in genomics. “With the recent advances in artificial intelligence and machine learning, we see so many opportunities for designing better software for genomic data analysis. The potential is there for accelerating existing analysis as well as enabling new types of analysis, and our goal is to develop such software,” added Han.
Whole genome sequencing has traditionally been used for discovering genomic mutations and identifying the root causes of diseases, which leads to the discovery and development of new drugs and cures. There could be many potential applications. Whole genome sequencing is used not only for research, but also for clinical purposes. “The science and technology for analyzing genomic data is making rapid progress to make it more accessible for scientists and patients. This will enhance our understanding about diseases and develop a better cure for patients of various diseases.”
The research was funded by the National Research Foundation of the Korean government’s Ministry of Science and ICT.
Youngmok Jung, Dongsu Han, “BWA-MEME:BWA-MEM emulated with a machine learning approach,” Bioinformatics, Volume 38, Issue 9, May 2022
Professor Dongsu Han
School of Electrical Engineering
A genome engineering-based systematic strategy for developing phage resistant Escherichia coli strains has been successfully developed through the collaborative efforts of a team led by Professor Sang Yup Lee, Professor Shi Chen, and Professor Lianrong Wang. This study by Xuan Zou et al. was published in Nature Communications in August 2022 and featured in Nature Communications Editors’ Highlights. The collaboration by the School of Pharmaceutical Sciences at Wuhan University, the First Af2022-08-23
Optical interferometry visualizes how often lilies emit volatile organic compounds Have you ever thought about when flowers emit their scents? KAIST mechanical engineers and biological scientists directly visualized how often a lily releases a floral scent using a laser interferometry method. These measurement results can provide new insights for understanding and further exploring the biosynthesis and emission mechanisms of floral volatiles. Why is it important to know this? It is well kno2022-05-25
(Molecular structures of Abo1 in different energy states (left), Demonstration of an Abo1-assisted histone loading onto DNA by the DNA curtain assay. ) The genetic material of our cells—DNA—exists in a high-order structure called “chromatin”. Chromatin consists of DNA wrapped around histone proteins and efficiently packs DNA into a small volume. Moreover, using a spool and thread analogy, chromatin allows DNA to be locally wound or unwound, thus enabling ge2020-01-07
Researchers reported the fabrication of microstructure arrays of DNA materials using topographic control. This method provides a platform for forming multiscale hierarchical orientations of soft and biomaterials using a process of simple shearing and controlled evaporation on a patterned substrate. This approach enables the potential of patterning applications using DNA or other anisotropic biomaterials. DNA is one of the most abundant biomaterials found in all living organisms in nature.2019-07-01
KAIST and four science and technology research universities in Korea co-hosted a technology start-up fair, the 2017 JETS (Job, Exhibition, Tech Forum, and Startup) Conference January 19 ~20 in the Ryu Geun-chul Sports Complex at KAIST. Korea’s major science and technology research universities, Daegu Gyeongbuk Institute of Science and Technology (DGIST), Gwangju Institute of Science and Technology (GIST), Pohang University of Science and Technology (Postech), and Ulsan National Inst2017-01-20