KAIST

NEWS

홈페이지 통합검색

-
KOREAN

genome+sequencing

Machine Learning-Based Algorithm to Speed up DNA Sequencing The algorithm presents the first full-fledged, short-read alignment software that leverages learned indices for solving the exact match search problem for efficient seeding The human genome consists of a complete set of DNA, which is about 6.4 billion letters long. Because of its size, reading the whole genome sequence at once is challenging. So scientists use DNA sequencers to produce hundreds of millions of DNA sequence fragments, or short reads, up to 300 letters long. Then the DNA sequencer assembles all the short reads like a giant jigsaw puzzle to reconstruct the entire genome sequence. Even with very fast computers, this job can take hours to complete. A research team at KAIST has achieved up to 3.45x faster speeds by developing the first short-read alignment software that uses a recent advance in machine-learning called a learned index. The research team reported their findings on March 7, 2022 in the journal Bioinformatics. The software has been released as open source and can be found on github (https://github.com/kaist-ina/BWA-MEME). Next-generation sequencing (NGS) is a state-of-the-art DNA sequencing method. Projects are underway with the goal of producing genome sequencing at population scale. Modern NGS hardware is capable of generating billions of short reads in a single run. Then the short reads have to be aligned with the reference DNA sequence. With large-scale DNA sequencing operations running hundreds of next-generation sequences, the need for an efficient short read alignment tool has become even more critical. Accelerating the DNA sequence alignment would be a step toward achieving the goal of population-scale sequencing. However, existing algorithms are limited in their performance because of their frequent memory accesses. BWA-MEM2 is a popular short-read alignment software package currently used to sequence the DNA. However, it has its limitations. The state-of-the-art alignment has two phases – seeding and extending. During the seeding phase, searches find exact matches of short reads in the reference DNA sequence. During the extending phase, the short reads from the seeding phase are extended. In the current process, bottlenecks occur in the seeding phase. Finding the exact matches slows the process. The researchers set out to solve the problem of accelerating the DNA sequence alignment. To speed the process, they applied machine learning techniques to create an algorithmic improvement. Their algorithm, BWA-MEME (BWA-MEM emulated) leverages learned indices to solve the exact match search problem. The original software compared one character at a time for an exact match search. The team’s new algorithm achieves up to 3.45x faster speeds in seeding throughput over BWA-MEM2 by reducing the number of instructions by 4.60x and memory accesses by 8.77x. “Through this study, it has been shown that full genome big data analysis can be performed faster and less costly than conventional methods by applying machine learning technology,” said Professor Dongsu Han from the School of Electrical Engineering at KAIST. The researchers’ ultimate goal was to develop efficient software that scientists from academia and industry could use on a daily basis for analyzing big data in genomics. “With the recent advances in artificial intelligence and machine learning, we see so many opportunities for designing better software for genomic data analysis. The potential is there for accelerating existing analysis as well as enabling new types of analysis, and our goal is to develop such software,” added Han. Whole genome sequencing has traditionally been used for discovering genomic mutations and identifying the root causes of diseases, which leads to the discovery and development of new drugs and cures. There could be many potential applications. Whole genome sequencing is used not only for research, but also for clinical purposes. “The science and technology for analyzing genomic data is making rapid progress to make it more accessible for scientists and patients. This will enhance our understanding about diseases and develop a better cure for patients of various diseases.” The research was funded by the National Research Foundation of the Korean government’s Ministry of Science and ICT. -PublicationYoungmok Jung, Dongsu Han, “BWA-MEME:BWA-MEM emulated with a machine learning approach,” Bioinformatics, Volume 38, Issue 9, May 2022 (https://doi.org/10.1093/bioinformatics/btac137) -ProfileProfessor Dongsu HanSchool of Electrical EngineeringKAIST
2022.05.10 View 11069
Genomic Data Reveals New Insights into Human Embryonic Development KAIST researchers have used whole-genome sequencing to track the development from a single fertilized-egg to a human body Genomic scientists at KAIST have revealed new insights into the process of human embryonic development using large-scale, whole-genome sequencing of cells and tissues from adult humans. The study, published in Nature on Aug.25, is the first to analyse somatic mutations in normal tissue across multiple organs within and between humans. An adult human body comprises trillions of cells of more than 200 types. How a human develops from a single fertilized egg to a fully grown adult is a fundamental question in biomedical science. Due to the ethical challenges of performing studies on human embryos, however, the details of this process remain largely unknown. To overcome these issues, the research team took a different approach. They analysed genetic mutations in cells taken from adult human post-mortem tissue. Specifically, they identified mutations that occur spontaneously in early developmental cell divisions. These mutations, also called genomic scars, act like unique genetic fingerprints that can be used to trace the embryonic development process. The study, which looked at 334 single-cell colonies and 379 tissue samples from seven recently deceased human body donors, is the largest single-cell, whole-genome analysis carried out to date. The researchers examined the genomic scars of each individual in order to reconstruct their early embryonic cellular dynamics. The result revealed several key characteristics of the human embryonic development process. Firstly, mutation rates are higher in the first cell division, but then decrease to approximately one mutation per cell during later cell division. Secondly, early cells contributed unequally to the development of the embryo in all informative donors, for example, at the two-cell stage, one of the cells always left more progeny cells than the other. The ratio of this was different from person to person, implying that the process varies between individuals and is not fully deterministic. The researchers were also able to deduce the timing of when cells begin to differentiate into individual organ-specific cells. They found that within three days of fertilization, embryonic cells began to be distributed asymmetrically into tissues for the left and right sides of the body, followed by differentiation into three germ layers, and then differentiation into specific tissues and organs. “It is an impressive scientific achievement that, within 20 years of the completion of human genome project, genomic technology has advanced to the extent that we are now able to accurately identify mutations in a single-cell genome,” said Professor Young Seok Ju from the Graduate School of Medical Science and Engineering at KAIST. “This technology will enable us to track human embryogenesis at even higher resolutions in the future.” The techniques used in this study could be used to improve our understanding of rare diseases caused by abnormalities in embryonic development, and to design new precision diagnostics and treatments for patients. The research was completed in collaboration with Kyungpook National University Hospital, the Korea Institute of Science and Technology Information, Catholic University of Korea School of Medicine, Genome Insights Inc, and Immune Square Inc. This work was supported by the Suh Kyungbae Foundation, the Ministry of Health and Welfare of Korea, the National Research Foundastion of Korea. -PublicationSeongyeol Park, Nanda Mali, Ryul Kim et al. ‘Clonal dynamics in early human embryogenesis inferred from somatic mutation’ Nature Online ahead of print, Aug. 25, 2021 (https://doi.org/10.1038/s41586-021-03786-8) -ProfileProfessor Young Seok JuLab of Cancer Genomics (https://www.julab.kaist.ac.kr/)Graduate School of Medical Science and EngineeringKAIST
2021.08.31 View 10816

KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea T.042-350-2114 F.042-350-2210(2220)