KAIST Develops Multimodal AI That Understands Text and Images Like Humans
<(From Left) M.S candidate Soyoung Choi, Ph.D candidate Seong-Hyeon Hwang, Professor Steven Euijong Whang>
Just as human eyes tend to focus on pictures before reading accompanying text, multimodal artificial intelligence (AI)—which processes multiple types of sensory data at once—also tends to depend more heavily on certain types of data. KAIST researchers have now developed a new multimodal AI training technology that enables models to recognize both text and images evenly, enabling far more accurate predictions.
KAIST (President Kwang Hyung Lee) announced on the 14th that a research team led by Professor Steven Euijong Whang from the School of Electrical Engineering has developed a novel data augmentation method that enables multimodal AI systems—those that must process multiple data types simultaneously—to make balanced use of all input data.
Multimodal AI combines various forms of information, such as text and video, to make judgments. However, AI models often show a tendency to rely excessively on one particular type of data, resulting in degraded prediction performance.
To solve this problem, the research team deliberately trained AI models using mismatched or incongruent data pairs. By doing so, the model learned to rely on all modalities—text, images, and even audio—in a balanced way, regardless of context.
The team further improved performance stability by incorporating a training strategy that compensates for low-quality data while emphasizing more challenging examples. The method is not tied to any specific model architecture and can be easily applied to various data types, making it highly scalable and practical.
<Model Prediction Changes with a Data-Centric Multimodal AI Training Framework>
Professor Steven Euijong Whang explained, “Improving AI performance is not just about changing model architectures or algorithms—it’s much more important how we design and use the data for training.” He continued, “This research demonstrates that designing and refining the data itself can be an effective approach to help multimodal AI utilize information more evenly, without becoming biased toward a specific modality such as images or text.”
The study was co-led by doctoral student Seong-Hyeon Hwang and master’s student Soyoung Choi, with Professor Steven Euijong Whang serving as the corresponding author. The results will be presented at NeurIPS 2025 (Conference on Neural Information Processing Systems), the world’s premier conference in the field of AI, which will be held this December in San Diego, USA, and Mexico City, Mexico.
※ Paper title: “MIDAS: Misalignment-based Data Augmentation Strategy for Imbalanced Multimodal Learning,” Original paper: https://arxiv.org/pdf/2509.25831
The research was supported by the Institute for Information & Communications Technology Planning & Evaluation (IITP) under the projects “Robust, Fair, and Scalable Data-Centric Continual Learning” (RS-2022-II220157) and “AI Technology for Non-Invasive Near-Infrared-Based Diagnosis and Treatment of Brain Disorders” (RS-2024-00444862).
Approaches to Human-Robot Interaction Using Biosignals
<(From left) Dr. Hwa-young Jeong, Professor Kyung-seo Park, Dr. Yoon-tae Jeong, Dr. Ji-hoon Seo, Professor Min-kyu Je, Professor Jung Kim >
A joint research team led by Professor Jung Kim of KAIST Department of Mechanical Engineering and Professor Min-kyu Je of the Department of Electrical and Electronic Engineering recently published a review paper on the latest trends and advancements in intuitive Human-Robot Interaction (HRI) using bio-potential and bio-impedance in the internationally renowned academic journal 'Nature Reviews Electrical Engineering'.
This review paper is the result of a collaborative effort by Dr. Kyung-seo Park (DGIST, co-first author), Dr. Hwa-young Jeong (EPFL, co-first author), Dr. Yoon-tae Jeong (IMEC), and Dr. Ji-hoon Seo (UCSD), all doctoral graduates from the two laboratories. Nature Reviews Electrical Engineering is a review specialized journal in the field of electrical, electronic, and artificial intelligence technology, newly launched by Nature Publishing Group last year. It is known to invite world-renowned scholars in the field through strict selection criteria. Professor Jung Kim's research team's paper, titled "Using bio-potential and bio-impedance for intuitive human-robot interaction," was published on July 18, 2025. (DOI: https://doi.org/10.1038/s44287-025-00191-5)
This review paper explains how biosignals can be used to quickly and accurately detect movement intentions and introduces advancements in movement prediction technology based on neural signals and muscle activity. It also focuses on the crucial role of integrated circuits (ICs) in maximizing low-noise performance and energy efficiency in biosignal sensing, covering thelatest development trends in low-noise, low-power designs for accurately measuring bio-potential and impedance signals.
The review emphasizes the importance of hybrid and multi-modal sensing approaches, presenting the possibility of building robust, intuitive, and scalable HRI systems. The research team stressed that collaboration between sensor and IC design fields is essential for the practical application of biosignal-based HRI systems and stated that interdisciplinary collaboration will play a significant role in the development of next-generation HRI technology. Dr. Hwa-young Jeong, a co-first author of the paper, presented the potential of bio-potential and impedance signals to make human-robot interaction more intuitive and efficient, predicting that it will make significant contributions to the development of HRI technologies such as rehabilitation robots and robotic prostheses using biosignals in the future. This research was supported by several research projects, including the Human Plus Project of the National Research Foundation of Korea.