KAIST

NEWS

홈페이지 통합검색

-
KOREAN

닫기

research

AI Technology World No. 1 in Finding the Exact Moment in a Video: Where is the First Place?

View : 2257 Date : 2025-12-01 Writer : PR Office

< (From left) Professor Joon Hyuk Noh (Assistant Professor, Department of Artificial Intelligence, Ewha Womans University), Seojin Hwan, Yoonki Cho (Ph.D. Candidate), Professor Sung-Eui Yoon (School of Computing, KAIST) >

When faced with a complex question like 'What object disappeared while the camera was pointing elsewhere?', a common problem is that AI often relies on language patterns to guess a 'plausible answer,' instead of actually observing the real situation in the video. To overcome this limitation, our university's research team developed a technology that enables the AI to autonomously identify the 'exact critical moment (Trigger moment)' within the video, and the team’s excellence was proven by winning an international AI competition with this technology. The university announced on the 28th that the research team led by Professor Sung-Eui Yoon from the School of Computing, in collaboration with Professor Joon Hyuk Noh's team from Ewha Womans University, took 1st place in the Grounded Video Question Answering track of the Perception Test Challenge held at ICCV 2025, a world-renowned computer vision conference. The Perception Test Challenge held at ICCV 2025 was organized by Google DeepMind with a total prize pool of 50,000 Euros (approximately 83 million KRW). It assesses the cognitive and reasoning abilities of multimodal AI, which must comprehensively understand various data, including video, audio, and text. Crucially, the core evaluation factor is the ability to make judgments based on actual video evidence, moving beyond language-centric bias. Unlike conventional methods that analyze the entire video indiscriminately, our university's research team developed a new technology that instructs the AI to first locate the core scene (Trigger moment) essential for finding the correct answer. Simply put, this technology is designed to make the AI autonomously determine: “This scene is decisive for answering this question!” The research team calls this framework CORTEX (Chain-of-Reasoning for Trigger Moment Extraction). The research team's system consists of a three-stage structure where three models performing different functions operate sequentially. First, the Reasoning AI (Gemini 2.5 Pro) reasons about which moment is required to answer the question and finds candidate Trigger moments. Next, the Object Location Finding Model (Grounding Model, Molmo-7B) accurately identifies the exact location (coordinates) of people, cars, and objects on the screen during the selected moment. Finally, the Tracking Model (SAM2) precisely tracks the movement of objects in the time frame before and after the selected scene, using that scene as a reference, thereby reducing errors. In short, the 'method of accurately pinpointing a key scene and tracking the evidence for the answer centered on that scene' significantly reduced problems like initial misjudgment or occlusion in the video. In the Grounded Video Question Answering (Grounded VideoQA) track, which saw 23 participating teams, the KAIST team SGVR Lab (Scalable Graphics, Vision & Robotics Lab) recorded 0.4968 points in the HOTA (Higher Order Tracking Accuracy) metric, overwhelmingly surpassing the 2nd place score of 0.4304 from Columbia University, USA, to secure 1st place. This achievement is nearly double the previous year's winning score of 0.2704 points. This technology has wide-ranging applications in real-life settings. Autonomous driving vehicles can accurately identify moments of potential accident risk, robots can understand the surrounding environment smarter, security and surveillance systems can rapidly locate critical scenes, and media analysis can precisely track the actions of people or objects in chronological order. This is a core technology that enables AI to judge based on "actual evidence in the video." The ability to accurately pinpoint how objects behave over time in a video is expected to greatly expand the application of AI in real-world scenarios in the future.

< Pipeline image of the grounding framework for video question answering proposed by the research team >

This research was presented on October 19th at ICCV 2025, the 3rd Perception Test Challenge conference. The achievement was supported by the Ministry of Science and ICT's Basic Research Program (Mid-Career Researcher), the SW Star Lab Project's 'Development of Perception, Action, and Interaction Algorithms for Open-World Robot Services,' and the AGI Project's 'Reality Construction and Bi-directional Capability Approach based on Cognitive Agents for Embodied AGI' tasks."

Seung-Eui Yoon School of Computing School of Computing

List

Releated news