Skip to content Skip to footer

Researchers from MIT and the MIT-IBM Watson AI Lab have introduced an efficient method to train machine-learning models to identify specific actions in videos by making use of the video’s automatically generated transcripts. The method, known as spatio-temporal grounding, helps the model intricately understand the video by dissecting it and analysing it through the lens of two perspectives: the spatial information – where objects are located, and the temporal information – when the actions occur.

This technique significantly improves the accuracy of action identification in longer videos with multiple activities. Furthermore, simultaneous training on spatial and temporal information enhances the proficiency of the model in recognising each separately. The researchers speculate that the approach could be instrumental in health care settings by swiftly detecting key moments in videos of diagnostic procedures.

Traditionally, researchers instructed models to carry out spatio-temporal grounding using videos with human annotations outlining the commencement and culmination of particular tasks. However, generating this set of data was not only expensive but also perplexing for humans to figure out what actions to annotate.

To overcome this conundrum, the researchers made use of freely available, unlabeled instructional videos and their accompanying text transcripts procured from platforms like YouTube. The model’s training process is bifurcated into a global representation and a local representation. In the global representation, the model is taught to absorb the overall video and ascertain the chronology of actions. In the local representation, the model focuses on the specific action. An additional component of the framework is designed to reconcile misalignments that arise between the verbal explanation and video demonstration.

The team aimed to develop a more realistic solution and focused on uncut videos that are several minutes long. Since existing benchmarks were ineffectual at testing a model on such long, unedited videos, the team formulated a new benchmark for this purpose, thereby reducing human labor and costs. After testing their approach with this new benchmark, it was found that the procedure was more precise at locating actions than other AI techniques, and better at concentrating on human-object interactions.

As future work, the researchers aim to upgrade their approach to detect when text and narration are out of sync automatically. They also plan to extend their framework to audio data to explore the correlations between actions and the sounds that objects generate. The research is partly funded by the MIT-IBM Watson AI Lab.

Leave a comment

0.0/5