Sound plays a crucial role in human experiences, communication, and emotional media context. Despite AI’s broad advances, creating accurate sound in video-generating models that match the human-created content’s complexity remains complex. A critical next stage is developing scores for these silent films to advance generated videos.
Google DeepMind is addressing this by introducing a video-to-audio (V2A) technology. It combines video pixels and text instructions in natural language to create immersive audio in sync with on-screen action. Autoregressive and diffusion methods were analyzed for a scalable AI structure with the most convincing outcomes from the diffusion method regarding the synchronization of audio and visuals.
The V2A technology starts by condensing the input video, then repeatedly removing background noise from the audio using the diffusion model. The process uses visual input and natural language prompts to produce realistic, synchronized audio in line with the instructions. The audio output process ends with decoding, waveform creation, and integration of audio with visual data.
Before the iterative process through the diffusion model, V2A encodes visual and audio prompts. The next stage involves creating compressing audio into a waveform. The training process is improved by additional information such as dialogue transcripts and AI-generated sound description annotations. This helps the model produce higher-quality audio and train to make specific noises.
The researched technology uses video, audio, and added annotation training to link distinct sound events with different visual scenarios. Pairing the V2A technology with video generation models such as Veo could create dramatic scores, realistic sound effects, and dialogue that aligns with video character and tone.
Furthermore, V2A opens up endless creative opportunities, as it can create multiple soundtracks for a wide variety of films, including silent and archived footage. Users can dictate the audio output to desire sounds or steer away from undesired sounds through the use of a “positive prompt” or a “negative prompt”. This flexibility allows users to experiment and find a perfect match for their creative concepts.
However, challenges remain. Audio output quality is influenced by video input, and video distortions can lead to noticeable audio degradation. To optimize lip-syncing in videos, V2A technology could use input transcripts to synchronize speech with the character’s mouth movements. The team is also working to correct the issue of video models not aligning with transcripts, causing awkward lip-syncing. They commit to maintaining high standards and continuously improving the technology.
The team is also inviting input from prominent creators and filmmakers to influence the development of V2A technology. They aim to serve the creative community by meeting their needs and enhancing their work. As part of their commitment to ethical technology use, they introduced the SynthID toolkit into the V2A study and watermarked all AI-generated content. They are seeking to protect content from any misuse.