See-2-Sound: How Spatial Audio has potential for clincial applications
See-2-Sound: How Spatial Audio has potential for clincial applications
Work led by: Rishit Dagli, CS, University of Toronto
The world of generative AI is constantly expanding, with models now capable of creating high-resolution content across multiple modalities, including images, text, speech, and video. However, one area that has lagged behind is the generation of high-quality spatial audio that complements these visuals. This is where SEE-2-SOUND comes in, a novel approach that generates spatial audio from images, animated images, and videos.
Bridging the Gap Between Visuals and Immersive Audio
SEE-2-SOUND is designed to fill the gap in generating spatial audio, which is crucial for creating truly immersive experiences. Current audio generation models often excel in producing natural audio, speech, or music, but they often fall short in integrating the spatial cues needed for realistic sound perception. The ability to pinpoint the location of a sound source is a key element of human perception, and SEE-2-SOUND aims to replicate this in generated audio.
How Does SEE-2-SOUND Work?
The SEE-2-SOUND method works by breaking down the process into several key stages:
- Source Estimation: The model first identifies regions of interest within the input visual content (image or video). It then estimates the 3D positions of these regions on a viewing sphere. This process includes using a monocular depth map to refine the spatial information.
- Mono Audio Generation: For each identified region of interest, the model generates a mono audio clip using a pre-trained CoDi model. The audio can also be conditioned on a text prompt.
- Spatial Audio Integration: The generated mono audio clips are combined with the spatial information to create a 4D representation for each region. The model then places these sound sources in a virtual room and computes Room Impulse Responses (RIRs) for each source-microphone pair. The microphones are positioned according to the 5.1 channel configuration, ensuring compatibility with common audio systems. This generates a 5.1 surround sound spatial audio output.
Zero-Shot Approach
A key advantage of SEE-2-SOUND is that it is a zero-shot approach. This means that it can generate spatial audio without needing specific training data for every type of visual input. This makes it highly versatile and applicable to a wide range of content, including images from the web, videos generated by models like OpenAI’s Sora, and other dynamic visuals.
Evaluation and Results
Evaluating spatial audio generation is challenging, as there are no direct metrics to measure its quality. Therefore, the researchers employed a combination of methods to assess their approach:
- Human Evaluation: Human evaluators rated the realism, immersion, and accuracy of generated audio when paired with visual content using semantic differential scales. They also performed tasks such as identifying the direction and distance of sounds and matching audio clips to their corresponding images or videos.
- Marginal Scene Guidance: A new evaluation protocol was developed to measure how well the generated audio is guided by the visual scene. This protocol uses another model, AViTAR, to modify audio to match the image, and then assesses the similarity between the modified audio and the original generated audio.
The results from these evaluations indicate that SEE-2-SOUND performs well in generating compelling spatial audio.
Future Directions and Potential Applications
While SEE-2-SOUND shows promising results, there are several avenues for future improvement:
- Fine Details: The model may not detect all fine details in images and videos and does not produce audio for every detail.
- Motion Cues: Currently, the model does not generate audio based on motion cues and adding motion backbones might improve the results.
- Real-Time Capabilities: The method does not currently work in real time on an Nvidia A100-80 GPU. However, using other models to solve the subproblems might bring it to real-time capabilities.
Despite these limitations, the potential applications of SEE-2-SOUND are vast:
- Enhancing Generated Visuals: It can add spatial audio to images and videos generated by AI models, making them more immersive.
- Interactive Real Images: It can make real images interactive through sound.
- Human-Computer Interaction: It can improve human-computer interaction by adding realistic spatial audio cues.
- Accessibility: It can enhance accessibility by providing audio information about visual content.
A Step Towards Complete Generation
SEE-2-SOUND is a step towards truly complete generation, bridging the gap between visual and auditory experiences. By enabling the creation of spatial audio from visual content, it opens up exciting new possibilities for immersive content creation and interaction. To the best of the authors’ knowledge, this approach is the first to generate spatial audio from images and videos. The team hopes to inspire future work that will lead to the generation of truly immersive digital content.
Relevant Links