SelVA: Hear What Matters!
Text-conditioned Selective Video-to-Audio Generation
Junwon Lee♯, Juhan Nam♯, Jiyoung Lee♭
♯MAC Lab, KAIST, ♭MMAI Lab, Ewha Womans Univ.
Paper Code 🤗 Checkpoints Benchmark(TBA)
TL;DR
The text prompt serves as an explicit selector of the target sound source, ensuring controllability for compositional workflows.
By extracting intent-focused video features, the text-conditioned video encoder conditions the generator to synthesize only the user-specified sound source (e.g., ‘cat meowing’ vs. ‘dog barking’).
DEMO
Real-world examples Show
Comparison with SoTAs Show
Citation
@article{selva,
title={Hear What Matters! Text-conditioned Selective Video-to-Audio Generation},
author={Lee, Junwon and Nam, Juhan and Lee, Jiyoung},
journal={arXiv preprint arXiv:2512.02650},
year={2025}
}
References
video sources
https://openai.com/index/sora/
https://www.youtube.com/shorts/X_-EDUEiOUw
https://www.youtube.com/watch?v=Le1GEAHnaGo
https://www.youtube.com/shorts/8r-wXBIt95s
https://www.youtube.com/shorts/4neHTD-ak7I
https://openai.com/index/sora-2/