SelVA: Hear What Matters!
Text-conditioned Selective Video-to-Audio Generation
Junwon Lee♯, Juhan Nam♯, Jiyoung Lee♭
♯MAC Lab, KAIST, ♭MMAI Lab, Ewha Womans Univ.
Paper Code 🤗 Checkpoints Benchmark(TBA)
TL;DR
The text prompt serves as an explicit selector of the target sound source, ensuring controllability for compositional workflows.
By extracting intent-focused video features, the text-conditioned video encoder conditions the generator to synthesize only the user-specified sound source (e.g., ‘cat meowing’ vs. ‘dog barking’).
DEMO
Real-world examples Show
Comparison with SoTAs Show
Citation
@article{selva,
title={},
author={},
journal={},
year={2025},
publisher={}
}
References
video sources
https://openai.com/index/sora/
https://www.youtube.com/shorts/X_-EDUEiOUw
https://www.youtube.com/watch?v=Le1GEAHnaGo
https://www.youtube.com/shorts/8r-wXBIt95s
https://www.youtube.com/shorts/4neHTD-ak7I
https://openai.com/index/sora-2/