SelVA: Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

SelVA: Hear What Matters!
Text-conditioned Selective Video-to-Audio Generation

Junwon Lee♯, Juhan Nam♯, Jiyoung Lee♭
♯MAC Lab, KAIST, ♭MMAI Lab, Ewha Womans Univ.

Paper Code 🤗 Checkpoints Benchmark

TL;DR The text prompt serves as an explicit selector of the target sound source, ensuring controllability for compositional workflows.
By extracting intent-focused video features, the text-conditioned video encoder conditions the generator to synthesize only the user-specified sound source (e.g., ‘cat meowing’ vs. ‘dog barking’).

DEMO

Real-world examples Show

Comparison with SoTAs Show

Text prompt	G.T.	MMAudio	VOS+MMAudio	SelVA (ours)
*dog barking*
*bird chirping*
*baby crying*
*child singing*
*playing accordion*
*playing harmonica*
*waterfall burbling*
*underwater bubbling*
*fireworks banging*
*machine gun shooting*
*church bell ringing*
*basketball bounce*
*skateboarding*
*chainsawing trees*
*typing keyboard*
*driving buses*
*car passing by*

Citation


        @article{selva,
          title={Hear What Matters! Text-conditioned Selective Video-to-Audio Generation},
          author={Lee, Junwon and Nam, Juhan and Lee, Jiyoung},
          journal={arXiv preprint arXiv:2512.02650},
          year={2025}
        }

References

video sources
https://openai.com/index/sora/
https://www.youtube.com/shorts/X_-EDUEiOUw
https://www.youtube.com/watch?v=Le1GEAHnaGo
https://www.youtube.com/shorts/8r-wXBIt95s
https://www.youtube.com/shorts/4neHTD-ak7I
https://openai.com/index/sora-2/

1 "bird squawking"	"stream burbing"	2 "cat meowing growling"	"robot vacuum cleaner cleaning floors"

3 "cat meowing hissing"	"dog barking bow-wow growling"	4 "car passing by"	"raining"

5 "car passing by"	"people running"	6 "firework banging"	"human speech, people speaking"

SelVA: Hear What Matters!Text-conditioned Selective Video-to-Audio Generation