SelVA:  Hear What Matters!
Text-conditioned Selective Video-to-Audio Generation

Junwon Lee♯, Juhan Nam♯, Jiyoung Lee♭

♯MAC Lab, KAIST, ♭MMAI Lab, Ewha Womans Univ.


Paper Code 🤗 Checkpoints Benchmark(TBA)


SelVA

TL;DR The text prompt serves as an explicit selector of the target sound source, ensuring controllability for compositional workflows.
By extracting intent-focused video features, the text-conditioned video encoder conditions the generator to synthesize only the user-specified sound source (e.g., ‘cat meowing’ vs. ‘dog barking’).

DEMO

Real-world examples Show


Comparison with SoTAs Show


Citation


        @article{selva,
          title={Hear What Matters! Text-conditioned Selective Video-to-Audio Generation},
          author={Lee, Junwon and Nam, Juhan and Lee, Jiyoung},
          journal={arXiv preprint arXiv:2512.02650},
          year={2025}
        }
      
References

video sources
https://openai.com/index/sora/
https://www.youtube.com/shorts/X_-EDUEiOUw
https://www.youtube.com/watch?v=Le1GEAHnaGo
https://www.youtube.com/shorts/8r-wXBIt95s
https://www.youtube.com/shorts/4neHTD-ak7I
https://openai.com/index/sora-2/