SelVA:  Hear What Matters!
Text-conditioned Selective Video-to-Audio Generation

Junwon Lee♯, Juhan Nam♯, Jiyoung Lee♭

♯MAC Lab, KAIST, ♭MMAI Lab, Ewha Womans Univ.


Paper Code 🤗 Checkpoints Benchmark(TBA)


SelVA

TL;DR The text prompt serves as an explicit selector of the target sound source, ensuring controllability for compositional workflows.
By extracting intent-focused video features, the text-conditioned video encoder conditions the generator to synthesize only the user-specified sound source (e.g., ‘cat meowing’ vs. ‘dog barking’).

DEMO

Real-world examples Show


Comparison with SoTAs Show


Citation


        @article{selva,
          title={},
          author={},
          journal={},
          year={2025},
          publisher={}
        }
      
References

video sources
https://openai.com/index/sora/
https://www.youtube.com/shorts/X_-EDUEiOUw
https://www.youtube.com/watch?v=Le1GEAHnaGo
https://www.youtube.com/shorts/8r-wXBIt95s
https://www.youtube.com/shorts/4neHTD-ak7I
https://openai.com/index/sora-2/