Abstract: In this paper, we explore the cross-modal adaptation of pre-trained Vision Transformers (ViTs) for the audio-visual domain by incorporating a limited set of trainable parameters. To this end ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results