VirConv-L: A light-weight multimodal 3D detector based on Virtual Sparse Convolution. VirConv-T: A improved multimodal 3D detector based on Virtual Sparse Convolution and transformed refinement scheme ...
Abstract: Currently prevalent multi-modal 3D detection methods rely on dense detectors that usually use dense Bird’s-Eye-View (BEV) feature maps. However, the cost of such BEV feature maps is ...
Abstract: Human perception is multimodal and able to comprehend a mixture of vision, natural language, speech, etc. Multimodal Transformer (MuIT, Fig. 16.1.1) models introduce a cross-modal attention ...