Abstract: With the increasing social needs, single-modal data have been unable to provide sufficient information for object detection. Reasonably processing multimodal information and fusing specific ...