Abstract: Siamese network based trackers describe object tracking as a similarity matching problem and these trackers achieve state-of-the-art performance on multiple benchmarks. However, due to the ...
Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object ...