Abstract: Recently, large multimodal models (LMMs) have been successfully adopted in various fields due to their outstanding adaptability and reasoning abilities. Despite their potential to automate ...
Understanding real-world videos with complex semantics and long temporal dependencies remains a fundamental challenge in computer vision. Recent progress in multimodal large language models (MLLMs) ...
Recent advances in Video Large Language Models (VLLMs) have significantly enhanced their ability to understand video content. Nonetheless, processing long videos remains challenging due to high ...