Abstract: Current efficient approaches to building Multimodal Large Language Models (MLLMs) mainly incorporate visual information into LLMs with a simple visual mapping network such as a linear ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results