Vision Language Model OpenCV

59m

Google’s Gemini Omni turns images, audio, and text into video — and that’s just the start

Google's Gemini Omni is a new multimodal model that reasons across text, images, audio, and video to generate and edit videos ...

Semiconductor Engineering

Vision-Language-Action Models Arrive

A vision-language-action model is an end-to-end neural network that takes sensor inputs—camera images, joint positions, ...

Xiaomi open source Xiaomi OneVL, a model for autonomous driving

Xiaomi has open-sourced OneVL, a new AI framework that combines VLA, world models, and latent space reasoning for autonomous driving.

EurekAlert!

Novel vision-language model to support diagnosis using computed tomography scans

Lung cancer diagnosis relies heavily on interpreting complex computed tomography (CT) images, where accuracy can vary ...

Morning Overview on MSN

Google’s Gemma 4 ships with 256K context, native vision and audio, and 140+ languages under an Apache 2.0 license

Google released Gemma 4 in early April 2026, and the most striking thing about it isn’t the 256K-token context window or the native audio processing. It’s the license. The entire family of open-weight ...

IEEE

Soccer-CLIP: Vision Language Model for Soccer Action Spotting

Abstract: In the rapidly advancing field of computer vision, the application of multimodal models—specifically, vision-language frameworks—has shown substantial promise for complex tasks such as video ...

9to5Mac

Apple researchers unveil LGTM, a potential boost for Apple Vision Pro graphics

A team of Apple researchers has developed a new framework that enables high-resolution 3D scene rendering with far greater efficiency. Here are the details of the new study. In a new study titled Less ...

SiliconANGLE

Microsoft open-sources multimodal reasoning model with 15B parameters

Microsoft Corp. today released a hardware-efficient reasoning model, Phi-4-reasoning-vision-15B, that can process multimodal files such as scientific charts. The model is based on two existing ...

VentureBeat

Microsoft built Phi-4-reasoning-vision-15B to know when to think — and when thinking is a waste of time

Microsoft on Tuesday released Phi-4-reasoning-vision-15B, a compact open-weight multimodal AI model that the company says matches or exceeds the performance of systems many times its size — while ...

Microsoft

Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model

In this post, we share the motivations, design choices, experiments, and learnings that informed its development, as well as an evaluation of the model’s performance and guidance on how to use it. Our ...

The Robot Report

Vision-language-action models are the next leap in autonomous robotics

Robotics has traditionally used modular pipelines. Perception, planning, and control sit in separate systems and connect through hand-tuned interfaces. This approach works for simple, well-defined ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results