Large Language Models Benchmarks

Italian Benchmark Evaluates Large Language Models, Includes AI Translation

A new community-driven initiative evaluates large language models using Italian-native tasks, with AI translation among the ...

The Chosun Ilbo on MSN

Exclusive: National AI evaluation introduces company-specific benchmarks

In the first evaluation of the "National Representative AI," it was revealed that individual benchmarks selected by each company, in addition to common benchmarks, were introduced as criteria for ...

8don MSN

Another Chinese quant fund joins DeepSeek in AI race with model rivalling GPT-5.1, Claude

Beijing-based Ubiquant launches code-focused systems claiming benchmark wins over US peers despite using far fewer parameters ...

EurekAlert!

MathEval: a comprehensive benchmark for evaluating large language models on mathematical reasoning capabilities

This study introduces MathEval, a comprehensive benchmarking framework designed to systematically evaluate the mathematical reasoning capabilities of large language models (LLMs). Addressing key ...

ZDNet

With AI models clobbering every benchmark, it's time for human evaluation

Artificial intelligence has traditionally advanced through automatic accuracy tests in tasks meant to approximate human knowledge. Carefully crafted benchmark tests such as The General Language ...

Becker's Hospital Review

AI misrepresents medical risk terms: Study

Large language models frequently misrepresent verbal risk terms used in medicine, potentially amplifying patient misunderstandings and diverging from established clinical definitions, according to a ...

8don MSN

Yann LeCun: Meta 'fudged a little bit' when benchmark-testing Llama 4 model

The testing sparked internal frustration about the progress of the Llama models. Yann LeCun, Meta’s outgoing chief AI ...

MiroMind’s MiroThinker 1.5 delivers trillion-parameter performance from a 30B model — at 1/20th the cost

Joining the ranks of a growing number of smaller, powerful reasoning models is MiroThinker 1.5 from MiroMind, with just 30 ...

15d

Unlocking Business Value With Open-Weight Large Language Models

Open-weight LLMs can unlock significant strategic advantages, delivering customization and independence in an increasingly AI ...

Results that may be inaccessible to you are currently showing.

Hide inaccessible results