benchmark
30 articles about benchmark in AI news
Claude Haiku 4.5 Costs $10.21 to Breach, 10x Harder Than Rivals in ACE Benchmark
Fabraix's ACE benchmark measures the dollar cost to break AI agents. Claude Haiku 4.5 required a mean adversarial cost of $10.21, making it 10x more resistant than the next best model, GPT-5.4 Nano ($1.15).
MIT and Anthropic Release New Benchmark Revealing AI Coding Limitations
Researchers from MIT and Anthropic have developed a new benchmark that systematically identifies significant limitations in current AI coding assistants. The benchmark reveals specific categories of coding tasks where large language models consistently fail, providing concrete data on their weaknesses.
QUMPHY Project's D4 Report Establishes Six Benchmark Problems and Datasets for ML on PPG Signals
A new report from the EU-funded QUMPHY project establishes six benchmark problems and associated datasets for evaluating machine and deep learning methods on photoplethysmography (PPG) signals. This standardization effort is a foundational step for quantifying uncertainty in medical AI applications.
From BM25 to Corrective RAG: A Benchmark Study Challenges the Dominance of Semantic Search for Tabular Data
A systematic benchmark of 10 RAG retrieval strategies on a financial QA dataset reveals that a two-stage hybrid + reranking pipeline performs best. Crucially, the classic BM25 algorithm outperformed modern dense retrieval models, challenging a core assumption in semantic search. The findings provide actionable, cost-aware guidance for building retrieval systems over heterogeneous documents.
Apple M5 Max NPU Benchmarks 2x Faster Than Intel Panther Lake NPU in Parakeet v3 AI Inference Test
A leaked benchmark using the Parakeet v3 AI speech recognition model shows Apple's next-generation M5 Max Neural Processing Unit (NPU) delivering double the inference speed of Intel's competing Panther Lake NPU. This real-world test provides early performance data in the intensifying on-device AI hardware race.
PhAIL: Open Benchmark for Robot AI on Real Hardware Shows Best Model at 5% of Human Throughput
Researchers have launched PhAIL (phail.ai), an open benchmark for evaluating robot AI systems on real hardware using the DROID platform, with the best-performing model achieving only 5% of human throughput and requiring intervention every 4 minutes.
Agentic AI Systems Failing in Production: New Research Reveals Benchmark Gaps
New research reveals that agentic AI systems are failing in production environments in ways not captured by current benchmarks, including alignment drift and context loss during handoffs between agents.
OmniSch Benchmark Exposes Major Gaps in LMMs for PCB Schematic Understanding
Researchers introduced OmniSch, a benchmark with 1,854 real PCB schematics, to evaluate LMMs on converting diagrams to netlist graphs. Results show current models have unreliable grounding, brittle parsing, and inconsistent connectivity reasoning for engineering artifacts.
Agent Psychometrics: New Framework Predicts Task-Level Success in Agentic Coding Benchmarks with 0.81 AUC
A new research paper introduces a framework using Item Response Theory and task features to predict success on individual agentic coding tasks, achieving 0.81 AUC. This enables benchmark designers to calibrate difficulty without expensive evaluations.
arXiv Paper Proposes 'Connections' Word Game as New Benchmark for AI Agent Social Intelligence
A new arXiv preprint introduces the improvisational word game 'Connections' as a benchmark for evaluating social intelligence in AI agents. It requires agents to gauge the cognitive states of others, testing collaborative reasoning beyond individual knowledge retrieval.
QAsk-Nav Benchmark Enables Separate Scoring of Navigation and Dialogue for Collaborative AI Agents
A new benchmark called QAsk-Nav enables separate evaluation of navigation and question-asking for collaborative embodied AI agents. The accompanying Light-CoNav model outperforms state-of-the-art methods while being significantly more efficient.
Emergence WebVoyager: A New Benchmark Exposes Inconsistencies in Web Agent Evaluation
A new study introduces Emergence WebVoyager, a standardized benchmark for evaluating web-based AI agents. It reveals significant performance inconsistencies, showing OpenAI Operator's success rate is 68.6%, not 87%. This highlights a critical need for rigorous, transparent testing in agent development.
New Benchmark and Methods Target Few-Shot Text-to-Image Retrieval for Complex Queries
Researchers introduce FSIR-BD, a benchmark for few-shot text-to-image retrieval, and two optimization methods to improve performance on compositional and out-of-distribution queries. This addresses a key weakness in pre-trained vision-language models.
ReCUBE Benchmark Reveals GPT-5 Scores Only 37.6% on Repository-Level Code Generation
Researchers introduce ReCUBE, a benchmark isolating LLMs' ability to use repository-wide context for code generation. GPT-5 achieves just a 37.57% strict pass rate, showing the task remains highly challenging.
GUIDE: A New Benchmark Reveals AI's Struggle to Understand User Intent in GUI Software
Researchers introduce GUIDE, a benchmark for evaluating AI's ability to understand user behavior and intent in open-ended GUI tasks. Across 10 software applications, state-of-the-art models struggled, highlighting a critical gap between automation and true collaborative assistance.
MemoryCD: New Benchmark Tests LLM Agents on Real-World, Lifelong User Memory for Personalization
Researchers introduce MemoryCD, the first large-scale benchmark for evaluating LLM agents' long-context memory using real Amazon user data across 12 domains. It reveals current methods are far from satisfactory for lifelong personalization.
RealChart2Code Benchmark Exposes Major Weakness in Vision-Language Models for Complex Data Visualization
A new benchmark reveals state-of-the-art Vision-Language Models struggle to generate code for complex, multi-panel charts from real-world data. Proprietary models outperform open-weight ones, but all show significant degradation versus simpler tasks.
Frontier AI Models Reportedly Score Below 1% on ARC-AGI v3 Benchmark
A social media post claims frontier AI models have achieved below 1% performance on the ARC-AGI v3 benchmark, suggesting a potential saturation point for current scaling approaches. No specific models or scores were disclosed.
Ego2Web Benchmark Bridges Egocentric Video and Web Agents, Exposing Major Performance Gaps
Researchers introduce Ego2Web, the first benchmark requiring AI agents to understand real-world first-person video and execute related web tasks. Their novel Ego2WebJudge evaluation method achieves 84% human agreement, while state-of-the-art agents perform poorly across all task categories.
Tessera Launches Open-Source Framework for 32 OWASP AI Security Tests, Benchmarks GPT-4o, Claude, Gemini, Llama 3
Tessera introduces the first open-source framework to run all 32 OWASP AI security tests against any model with one CLI command. It provides benchmark results for GPT-4o, Claude, Gemini, Llama 3, and Mistral across 21 model-specific security tests.
ReXInTheWild Benchmark Reveals VLMs Struggle with Medical Photos: Gemini-3 Leads at 78%, MedGemma Trails at 37%
Researchers introduced ReXInTheWild, a benchmark of 955 clinician-verified questions based on 484 real medical photographs. Leading multimodal models show wide performance gaps, with Gemini-3 scoring 78% accuracy while the specialized MedGemma model achieved only 37%.
ItinBench Benchmark Reveals LLMs Struggle with Multi-Dimensional Planning, Scoring Below 50% on Combined Tasks
Researchers introduced ItinBench, a benchmark testing LLMs on trip planning requiring simultaneous verbal and spatial reasoning. Models like GPT-4o and Gemini 1.5 Pro showed inconsistent performance, highlighting a gap in integrated cognitive capabilities.
Gastric-X: New 1.7K-Case Multimodal Benchmark Challenges VLMs on Realistic Gastric Cancer Diagnosis Workflow
Researchers introduce Gastric-X, a comprehensive multimodal benchmark with 1.7K gastric cancer cases including CT scans, endoscopy, lab data, and expert notes. It evaluates VLMs on five clinical tasks to test if they can correlate biochemical signals with tumor features like physicians do.
Nobody Warns You About Eval Drift: 7 Ways Benchmarks Rot
A critical examination of how AI evaluation benchmarks degrade over time, losing their ability to reflect real-world performance. This 'eval drift' poses a silent risk to any team relying on static metrics for model validation and deployment decisions.
RedNote's 3B-Parameter Multimodal OCR Model Ranks Second to Gemini 3 Pro on Document Parsing Benchmarks
RedNote has released a 3-billion parameter multimodal OCR model that converts text, charts, diagrams, and tables into structured formats like Markdown and HTML. It reportedly ranks second only to Google's Gemini 3 Pro on OCR benchmarks.
Research Identifies 'Giant Blind Spot' in AI Scaling: Models Improve on Benchmarks Without Understanding
A new research paper argues that current AI scaling approaches have a fundamental flaw: models improve on narrow benchmarks without developing genuine understanding, creating a 'giant blind spot' in progress measurement.
ARC-AGI-3 AI Benchmark Launch Announced for Next Week
The ARC-AGI-3 benchmark for evaluating advanced AI reasoning is launching next week. The announcement has sparked speculation about Google's potential performance.
Health AI Benchmarks Show 'Validity Gap': 0.6% of Queries Use Raw Medical Records, 5.5% Cover Chronic Care
Analysis of 18,707 health queries across six public benchmarks reveals a structural misalignment with clinical reality. Benchmarks over-index on wellness data (17.7%) while under-representing lab values (5.2%), imaging (3.8%), and safety-critical scenarios.
DEAF Benchmark Reveals Audio MLLMs Rely on Text, Not Sound, Scoring Below 50% on Acoustic Faithfulness
Researchers introduce DEAF, a 2,700-stimulus benchmark testing Audio MLLMs' acoustic processing. Evaluation of seven models shows a consistent pattern of text dominance, with models scoring below 50% on acoustic faithfulness metrics.
Cursor Launches Composer 2 with $0.50/M Input Token Pricing, Claims Major Benchmark Gains
Cursor has released Composer 2, a coding AI model priced at $0.50 per million input tokens and $2.50 per million output tokens. The company reports significant benchmark improvements over previous versions across CursorBench, Terminal-Bench 2.0, and SWE-bench Multilingual.