AI Research

DISCO-TAB: Hierarchical RL Framework Boosts Clinical Data Synthesis by 38.2%, Achieves JSD < 0.01

AI Research

86

DISCO-TAB: Hierarchical RL Framework Boosts Clinical Data Synthesis by 38.2%, Achieves JSD < 0.01

Researchers propose DISCO-TAB, a reinforcement learning framework that guides a fine-tuned LLM with multi-granular feedback to generate synthetic clin...

arxiv.org·7h ago·3 min read·Multi-Source

generative-airesearchmachine-learning

QUMPHY Project's D4 Report Establishes Six Benchmark Problems and Datasets for ML on PPG Signals

AI Research

81

QUMPHY Project's D4 Report Establishes Six Benchmark Problems and Datasets for ML on PPG Signals

A new report from the EU-funded QUMPHY project establishes six benchmark problems and associated datasets for evaluating machine and deep learning met...

arxiv.org·7h ago·3 min read·Multi-Source

researchbenchmarksarxiv

Apple M5 Max NPU Benchmarks 2x Faster Than Intel Panther Lake NPU in Parakeet v3 AI Inference Test

AI Research

85

Apple M5 Max NPU Benchmarks 2x Faster Than Intel Panther Lake NPU in Parakeet v3 AI Inference Test

A leaked benchmark using the Parakeet v3 AI speech recognition model shows Apple's next-generation M5 Max Neural Processing Unit (NPU) delivering doub...

x.com·8h ago·3 min read

hardwareapple siliconintel

Frontier AI Models Resist Prompt Injection Attacks in Grading, New Study Finds

AI Research

85

Frontier AI Models Resist Prompt Injection Attacks in Grading, New Study Finds

A new study finds that while hidden AI prompts can successfully bias older and smaller LLMs used for grading, most frontier models (GPT-4, Claude 3) a...

x.com·16h ago·3 min read

ai securityresearchlarge language models

Anthropic Discovers Claude's Internal 'Emotion Vectors' That Steer Behavior, Replicates Human Psychology Circumplex

AI Research

99

Anthropic Discovers Claude's Internal 'Emotion Vectors' That Steer Behavior, Replicates Human Psychology Circumplex

Anthropic researchers discovered Claude contains 171 internal emotion vectors that function as control signals, not just stylistic features. In evalua...

x.com·18h ago·3 min read

anthropicai safetyresearch

Gemma 4 Demonstrates Self-Terminating Loop Detection in Code Execution, User Reports

AI Research

85

Gemma 4 Demonstrates Self-Terminating Loop Detection in Code Execution, User Reports

A developer shared an observation that Google's Gemma 4 model recognized it was stuck in an infinite loop during a coding task and stopped itself. Thi...

x.com·18h ago·3 min read

code generationai safetyemergent behavior

Gamma 31B Model Reportedly Outperforms Qwen 3.5 397B, Highlighting Efficiency Leap

AI Research

85

Gamma 31B Model Reportedly Outperforms Qwen 3.5 397B, Highlighting Efficiency Leap

A developer's social media post claims the Gamma 31B model outperforms the much larger Qwen 3.5 397B. If verified, this would represent a dramatic eff...

x.com·19h ago·3 min read

scaling lawsresearchmodel efficiency

arXiv Paper Proposes 'Connections' Word Game as New Benchmark for AI Agent Social Intelligence

AI Research

88

arXiv Paper Proposes 'Connections' Word Game as New Benchmark for AI Agent Social Intelligence

A new arXiv preprint introduces the improvisational word game 'Connections' as a benchmark for evaluating social intelligence in AI agents. It require...

arxiv.org·1d ago·3 min read·Multi-Source

natural language processingresearchai agents

mmAnomaly: New Multi-Modal Framework Uses Conditional Latent Diffusion to Achieve 94% F1 Score for mmWave Anomaly Detection

AI Research

72

mmAnomaly: New Multi-Modal Framework Uses Conditional Latent Diffusion to Achieve 94% F1 Score for mmWave Anomaly Detection

Researchers introduced mmAnomaly, a multi-modal anomaly detection system that uses a conditional latent diffusion model to synthesize expected mmWave...

arxiv.org·1d ago·3 min read

securitycomputer visionresearch

BloClaw: New AI4S 'Operating System' Cuts Agent Tool-Calling Errors to 0.2% with XML-Regex Protocol

AI Research

75

BloClaw: New AI4S 'Operating System' Cuts Agent Tool-Calling Errors to 0.2% with XML-Regex Protocol

Researchers introduced BloClaw, a unified operating system for AI-driven scientific discovery that replaces fragile JSON tool-calling with a dual-trac...

arxiv.org·1d ago·3 min read

scientific computingmachine learningai research

FAOS Neurosymbolic Architecture Boosts Enterprise Agent Accuracy by 46% via Ontology-Constrained Reasoning

AI Research

88

FAOS Neurosymbolic Architecture Boosts Enterprise Agent Accuracy by 46% via Ontology-Constrained Reasoning

Researchers introduced a neurosymbolic architecture that constrains LLM-based agents with formal ontologies, improving metric accuracy by 46% and regu...

arxiv.org·1d ago·3 min read·Multi-Source

neurosymbolicresearchai agents

DRKL: Diversity-Aware Reverse KL Divergence Fixes Overconfidence in LLM Distillation

AI Research

80

DRKL: Diversity-Aware Reverse KL Divergence Fixes Overconfidence in LLM Distillation

A new paper proposes Diversity-aware Reverse KL (DRKL), a fix for the overconfidence and reduced diversity caused by the popular Reverse KL divergence...

arxiv.org·1d ago·3 min read

researchmachine learninglarge language models

HIVE Framework Introduces Hierarchical Cross-Attention for Vision-Language Pre-Training, Outperforms Self-Attention on MME and GQA

AI Research

84

HIVE Framework Introduces Hierarchical Cross-Attention for Vision-Language Pre-Training, Outperforms Self-Attention on MME and GQA

A new paper introduces HIVE, a hierarchical pre-training framework that connects vision encoders to LLMs via cross-attention across multiple layers. I...

arxiv.org·1d ago·3 min read

architecturetransformerresearch

QAsk-Nav Benchmark Enables Separate Scoring of Navigation and Dialogue for Collaborative AI Agents

AI Research

75

QAsk-Nav Benchmark Enables Separate Scoring of Navigation and Dialogue for Collaborative AI Agents

A new benchmark called QAsk-Nav enables separate evaluation of navigation and question-asking for collaborative embodied AI agents. The accompanying L...

arxiv.org·1d ago·3 min read

natural language processingroboticscomputer vision

Agent Judges with Big Five Personas Match Human Evaluators, Show Logarithmic Score Saturation in New arXiv Study

AI Research

72

Agent Judges with Big Five Personas Match Human Evaluators, Show Logarithmic Score Saturation in New arXiv Study

A new arXiv study shows LLM agents conditioned with Big Five personalities produce evaluations indistinguishable from humans. Crucially, quality score...

arxiv.org·1d ago·3 min read

large-language-modelsagentsresearch

E-STEER: New Framework Embeds Emotion in LLM Hidden States, Shows Non-Monotonic Impact on Reasoning and Safety

AI Research

75

E-STEER: New Framework Embeds Emotion in LLM Hidden States, Shows Non-Monotonic Impact on Reasoning and Safety

A new arXiv paper introduces E-STEER, an interpretable framework for embedding emotion as a controllable variable in LLM hidden states. Experiments sh...

arxiv.org·1d ago·3 min read

large-language-modelsai-agentsresearch

Truth AnChoring (TAC): New Post-Hoc Calibration Method Aligns LLM Uncertainty Scores with Factual Correctness

AI Research

76

Truth AnChoring (TAC): New Post-Hoc Calibration Method Aligns LLM Uncertainty Scores with Factual Correctness

A new arXiv paper introduces Truth AnChoring (TAC), a post-hoc calibration protocol that aligns heuristic uncertainty estimation metrics with factual...

arxiv.org·1d ago·3 min read

open sourceresearchreliability

Uni-SafeBench Study: Unified Multimodal Models Show 30-50% Higher Safety Failure Rates Than Specialized Counterparts

AI Research

76

Uni-SafeBench Study: Unified Multimodal Models Show 30-50% Higher Safety Failure Rates Than Specialized Counterparts

Researchers introduced Uni-SafeBench, a benchmark showing that Unified Multimodal Large Models (UMLMs) suffer a significant safety degradation compare...

arxiv.org·1d ago·3 min read

open-sourcemultimodal-airesearch

Agent Psychometrics: New Framework Predicts Task-Level Success in Agentic Coding Benchmarks with 0.81 AUC

AI Research

75

Agent Psychometrics: New Framework Predicts Task-Level Success in Agentic Coding Benchmarks with 0.81 AUC

A new research paper introduces a framework using Item Response Theory and task features to predict success on individual agentic coding tasks, achiev...

arxiv.org·1d ago·3 min read

codingresearchai agents

TPC-CMA Framework Reduces CLIP Modality Gap by 82.3%, Boosts Captioning CIDEr by 57.1%

AI Research

74

TPC-CMA Framework Reduces CLIP Modality Gap by 82.3%, Boosts Captioning CIDEr by 57.1%

Researchers propose TPC-CMA, a three-phase fine-tuning curriculum that reduces the modality gap in CLIP-like models by 82.3%, improving clustering ARI...

arxiv.org·1d ago·3 min read

multimodal-airesearchcomputer-vision

OmniSch Benchmark Exposes Major Gaps in LMMs for PCB Schematic Understanding

AI Research

76

OmniSch Benchmark Exposes Major Gaps in LMMs for PCB Schematic Understanding

Researchers introduced OmniSch, a benchmark with 1,854 real PCB schematics, to evaluate LMMs on converting diagrams to netlist graphs. Results show cu...

arxiv.org·1d ago·3 min read

multimodal-aihardware-designbenchmarks

Google DeepMind Maps Six 'AI Agent Traps' That Can Hijack Autonomous Systems in the Wild

AI Research

95

Google DeepMind Maps Six 'AI Agent Traps' That Can Hijack Autonomous Systems in the Wild

Google DeepMind has published a framework identifying six categories of 'traps'—from hidden web instructions to poisoned memory—that can exploit auton...

the-decoder.com·1d ago·3 min read·Multi-Source

llmssecurityai agents

DeepSeek-R1 Reportedly Hits 78.9% on OS-World, Outperforming GPT-5.4 at 1/10th Cost

AI Research

95

DeepSeek-R1 Reportedly Hits 78.9% on OS-World, Outperforming GPT-5.4 at 1/10th Cost

A new benchmark claim suggests DeepSeek-R1 has achieved 78.9% on the OS-World agentic coding benchmark, reportedly outperforming GPT-5.4 while operati...

x.com·1d ago·3 min read

reasoningai agentsbenchmarks

MemFactory Framework Unifies Agent Memory Training & Inference, Reports 14.8% Gains Over Baselines

AI Research

97

MemFactory Framework Unifies Agent Memory Training & Inference, Reports 14.8% Gains Over Baselines

Researchers introduced MemFactory, a unified framework treating agent memory as a trainable component. It supports multiple memory paradigms and shows...

x.com·1d ago·3 min read

ai-agentsframeworksresearch

Google Quantum AI Team Reduces Bitcoin-Cracking Qubit Estimate to ~500k, Enabling 9-Minute Key Derivation

AI Research

95

Google Quantum AI Team Reduces Bitcoin-Cracking Qubit Estimate to ~500k, Enabling 9-Minute Key Derivation

Google researchers have compiled Shor's algorithm to solve Bitcoin's 256-bit elliptic curve problem with ~1.2k logical qubits, translating to <500k ph...

x.com·1d ago·3 min read

securityresearchblockchain

CARLA-Air Unifies CARLA and AirSim Simulators in Single Unreal Engine Process for Embodied AI

AI Research

85

CARLA-Air Unifies CARLA and AirSim Simulators in Single Unreal Engine Process for Embodied AI

CARLA-Air merges the CARLA autonomous driving and AirSim drone simulators into one Unreal Engine process, enabling zero-latency air-ground sensor sync...

x.com·1d ago·3 min read

simulationroboticsresearch tool

OpenAI Internal Model Reportedly Solves Three New Erdős Problems, Marking AI Advance in Pure Mathematics

AI Research

85

OpenAI Internal Model Reportedly Solves Three New Erdős Problems, Marking AI Advance in Pure Mathematics

An internal AI model at OpenAI has reportedly solved three previously unsolved mathematical problems from the Erdős collection. This development signa...

x.com·2d ago·3 min read

reasoningmathematicstheorem proving

Qwen3.5-Omni Demonstrates 'Audio-Visual Vibe Coding' as an Emergent Ability

AI Research

85

Qwen3.5-Omni Demonstrates 'Audio-Visual Vibe Coding' as an Emergent Ability

Alibaba's Qwen3.5-Omni model appears to have developed an emergent ability to generate code from combined audio and visual inputs without specific tra...

x.com·2d ago·3 min read

code generationmultimodal modelsresearch

AI Model Analyzes Blood Proteins to Diagnose Alzheimer's, Parkinson's, ALS, and Stroke with 17,187-Patient Study

AI Research

97

AI Model Analyzes Blood Proteins to Diagnose Alzheimer's, Parkinson's, ALS, and Stroke with 17,187-Patient Study

An AI model can diagnose Alzheimer's, Parkinson's, ALS, frontotemporal dementia, and stroke from a single blood sample by analyzing protein profiles....

x.com·2d ago·3 min read

medical-airesearchmachine-learning

Microsoft & CUHK Debut 'Medical AI Scientist' Agent That Generates Ideas, Runs Experiments, and Writes Papers

AI Research

95

Microsoft & CUHK Debut 'Medical AI Scientist' Agent That Generates Ideas, Runs Experiments, and Writes Papers

Microsoft Research and CUHK have developed an autonomous AI agent that can formulate research ideas, execute experiments, and author papers, achieving...

x.com·3d ago·3 min read

agentic aiacademic aimicrosoft