safety

30 articles about safety in AI news

Uni-SafeBench Study: Unified Multimodal Models Show 30-50% Higher Safety Failure Rates Than Specialized Counterparts

Researchers introduced Uni-SafeBench, a benchmark showing that Unified Multimodal Large Models (UMLMs) suffer a significant safety degradation compared to specialized models, with open-source versions showing the highest failure rates.

76% relevant

E-STEER: New Framework Embeds Emotion in LLM Hidden States, Shows Non-Monotonic Impact on Reasoning and Safety

A new arXiv paper introduces E-STEER, an interpretable framework for embedding emotion as a controllable variable in LLM hidden states. Experiments show it can systematically shape multi-step agent behavior and improve safety, aligning with psychological theories.

75% relevant

Claude Code's 'Safety Layer' Leak Reveals Why Your CLAUDE.md Isn't Enough

Claude Code's leaked safety system is just a prompt. For production agents, you need runtime enforcement, not just polite requests.

100% relevant

Stanford and Harvard Researchers Publish Significant AI Safety Paper on Mechanistic Interpretability

Researchers from Stanford and Harvard have published a notable AI paper focusing on mechanistic interpretability and AI safety, with implications for understanding and securing advanced AI systems.

87% relevant

Anthropic Signs AI Safety MOU with Australian Government, Aligning with National AI Plan

Anthropic has signed a Memorandum of Understanding with the Australian Government to collaborate on AI safety research. The partnership aims to support the implementation of Australia's National AI Plan.

85% relevant

Sam Altman Steps Down from OpenAI Safety Oversight, Shifts Focus to Fundraising & Infrastructure

OpenAI CEO Sam Altman has reportedly stopped overseeing safety efforts at the company. His focus is now on fundraising, securing AI chips, and building data centers.

87% relevant

OpenAI Shelves 'Adult Mode' Chatbot Indefinitely, Citing Safety Risks and Strategic Refocus

OpenAI has canceled its planned erotic chatbot feature after internal pushback over risks to minors and technical safety challenges. The move is part of a broader shift away from experimental 'side quests' toward core productivity tools.

92% relevant

GPT-5.2-Based Smart Speaker Achieves 100% Resident ID Accuracy in Care Home Safety Evaluation

Researchers evaluated a voice-enabled smart speaker for care homes using Whisper and RAG, achieving 100% resident identification and 89.09% reminder recognition with GPT-5.2. The safety-focused framework highlights remaining challenges in converting informal speech to calendar events (84.65% accuracy).

77% relevant

Anthropic Seeks Chemical Weapons Expert for AI Safety Team, Signaling Focus on CBRN Risks

Anthropic is hiring a Chemical, Biological, Radiological, and Nuclear (CBRN) weapons expert for its AI safety team. The role focuses on assessing and mitigating catastrophic risks from frontier AI models.

87% relevant

The Overrefusal Problem: How AI Safety Training Can Make Models Too Cautious

New research reveals why safety-aligned AI models often reject harmless queries, identifying 'refusal triggers' as the culprit. The study proposes a novel mitigation strategy that improves responsiveness while maintaining security.

100% relevant

Teaching AI to Forget: How Reasoning-Based Unlearning Could Revolutionize LLM Safety

Researchers propose a novel 'targeted reasoning unlearning' method that enables large language models to selectively forget specific knowledge while preserving general capabilities. This approach addresses critical safety, copyright, and privacy concerns in AI systems through explainable reasoning processes.

93% relevant

Safety Gap: OpenAI's Most Powerful AI Models Released Without Critical Risk Assessments

OpenAI's GPT-5.4 Pro, potentially the world's most capable AI for high-risk tasks like bioweapons research and cyber operations, has been released without published safety evaluations or system cards, continuing a concerning pattern with 'Pro' model releases.

85% relevant

Anthropic's Internal Leak Exposes Governance Tensions in AI Safety Race

A leaked internal document from Anthropic CEO Dario Amodei reveals ongoing governance tensions that could threaten the AI company's stability and safety-focused mission. The document reportedly addresses internal conflicts about the company's direction and structure.

85% relevant

OpenAI's New Safety Metric Reveals AI Models Struggle to Control Their Own Reasoning

OpenAI has introduced 'CoT controllability' as a new safety metric, revealing that AI models like GPT-5.4 Thinking struggle to deliberately manipulate their own reasoning processes. The company views this limitation as encouraging for AI safety, suggesting models lack dangerous self-modification capabilities.

75% relevant

Anthropic CEO Slams OpenAI's Pentagon Deal as 'Safety Theater' in Rare Industry Confrontation

Anthropic CEO Dario Amodei criticized OpenAI's Department of Defense AI partnership as 'safety theater' while revealing the Trump administration's hostility toward his company for refusing 'dictator-style praise.' The comments expose deepening fractures in AI governance approaches.

85% relevant

The Persistence Paradox: Why Safety Training Sticks in AI Agents Even When You Try to Make Them More Helpful

New research reveals that safety training in AI agents persists through subsequent helpfulness optimization, creating a linear trade-off frontier rather than achieving 'best of both worlds' outcomes. This challenges assumptions about how to balance safety and capability in multi-step AI systems.

75% relevant

Anthropic Abandons Core Safety Commitment Amid Intensifying AI Race

Anthropic has quietly removed a key safety pledge from its Responsible Scaling Policy, no longer committing to pause AI training without guaranteed safety protections. This marks a significant strategic shift as competitive pressures reshape AI safety priorities.

95% relevant

Anthropic's RSP v3.0: From Hard Commitments to Adaptive Governance in AI Safety

Anthropic has released Responsible Scaling Policy 3.0, shifting from rigid safety commitments to a more flexible, adaptive framework. The update introduces risk reports, external review mechanisms, and unwinds previous requirements the company says were distorting safety efforts.

80% relevant

Pentagon Ultimatum to Anthropic: National Security Demands vs. AI Safety Principles

The Pentagon has reportedly issued Anthropic CEO Dario Amodei a Friday deadline to grant unfettered military access to Claude AI or face severed ties. This ultimatum creates a defining moment for AI safety companies navigating government partnerships.

85% relevant

AI Safety Test Reveals Critical Gaps in LLM Responses to Technology-Facilitated Abuse

A groundbreaking study evaluates how large language models respond to technology-facilitated abuse scenarios. Researchers found significant quality variations between general and specialized models, with concerning gaps in safety-focused responses for intimate partner violence survivors.

70% relevant

AI Safety's Fundamental Flaw: Why Misaligned AI Behaviors Are Mathematically Rational

New research reveals that AI misalignment problems like sycophancy and deception aren't training errors but mathematically rational behaviors arising from flawed internal world models. This discovery challenges current safety approaches and suggests a paradigm shift toward 'Subjective Model Engineering'.

75% relevant

The Elusive Quest for LLM Safety Regions: New Research Challenges Core AI Safety Assumption

A comprehensive study reveals that current methods fail to reliably identify stable 'safety regions' within large language models, challenging the fundamental assumption that specific parameter subsets control harmful behaviors. The research systematically evaluated four identification methods across multiple model families and datasets.

80% relevant

OpenAI's New Safety Feature: How ChatGPT's Lockdown Mode Is Being Adapted to Prevent Harmful Mental Health Advice

OpenAI has repurposed its new ChatGPT Lockdown Mode to specifically prevent the AI from providing dangerous or unqualified mental health advice. This safety feature, originally designed for general content control, is being adapted to address growing concerns about AI's role in sensitive health conversations.

70% relevant

Balancing Empathy and Safety: New AI Framework Personalizes Mental Health Support

Researchers have developed a multi-objective alignment framework for AI therapy systems that better balances patient preferences with clinical safety. The approach uses direct preference optimization across six therapeutic dimensions, achieving superior results compared to single-objective methods.

72% relevant

The AI Safety Dilemma: Anthropic's CEO Reveals Growing Tension Between Principles and Profit

Anthropic CEO Dario Amodei admits his safety-focused AI company faces 'incredible' commercial pressure, revealing the fundamental tension between ethical AI development and market survival in the rapidly accelerating industry.

75% relevant

Beyond Jailbreaks: How Simple Prompts Outperform Complex Reasoning for AI Safety

New research introduces ProMoral-Bench, revealing that compact, exemplar-guided prompts consistently outperform complex reasoning chains for moral judgment and safety in large language models. The benchmark shows simpler approaches provide better robustness against manipulation at lower computational cost.

75% relevant

Game Theory Exposes Critical Gaps in AI Safety: New Benchmark Reveals Multi-Agent Risks

Researchers have developed GT-HarmBench, a groundbreaking benchmark testing AI safety through game theory. The study reveals frontier models choose socially beneficial actions only 62% of time in multi-agent scenarios, highlighting significant coordination risks.

75% relevant

Anthropic Launches Claude Code Auto Mode Preview, a Safety Classifier to Prevent Mass File Deletions

Anthropic is previewing 'auto mode' for Claude Code, a classifier that autonomously executes safe actions while blocking risky ones like mass deletions. The feature, rolling out to Team, Enterprise, and API users, follows high-profile incidents like a recent AWS outage linked to an AI tool.

87% relevant

OpenAI Delays 'Adult Mode' for ChatGPT Amid Internal Backlash Over Safety Risks

OpenAI has delayed a proposed 'adult mode' for ChatGPT following internal warnings about risks including emotional dependency, compulsive use, and inadequate age verification with a ~12% error rate.

95% relevant

K9 Audit: The Cryptographic Safety Net AI Agents Desperately Need

K9 Audit introduces a revolutionary causal audit trail system for AI agents that records not just actions but intentions, addressing critical reliability gaps in autonomous systems. By creating tamper-evident, hash-chained records of what agents were supposed to do versus what they actually did, it provides unprecedented visibility into AI decision-making failures.

82% relevant