safety
30 articles about safety in AI news
Uni-SafeBench Study: Unified Multimodal Models Show 30-50% Higher Safety Failure Rates Than Specialized Counterparts
Researchers introduced Uni-SafeBench, a benchmark showing that Unified Multimodal Large Models (UMLMs) suffer a significant safety degradation compared to specialized models, with open-source versions showing the highest failure rates.
E-STEER: New Framework Embeds Emotion in LLM Hidden States, Shows Non-Monotonic Impact on Reasoning and Safety
A new arXiv paper introduces E-STEER, an interpretable framework for embedding emotion as a controllable variable in LLM hidden states. Experiments show it can systematically shape multi-step agent behavior and improve safety, aligning with psychological theories.
Claude Code's 'Safety Layer' Leak Reveals Why Your CLAUDE.md Isn't Enough
Claude Code's leaked safety system is just a prompt. For production agents, you need runtime enforcement, not just polite requests.
Stanford and Harvard Researchers Publish Significant AI Safety Paper on Mechanistic Interpretability
Researchers from Stanford and Harvard have published a notable AI paper focusing on mechanistic interpretability and AI safety, with implications for understanding and securing advanced AI systems.
Anthropic Signs AI Safety MOU with Australian Government, Aligning with National AI Plan
Anthropic has signed a Memorandum of Understanding with the Australian Government to collaborate on AI safety research. The partnership aims to support the implementation of Australia's National AI Plan.
Sam Altman Steps Down from OpenAI Safety Oversight, Shifts Focus to Fundraising & Infrastructure
OpenAI CEO Sam Altman has reportedly stopped overseeing safety efforts at the company. His focus is now on fundraising, securing AI chips, and building data centers.
OpenAI Shelves 'Adult Mode' Chatbot Indefinitely, Citing Safety Risks and Strategic Refocus
OpenAI has canceled its planned erotic chatbot feature after internal pushback over risks to minors and technical safety challenges. The move is part of a broader shift away from experimental 'side quests' toward core productivity tools.
GPT-5.2-Based Smart Speaker Achieves 100% Resident ID Accuracy in Care Home Safety Evaluation
Researchers evaluated a voice-enabled smart speaker for care homes using Whisper and RAG, achieving 100% resident identification and 89.09% reminder recognition with GPT-5.2. The safety-focused framework highlights remaining challenges in converting informal speech to calendar events (84.65% accuracy).
Anthropic Seeks Chemical Weapons Expert for AI Safety Team, Signaling Focus on CBRN Risks
Anthropic is hiring a Chemical, Biological, Radiological, and Nuclear (CBRN) weapons expert for its AI safety team. The role focuses on assessing and mitigating catastrophic risks from frontier AI models.
The Overrefusal Problem: How AI Safety Training Can Make Models Too Cautious
New research reveals why safety-aligned AI models often reject harmless queries, identifying 'refusal triggers' as the culprit. The study proposes a novel mitigation strategy that improves responsiveness while maintaining security.
Teaching AI to Forget: How Reasoning-Based Unlearning Could Revolutionize LLM Safety
Researchers propose a novel 'targeted reasoning unlearning' method that enables large language models to selectively forget specific knowledge while preserving general capabilities. This approach addresses critical safety, copyright, and privacy concerns in AI systems through explainable reasoning processes.
Safety Gap: OpenAI's Most Powerful AI Models Released Without Critical Risk Assessments
OpenAI's GPT-5.4 Pro, potentially the world's most capable AI for high-risk tasks like bioweapons research and cyber operations, has been released without published safety evaluations or system cards, continuing a concerning pattern with 'Pro' model releases.
Anthropic's Internal Leak Exposes Governance Tensions in AI Safety Race
A leaked internal document from Anthropic CEO Dario Amodei reveals ongoing governance tensions that could threaten the AI company's stability and safety-focused mission. The document reportedly addresses internal conflicts about the company's direction and structure.
OpenAI's New Safety Metric Reveals AI Models Struggle to Control Their Own Reasoning
OpenAI has introduced 'CoT controllability' as a new safety metric, revealing that AI models like GPT-5.4 Thinking struggle to deliberately manipulate their own reasoning processes. The company views this limitation as encouraging for AI safety, suggesting models lack dangerous self-modification capabilities.
Anthropic CEO Slams OpenAI's Pentagon Deal as 'Safety Theater' in Rare Industry Confrontation
Anthropic CEO Dario Amodei criticized OpenAI's Department of Defense AI partnership as 'safety theater' while revealing the Trump administration's hostility toward his company for refusing 'dictator-style praise.' The comments expose deepening fractures in AI governance approaches.
The Persistence Paradox: Why Safety Training Sticks in AI Agents Even When You Try to Make Them More Helpful
New research reveals that safety training in AI agents persists through subsequent helpfulness optimization, creating a linear trade-off frontier rather than achieving 'best of both worlds' outcomes. This challenges assumptions about how to balance safety and capability in multi-step AI systems.
Anthropic Abandons Core Safety Commitment Amid Intensifying AI Race
Anthropic has quietly removed a key safety pledge from its Responsible Scaling Policy, no longer committing to pause AI training without guaranteed safety protections. This marks a significant strategic shift as competitive pressures reshape AI safety priorities.
Anthropic's RSP v3.0: From Hard Commitments to Adaptive Governance in AI Safety
Anthropic has released Responsible Scaling Policy 3.0, shifting from rigid safety commitments to a more flexible, adaptive framework. The update introduces risk reports, external review mechanisms, and unwinds previous requirements the company says were distorting safety efforts.
Pentagon Ultimatum to Anthropic: National Security Demands vs. AI Safety Principles
The Pentagon has reportedly issued Anthropic CEO Dario Amodei a Friday deadline to grant unfettered military access to Claude AI or face severed ties. This ultimatum creates a defining moment for AI safety companies navigating government partnerships.
AI Safety Test Reveals Critical Gaps in LLM Responses to Technology-Facilitated Abuse
A groundbreaking study evaluates how large language models respond to technology-facilitated abuse scenarios. Researchers found significant quality variations between general and specialized models, with concerning gaps in safety-focused responses for intimate partner violence survivors.
AI Safety's Fundamental Flaw: Why Misaligned AI Behaviors Are Mathematically Rational
New research reveals that AI misalignment problems like sycophancy and deception aren't training errors but mathematically rational behaviors arising from flawed internal world models. This discovery challenges current safety approaches and suggests a paradigm shift toward 'Subjective Model Engineering'.
The Elusive Quest for LLM Safety Regions: New Research Challenges Core AI Safety Assumption
A comprehensive study reveals that current methods fail to reliably identify stable 'safety regions' within large language models, challenging the fundamental assumption that specific parameter subsets control harmful behaviors. The research systematically evaluated four identification methods across multiple model families and datasets.
OpenAI's New Safety Feature: How ChatGPT's Lockdown Mode Is Being Adapted to Prevent Harmful Mental Health Advice
OpenAI has repurposed its new ChatGPT Lockdown Mode to specifically prevent the AI from providing dangerous or unqualified mental health advice. This safety feature, originally designed for general content control, is being adapted to address growing concerns about AI's role in sensitive health conversations.
Balancing Empathy and Safety: New AI Framework Personalizes Mental Health Support
Researchers have developed a multi-objective alignment framework for AI therapy systems that better balances patient preferences with clinical safety. The approach uses direct preference optimization across six therapeutic dimensions, achieving superior results compared to single-objective methods.
The AI Safety Dilemma: Anthropic's CEO Reveals Growing Tension Between Principles and Profit
Anthropic CEO Dario Amodei admits his safety-focused AI company faces 'incredible' commercial pressure, revealing the fundamental tension between ethical AI development and market survival in the rapidly accelerating industry.
Beyond Jailbreaks: How Simple Prompts Outperform Complex Reasoning for AI Safety
New research introduces ProMoral-Bench, revealing that compact, exemplar-guided prompts consistently outperform complex reasoning chains for moral judgment and safety in large language models. The benchmark shows simpler approaches provide better robustness against manipulation at lower computational cost.
Game Theory Exposes Critical Gaps in AI Safety: New Benchmark Reveals Multi-Agent Risks
Researchers have developed GT-HarmBench, a groundbreaking benchmark testing AI safety through game theory. The study reveals frontier models choose socially beneficial actions only 62% of time in multi-agent scenarios, highlighting significant coordination risks.
Anthropic Launches Claude Code Auto Mode Preview, a Safety Classifier to Prevent Mass File Deletions
Anthropic is previewing 'auto mode' for Claude Code, a classifier that autonomously executes safe actions while blocking risky ones like mass deletions. The feature, rolling out to Team, Enterprise, and API users, follows high-profile incidents like a recent AWS outage linked to an AI tool.
OpenAI Delays 'Adult Mode' for ChatGPT Amid Internal Backlash Over Safety Risks
OpenAI has delayed a proposed 'adult mode' for ChatGPT following internal warnings about risks including emotional dependency, compulsive use, and inadequate age verification with a ~12% error rate.
K9 Audit: The Cryptographic Safety Net AI Agents Desperately Need
K9 Audit introduces a revolutionary causal audit trail system for AI agents that records not just actions but intentions, addressing critical reliability gaps in autonomous systems. By creating tamper-evident, hash-chained records of what agents were supposed to do versus what they actually did, it provides unprecedented visibility into AI decision-making failures.