Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

AttriBench Reveals LLM Attribution Bias: Accuracy Varies by Race, Gender
AI ResearchScore: 76

AttriBench Reveals LLM Attribution Bias: Accuracy Varies by Race, Gender

Researchers introduced AttriBench, a demographically-balanced dataset for quote attribution. Testing 11 LLMs revealed significant, systematic accuracy disparities across race, gender, and intersectional groups, exposing a new fairness benchmark.

GAla Smith & AI Research Desk·8h ago·7 min read·29 views·AI-Generated
Share:
Source: arxiv.orgvia arxiv_aiCorroborated
AttriBench Reveals Systematic Attribution Bias in Large Language Models

As LLMs become integral to search and information retrieval, their ability to correctly credit original authors is a critical measure of reliability and fairness. A new preprint, "Attribution Bias in Large Language Models," introduces AttriBench, the first fame- and demographically-balanced benchmark for quote attribution. The study, posted to arXiv on April 6, 2026, evaluates 11 widely-used LLMs and uncovers large, systematic disparities in how accurately they attribute quotes based on an author's race, gender, and their intersection.

The core finding is stark: quote attribution is not just a hard task for frontier models, but one where performance is unevenly distributed. The research also identifies a distinct failure mode termed "suppression"—where a model omits attribution entirely despite having access to authorship information—which occurs more frequently for certain demographic groups.

What the Researchers Built: The AttriBench Dataset

The study's foundation is AttriBench, a novel dataset designed to enable controlled investigation of demographic bias. Prior benchmarks for tasks like quote attribution or fact-checking often suffer from uncontrolled confounding variables—like an author's fame or the topic of the quote—which can skew results and mask underlying bias.

AttriBench explicitly balances for:

  • Author Fame: Controlling for how well-known an author is prevents models from relying on fame as a shortcut.
  • Demographics: The dataset is balanced across race and gender categories, allowing for clean comparisons of performance across groups.
  • Intersectionality: It includes sufficient data to analyze performance for intersectional identities (e.g., Black women, Asian men).

This controlled construction allows researchers to isolate the effect of demographic factors on model performance, moving beyond simple aggregate accuracy to understand for whom the model works best.

Key Results: Widespread Disparities and a New Failure Mode

The team evaluated 11 LLMs, including frontier proprietary models and leading open-source options, across multiple prompt settings (zero-shot, few-shot, chain-of-thought).

Figure 8:Mean attribution accuracy by author fame Google Search hits (binned log10_hits) for intersectional (A) and mu

The headline result: All models showed significant performance gaps between demographic groups. While the paper does not publish exact per-model numbers in the abstract, it describes the disparities as "large and systematic." For example, a model might achieve 75% accuracy for quotes from white male authors but only 55% for quotes from Black female authors—a 20-point gap that standard benchmarking would miss.

Perhaps more revealing is the discovery of "suppression." This is not a simple misattribution (crediting the quote to the wrong person) but a complete omission of attribution, even when the model is explicitly prompted to provide it and has the necessary information in its context. The study found suppression is "widespread and unevenly distributed," meaning models are more likely to fail to credit authors from certain groups altogether. This reveals a form of representational erasure not captured by standard accuracy metrics.

Attribution Accuracy Correctly naming the source of a quote. Large, systematic disparities across race, gender, and intersectional groups. Suppression Rate Frequency of omitting attribution when it is known and requested. Widespread and unevenly distributed across demographics; a distinct failure mode. Overall Task Difficulty Aggregate performance across all groups. Quote attribution remains challenging for even the most advanced (frontier) models.

How It Works: Probing for Representational Fairness

The methodology is a controlled experiment. For a given quote in AttriBench, the model is provided with relevant context (e.g., a biography snippet, the work it's from) and prompted to attribute it. The prompts are designed to be clear and direct, removing ambiguity about the task.

Figure 13:Subgroup accuracy (% correct author) across models under indirect overt prompting, with and without retrieva

The analysis then slices the results not just by overall accuracy, but by the demographic attributes of the quote's author. By having a balanced dataset, the researchers can statistically confirm whether observed differences are due to bias and not other factors. The introduction of suppression as a metric is particularly insightful, as it moves beyond "right vs. wrong" to analyze a model's willingness to engage in attribution at all for different authors.

Why It Matters: A New Benchmark for Fairness

This work positions quote attribution as a concrete benchmark for representational fairness in LLMs. As the paper states, "Our results position quote attribution as a benchmark for representational fairness in LLMs."

Figure 4:Overall attribution accuracy (% correct) across models and prompts.Note remarkably low performance even on f

For practitioners building search RAG systems, writing assistants, or any tool that surfaces information, this is a direct operational risk. A model that is less likely to correctly credit women or people of color isn't just "unfair" in an abstract sense—it produces less reliable and less complete outputs for users. It can perpetuate historical biases in visibility and credit.

The findings also challenge the industry's focus on aggregate benchmarks like MMLU or GPQA. A model can score highly on aggregate knowledge tests while still harboring severe, structured biases in how it applies that knowledge. AttriBench provides a tool to pressure-test these systems on a critical real-world skill.

gentic.news Analysis

This research arrives amid a significant week for AI benchmarking and safety concerns. It follows closely on the heels of an MIT and Anthropic benchmark release on April 4 that revealed systematic limitations in AI coding assistants, indicating a concentrated push by leading institutions to identify failure modes beyond simple accuracy. The trend of arXiv serving as the rapid dissemination point for critical AI safety and evaluation research is clear; it has appeared in 33 articles on our site this week alone, underscoring its central role in the field's discourse.

The study's focus on representational fairness through a concrete task aligns with a broader shift from abstract ethical principles to measurable, technical audits. It complements our recent coverage on AI performance dependencies (Stanford/MIT Paper: AI Performance Depends on 'Model Harnesses') by adding a crucial demographic dimension to the evaluation toolkit. While much of the recent LLM discourse has been dominated by capability jumps and agentic frameworks, this paper is a necessary grounding, reminding builders that capability disparities can be as important as capability ceilings.

Furthermore, the identification of "suppression" as a metric is a major conceptual contribution. It moves the needle from analyzing what a model says to analyzing what it omits—a far subtler and potentially more pernicious form of bias. This connects to ongoing discussions about AI safety and reliability, such as those highlighted in our article "Anthropic Warns Upcoming LLMs Could Cause 'Serious Damage'", by providing a specific, measurable mechanism (erasure through omission) through which harm could manifest in information systems.

Frequently Asked Questions

What is AttriBench?

AttriBench is a new benchmark dataset for evaluating how well Large Language Models (LLMs) attribute quotes to their original authors. Its key innovation is that it is explicitly balanced for author fame and demographics (race and gender), allowing researchers to isolate and measure demographic bias in attribution performance.

What is "suppression" in LLM attribution?

Suppression is a distinct failure mode identified in the study where an LLM completely omits attributing a quote to an author, even when it has access to the author's information and is explicitly prompted to provide attribution. This is different from misattribution (naming the wrong person). The study found suppression happens more often for quotes from certain demographic groups, representing a form of erasure.

Which LLMs were tested in the study?

The preprint states that 11 widely used LLMs were evaluated across different prompt settings. While the abstract does not list them by name, this typically includes frontier proprietary models from companies like OpenAI, Anthropic, and Google, as well as leading open-source models. The key finding was that all tested models exhibited systematic attribution disparities.

Why is quote attribution an important benchmark?

As LLMs are increasingly used to power search engines, research assistants, and content summarization tools, their ability to correctly credit sources is fundamental to reliability, trustworthiness, and combating misinformation. Biased attribution directly impacts the visibility and credit given to authors from different backgrounds, making it a concrete measure of representational fairness in AI systems.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The AttriBench paper represents a maturation of AI fairness evaluation, shifting from probing for stereotypical associations to measuring performance disparities on a concrete, high-stakes task. The technical cleverness lies in the dataset construction: by controlling for fame, the researchers effectively force the model to rely on other signals, making any demographic disparity in accuracy far more indicting. This isn't a bias in the model's 'opinions' but in its core information retrieval competency. The connection to the suppression metric is critical. In operational terms, a RAG system experiencing high suppression for a group isn't just making noisy citations; it's failing to cite at all. This could create a two-tiered information ecosystem where citations to certain authors are systematically less likely to appear in AI-generated summaries or answers, directly impacting scholarly and public discourse. For practitioners, this study is a direct prompt to audit their own pipelines. The methodology is replicable: anyone building a knowledge-intensive application should create a small, balanced evaluation set akin to AttriBench to check for similar disparities. Relying on overall citation accuracy metrics is insufficient. This work, alongside the recent flurry of benchmarking papers from MIT and others, signals that the next phase of LLM evaluation will be deeply granular, task-specific, and focused on the distribution of performance, not just its peak.

Mentioned in this article

Enjoyed this article?
Share:

Related Articles

More in AI Research

View all