From BM25 to Corrective RAG: A Benchmark Study Challenges the Dominance of Semantic Search for Tabular Data
AI ResearchScore: 80

From BM25 to Corrective RAG: A Benchmark Study Challenges the Dominance of Semantic Search for Tabular Data

A systematic benchmark of 10 RAG retrieval strategies on a financial QA dataset reveals that a two-stage hybrid + reranking pipeline performs best. Crucially, the classic BM25 algorithm outperformed modern dense retrieval models, challenging a core assumption in semantic search. The findings provide actionable, cost-aware guidance for building retrieval systems over heterogeneous documents.

GAla Smith & AI Research Desk·12h ago·4 min read·5 views·AI-Generated
Share:
Source: arxiv.orgvia arxiv_irSingle Source

What Happened

A new technical paper, "From BM25 to Corrective RAG: Benchmarking Retrieval Strategies for Text-and-Table Documents," was posted to the arXiv preprint server. The research addresses a critical gap: while Retrieval-Augmented Generation (RAG) systems are ubiquitous, there has been no systematic comparison of modern retrieval methods for the complex, heterogeneous documents common in business—those containing both free-form text and structured tabular data.

The authors constructed a challenging financial Question-Answering (QA) benchmark comprising 23,088 queries over 7,318 documents with mixed content. They then benchmarked ten distinct retrieval strategies, spanning the full modern arsenal:

  • Sparse Retrieval: The classic lexical search algorithm, BM25.
  • Dense Retrieval: Semantic search using state-of-the-art embedding models.
  • Hybrid Fusion: Combining scores from sparse and dense retrievers.
  • Cross-Encoder Reranking: Using a more computationally expensive neural model to reorder an initial set of candidate documents.
  • Query Expansion: Techniques like HyDE (Hypothetical Document Embeddings) and multi-query generation.
  • Adaptive & Contextual Retrieval: Methods that adjust the retrieval based on context or previous steps.

Performance was evaluated on both retrieval quality (Recall@k, MRR, nDCG) and end-to-end generation accuracy (via a "Number Match" metric suitable for financial data), with statistical significance testing.

Technical Details & Key Findings

The results deliver several actionable and somewhat counter-intuitive insights for AI practitioners:

  1. The Two-Stage Pipeline is King: The most effective strategy was a two-stage pipeline that first uses a hybrid retrieval method (fusing BM25 and dense embedding scores) to fetch a broad set of candidates (e.g., top 100), then applies a neural cross-encoder reranker to select the final top passages. This pipeline achieved a Recall@5 of 0.816 and an MRR@3 of 0.605, significantly outperforming any single-stage method.

  2. BM25 Challenges Semantic Dominance: In a major finding, the simple, decades-old BM25 algorithm consistently outperformed state-of-the-art dense retrieval models on this financial document corpus. This directly challenges the common assumption that semantic (vector) search universally dominates keyword-based search. The authors attribute this to the precise, often numeric nature of financial queries, where exact term matching remains highly effective.

  3. Not All Advanced Techniques Pay Off: For this domain of precise numerical QA, query expansion methods (HyDE, multi-query) and adaptive retrieval provided limited to no benefit. However, contextual retrieval—where the system uses information from initially retrieved documents to refine a follow-up search—yielded consistent gains.

  4. Cost-Accuracy Trade-offs are Explicit: The paper provides practical guidance. If maximum accuracy is critical, invest in the two-stage hybrid+reranking pipeline. If latency and cost are primary constraints, a well-tuned BM25 system might be the most efficient choice, outperforming more expensive dense models.

Retail & Luxury Implications

The benchmark uses financial documents, but its conclusions are directly transferable to core retail and luxury AI use cases that rely on RAG over complex, mixed-format data. The key insight is that the optimal retrieval architecture is not a one-size-fits-all semantic search but is dependent on your data and query profile.

Figure 1: Recall@kk curves for BM25, dense (text-embedding-3-large), and hybrid RRF retrieval. Hybrid fusion consistentl

Potential Applications & Architectural Guidance:

  • Product Information & Customer Service Chatbots: Knowledge bases contain product descriptions (text), technical specifications (tables), pricing histories, and inventory logs. A customer asking "What were the price changes for the Lady Dior bag in Q4 2025?" is making a precise, quasi-tabular query. This research suggests a hybrid BM25/dense first stage would likely retrieve the correct pricing table, which a reranker could then confirm.

  • Internal Enterprise Search: Merchandising plans, global sales reports, and supply chain documents are quintessential text-and-table documents. An analyst searching for "SKU 78945 sell-through in Paris in December" needs pinpoint accuracy. Relying solely on a semantic embedding might miss the specific SKU number; BM25 would catch it. The recommended two-stage pipeline is ideal for such internal intelligence systems.

  • Sustainability & Compliance Reporting: Generating reports from ESG data, material sourcing ledgers, and audit trails involves extracting precise figures from structured tables referenced in narrative text. The benchmark's finding that query expansion adds little value for numerical queries is crucial here—it prevents teams from over-engineering their RAG pipelines with ineffective complexity.

Implementation Consideration: For luxury houses, where product catalogs are smaller but richer in detail (heritage, craftsmanship notes, material provenance), dense semantic retrieval will still be vital for understanding conceptual queries like "bags suitable for gifting a diplomat." The lesson is not to abandon embeddings, but to default to a hybrid approach where BM25 ensures precision on key entities (product codes, names, numbers) and embeddings capture broader semantic intent. The reranking stage acts as a high-confidence arbiter.

AI Analysis

This paper provides empirical validation for a growing practitioner sentiment: the rush to implement pure vector search for all RAG applications is often misguided. For retail and luxury—domains awash in structured product data, SKUs, dates, and prices—lexical search remains indispensable. The research empowers technical leaders to push back on blanket "semantic search" mandates and design architectures based on data morphology. The timing is pertinent. This follows a **recent cautionary tale shared by a developer on March 25 about RAG system failure at production scale**, highlighting the real-world risks of poorly configured retrieval. It also complements our recent coverage of **Nemotron ColEmbed V2**, NVIDIA's new state-of-the-art embedding models for visual document retrieval. The present study acts as a crucial reminder that even the best embeddings are not a panacea; they are one component in a larger, thoughtfully designed retrieval pipeline. Furthermore, the paper's focus on heterogeneous data aligns with the industry's trajectory toward multi-modal RAG, combining text, tables, and images (e.g., for visual search or document understanding). As noted in our **KG intelligence**, arXiv's usage of Retrieval-Augmented Generation and Vision-Language Models is trending upward, with 18 and 5 related articles this week, respectively. This benchmark provides a methodological foundation for evaluating these more complex systems as they evolve. For implementation, teams should start by auditing their internal knowledge sources: what percentage is unstructured text versus semi-structured tables or lists? For queries demanding numerical precision, piloting a hybrid retriever with BM25 should be a priority. The released benchmark code offers a reproducible starting point for creating internal validation suites, moving beyond generic RAG evaluations to tests that reflect the specific data landscape of a luxury group.
Enjoyed this article?
Share:

Related Articles

More in AI Research

View all