Why 90% Accuracy in Text-to-SQL is 100% Useless

-

I actually have been working within the Analytics space for over 20 years. Back then, it was not called “analytics”, it was “Business Intelligence” and even “Decision Support Systems” in older times. The terms change, from data warehouses to Big Data, to lakehouses, and now with AI, the essence and the everlasting promise of self-service Analytics stays the identical: extracting truth from data to empower users without counting on someone from the information team. AI without humans within the loop? That sounds controversial.

With the arrival of Large Language Models (LLMs), one use case I find fascinating is developing conversational interfaces to talk with databases (Text-to-SQL). The potential here is immense, promising to democratize data access across organizations.

Nevertheless, for this specific use case, the answer needs to be binary. It either works or it doesn’t.

An accuracy of 80% and even 90% is, unfortunately, not enough. Giving your end-users an AI analytical application that hallucinates tables or misinterprets filters isn’t any joke. You can’t compromise on accuracy since it immediately erodes trust. And what happens when a system loses trust? It’s going to not be used. Adoption will decline, without forgetting the catastrophic risk of business decisions being made based on the incorrect data.

The Complexity of the RAG Pipeline

I began my research on this topic over one 12 months and a half ago and it quickly became clear that orchestrating a sturdy Text-to-SQL RAG (Retrieval-Augmented Generation) application just isn’t trivial. You wish multiple components in your pipeline, working in perfect harmony:

  • An intent classifier to detect the goal of the query.
  • A vector database to store additional context (like business definitions) that the language models need.
  • An embeddings model to vectorize this extra knowledge.
  • A retrieval mechanism for the stored data.
  • Access to the database.
  • The power to generate SQL within the specific dialect of the database.
  • And the flexibility to evaluate the outcomes.

This last part, evaluation, I imagine is commonly omitted or treated as an afterthought, but it surely is probably essentially the most crucial component for ensuring the reliability needed in an enterprise setting.

BigQuery: A Case Study in Native AI Integration

Managing this complex pipeline often requires integrating multiple platforms. I used to be recently impressed by how BigQuery has introduced the merger of Analytics and Generative AI natively of their platform.

You will have the flexibility to work together with your SQL within the BigQuery IDE and use Gen AI immediately without going to a different platform or product. For instance: you may query the database and the retrieved results might be immediately sent to Gemini (or through Vertex you can too add other models). You need to use Gemini to categorise intent, create embeddings and store them in BigQuery’s vector database capabilities, do a semantic search, and generate SQL.

All of that with just one platform, without the effort of managing multiple subscriptions.

After all, like every part in life, it has its drawbacks.

One in all the principal cons is that BigQuery may not be the most cost effective database, and I actually have heard stories of startups where a single incorrect query can drain your bank card. It has not happened to me, but I can relate to how this could occur. One other con can be that you just get entirely locked-in with Google. Possibly that just isn’t a nasty thing; the identical way we’re all locked in with Gmail. Perhaps in the longer term, AI shall be a commodity, the way in which emails at the moment are.

One other drawback is the shortage of granular traceability of the price of the tokens and a form of “mock LLM” for development; you don’t want to actually use the actual expensive LLM at your development stage.

In case you are okay with the cons above, you get a powerful product that mixes multiple tools into one single cloud platform which might handle big data massively.

I actually have created the next repo which was a part of the Kaggle hackathon, where I explored these BigQuery native capabilities further. For more information please visit the repo here:

https://github.com/garyzava/bigq-ethereum-rag

The Missing Piece: Rigorous Evaluation

A whimsical visualization of the multi-layered evaluation framework required for robust Text-to-SQL systems, assessing predicted answers against gold standards using metrics starting from Execution Accuracy to LLM-based judging. Image by creator using Google’s Gemini.

Now, going back to eval frameworks. Platforms like BigQuery simplify the , but they don’t robotically solve the problem. I see multiple solutions on the market, but most of them lack robust evaluation capabilities.

If we accept that Text-to-SQL have to be binary (correct or incorrect), we’d like evaluation strategies that reflect the messy reality of enterprise data, not the pristine environments of educational or demo datasets.

Evaluating a Text-to-SQL system is notoriously difficult because of the declarative nature of SQL and the way complex your database schema is. Does it have hundreds of tables? Are those tables well documented? Probably not. Are naming conventions consistent across all tables?. Two queries can look completely different syntactically (e.g., different join orders, aliasing, or use of CTEs) yet produce equivalent results.

To actually benchmark your RAG application during development and in production, you need to use the precise metrics.

Metrics That Matter

Going back to the promise of self-service BI or analytics, this implies the end-user is relying 100% on themselves; unfortunately, there isn’t a human-in-the-loop or data expert to validate the outcomes. For this reason, we’d like to determine an explainable AI or evaluation framework with a set of metrics to measure the standard of the generated SQL.

  1. The Shift to Execution Accuracy (EX): Early benchmarks relied on Exact Match (EM), which compared the expected SQL string to the bottom truth. This was deeply flawed, because it penalized valid syntactic variations. The trendy standard is Execution Accuracy (EX). This metric executes each the expected SQL and the “Gold” (ground truth) SQL against the actual database and compares the returned result sets. This appropriately validates queries no matter how they’re written.
  2. Focused Evaluation: In enterprise contexts, a question might return extra, non-essential columns (e.g., an ID column used for a join). Strict execution accuracy might mark this as a failure. “Execution-based focused evaluation” allows for a more nuanced comparison, checking if the goal columns and values are correct, while being more lenient on extraneous data or row ordering.
  3. The “Soft-F1” Metric: To mitigate the binary nature of Execution Accuracy (where one incorrect cell fails your entire test), Soft-F1 is increasingly used. This metric provides partial credit by calculating the overlap between the expected and gold results. If a question returns 99 out of 100 correct rows, Soft-F1 reflects high performance, whereas EX would return 0. That is crucial for debugging.
  4. LLM-as-a-Judge: Sometimes execution is inconceivable (e.g., missing private data, environment errors). In these cases, a complicated LLM might be prompted to check the semantic logic of the expected SQL against the Gold SQL. While less objective than execution, it correlates highly with human judgment.

Spider 2.0: The Enterprise Reality Check

Currently there are three remarkable evaluation frameworks: Spider 2.0, BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL) and SynSQL (based on synthetic data). Nevertheless, the industry has been affected by a false sense of security created by outdated benchmarks. For years, the industry relied on Spider 1.0. It focused on small, clean SQLite databases (averaging fewer than 10 tables). Models were achieving 90%+ accuracy, leading many to imagine the issue was “solved.”

The framework that I all the time emphasize, which includes these modern metrics and truly tests enterprise readiness, is Spider 2.0.

Spider 2.0 (released along with ICLR 2025) is a paradigm shift, designed to deal with this “reality gap” by introducing the complexities that break LLMs in production:

  1. Massive Scale: Enterprise schemas are huge. Spider 2.0 databases average 812 columns, with some exceeding 3,000. This scale often exceeds the LLM’s context limits, forcing models to employ “Schema Linking” (retrieval) strategies simply to discover the relevant tables before generating SQL.
  2. Dialect Diversity: Real corporations use Snowflake, BigQuery, and T-SQL, not only SQLite. Spider 2.0 enforces dialect diversity, requiring models to master specific syntax (e.g., handling nested JSON data using UNNEST or FLATTEN).
  3. External Knowledge: Business logic (just like the definition of “Churn Rate”) resides in documentation or project codebases (like DBT), not the schema. Spider 2.0 simulates this by providing external files (Markdown, YAML) the model must read to ground its reasoning.
  4. The Agentic Workflow: Crucially, Spider 2.0 models the workflow of a contemporary data engineer. It moves beyond static translation, evaluating the model’s ability to explore the file system, read documentation, interact with live database instances, and debug errors iteratively.

The difference in difficulty is stark. Models that dominate Spider 1.0 see their success rates drop to 10-20% on the complete Spider 2.0 benchmark, highlighting the deficiencies of current LLMs when faced with real-world complexity.

Conclusion: The Binary Bar for Enterprise Data

The journey from Business Intelligence to AI-driven analytics has been marked by increasing abstraction, but the elemental requirement for data integrity stays unchanged. While the promise of Text-to-SQL is closer than ever, we must resist the allure of high scores on outdated benchmarks.

Achieving 90% accuracy is perhaps academically interesting, but within the enterprise, it’s industrially useless. The bar is binary: it really works or it breaks trust.

As platforms like BigQuery simplify the combination of AI and data, it’s imperative that we concurrently adopt sophisticated evaluation methodologies and rigorous benchmarks like Spider 2.0. Only by testing against the messy reality of enterprise data can we develop Text-to-SQL applications reliable enough to bet the business on.

Until next time, I hope you might have found this topic as fascinating as I do.

Further Reading

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows Authors: Fangyu Lei, Jixuan Chen, Yuxiao Ye, et al. Published: arXiv (Nov 2024), Accepted to ICLR 2025 (Oral). Link: https://arxiv.org/abs/2411.07763

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x