Governed Text-to-SQL: Fan-Trap Prevention & MCP

Q: Why does naive text-to-SQL fail on production analytical queries?

Naive text-to-SQL gives the LLM raw schema and asks it to write SQL. The LLM doesn't see business rules, doesn't know your join paths, and silently fan-traps on multi-fact queries. The output is non-deterministic, prone to hallucinated columns, and unauditable.

Q: How does ontology-grounded text-to-SQL differ?

Ontology-grounded text-to-SQL replaces 'give the LLM the schema' with 'give the LLM a semantic graph of the schema'. The agent navigates concepts and relationships (via GraphRAG or SPARQL) instead of guessing tables and joins. The result: correct multi-table joins, no fan-traps, and queries that respect declared business rules.

Q: How does OrionBelt prevent fan-traps?

OrionBelt Semantic Layer (OBSL) detects multi-fact queries via the Composite Fact Layer (CFL), splits them, runs each independently, and combines via UNION ALL. SQL is built through a custom AST, not string templating, so the output is guaranteed correct and injection-safe. OrionBelt Analytics adds OBQC, a deterministic ontology-based validator that catches fan-traps, bad joins, and type mismatches before the query ever runs.

Q: Why MCP for text-to-SQL?

MCP (Model Context Protocol) standardizes how AI clients call remote tools. Exposing governed text-to-SQL via MCP means Claude, ChatGPT, Copilot, Cursor, Windsurf, and other MCP clients all get the same correct behavior. The agent calls a semantic query tool instead of writing freeform SQL it might get wrong.

Why Naive Text-to-SQL Fails in Production

Hand an LLM a database schema and a natural-language question, and it will return SQL. The SQL will often run. The result will sometimes be right. In production analytical workloads, three failure modes dominate:

Hallucinated schema: columns that don't exist, joins on keys that aren't keys, filters that reference the wrong grain.
Non-deterministic output: the same question produces a different SQL plan, a different filter ordering, and occasionally a different number on different runs.
Silent fan-traps: the killer. The query runs, returns a number, and the number is over-counted by a factor of how many rows fan out through the shared dimension.

The Fan-Trap, Concretely

Suppose you have two fact tables: orders and support_tickets. Both relate to customers. You ask: "Total order revenue and total support tickets, by customer, last quarter."

Anti-pattern: naive text-to-SQL

SELECT c.customer_id,
       SUM(o.amount)        AS revenue,
       COUNT(t.ticket_id)   AS tickets
FROM   customers   c
JOIN   orders      o ON o.customer_id = c.customer_id
JOIN   support_tickets t ON t.customer_id = c.customer_id
WHERE  o.created_at >= '2026-01-01'
  AND  t.created_at >= '2026-01-01'
GROUP  BY c.customer_id;

This runs cleanly. It returns plausible-looking numbers. Revenue is multiplied by the number of tickets per customer. Tickets is multiplied by the number of orders per customer. The CFO will not laugh when this hits a quarterly report.

Governed pattern: split via Composite Fact Layer

-- OBSL detects multi-fact and emits:
SELECT customer_id, SUM(revenue) AS revenue, MAX(tickets) AS tickets
FROM (
  SELECT customer_id, SUM(amount) AS revenue, NULL AS tickets
  FROM   orders
  WHERE  created_at >= '2026-01-01'
  GROUP  BY customer_id

  UNION ALL

  SELECT customer_id, NULL AS revenue, COUNT(*) AS tickets
  FROM   support_tickets
  WHERE  created_at >= '2026-01-01'
  GROUP  BY customer_id
)
GROUP BY customer_id;

The key insight: fan-trap prevention is a property of the SQL generator, not the LLM. You cannot prompt-engineer your way out of a fan-trap reliably. You need a generator that recognizes multi-fact queries and splits them by construction — this is exactly what the Composite Fact Layer (CFL) does.

From Schema-Grounded to Ontology-Grounded

The deeper fix is to stop giving the LLM raw schema. Schemas are noisy: column names like fk_user_id_v2, surrogate keys, audit columns, deprecated tables. An ontology, by contrast, models concepts: Customer, Order, Revenue, with declared relationships and constraints.

GraphRAG: discover the right concepts

OrionBelt Analytics generates an RDF/OWL ontology from your database schema and serves it via an MCP server. When the agent asks a question, it queries the ontology graph (up to 12-hop traversal) combined with vector embeddings (ChromaDB) for semantic schema discovery. The agent finds the right concepts, with the right join paths, before any SQL is written.

OBQC: validate before execution

Even with concept-level grounding, you want a deterministic safety net. OBQC (the Ontology-Based Query Checker) validates candidate SQL against the ontology and catches fan-traps, bad joins, and type mismatches before the query runs. Failure is loud and structured, not silent and statistical.

OBSL: compile via a custom SQL AST

Once the concepts are known, OrionBelt Semantic Layer compiles the request through its custom SQL Abstract Syntax Tree, emitting dialect-specific SQL for PostgreSQL, Snowflake, BigQuery, ClickHouse, Databricks, DuckDB/MotherDuck, Dremio, or MySQL. Because the AST is custom (not string templating), the output is guaranteed syntactically valid and injection-safe.

Serving Governed Text-to-SQL Through MCP and Postgres Wire

The transport matters. Exposing governed text-to-SQL through the Model Context Protocol means:

Any MCP-capable client (Claude, ChatGPT, Copilot, Cursor, Windsurf) gets the same correct behavior, without bespoke integration.
The agent sees stable, semantically-rich tools, not a raw database connection.
Streamable HTTP makes deployment trivial: stateless, scalable, behind any reverse proxy.
The full result envelope (compiled SQL, query plan, warnings, timing) flows back to the agent, so it can self-correct and so you can audit.

For tools that don't speak MCP, OBSL also exposes a PostgreSQL Wire Protocol endpoint. Any psql, JDBC, or ODBC client (and therefore most BI tools) can connect to OBSL as if it were a Postgres database. The same governed semantic model, the same fan-trap-free SQL generation, but reachable from anything that knows how to talk to Postgres.

Naive vs Governed Text-to-SQL

Property	Naive text-to-SQL	Governed text-to-SQL (OrionBelt)
Grounding	Raw schema	RDF/OWL ontology + semantic model
SQL construction	String generation by LLM	Custom SQL AST, dialect-specific emitter
Fan-trap handling	Silent over-count	Detected and split via CFL
Multi-table joins	Hopeful	Follows declared join paths
Validation	Run and pray	OBQC validates before execution
Determinism	Varies per call	Same input, same output
Injection safety	Depends on prompt	Guaranteed by AST
Audit	Question text only	Question + compiled SQL + plan + warnings + timing

Frequently Asked Questions

What is a fan-trap in SQL?

A fan-trap is the SQL anti-pattern where joining two independent fact tables through a shared dimension multiplies rows and produces silent over-counts. Naive text-to-SQL falls into this trap routinely on multi-fact analytical queries.

Why does naive text-to-SQL fail on production analytical queries?

It gives the LLM raw schema and asks it to write SQL. The LLM doesn't see business rules, doesn't know your join paths, and silently fan-traps on multi-fact queries. The output is non-deterministic, prone to hallucinated columns, and unauditable.

How does ontology-grounded text-to-SQL differ?

It replaces "give the LLM the schema" with "give the LLM a semantic graph of the schema". The agent navigates concepts and relationships instead of guessing tables and joins. Result: correct multi-table joins, no fan-traps, queries that respect declared business rules.

How does OrionBelt prevent fan-traps?

OBSL detects multi-fact queries via the Composite Fact Layer (CFL), splits them, runs each independently, and combines via UNION ALL. SQL is built through a custom AST, so output is guaranteed correct and injection-safe. OrionBelt Analytics adds OBQC, a deterministic ontology-based validator that catches fan-traps, bad joins, and type mismatches before the query ever runs.

Why MCP for text-to-SQL?

MCP standardizes how AI clients call remote tools. Exposing governed text-to-SQL via MCP means Claude, ChatGPT, Copilot, Cursor, Windsurf, and other MCP clients all get the same correct behavior. The agent calls a semantic query tool instead of writing freeform SQL.

Why Naive Text-to-SQL Fails in Production

The Fan-Trap, Concretely

From Schema-Grounded to Ontology-Grounded

GraphRAG: discover the right concepts

OBQC: validate before execution

OBSL: compile via a custom SQL AST

Serving Governed Text-to-SQL Through MCP and Postgres Wire

Naive vs Governed Text-to-SQL

Frequently Asked Questions

Related Topics

Try governed text-to-SQL live