Combining Text-to-SQL with Semantic Seek for Retrieval Augmented Generation Summary Context A Query Engine to Mix Structured Analytics and Semantic Search Experiments Conclusion

Artificial Intelligence

Combining Text-to-SQL with Semantic Seek for Retrieval Augmented Generation Summary Context A Query Engine to Mix Structured Analytics and Semantic Search Experiments Conclusion

admin

May 31, 2023

Combining Text-to-SQL with Semantic Seek for Retrieval Augmented Generation
Summary
Context
A Query Engine to Mix Structured Analytics and Semantic Search
Experiments
Conclusion

In this text, we showcase a strong recent query engine ( SQLAutoVectorQueryEngine ) in LlamaIndex that may leverage each a SQL database in addition to a vector store to meet complex natural language queries over a mixture of structured and unstructured data. This question engine can leverage the expressivity of SQL over structured data, and join it with unstructured context from a vector database. We showcase this question engine on a couple of examples and show that it could possibly handle queries that make use of each structured/unstructured data, or either.

Take a look at the complete guide here: https://gpt-index.readthedocs.io/en/latest/examples/query_engine/SQLAutoVectorQueryEngine.html.

Data lakes in enterprises typically encompass each and data. Structured data is often stored in a tabular format in SQL databases, organized into tables with predefined schemas and relationships between entities. However, unstructured data present in data lakes lacks a predefined structure and doesn’t fit neatly into traditional databases. Such a data includes text documents, but in addition other multimodal formats corresponding to audio recordings, videos, and more.

Large Language Models (LLMs) have the power to extract insights from each structured and unstructured data. There have been some initial tooling and stacks which have emerged for tackling each sorts of data:

Given a group of tabular schemas, we convertnatural language right into a SQL statement which might then be executed against the database.
Store unstructured documents together with their embeddings in a vector database (e.g. Pinecone, Chroma, Milvus, Weaviate, etc.). During query-time, fetch the relevant documents by embedding similarity, after which put into the LLM input prompt to synthesize a response.

Each of those stacks solves particular use cases.

Text-to-SQL Over Structured Data

Within the structured setting, SQL is an especially expressive language for operating over tabular data — within the case of analytics, you possibly can get aggregations, join information across multiple tables, sort by timestamp, and way more. Using the LLM to convert natural language to SQL will be thought as a program synthesis “cheat code” — just let the LLM compile to the best SQL query, and let the SQL engine on the database handle the remainder!

Text-to-SQL queries are well-suited for analytics use cases where the reply will be found by executing a SQL statement. They aren’t suited to cases where you’d need more detail than what’s present in a structured table, or in case you’d need more sophisticated ways of determining relevance to the query beyond easy constructs like WHERE conditions.

“What’s the typical population of cities in North America”?
“What are the most important cities and populations in each respective continent?”

Semantic Search over Unstructured Data

Within the unstructured setting, the behavior for retrieval-augmented generation systems is to first perform retrieval after which synthesis. During retrieval, we first look up essentially the most relevant documents to the query by embedding similarity. Some vector stores support with the ability to handle additional metadata filters for retrieval. We will decide to manually specify the set of required filters, or have the LLM “infer” what the query string and metadata filters ought to be (see our auto-retrieval modules in LlamaIndex or LangChain’s self-query module).

Retrieval Augmented Generation is well suited to queries where the reply will be obtained inside some sections of unstructured text data. Most existing vector stores (e.g. Pinecone, Chroma) don’t offer a SQL-like interface; hence they’re less suited to queries that involve aggregations, joins, sums, etc.

“Tell me in regards to the historical museums in Berlin”
“What does Jordan ask from Nick on behalf of Gatsby?”

Combining These Two Systems

For some queries, we may have the desire to make use of data in as a way to give one of the best answer to the query. Ideally this will give us one of the best of each worlds: the analytics capabilities over structured data, and semantic understanding over unstructured data.

Here’s an example use case:

You have got access to a group of articles about different cities, stored in a vector database
You furthermore may have access to a structured table containing statistics for every city.

Given this data collection, let’s take an example query: “Tell me in regards to the arts and culture of town with the best population.”

The “proper” method to answer this query is roughly as follows:

Query the structured table for town with the best population.

SELECT city, population FROM city_stats ORDER BY population DESC LIMIT 1

Convert the unique query right into a more detailed query: “Tell me in regards to the arts and culture of Tokyo.”
Ask the brand new query over your vector database.
Use the unique query + intermediate queries/responses to SQL db and vector db to synthesize the reply.

Let’s take into consideration a few of the high-level implications of such a sequence:

As an alternative of doing embedding search (and optionally metadata filters) to retrieve relevant context, we wish to in some way have a SQL query as a primary “retrieval” step.
We would like to make certain that we are able to in some way “join” the outcomes from the SQL query with the context stored within the vector database. There isn’t a existing language to “join” information between a SQL and vector database. We could have to implement this behavior ourselves.
Neither data source can answer this query by itself. The structured table only comprises population information. The vector database comprises city information but no easy method to query for town with the utmost population.

We’ve created a brand-new query engine ( SQLAutoVectorQueryEngine ) that may query, join, sequence, and mix each structured data from each your SQL database and unstructured data out of your vector database as a way to synthesize the ultimate answer.

The SQLAutoVectorQueryEngine is initialized through passing in a SQL query engine ( GPTNLStructStoreQueryEngine ) in addition to a question engine that uses our vector store auto-retriever module ( VectorIndexAutoRetriever ). Each the SQL query engine and vector query engines are wrapped as “Tool” objects containing a name and description field.

Reminder: the VectorIndexAutoRetriever takes in a natural language query as input. Given some knowledge of the metadata schema of the vector database, the auto retriever first infers the opposite obligatory query parameters to pass in (e.g. top-k value, and metadata filters), and executes a question against the vector database with all of the query parameters.

Diagram of the flow for SQLAutoVectorQueryEngine

During query-time, we run the next steps:

A selector prompt (similarly utilized in our RouterQueryEngine , see guide) first chooses whether we should always query the SQL database or the vector database. If it chooses to make use of the vector query engine, then the remainder of the function execution is identical as querying the RetrieverQueryEngine with VectorIndexAutoRetriever .
If it chooses to question the SQL database, it should execute a text-to-SQL query operation against the database, and (optionally) synthesize a natural language output.
A is run, to convert the unique query right into a more detailed query given the outcomes from the SQL query. As an illustration if the unique query is “Tell me in regards to the arts and culture of town with the best population.”, and the SQL query returns Tokyo as town with the best population, then the brand new query is “Tell me in regards to the arts and culture of Tokyo.” The one exception is that if the SQL query itself is sufficient to answer the unique query; whether it is, then function execution returns with the SQL query because the response.
The brand new query is then run through through the vector store query engine, which performs retrieval from the vector store after which LLM response synthesis. We implement using a VectorIndexAutoRetriever module. This enables us to mechanically infer the best query parameters (query string, top k, metadata filters), given the results of the SQL query. As an illustration, with the instance above, we may infer the query to be something like query_str="arts and culture" and filters={"title": "Tokyo"} .
The unique query, SQL query, SQL response, vector store query, and vector store response are combined right into a prompt to synthesize the ultimate answer.

Taking a step back, listed below are some general comments about this approach:

Using our auto-retrieval module is our way of simulating a join between the SQL database and vector database. We effectively use the outcomes from our SQL query to find out the parameters to question the vector database with.
This also implies that there doesn’t must be an explicit mapping between the items within the SQL database and the metadata within the vector database, since we are able to depend on the LLM having the ability give you the best query for various items. It could be interesting to model explicit relationships between structured tables and document store metadata though; that way we don’t have to spend an additional LLM call within the auto-retrieval step inferring the best metadata filters.

So how well does this work? It really works surprisingly well across a broad range of queries, from queries that may leverage each structured data and unstructured data to queries which can be specific to a structured data collection or unstructured data collection.

Setup

Our experiment setup may be very easy. We’ve a SQL table called city_stats which comprises town, population, and country of three different cities: Toronto, Tokyo, and Berlin.

We also use a Pinecone index to store Wikipedia articles corresponding to the three cities. Each article is chunked up and stored as a separate “Node” object; each chunk also comprises a title metadata attribute containing town name.

We then derive the VectorIndexAutoRetriever and RetrieverQueryEngine from the Pinecone vector index.

from llama_index.indices.vector_store.retrievers import VectorIndexAutoRetriever
from llama_index.vector_stores.types import MetadataInfo, VectorStoreInfo
from llama_index.query_engine.retriever_query_engine import RetrieverQueryEnginevector_store_info = VectorStoreInfo(
content_info='articles about different cities',
metadata_info=[
MetadataInfo(
name='city', 
type='str', 
description='The name of the city'),
]
)
vector_auto_retriever = VectorIndexAutoRetriever(vector_index, vector_store_info=vector_store_info)
retriever_query_engine = RetrieverQueryEngine.from_args(
vector_auto_retriever, service_context=service_context
)

You may as well get the SQL query engine as follows

sql_query_engine = sql_index.as_query_engine()

Each the SQL query engine and vector query engine will be wrapped as QueryEngineTool objects.

sql_tool = QueryEngineTool.from_defaults(
query_engine=sql_query_engine,
description=(
'Useful for translating a natural language query right into a SQL query over a table containing: '
'city_stats, containing the population/country of every city'
)
)
vector_tool = QueryEngineTool.from_defaults(
query_engine=query_engine,
description=f'Useful for answering semantic questions on different cities',
)

Finally, we are able to define our SQLAutoVectorQueryEngine

query_engine = SQLAutoVectorQueryEngine(
sql_tool,
vector_tool,
service_context=service_context
)

Results

We run some example queries.

query_engine.query(
'Tell me in regards to the arts and culture of town with the best population'
)

Intermediate steps:

Final Response:

Tokyo is town with the best population, with 13.96 million people. It's a vibrant city with a wealthy culture and a wide range of art forms. From traditional Japanese art corresponding to calligraphy and woodblock prints to modern art galleries and museums, Tokyo has something for everybody. There are also many festivals and events all year long that commemorate town's culture and art. Some popular festivals and events in Tokyo include the Sannō Festival at Hie Shrine, the Sanja Festival at Asakusa Shrine, and the biennial Kanda Festivals. These events often feature parades with elaborately decorated floats and 1000's of individuals. Moreover, an unlimited fireworks display over the Sumida River takes place annually on the last Saturday of July, attracting over 1,000,000 viewers. During spring, when cherry blossoms bloom, many residents gather in Ueno Park, Inokashira Park, and the Shinjuku Gyoen National Garden for picnics under the blossoms.

This question runs through the complete flow of the SQLAutoVectorQueryEngine . It first queries the SQL database for town with the best population (“Tokyo”), after which queries the vector database with the brand new query. The outcomes are combined right into a final response.

query_engine.query("Tell me in regards to the history of Berlin")

Intermediate Steps:

Final Response:

Berlin's history dates back to the early thirteenth century when it was founded as a small settlement. In 1618, the Margraviate of Brandenburg entered into a private union with the Duchy of Prussia, and in 1701, they formed the Kingdom of Prussia with Berlin as its capital. Town grew and merged with neighboring cities, becoming a middle of the Enlightenment under the rule of Frederick the Great within the 18th century.The Industrial Revolution within the nineteenth century transformed Berlin, expanding its economy, population, and infrastructure. In 1871, it became the capital of the newly founded German Empire. The early twentieth century saw Berlin as a hub for the German Expressionist movement and a significant world capital known for its contributions to science, technology, arts, and other fields.
In 1933, Adolf Hitler and the Nazi Party got here to power, resulting in a decline in Berlin's Jewish community and town's involvement in World War II. After the war, Berlin was divided into East and West Berlin, with the previous under Soviet control and the latter under the control of the US, United Kingdom, and France. The Berlin Wall was inbuilt 1961, physically and ideologically dividing town until its fall in 1989. Following the reunification of Germany in 1990, Berlin once more became the capital of a unified Germany and has since continued to grow and develop as a significant global city.

This question only requires the vector database and never the SQL database. The initial selector accurately identifies that we should always just query the vector database and return the result.

query_engine.query('Are you able to give me the country corresponding to every city?')

Intermediate Steps

Final Response

 Toronto is in Canada, Tokyo is in Japan, and Berlin is in Germany.

This question will be answered by just querying the SQL database, it doesn’t need additional information from the vector database. The query transform step accurately identifies “None” because the followup query, indicating that the unique query has been answered.

Up to now, the stacks around LLMs + unstructured data and LLMs + structured data have largely been separate. We’re enthusiastic about how combining LLMs on top of each structured and unstructured data can unlock recent retrieval/query capabilities in novel and interesting ways!

We’d love so that you can check out the SQLAutoVectorQueryEngine and tell us what you think that.

The complete notebook walkthrough will be found on this guide (associated notebook).

Text-to-SQL Over Structured Data

Semantic Search over Unstructured Data

Combining These Two Systems

Setup

Results

LEAVE A REPLY Cancel reply