Combining Text-to-SQL with Semantic Seek for Retrieval Augmented Generation Summary Context A Query Engine to Mix Structured Analytics and Semantic Search Experiments Conclusion

Artificial Intelligence

Combining Text-to-SQL with Semantic Seek for Retrieval Augmented Generation Summary Context A Query Engine to Mix Structured Analytics and Semantic Search Experiments Conclusion

admin

May 31, 2023

Combining Text-to-SQL with Semantic Seek for Retrieval Augmented Generation
Summary
Context
A Query Engine to Mix Structured Analytics and Semantic Search
Experiments
Conclusion

In this text, we showcase a robust latest query engine ( SQLAutoVectorQueryEngine ) in LlamaIndex that may leverage each a SQL database in addition to a vector store to meet complex natural language queries over a mix of structured and unstructured data. This question engine can leverage the expressivity of SQL over structured data, and join it with unstructured context from a vector database. We showcase this question engine on a couple of examples and show that it might probably handle queries that make use of each structured/unstructured data, or either.

Try the total guide here: https://gpt-index.readthedocs.io/en/latest/examples/query_engine/SQLAutoVectorQueryEngine.html.

Data lakes in enterprises typically encompass each and data. Structured data is often stored in a tabular format in SQL databases, organized into tables with predefined schemas and relationships between entities. Then again, unstructured data present in data lakes lacks a predefined structure and doesn’t fit neatly into traditional databases. This sort of data includes text documents, but in addition other multimodal formats akin to audio recordings, videos, and more.

Large Language Models (LLMs) have the power to extract insights from each structured and unstructured data. There have been some initial tooling and stacks which have emerged for tackling each kinds of data:

Given a group of tabular schemas, we convertnatural language right into a SQL statement which might then be executed against the database.
Store unstructured documents together with their embeddings in a vector database (e.g. Pinecone, Chroma, Milvus, Weaviate, etc.). During query-time, fetch the relevant documents by embedding similarity, after which put into the LLM input prompt to synthesize a response.

Each of those stacks solves particular use cases.

Text-to-SQL Over Structured Data

Within the structured setting, SQL is an especially expressive language for operating over tabular data — within the case of analytics, you may get aggregations, join information across multiple tables, sort by timestamp, and way more. Using the LLM to convert natural language to SQL may be thought as a program synthesis “cheat code” — just let the LLM compile to the correct SQL query, and let the SQL engine on the database handle the remainder!

Text-to-SQL queries are well-suited for analytics use cases where the reply may be found by executing a SQL statement. They usually are not suited to cases where you’d need more detail than what’s present in a structured table, or in case you’d need more sophisticated ways of determining relevance to the query beyond easy constructs like WHERE conditions.

“What’s the typical population of cities in North America”?
“What are the most important cities and populations in each respective continent?”

Semantic Search over Unstructured Data

Within the unstructured setting, the behavior for retrieval-augmented generation systems is to first perform retrieval after which synthesis. During retrieval, we first look up probably the most relevant documents to the query by embedding similarity. Some vector stores support with the ability to handle additional metadata filters for retrieval. We are able to decide to manually specify the set of required filters, or have the LLM “infer” what the query string and metadata filters ought to be (see our auto-retrieval modules in LlamaIndex or LangChain’s self-query module).

Retrieval Augmented Generation is well suited to queries where the reply may be obtained inside some sections of unstructured text data. Most existing vector stores (e.g. Pinecone, Chroma) don’t offer a SQL-like interface; hence they’re less suited to queries that involve aggregations, joins, sums, etc.

“Tell me in regards to the historical museums in Berlin”
“What does Jordan ask from Nick on behalf of Gatsby?”

Combining These Two Systems

For some queries, we may need to make use of data in with the intention to give one of the best answer to the query. Ideally this will give us one of the best of each worlds: the analytics capabilities over structured data, and semantic understanding over unstructured data.

Here’s an example use case:

You have got access to a group of articles about different cities, stored in a vector database
You furthermore mght have access to a structured table containing statistics for every city.

Given this data collection, let’s take an example query: “Tell me in regards to the arts and culture of the town with the very best population.”

The “proper” option to answer this query is roughly as follows:

Query the structured table for the town with the very best population.

SELECT city, population FROM city_stats ORDER BY population DESC LIMIT 1

Convert the unique query right into a more detailed query: “Tell me in regards to the arts and culture of Tokyo.”
Ask the brand new query over your vector database.
Use the unique query + intermediate queries/responses to SQL db and vector db to synthesize the reply.

Let’s take into consideration a number of the high-level implications of such a sequence:

As an alternative of doing embedding search (and optionally metadata filters) to retrieve relevant context, we wish to in some way have a SQL query as a primary “retrieval” step.
We wish to ensure that that we will in some way “join” the outcomes from the SQL query with the context stored within the vector database. There isn’t a existing language to “join” information between a SQL and vector database. We may have to implement this behavior ourselves.
Neither data source can answer this query by itself. The structured table only incorporates population information. The vector database incorporates city information but no easy option to query for the town with the utmost population.

We’ve created a brand-new query engine ( SQLAutoVectorQueryEngine ) that may query, join, sequence, and mix each structured data from each your SQL database and unstructured data out of your vector database with the intention to synthesize the ultimate answer.

The SQLAutoVectorQueryEngine is initialized through passing in a SQL query engine ( GPTNLStructStoreQueryEngine ) in addition to a question engine that uses our vector store auto-retriever module ( VectorIndexAutoRetriever ). Each the SQL query engine and vector query engines are wrapped as “Tool” objects containing a name and description field.

Reminder: the VectorIndexAutoRetriever takes in a natural language query as input. Given some knowledge of the metadata schema of the vector database, the auto retriever first infers the opposite vital query parameters to pass in (e.g. top-k value, and metadata filters), and executes a question against the vector database with all of the query parameters.

Diagram of the flow for SQLAutoVectorQueryEngine

During query-time, we run the next steps:

A selector prompt (similarly utilized in our RouterQueryEngine , see guide) first chooses whether we must always query the SQL database or the vector database. If it chooses to make use of the vector query engine, then the remainder of the function execution is similar as querying the RetrieverQueryEngine with VectorIndexAutoRetriever .
If it chooses to question the SQL database, it is going to execute a text-to-SQL query operation against the database, and (optionally) synthesize a natural language output.
A is run, to convert the unique query right into a more detailed query given the outcomes from the SQL query. As an illustration if the unique query is “Tell me in regards to the arts and culture of the town with the very best population.”, and the SQL query returns Tokyo as the town with the very best population, then the brand new query is “Tell me in regards to the arts and culture of Tokyo.” The one exception is that if the SQL query itself is sufficient to answer the unique query; whether it is, then function execution returns with the SQL query because the response.
The brand new query is then run through through the vector store query engine, which performs retrieval from the vector store after which LLM response synthesis. We implement using a VectorIndexAutoRetriever module. This enables us to routinely infer the correct query parameters (query string, top k, metadata filters), given the results of the SQL query. As an illustration, with the instance above, we may infer the query to be something like query_str="arts and culture" and filters={"title": "Tokyo"} .
The unique query, SQL query, SQL response, vector store query, and vector store response are combined right into a prompt to synthesize the ultimate answer.

Taking a step back, listed below are some general comments about this approach:

Using our auto-retrieval module is our way of simulating a join between the SQL database and vector database. We effectively use the outcomes from our SQL query to find out the parameters to question the vector database with.
This also implies that there doesn’t should be an explicit mapping between the items within the SQL database and the metadata within the vector database, since we will depend on the LLM having the ability give you the correct query for various items. It will be interesting to model explicit relationships between structured tables and document store metadata though; that way we don’t must spend an additional LLM call within the auto-retrieval step inferring the correct metadata filters.

So how well does this work? It really works surprisingly well across a broad range of queries, from queries that may leverage each structured data and unstructured data to queries which might be specific to a structured data collection or unstructured data collection.

Setup

Our experiment setup could be very easy. We’ve a SQL table called city_stats which incorporates the town, population, and country of three different cities: Toronto, Tokyo, and Berlin.

We also use a Pinecone index to store Wikipedia articles corresponding to the three cities. Each article is chunked up and stored as a separate “Node” object; each chunk also incorporates a title metadata attribute containing the town name.

We then derive the VectorIndexAutoRetriever and RetrieverQueryEngine from the Pinecone vector index.

from llama_index.indices.vector_store.retrievers import VectorIndexAutoRetriever
from llama_index.vector_stores.types import MetadataInfo, VectorStoreInfo
from llama_index.query_engine.retriever_query_engine import RetrieverQueryEnginevector_store_info = VectorStoreInfo(
content_info='articles about different cities',
metadata_info=[
MetadataInfo(
name='city', 
type='str', 
description='The name of the city'),
]
)
vector_auto_retriever = VectorIndexAutoRetriever(vector_index, vector_store_info=vector_store_info)
retriever_query_engine = RetrieverQueryEngine.from_args(
vector_auto_retriever, service_context=service_context
)

You may as well get the SQL query engine as follows

sql_query_engine = sql_index.as_query_engine()

Each the SQL query engine and vector query engine may be wrapped as QueryEngineTool objects.

sql_tool = QueryEngineTool.from_defaults(
query_engine=sql_query_engine,
description=(
'Useful for translating a natural language query right into a SQL query over a table containing: '
'city_stats, containing the population/country of every city'
)
)
vector_tool = QueryEngineTool.from_defaults(
query_engine=query_engine,
description=f'Useful for answering semantic questions on different cities',
)

Finally, we will define our SQLAutoVectorQueryEngine

query_engine = SQLAutoVectorQueryEngine(
sql_tool,
vector_tool,
service_context=service_context
)

Results

We run some example queries.

query_engine.query(
'Tell me in regards to the arts and culture of the town with the very best population'
)

Intermediate steps:

Final Response:

Tokyo is the town with the very best population, with 13.96 million people. It's a vibrant city with a wealthy culture and a wide selection of art forms. From traditional Japanese art akin to calligraphy and woodblock prints to modern art galleries and museums, Tokyo has something for everybody. There are also many festivals and events all year long that remember the town's culture and art. Some popular festivals and events in Tokyo include the Sannō Festival at Hie Shrine, the Sanja Festival at Asakusa Shrine, and the biennial Kanda Festivals. These events often feature parades with elaborately decorated floats and hundreds of individuals. Moreover, an unlimited fireworks display over the Sumida River takes place annually on the last Saturday of July, attracting over one million viewers. During spring, when cherry blossoms bloom, many residents gather in Ueno Park, Inokashira Park, and the Shinjuku Gyoen National Garden for picnics under the blossoms.

This question runs through the total flow of the SQLAutoVectorQueryEngine . It first queries the SQL database for the town with the very best population (“Tokyo”), after which queries the vector database with the brand new query. The outcomes are combined right into a final response.

query_engine.query("Tell me in regards to the history of Berlin")

Intermediate Steps:

Final Response:

Berlin's history dates back to the early thirteenth century when it was founded as a small settlement. In 1618, the Margraviate of Brandenburg entered into a private union with the Duchy of Prussia, and in 1701, they formed the Kingdom of Prussia with Berlin as its capital. Town grew and merged with neighboring cities, becoming a middle of the Enlightenment under the rule of Frederick the Great within the 18th century.The Industrial Revolution within the nineteenth century transformed Berlin, expanding its economy, population, and infrastructure. In 1871, it became the capital of the newly founded German Empire. The early twentieth century saw Berlin as a hub for the German Expressionist movement and a significant world capital known for its contributions to science, technology, arts, and other fields.
In 1933, Adolf Hitler and the Nazi Party got here to power, resulting in a decline in Berlin's Jewish community and the town's involvement in World War II. After the war, Berlin was divided into East and West Berlin, with the previous under Soviet control and the latter under the control of the USA, United Kingdom, and France. The Berlin Wall was in-built 1961, physically and ideologically dividing the town until its fall in 1989. Following the reunification of Germany in 1990, Berlin once more became the capital of a unified Germany and has since continued to grow and develop as a significant global city.

This question only requires the vector database and never the SQL database. The initial selector appropriately identifies that we must always just query the vector database and return the result.

query_engine.query('Are you able to give me the country corresponding to every city?')

Intermediate Steps

Final Response

 Toronto is in Canada, Tokyo is in Japan, and Berlin is in Germany.

This question may be answered by just querying the SQL database, it doesn’t need additional information from the vector database. The query transform step appropriately identifies “None” because the followup query, indicating that the unique query has been answered.

To this point, the stacks around LLMs + unstructured data and LLMs + structured data have largely been separate. We’re enthusiastic about how combining LLMs on top of each structured and unstructured data can unlock latest retrieval/query capabilities in novel and interesting ways!

We’d love so that you can check out the SQLAutoVectorQueryEngine and tell us what you’re thinking that.

The total notebook walkthrough may be found on this guide (associated notebook).

Text-to-SQL Over Structured Data

Semantic Search over Unstructured Data

Combining These Two Systems

Setup

Results

LEAVE A REPLY Cancel reply