I in each a graph database and a SQL database, then used various large language models (LLMs) to reply questions on the information through a retrieval-augmented generation (RAG) approach. By utilizing the identical dataset and questions across each systems, I evaluated which database paradigm delivers more accurate and insightful results.
Retrieval-Augmented Generation (RAG) is an AI framework that enhances large language models (LLMs) by letting them relevant external information before generating a solution. As a substitute of relying solely on what the model was trained on, RAG dynamically queries a knowledge source (in this text a SQL or graph database) and integrates those results into its response. An introduction to RAG may be found here.
SQL databases organize data into tables made up of rows and columns. Each row represents a record, and every column represents an attribute. Relationships between tables are defined using keys and joins, and all data follows a hard and fast schema. SQL databases are perfect for structured, transactional data where consistency and precision are essential — for instance, finance, inventory, or patient records.
Graph databases store data as nodes (entities) and edges (relationships) with optional properties attached to each. As a substitute of joining tables, they directly represent relationships, allowing for fast traversal across connected data. Graph databases are perfect for modelling networks and relationships — equivalent to social graphs, knowledge graphs, or molecular interaction maps — where connections are as essential because the entities themselves.
Data
The dataset I used to match the performance of RAGs accommodates Formula 1 results from 1950 to 2024. It includes detailed results at races of drivers and constructors (teams) covering qualifying, sprint race, fundamental race, and even lap times and pit stop times. The standings of the drivers and constructors’ championships after every race are also included.
SQL Schema
This dataset is already structured in tables with keys in order that a SQL database may be easily arrange. The database’s schema is shown below:
is the central table which is linked with all sorts of results in addition to additional information like season and circuits. The outcomes tables are also linked with and tables to record their result at each race. The championship standings after each race are stored within the and tables.
Graph Schema
The schema of the graph database is shown below:

As graph databases can store information in nodes and relationships it only requires six nodes in comparison with 14 tables of the SQL database. The node is an intermediate node that’s used to model that a driver drove a automotive of a constructor at a selected race. Since driver – constructor pairings are changing over time, this relationship must be defined for every race. The race results are stored within the relationships e.g. between and While the relationships contain the driving force and constructor championship standings after each race.
Querying the Database
I used LangChain to construct a RAG chain for each database types that generates a question based on a user query, runs the query, and converts the query result to a solution to the user. The code may be present in this repo. I defined a generic system prompt that might be used to generate queries of any SQL or graph database. The one data specific information was included by inserting the auto-generated database schema into the prompt. The system prompts may be found here.
Here is an example methods to initialize the model chain and ask the query: “What driver won the 92 Grand Prix in Belgium?”
from langchain_community.utilities import SQLDatabase
from langchain_openai import ChatOpenAI
from qa_chain import GraphQAChain
from config import DATABASE_PATH
# hook up with database
connection_string = f"sqlite:///{DATABASE_PATH}"
db = SQLDatabase.from_uri(connection_string)
# initialize LLM
llm = ChatOpenAI(temperature=0, model="gpt-5")
# initialize qa chain
chain = GraphQAChain(llm, db, db_type='SQL', verbose=True)
# ask an issue
chain.invoke("What driver won the 92 Grand Prix in Belgium?")
Which returns:
{'write_query': {'query': "SELECT d.forename, d.surname
FROM results r
JOIN races ra ON ra.raceId = r.raceId
JOIN drivers d ON d.driverId = r.driverId
WHERE ra.yr = 1992
AND ra.name = 'Belgian Grand Prix'
AND r.positionOrder = 1
LIMIT 10;"}}
{'execute_query': {'result': "[('Michael', 'Schumacher')]"}}
{'generate_answer': {'answer': 'Michael Schumacher'}}
The SQL query joins the , , and tables, selects the race on the 1992 Belgian Grand Prix and the driving force who finished first. The LLM converted the yr 92 to 1992 and the race name from “Grand Prix in Belgium” to “Belgian Grand Prix”. It derived these conversions from the database schema which included three sample rows of every table. The query result’s “Michael Schumacher” which the LLM returned as answer.
Evaluation
Now the query I need to reply is that if an LLM is best in querying the SQL or the graph database. I defined three difficulty levels (easy, medium, and hard) where easy were questions that might be answered by querying data from just one table or node, medium were questions which required one or two links amongst tables or nodes and hard questions required more links or subqueries. For every difficulty level I defined five questions. Moreover, I defined five questions that would not be answered with data from the database.
I answered each query with three LLM models (GPT-5, GPT-4, and GPT-3.5-turbo) to investigate if essentially the most advanced models are needed or older and cheaper models could also create satisfactory results. If a model gave the proper answer, it got 1 point, if it replied that it couldn’t answer the query it got 0 points, and in case it gave a unsuitable answer it got -1 point. All questions and answers are listed here. Below are the scores of all models and database types:
| Model | Graph DB | SQL DB |
| GPT-3.5-turbo | -2 | 4 |
| GPT-4 | 7 | 9 |
| GPT-5 | 18 | 18 |
It’s remarkable how more advanced models outperform simpler models: GPT-3-turbo got about half the variety of questions unsuitable, GPT-4 got 2 to three questions unsuitable but couldn’t answer 6 to 7 questions, and GPT-5 got all except one query correct. Simpler models appear to perform higher with a SQL than graph database while GPT-5 achieved the identical rating with either database.
The one query GPT-5 got unsuitable using the SQL database was “Which driver won essentially the most world championships?”. The reply “Lewis Hamilton, with 7 world championships” is just not correct because Lewis Hamilton and Michael Schumacher won 7 world championships. The generated SQL query aggregated the variety of championships by driver, sorted them in descending order and only chosen the primary row while the driving force within the second row had the identical variety of championships.
Using the graph database, the one query GPT-5 got unsuitable was “Who won the Formula 2 championship in 2017?” which was answered with “Lewis Hamilton” (Lewis Hamilton won the Formula 1 but not Formula 2 championship that yr). This can be a tricky query since the database only accommodates Formula 1 but not Formula 2 results. The expected answer would have been to answer that this query couldn’t be answered based on the provided data. Nonetheless, considering that the system prompt didn’t contain any specific information in regards to the dataset it’s comprehensible that this query was not accurately answered.
Interestingly using the SQL database GPT-5 gave the proper answer “Charles Leclerc”. The generated SQL query only searched the drivers table for the name “Charles Leclerc”. Here the LLM will need to have recognized that the database doesn’t contain Formula 2 results and answered this query from its common knowledge. Although this led to the proper answer on this case it will possibly be dangerous when the LLM is just not using the provided data to reply questions. One method to reduce this risk might be to explicitly state within the system prompt that the database should be the one source to reply questions.
Conclusion
This comparison of RAG performance using a Formula 1 results dataset shows that the newest LLMs perform exceptionally well, producing highly accurate and contextually aware answers with none additional prompt engineering. While simpler models struggle, newer ones like GPT-5 handle complex queries with near-perfect precision. Importantly, there was no significant difference in performance between the graph and SQL database approaches – users can simply select the database paradigm that most closely fits the structure of their data.
The dataset used here serves only as an illustrative example; results may differ when using other datasets, especially people who require specialized domain knowledge or access to non-public data sources. Overall, these findings highlight how far retrieval-augmented LLMs have advanced in integrating structured data with natural language reasoning.
