Connecting the Dots for Higher Movie Recommendations

-

guarantees of retrieval-augmented generation (RAG) is that it allows AI systems to reply questions using up-to-date or domain-specific information, without retraining the model. But most RAG pipelines still treat documents and knowledge as flat and disconnected—retrieving isolated chunks based on vector similarity, with no sense of how those chunks relate.

With the intention to treatment RAG’s ignorance of—often obvious—connections between documents and chunks, developers have turned to graph RAG approaches, but often found that the advantages of graph RAG were not definitely worth the added complexity of implementing it

In our recent article on the open-source Graph RAG Project and GraphRetriever, we introduced a brand new, simpler approach that mixes your existing vector search with lightweight, metadata-based graph traversal, which doesn’t require graph construction or storage. The graph connections will be defined at runtime—and even query-time—by specifying which document metadata values you desire to to make use of to define graph “edges,” and these connections are traversed during retrieval in graph RAG.

In this text, we expand on one in every of the use cases within the Graph RAG Project documentation—a demo notebook will be found here—which is a straightforward but illustrative example: searching movie reviews from a Rotten Tomatoes dataset, mechanically connecting each review with its local subgraph of related information, after which putting together query responses with full context and relationships between movies, reviews, reviewers, and other data and metadata attributes.

The dataset: Rotten Tomatoes reviews and movie metadata

The dataset utilized in this case study comes from a public Kaggle dataset titled “Massive Rotten Tomatoes Movies and Reviews”. It includes two primary CSV files:

  • rotten_tomatoes_movies.csv — containing structured information on over 200,000 movies, including fields like title, solid, directors, genres, language, release date, runtime, and box office earnings.
  • rotten_tomatoes_movie_reviews.csv — a set of nearly 2 million user-submitted movie reviews, with fields akin to review text, rating (e.g., 3/5), sentiment classification, review date, and a reference to the associated movie.

Each review is linked to a movie via a shared movie_id, making a natural relationship between unstructured review content and structured movie metadata. This makes it an ideal candidate for demonstrating GraphRetriever’s ability to traverse document relationships using metadata alone—no must manually construct or store a separate graph.

By treating metadata fields akin to movie_id, genre, and even shared actors and directors as graph edges, we will construct a connected retrieval flow that enriches each query with related context mechanically.

The challenge: putting movie reviews in context

A standard goal in AI-powered search and advice systems is to let users ask natural, open-ended questions and get meaningful, contextual results. With a big dataset of movie reviews and metadata, we would like to support full-context responses to prompts like:

  • “What are some good family movies?”
  • “What are some recommendations for exciting motion movies?”
  • “What are some classic movies with amazing cinematography?”

A fantastic answer to every of those prompts requires subjective review content together with some semi-structured attributes like genre, audience, or visual style. To present an excellent answer with full context, the system must:

  1. Retrieve probably the most relevant reviews based on the user’s query, using vector-based semantic similarity
  2. Enrich each review with full movie details—title, release yr, genre, director, etc.—so the model can present a whole, grounded advice
  3. Connect this information with other reviews or movies that provide an excellent broader context, akin to: What are other reviewers saying? How do other movies within the genre compare?

A standard RAG pipeline might handle step 1 well—pulling relevant snippets of text. But, without knowledge of how the retrieved chunks relate to other information within the dataset, the model’s responses can lack context, depth, or accuracy. 

How graph RAG addresses the challenge

Given a user’s query, a plain RAG system might recommend a movie based on a small set of directly semantically relevant reviews. But graph RAG and GraphRetriever can easily pull in relevant context—for instance, other reviews of the identical movies or other movies in the identical genre—to match and contrast before making recommendations.

From an implementation standpoint, graph RAG provides a clean, two-step solution:

Step 1: Construct a normal RAG system

First, similar to with any RAG system, we embed the document text using a language model and store the embeddings in a vector database. Each embedded review may include structured metadata, akin to reviewed_movie_id, rating, and sentiment—information we’ll use to define relationships later. Each embedded movie description includes metadata akin to movie_id, genre, release_year, director, etc.

This permits us to handle typical vector-based retrieval: when a user enters a question like “What are some good family movies?”, we will quickly fetch reviews from the dataset which might be semantically related to family movies. Connecting these with broader context occurs in the subsequent step.

Step 2: Add graph traversal with GraphRetriever

Once the semantically relevant reviews are retrieved in step 1 using vector search, we will then use GraphRetriever to traverse connections between reviews and their related movie records.

Specifically, the GraphRetriever:

  • Fetches relevant reviews via semantic search (RAG)
  • Follows metadata-based edges (like reviewed_movie_id) to retrieve more information that’s directly related to every review, akin to movie descriptions and attributes, data in regards to the reviewer, etc
  • Merges the content right into a single context window for the language model to make use of when generating a solution

A key point: no pre-built knowledge graph is required. The graph is defined entirely by way of metadata and traversed dynamically at query time. If you wish to expand the connections to incorporate shared actors, genres, or time periods, you only update the sting definitions within the retriever config—no must reprocess or reshape the info.

So, when a user asks about exciting motion movies with some specific qualities, the system can herald datapoints just like the movie’s release yr, genre, and solid, improving each relevance and readability. When someone asks about classic movies with amazing cinematography, the system can draw on reviews of older movies and pair them with metadata like genre or era, giving responses which might be each subjective and grounded in facts.

Briefly, GraphRetriever bridges the gap between unstructured opinions (subjective text) and structured context (connected metadata)—producing query responses which might be more intelligent, trustworthy, and complete.

GraphRetriever in motion

To indicate how GraphRetriever can connect unstructured review content with structured movie metadata, we walk through a basic setup using a sample of the Rotten Tomatoes dataset. This involves three essential steps: making a vector store, converting raw data into LangChain documents, and configuring the graph traversal strategy.

See the instance notebook within the Graph RAG Project for complete, working code.

Create the vector store and embeddings

We start by embedding and storing the documents, similar to we might in any RAG system. Here, we’re using OpenAIEmbeddings and the Astra DB vector store:

from langchain_astradb import AstraDBVectorStore
from langchain_openai import OpenAIEmbeddings

COLLECTION = "movie_reviews_rotten_tomatoes"
vectorstore = AstraDBVectorStore(
    embedding=OpenAIEmbeddings(),
    collection_name=COLLECTION,
)

The structure of information and metadata

We store and embed document content as we often would for any RAG system, but we also preserve structured metadata to be used in graph traversal. The document content is kept minimal (review text, movie title, description), while the wealthy structured data is stored within the “metadata” fields within the stored document object.

That is example JSON from one movie document within the vector store:

> pprint(documents[0].metadata)

{'audienceScore': '66',
 'boxOffice': '$111.3M',
 'director': 'Barry Sonnenfeld',
 'distributor': 'Paramount Pictures',
 'doc_type': 'movie_info',
 'genre': 'Comedy',
 'movie_id': 'addams_family',
 'originalLanguage': 'English',
 'rating': '',
 'ratingContents': '',
 'releaseDateStreaming': '2005-08-18',
 'releaseDateTheaters': '1991-11-22',
 'runtimeMinutes': '99',
 'soundMix': 'Surround, Dolby SR',
 'title': 'The Addams Family',
 'tomatoMeter': '67.0',
 'author': 'Charles Addams,Caroline Thompson,Larry Wilson'}

Note that graph traversal with GraphRetriever uses only the attributes this metadata field, doesn’t require a specialized graph DB, and doesn’t use any LLM calls or other expensive 

Configure and run GraphRetriever

The GraphRetriever traverses a straightforward graph defined by metadata connections. On this case, we define an edge from each review to its corresponding movie using the directional relationship between reviewed_movie_id (in reviews) and movie_id (in movie descriptions).

We use an “eager” traversal strategy, which is one in every of the best traversal strategies. See documentation for the Graph RAG Project for more details about strategies.

from graph_retriever.strategies import Eager
from langchain_graph_retriever import GraphRetriever

retriever = GraphRetriever(
    store=vectorstore,
    edges=[("reviewed_movie_id", "movie_id")],
    strategy=Eager(start_k=10, adjacent_k=10, select_k=100, max_depth=1),
)

On this configuration:

  • start_k=10: retrieves 10 review documents using semantic search
  • adjacent_k=10: allows as much as 10 adjoining documents to be pulled at each step of graph traversal
  • select_k=100: as much as 100 total documents will be returned
  • max_depth=1: the graph is barely traversed one level deep, from review to movie

Note that because each review links to precisely one reviewed movie, the graph traversal depth would have stopped at 1 no matter this parameter, in this straightforward example. See more examples within the Graph RAG Project for more sophisticated traversal.

Invoking a question

You’ll be able to now run a natural language query, akin to:

INITIAL_PROMPT_TEXT = "What are some good family movies?"

query_results = retriever.invoke(INITIAL_PROMPT_TEXT)

And with a bit sorting and reformatting of text—see the notebook for details—we will print a basic list of the retrieved movies and reviews, for instance:

 Movie Title: The Addams Family
 Movie ID: addams_family
 Review: A witty family comedy that has enough sly humour to maintain adults chuckling throughout.

 Movie Title: The Addams Family
 Movie ID: the_addams_family_2019
 Review: ...The film's simplistic and episodic plot put a serious dampener on what might have been a welcome breath of fresh air for family animation.

 Movie Title: The Addams Family 2
 Movie ID: the_addams_family_2
 Review: This serviceable animated sequel focuses on Wednesday's feelings of alienation and advantages from the family's kid-friendly jokes and road trip adventures.
 Review: The Addams Family 2 repeats what the primary movie achieved by taking the favored family and turning them into probably the most boringly generic kids movies lately.

 Movie Title: Addams Family Values
 Movie ID: addams_family_values
 Review: The title is apt. Using those morbidly sensual cartoon characters as pawns, the brand new movie Addams Family Values launches a witty assault on those with fixed ideas about what constitutes a loving family. 
 Review: Addams Family Values has its moments -- somewhat a whole lot of them, actually. You knew that just from the title, which is a pleasant way of turning Charles Addams' family of ghouls, monsters and vampires loose on Dan Quayle.

We will then pass the above output to the LLM for generation of a final response, using the complete set information from the reviews in addition to the linked movies.

Organising the ultimate prompt and LLM call looks like this:

from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from pprint import pprint

MODEL = ChatOpenAI(model="gpt-4o", temperature=0)

VECTOR_ANSWER_PROMPT = PromptTemplate.from_template("""

A listing of Movie Reviews appears below. Please answer the Initial Prompt text
(below) using only the listed Movie Reviews.

Please include all movies that may be helpful to someone searching for movie
recommendations.

Initial Prompt:
{initial_prompt}

Movie Reviews:
{movie_reviews}
""")

formatted_prompt = VECTOR_ANSWER_PROMPT.format(
    initial_prompt=INITIAL_PROMPT_TEXT,
    movie_reviews=formatted_text,
)

result = MODEL.invoke(formatted_prompt)

print(result.content)

And, the ultimate response from the graph RAG system might appear like this:

Based on the reviews provided, "The Addams Family" and "Addams Family Values" are really useful nearly as good family movies. "The Addams Family" is described as a witty family comedy with enough humor to entertain adults, while "Addams Family Values" is noted for its clever tackle family dynamics and its entertaining moments.

Be mindful that this final response was the results of the initial semantic seek for reviews mentioning family movies—plus expanded context from documents which might be directly related to those reviews. By expanding the window of relevant context beyond easy semantic search, the LLM and overall graph RAG system is in a position to put together more complete and more helpful responses.

Try It Yourself

The case study in this text shows learn how to:

  • Mix unstructured and structured data in your RAG pipeline
  • Use metadata as a dynamic knowledge graph without constructing or storing one
  • Improve the depth and relevance of AI-generated responses by surfacing connected context

Briefly, that is Graph RAG in motion: adding structure and relationships to make LLMs not only retrieve, but construct context and reason more effectively. In the event you’re already storing wealthy metadata alongside your documents, GraphRetriever gives you a practical technique to put that metadata to work—with no additional infrastructure.

We hope this inspires you to try GraphRetriever on your individual data—it’s all open-source—especially should you’re already working with documents which might be implicitly connected through shared attributes, links, or references.

You’ll be able to explore the complete notebook and implementation details here: Graph RAG on Movie Reviews from Rotten Tomatoes.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x