Context Engineering — A Comprehensive Hands-On Tutorial with DSPy

Context Engineering by now. This text will cover the important thing ideas behind creating LLM applications using Context Engineering principles, visually explain these workflows, and share code snippets that apply these concepts practically.

Don’t worry about copy-pasting the code from this text into your editor. At the tip of this text, I’ll share the GitHub link to the open-source code repository and a link to my 1-hour 20-minute YouTube course that explains the concepts presented here in greater detail.

Let’s begin!

What’s Context Engineering?

There’s a big gap between writing easy prompts and constructing production-ready applications. Context Engineering is an umbrella term that refers back to the delicate art and science of fitting information into the context window of an LLM as it really works on a task.

The precise scope of where the definition of Context Engineering begins and ends is debatable, but in keeping with this tweet from Andrej Karpathy, we will discover the next key points:

It isn’t atomic prompt engineering, where you ask one query to the LLM and get a response
It’s a holistic approach that breaks up a bigger problem into multiple subproblems
These subproblems will be solved by multiple LLMs (or agents) in isolation. Each agent is supplied with the suitable context to perform its task
Each agent will be of appropriate capability and size depending on the complexity of the duty.
Intermediate steps that every agent can take to finish the duty – the context is information we input – it also includes that the LLMs see during generation (eg. reasoning steps, tool results, etc)
The agents are connected with control flows, and we orchestrate exactly how information flows through our system
The data available to the agents can come from multiple sources – external databases with Retrieval-Augmented Generation (RAG), tool calls (like web search), memory systems, or classic few-shot examples.
Agents can take actions while generating responses. Each motion the agent can take must be well-defined so the LLM can interact with it through reasoning and acting.
Moreover, systems must be evaluated with metrics and maintained with observability. Monitoring token usage, latency, and value to output quality is a key consideration.

Essential: How this text is structured

But before that, we must ask ourselves one query…

Why not pass into the LLM?

Research has shown that cramming each piece of knowledge into the context of an LLM is way from ideal. Regardless that many frontier models do claim to support “long-context” windows, they still suffer from issues like or .

A recent report from Chroma describes how increasing tokens can negatively impact LLM performance
(Source: Chroma)

An excessive amount of unnecessary information in an LLM’s context can pollute the model’s understanding, result in hallucinations, and end in poor performance.

This is the reason simply having a big context window isn’t enough. We’d like systematic approaches to context engineering.

Why DSPY

For this tutorial, I actually have chosen the DSPy framework. I’ll explain the reasoning for this selection shortly, but let me assure you that the concepts presented here apply to almost any prompting framework, including writing prompts in pure English.

DSPy is a declarative framework for constructing modular AI software. They’ve neatly separated the 2 key elements of any LLM task —
(a) the input and output contracts passed right into a module,
and (b) the logic that governs how information flows.

Let’s see an example!

Imagine we would like to make use of an LLM to jot down a joke. Specifically, we would like it to generate a setup, a punchline, and the complete delivery in a comedian’s voice.

Oh, and we also want the output in JSON format in order that we will post-process individual fields of the dictionary after generation. For instance, perhaps we would like to print the punchline on a T-shirt (assume someone has already written a convenient function for that).

system_prompt = """
You're a comedian who tells jokes, you're all the time funny. 
Generate the setup, punchline, and full delivery within the comedian's voice.

Output in the next JSON format:
{
"setup": ,
"punchline": ,
"delivery": 
}

Your response must be parsable withou errors in Python using json.loads().
"""

client = openai.Client()
response = client.chat.completions.create( 
    model="gpt-4o-mini", 
    temperature = 1,
    messages=[
      {"role": "system", "content": system_prompt,
      {"role": "user", "content": "Write a joke about AI"}
    ] 
)

joke = json.loads(response.selections[0].message.content) # Hope for one of the best

print_on_a_tshirt(joke["punchline"])

Notice how we post-process the LLM’s response to extract the dictionary? What if something “bad” happened, just like the LLM failing to generate the response in the specified format? Our entire code would fail and there will probably be no printing on any T-shirts!

The above code can also be quite difficult to increase. For instance, if we wanted the LLM to do chain of thought reasoning before generating the reply, we would wish to jot down additional logic to parse that reasoning text accurately.

Moreover, it will probably be difficult to have a look at plain English prompts like these and understand what the inputs and outputs of those systems are. DSPy solves the entire above. Let’s write the above example using DSPy.

class JokeGenerator(dspy.Signature): 
    """You are a comedian who tells jokes. You are all the time funny.""" 
    query: str = dspy.InputField()

    setup: str = dspy.OutputField()
    punchline: str = dspy.OutputField() 
    delivery: str = dspy.OutputField()

joke_gen = dspy.Predict(JokeGenerator) 
joke_gen.set_lm(lm=dspy.LM("openai/gpt-4.1-mini", temperature=1))

result = joke_gen(query="Write a joke about AI")
print(result)
print_on_a_tshirt(result.punchline)

This approach gives you structured, predictable outputs that you would be able to work with programmatically, eliminating the necessity for parsing or error-prone string manipulation.

Dspy Signatures explicitly makes you define what the inputs to the system are (“query” within the above example), and the outputs to the system (setup, punchline, and delivery) in addition to their data-types. It also tells the LLM the order through which you would like them to be generated.

The output of the previous code block (minus the t-shirt stuff)

The dspy.Predict thing is an example of a DSPy Module. With modules, you define how the LLM converts from inputs to outputs. dspy.Predict is probably the most basic one – you may pass the query to it, as in joke_gen(query="Write a joke about AI") and it’s going to create a basic prompt to send to the LLM. Internally, DSPy just creates a prompt as you may see below.

Once the LLM responds, DSPy will create Pydantic BaseModel objects that perform automatic schema validation and send back the output. If errors occur during this validation process, DSPy mechanically attempts to repair them by re-prompting the LLM—thereby significantly reducing the chance of a program crash.

Dspy.Predict vs Dspy.ChainOfThought — In chain of thought, we ask the LLM to generate reasoning text before generating the reply (Source: Writer)

One other common theme in context engineering is Chain of Thought. Here, we would like the LLM to generate reasoning text before providing its final answer. This enables the LLM’s context to be populated with its self-generated reasoning before it generates the ultimate output tokens.

To try this, you may simply replace dspy.Predict with dspy.ChainOfThought in the instance above. The remainder of the code stays the identical. Now you may see that the LLM generates reasoning before the defined output fields.

Multi-Step Interactions and Agentic Workflows

The perfect a part of DSPy’s approach is the way it decouples system dependencies (Signatures) from control flows (Modules), which makes writing code for multi-step interactions trivial (and fun!). On this section, let’s see how we will construct some easy agentic flows.

Sequential Processing

Let’s remind ourselves about one in all the important thing components of Context Engineering.

Let’s proceed with our joke generation example. We are able to easily separate out two subproblems from it. Generating the concept is one, making a joke is one other.

Sequential flows allow us to design LLM systems in a modular way where each agent will be of appropriate strength/size and is given context and tools which can be appropriate for its task (Illustrated by writer)

Let’s have two agents then — the primary Agent generates a joke idea (setup and punchline) from a question. A second agent then generates the joke from this concept.

We’re also running the primary agent with gpt-4.1-mini and the second agent with the more powerful gpt-4.1.

Notice how we wrote our own dspy.Module called JokeGenerator. Here we use two separate dspy modules – the query_to_idea and the idea_to_joke to convert our original query to a JokeIdea and subsequently right into a joke (as pictured above).

class JokeIdea(BaseModel):
    setup: str
    contradiction: str
    punchline: str

class QueryToIdea(dspy.Signature):
    """Generate a joke idea with setup, contradiction, and punchline."""
    query = dspy.InputField()
    joke_idea: JokeIdea = dspy.OutputField()

class IdeaToJoke(dspy.Signature):
    """Convert a joke idea right into a full comedian delivery."""
    joke_idea: JokeIdea = dspy.InputField()
    joke = dspy.OutputField()

class JokeGenerator(dspy.Module):
    def __init__(self):
        self.query_to_idea = dspy.Predict(QueryToIdea)
        self.idea_to_joke = dspy.Predict(IdeaToJoke)
        
        self.query_to_idea.set_lm(lm=dspy.LM("openai/gpt-4.1-mini"))
        self.idea_to_joke.set_lm(lm=dspy.LM("openai/gpt-4.1"))

    
    def forward(self, query):
        idea = self.query_to_idea(query=query)
        joke = self.idea_to_joke(joke_idea=idea.joke_idea)
        return joke

You may as well implement iterative improvement where the LLM reflects on and refines its outputs. For instance, we will write a refinement module whose context is the output of a previous LM, and it must act as a feedback provider. The primary LM can input this feedback and iteratively improve its response.

An illustration of Iterative refinement. The Idea LM produces a “Setup”, “Contradiction”, and “Punchline” for a joke. The Joke LM generates a joke out of it. The Refinement LM provides feedback to the Joke LM to
iteratively improve the ultimate joke. (Source: Writer)

Conditional Branching and Multi-Output Systems

Sometimes you would like your agent to output multiple variations, after which select one of the best amongst them. Let’s have a look at an example of that.

Here we’ve first defined a joke judge – it inputs several joke ideas, after which picks the index of one of the best joke. This joke is then passed into the following section.

num_samples = 5

class JokeJudge(dspy.Signature):
    """Given a listing of joke ideas, you have to pick one of the best joke"""
    joke_ideas: list[JokeIdeas] = dspy.InputField()
    best_idx: int = dspy.OutputField(
        le=num_samples,
        ge=1,
        description="The index of the funniest joke")

class ConditionalJokeGenerator(dspy.Module):
    def __init__(self):
        self.query_to_idea = dspy.ChainOfThought(QueryToIdea)
        self.judge = dspy.ChainOfThought(JokeJudge)
        self.idea_to_joke = dspy.ChainOfThought(IdeaToJoke)
    
    async def forward(self, query):
        # Generate multiple ideas in parallel
        ideas = await asyncio.gather(*[
            self.query_to_idea.acall(query=query) 
            for _ in range(num_samples)
        ])
        
        # Judge and rank ideas
        best_idx = (await self.judge.acall(joke_ideas=ideas)).best_idx
        
        # Select best idea and generate final joke
        best_idea = ideas[best_idx]

        # Convert from idea to joke
        return await self.idea_to_joke.acall(joke_idea=best_idea)

Tool Calling

LLM applications often must interact with external systems. That is where tool-calling steps in. You may imagine a tool to be any Python function. You only need two things to define a Python function as an LLM tool:

An outline of what the function does
An inventory of inputs and their data types

Let’s see an example of fetching news. We first write an easy Python function, where we use Tavily. The function inputs a search query and fetches recent news articles from the last 7 days.

client = TavilyClient(api_key=os.getenv("TAVILY_API_KEY"))

def fetch_recent_news(query: str) -> str:
    """Inputs a question string, searches for news and returns top results."""
    response = tavily_client.search(query, search_depth="advanced", 
                                    topic="news", days=7, max_results=3)
    return [x["content"] for x in response["results"]]

Now let’s usedspy.ReAct (or the REasoning and ACTing). The module mechanically reasons concerning the user’s query, decides when to call which tools, and incorporates the tool results into the ultimate response. Doing that is pretty easy:

class HaikuGenerator(dspy.Signature):
    """
Generates a haiku concerning the latest news on the query.
Also create an easy file where you save the ultimate summary.
    """
    query = dspy.InputField()
    summary = dspy.OutputField(desc="A summary of the newest news")
    haiku = dspy.OutputField()

program = dspy.ReAct(signature=HaikuGenerator,
                     tools=[fetch_recent_news],
                     max_iters=2)

program.set_lm(lm=dspy.LM("openai/gpt-4.1", temperature=0.7))
pred = program(query="OpenAI")

When the above code runs, the LLM first reasons about what the user wants and which tool to call (if any). Then it generates the name of the function and the arguments to call the function.

We call the news function with the generated args, execute the function to generate the news data. This information is passed back into the LLM. The LLM comes to a decision whether to call more tools, or “finish”. If the LLM reasons that it has enough information to reply the user’s original request, it chooses to complete, and generate the reply.

Advanced Tool Usage — Scratchpad and File I/O

An evolving standard for contemporary applications is to permit LLMs access to the file system, allowing them to read and write files, move between directories (with appropriate restrictions), grep and search text inside files, and even run terminal commands!

This pattern opens a ton of possibilities. It transforms the LLM from a passive text generator into an energetic agent able to performing complex, multi-step tasks directly inside a user’s environment. For instance, just displaying the list of tools available to Gemini CLI will reveal a brief but incredibly powerful collection of tools.

Gemini CLI Tools — A screenshot of the default tools available via Gemini CLI

A fast word on MCP Servers

One other recent paradigm within the space of agentic systems are MCP servers. MCPs need their very own dedicated article, so I won’t go over them intimately on this one.

This has quickly develop into the industry-standard solution to serve specialized tools to LLMs. It follows the classic Client-Server architecture where the LLM (a client) sends a request to the MCP server, and the MCP server carries out the requested motion, and returns a result back to the LLM for downstream processing. MCPs are perfect for context engineering specific examples since you may declare system prompt formats, resources, restricted database access, etc, to your application.

This repository has an important list of MCP servers that you would be able to study to make your LLM applications connect with a wide range of applications.

Retrieval-Augmented Generation (RAG)

Retrieval Augmented Generation has develop into a cornerstone of contemporary AI application development. It’s an architectural approach that injects external, relevant, and up-to-date information into the Large Language Models (LLMs) that’s contextually relevant to the user’s query.

RAG pipelines consist of a preprocessing and an inference-time phase. During pre-processing, we process the reference data corpus and reserve it in a queryable format. Within the inference phase, we process the user query, retrieve relevant documents from our database, and pass them into the LLM to generate a response.

Constructing RAGs is complicated, and there was loads of great research and engineering optimizations which have made life easier. I made a 17-minute video that covers all of the elements of constructing a reliable RAG pipeline.

Some practical suggestions for Good RAG

When preprocessing, generate additional metadata per chunk. This will be so simple as “questions this chunk answers”. When saving the chunks to your database, also save the generated metadata!

class ChunkAnnotator(dspy.Signature):
    chunk: str = dspy.InputField()
    possible_questions: list[str] = dspy.OutputField(
           description="list of questions that this chunk answers"
           )

Query Rewriting: Directly using the user’s query to do RAG retrieval is commonly a nasty idea. Users write pretty random things, which can not match the distribution of text in your corpus. Query rewriting does what it says – it “rewrites” the query, perhaps fixing grammar, spelling errors, contextualizes it with past conversation, and even adds additional keywords that make querying easier.

class QueryRewriting(dspy.Signature):
    user_query: str = dspy.InputField()
    conversation: str = dspy.InputField(
           description="The conversation to this point")
    modified_query: str = dspy.OutputField(
           description="a question contextualizing the user query with the conversation's context and optimized for retrieval search"
           )

HYDE or Hypothetical Document Embedding is a variety of Query Rewriting system. In HYDE, we generate a man-made (or hypothetical) answer from the LLM’s internal knowledge. This response often incorporates essential keywords that attempt to directly match with the answers database. Vanilla query rewriting is great for searching a database of questions, and HYDE is great for searching a database with answers.

Direct retrieval vs Query rewriting vs HYDE — Direct Retrieval vs Query Rewriting vs HYDE (Source: Writer)

Hybrid search is sort of all the time higher than purely semantic or purely keyword-based search. For semantic search, I’d use cosine similarity nearest neighbor search with vector embeddings. And for semantic search, use BM25.
RRF: You may select multiple strategies to retrieve documents, after which use reciprocal rank fusion to mix them into one unified list!

Multi-Hop Retrieval and Hybrid HyDE Search (Illustrated by Writer)

Multi-Hop Search is an option to contemplate as well if you happen to can afford additional latency. Here, you pass the retrieved documents back into the LLM to generate recent queries, that are used to conduct additional searches on the database.

class MultiHopHyDESearch(dspy.Module):
    def __init__(self, retriever):
        self.generate_queries = dspy.ChainOfThought(QueryGeneration)
        self.retriever = retriever
    
    def forward(self, query, n_hops=3):
        results = []
        
        for hop in range(n_hops): # Notice we loop multiple times

            # Generate optimized search queries
            search_queries = self.generate_queries(
                query=query, 
                previous_jokes=retrieved_jokes
            )
            
            # Retrieve using each semantic and keyword search
            semantic_results = self.retriever.semantic_search(
                search_queries.semantic_query
            )
            bm25_results = self.retriever.bm25_search(
                search_queries.bm25_query
            )
            
            # Fuse results
            hop_results = reciprocal_rank_fusion([
                semantic_results, bm25_results
            ])
            results.extend(hop_results)
        
        return results

Citations: When asking LLM to generate responses from the retrieved documents, we may also ask the LLM to cite references to the documents it found useful. This enables the LLM to first generate a plan of the way it’s going to make use of the retrieved content.
Memory: Should you are constructing a chatbot, it is crucial to work out the query of memory. You may imagine Memory as a mixture of Retrieval and Tool Calling. A widely known system is the Mem0 system. The LLM observes recent data and calls tools to make your mind up if it needs so as to add or modify its existing memories. During question-answering, it retrieves relevant memories using RAG to generate answers.

The Mem0 architecture (Source: The Mem0 paper)

Best Practices and Production Considerations

This section isn’t directly about Context Engineering, but more about best practices to construct LLM apps for production.

1. Design Evaluation First

Before constructing features, determine the way you’ll measure success. This helps scope your application and guides optimization decisions.

Hyperparameters impacting LLM outputs — Lots of parameters impact the standard of LLM’s outputs (Illustrated by the writer)

Should you can design verifiable or objective rewards, that’s one of the best. (example: classification tasks where you will have a validation dataset)
If not, are you able to define functions that heuristically evaluate LLM responses in your use case? (example: variety of times a selected chunk is retrieved given a matter)
If not, are you able to get humans to annotate your LLM’s responses?
If nothing works, use an LLM as a judge to judge responses. Usually, you ought to set your evaluation task as a comparison study, where the Judge receives multiple responses produced using different hyperparameters/prompts, and the judge must rank which of them are one of the best.

Evaluation of LLM apps — An easy flowchart about evaluating LLM apps (Illustration by writer)

3. Use Structured Outputs Almost In every single place

All the time prefer structured outputs over free-form text. It makes your system more reliable and easier to debug. You may add validation and retries as well!

4. Design for failure

When designing prompts or dspy modules, make certain you usually consider “what happens if things go mistaken?”

Like several good software, cutting down error states and failing with swagger is the perfect scenario.

5. Monitor All the pieces

DSpy integrates with MLflow to trace:

Individual prompts passed into the LLM and their responses
Token usage and costs
Latency per module
Success/failure rates
Model performance over time

Langfuse, Logfire are equally great alternatives.

Outro

Context engineering represents a paradigm shift from easy prompt engineering to constructing comprehensive and modular LLM applications.

The DSPy framework provides the tools and abstractions needed to implement these patterns systematically. As LLM capabilities proceed to evolve, context engineering will develop into increasingly crucial for constructing applications that effectively leverage the facility of huge language models.

To observe the complete video course on which this text relies, please visit this YouTube link.

To access the complete GitHub repo, visit:

https://github.com/avbiswas/context-engineering-dspy

Visit the Context Engineering repo for code access!

References

Writer’s YouTube channel: https://www.youtube.com/@avb_fj

Writer’s Patreon: www.patreon.com/NeuralBreakdownwithAVB

Writer’s Twitter (X) account: https://x.com/neural_avb

Full Context Engineering video course: https://youtu.be/5Bym0ffALaU

Github Link: https://github.com/avbiswas/context-engineering-dspy

Context Engineering — A Comprehensive Hands-On Tutorial with DSPy

What’s Context Engineering?

Essential: How this text is structured

Why not pass into the LLM?

Why DSPY

Let’s see an example!

Multi-Step Interactions and Agentic Workflows

Sequential Processing

Iterative Refinement

Conditional Branching and Multi-Output Systems

Tool Calling

Advanced Tool Usage — Scratchpad and File I/O

A fast word on MCP Servers

Retrieval-Augmented Generation (RAG)

Some practical suggestions for Good RAG

Best Practices and Production Considerations

1. Design Evaluation First

3. Use Structured Outputs Almost In every single place

4. Design for failure

5. Monitor All the pieces

Outro

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Welcome Gemma – Google’s recent open LLM

Beyond the Flat Table: Constructing an Enterprise-Grade Financial Model in Power BI

Introducing the Red-Teaming Resistance Leaderboard

Federated Learning, Part 1: The Basics of Training Models Where the Data Lives

🪆 Introduction to Matryoshka Embedding Models

Context Engineering — A Comprehensive Hands-On Tutorial with DSPy

What’s Context Engineering?

Essential: How this text is structured

Why not pass into the LLM?

Why DSPY

Let’s see an example!

Multi-Step Interactions and Agentic Workflows

Sequential Processing

Iterative Refinement

Conditional Branching and Multi-Output Systems

Tool Calling

Advanced Tool Usage — Scratchpad and File I/O

A fast word on MCP Servers

Retrieval-Augmented Generation (RAG)

Some practical suggestions for Good RAG

Best Practices and Production Considerations

1. Design Evaluation First

3. Use Structured Outputs Almost In every single place

4. Design for failure

5. Monitor All the pieces

Outro

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.