Optimize LLM with DSPy : A Step-by-Step Guide to construct, optimize, and evaluate AI systems

Because the capabilities of huge language models (LLMs) proceed to expand, developing robust AI systems that leverage their potential has turn out to be increasingly complex. Conventional approaches often involve intricate prompting techniques, data generation for fine-tuning, and manual guidance to make sure adherence to domain-specific constraints. Nonetheless, this process might be tedious, error-prone, and heavily reliant on human intervention.

Enter DSPy, a revolutionary framework designed to streamline the event of AI systems powered by LLMs. DSPy introduces a scientific approach to optimizing LM prompts and weights, enabling developers to construct sophisticated applications with minimal manual effort.

On this comprehensive guide, we’ll explore the core principles of DSPy, its modular architecture, and the array of powerful features it offers. We’ll also dive into practical examples, demonstrating how DSPy can transform the way in which you develop AI systems with LLMs.

What’s DSPy, and Why Do You Need It?

DSPy is a framework that separates the flow of your program (modules) from the parameters (LM prompts and weights) of every step. This separation allows for the systematic optimization of LM prompts and weights, enabling you to construct complex AI systems with greater reliability, predictability, and adherence to domain-specific constraints.

Traditionally, developing AI systems with LLMs involved a laborious technique of breaking down the issue into steps, crafting intricate prompts for every step, generating synthetic examples for fine-tuning, and manually guiding the LMs to stick to specific constraints. This approach was not only time-consuming but additionally vulnerable to errors, as even minor changes to the pipeline, LM, or data could necessitate extensive rework of prompts and fine-tuning steps.

DSPy addresses these challenges by introducing a brand new paradigm: optimizers. These LM-driven algorithms can tune the prompts and weights of your LM calls, given a metric you desire to maximize. By automating the optimization process, DSPy empowers developers to construct robust AI systems with minimal manual intervention, enhancing the reliability and predictability of LM outputs.

DSPy’s Modular Architecture

At the center of DSPy lies a modular architecture that facilitates the composition of complex AI systems. The framework provides a set of built-in modules that abstract various prompting techniques, akin to dspy.ChainOfThought and dspy.ReAct. These modules might be combined and composed into larger programs, allowing developers to construct intricate pipelines tailored to their specific requirements.

Each module encapsulates learnable parameters, including the instructions, few-shot examples, and LM weights. When a module is invoked, DSPy’s optimizers can fine-tune these parameters to maximise the specified metric, ensuring that the LM’s outputs adhere to the desired constraints and requirements.

Optimizing with DSPy

DSPy introduces a variety of powerful optimizers designed to boost the performance and reliability of your AI systems. These optimizers leverage LM-driven algorithms to tune the prompts and weights of your LM calls, maximizing the desired metric while adhering to domain-specific constraints.

A few of the key optimizers available in DSPy include:

BootstrapFewShot: This optimizer extends the signature by robotically generating and including optimized examples inside the prompt sent to the model, implementing few-shot learning.
BootstrapFewShotWithRandomSearch: Applies BootstrapFewShot several times with random search over generated demonstrations, choosing the most effective program over the optimization.
MIPRO: Generates instructions and few-shot examples in each step, with the instruction generation being data-aware and demonstration-aware. It uses Bayesian Optimization to effectively search over the space of generation instructions and demonstrations across your modules.
BootstrapFinetune: Distills a prompt-based DSPy program into weight updates for smaller LMs, allowing you to fine-tune the underlying LLM(s) for enhanced efficiency.

By leveraging these optimizers, developers can systematically optimize their AI systems, ensuring high-quality outputs while adhering to domain-specific constraints and requirements.

Getting Began with DSPy

For instance the facility of DSPy, let’s walk through a practical example of constructing a retrieval-augmented generation (RAG) system for question-answering.

Step 1: Organising the Language Model and Retrieval Model

Step one involves configuring the language model (LM) and retrieval model (RM) inside DSPy.

To put in DSPy run:

pip install dspy-ai

DSPy supports multiple LM and RM APIs, in addition to local model hosting, making it easy to integrate your selected models.

import dspy
# Configure the LM and RM
turbo = dspy.OpenAI(model='gpt-3.5-turbo')
colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
dspy.settings.configure(lm=turbo, rm=colbertv2_wiki17_abstracts)

Step 2: Loading the Dataset

Next, we’ll load the HotPotQA dataset, which comprises a group of complex question-answer pairs typically answered in a multi-hop fashion.

from dspy.datasets import HotPotQA
# Load the dataset
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0)
# Specify the 'query' field because the input
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

Step 3: Constructing Signatures

DSPy uses signatures to define the behavior of modules. In this instance, we’ll define a signature for the reply generation task, specifying the input fields (context and query) and the output field (answer).

class GenerateAnswer(dspy.Signature):
"""Answer questions with short factoid answers."""
context = dspy.InputField(desc="may contain relevant facts")
query = dspy.InputField()
answer = dspy.OutputField(desc="often between 1 and 5 words")

Step 4: Constructing the Pipeline

We’ll construct our RAG pipeline as a DSPy module, which consists of an initialization method (__init__) to declare the sub-modules (dspy.Retrieve and dspy.ChainOfThought) and a forward method (forward) to explain the control flow of answering the query using these modules.

class RAG(dspy.Module):
    def __init__(self, num_passages=3):
    super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
    def forward(self, query):
        context = self.retrieve(query).passages
        prediction = self.generate_answer(context=context, query=query)
        return dspy.Prediction(context=context, answer=prediction.answer)

Step 5: Optimizing the Pipeline

With the pipeline defined, we are able to now optimize it using DSPy’s optimizers. In this instance, we’ll use the BootstrapFewShot optimizer, which generates and selects effective prompts for our modules based on a training set and a metric for validation.

from dspy.teleprompt import BootstrapFewShot
# Validation metric
def validate_context_and_answer(example, pred, trace=None):
answer_EM = dspy.evaluate.answer_exact_match(example, pred)
answer_PM = dspy.evaluate.answer_passage_match(example, pred)
return answer_EM and answer_PM
# Arrange the optimizer
teleprompter = BootstrapFewShot(metric=validate_context_and_answer)
# Compile this system
compiled_rag = teleprompter.compile(RAG(), trainset=trainset)

Step 6: Evaluating the Pipeline

After compiling this system, it is important to judge its performance on a development set to make sure it meets the specified accuracy and reliability.

from dspy.evaluate import Evaluate
# Arrange the evaluator
evaluate = Evaluate(devset=devset, metric=validate_context_and_answer, num_threads=4, display_progress=True, display_table=0)
# Evaluate the compiled RAG program
evaluation_result = evaluate(compiled_rag)
print(f"Evaluation Result: {evaluation_result}")

Step 7: Inspecting Model History

For a deeper understanding of the model’s interactions, you possibly can review essentially the most recent generations by inspecting the model’s history.

# Inspect the model's history
turbo.inspect_history(n=1)

Step 8: Making Predictions

With the pipeline optimized and evaluated, you possibly can now use it to make predictions on latest questions.

# Example query
query = "Which award did Gary Zukav's first book receive?"
# Make a prediction using the compiled RAG program
prediction = compiled_rag(query)
print(f"Query: {query}")
print(f"Answer: {prediction.answer}")
print(f"Retrieved Contexts: {prediction.context}")

Minimal Working Example with DSPy

Now, let’s walk through one other minimal working example using the GSM8K dataset and the OpenAI GPT-3.5-turbo model to simulate prompting tasks inside DSPy.

Setup

First, ensure your environment is correctly configured:

import dspy
from dspy.datasets.gsm8k import GSM8K, gsm8k_metric
# Arrange the LM
turbo = dspy.OpenAI(model='gpt-3.5-turbo-instruct', max_tokens=250)
dspy.settings.configure(lm=turbo)
# Load math questions from the GSM8K dataset
gsm8k = GSM8K()
gsm8k_trainset, gsm8k_devset = gsm8k.train[:10], gsm8k.dev[:10]
print(gsm8k_trainset)

The gsm8k_trainset and gsm8k_devset datasets contain an inventory of examples with each example having a matter and answer field.

Define the Module

Next, define a custom program utilizing the ChainOfThought module for step-by-step reasoning:

class CoT(dspy.Module):
def __init__(self):
super().__init__()
self.prog = dspy.ChainOfThought("query -&gt; answer")
def forward(self, query):
return self.prog(query=query)

Compile and Evaluate the Model

Now compile it with the BootstrapFewShot teleprompter:

from dspy.teleprompt import BootstrapFewShot
# Arrange the optimizer
config = dict(max_bootstrapped_demos=4, max_labeled_demos=4)
# Optimize using the gsm8k_metric
teleprompter = BootstrapFewShot(metric=gsm8k_metric, **config)
optimized_cot = teleprompter.compile(CoT(), trainset=gsm8k_trainset)
# Arrange the evaluator
from dspy.evaluate import Evaluate
evaluate = Evaluate(devset=gsm8k_devset, metric=gsm8k_metric, num_threads=4, display_progress=True, display_table=0)
evaluate(optimized_cot)
# Inspect the model's history
turbo.inspect_history(n=1)

This instance demonstrates find out how to arrange your environment, define a custom module, compile a model, and rigorously evaluate its performance using the provided dataset and teleprompter configurations.

Data Management in DSPy

DSPy operates with training, development, and test sets. For every example in your data, you sometimes have three kinds of values: inputs, intermediate labels, and final labels. While intermediate or final labels are optional, having just a few example inputs is important.

Creating Example Objects

Example objects in DSPy are much like Python dictionaries but include useful utilities:

qa_pair = dspy.Example(query="This can be a query?", answer="That is a solution.")
print(qa_pair)
print(qa_pair.query)
print(qa_pair.answer)

Output:

Example({'query': 'This can be a query?', 'answer': 'That is a solution.'}) (input_keys=None)
This can be a query?
That is a solution.

Specifying Input Keys

In DSPy, Example objects have a with_inputs() method to mark specific fields as inputs:

print(qa_pair.with_inputs("query"))
print(qa_pair.with_inputs("query", "answer"))

Values might be accessed using the dot operator, and methods like inputs() and labels() return latest Example objects containing only input or non-input keys, respectively.

Optimizers in DSPy

A DSPy optimizer tunes the parameters of a DSPy program (i.e., prompts and/or LM weights) to maximise specified metrics. DSPy offers various built-in optimizers, each employing different strategies.

Available Optimizers

BootstrapFewShot: Generates few-shot examples using provided labeled input and output data points.
BootstrapFewShotWithRandomSearch: Applies BootstrapFewShot multiple times with random search over generated demonstrations.
COPRO: Generates and refines latest instructions for every step, optimizing them with coordinate ascent.
MIPRO: Optimizes instructions and few-shot examples using Bayesian Optimization.

Selecting an Optimizer

In case you’re unsure where to start out, use BootstrapFewShotWithRandomSearch:

For little or no data (10 examples), use BootstrapFewShot.
For barely more data (50 examples), use BootstrapFewShotWithRandomSearch.
For larger datasets (300+ examples), use MIPRO.

Here’s find out how to use BootstrapFewShotWithRandomSearch:

from dspy.teleprompt import BootstrapFewShotWithRandomSearch
config = dict(max_bootstrapped_demos=4, max_labeled_demos=4, num_candidate_programs=10, num_threads=4)
teleprompter = BootstrapFewShotWithRandomSearch(metric=YOUR_METRIC_HERE, **config)
optimized_program = teleprompter.compile(YOUR_PROGRAM_HERE, trainset=YOUR_TRAINSET_HERE)

Saving and Loading Optimized Programs

After running a program through an optimizer, reserve it for future use:

optimized_program.save(YOUR_SAVE_PATH)

Load a saved program:

loaded_program = YOUR_PROGRAM_CLASS()
loaded_program.load(path=YOUR_SAVE_PATH)

Advanced Features: DSPy Assertions

DSPy Assertions automate the enforcement of computational constraints on LMs, enhancing the reliability, predictability, and correctness of LM outputs.

Using Assertions

Define validation functions and declare assertions following the respective model generation. For instance:

dspy.Suggest(
len(query) <= 100,
"Query must be short and lower than 100 characters",
)
dspy.Suggest(
validate_query_distinction_local(prev_queries, query),
"Query must be distinct from: " + "; ".join(f"{i+1}) {q}" for i, q in enumerate(prev_queries)),
)

Transforming Programs with Assertions

from dspy.primitives.assertions import assert_transform_module, backtrack_handler
baleen_with_assertions = assert_transform_module(SimplifiedBaleenAssertions(), backtrack_handler)

Alternatively, activate assertions directly on this system:

baleen_with_assertions = SimplifiedBaleenAssertions().activate_assertions()

Assertion-Driven Optimizations

DSPy Assertions work with DSPy optimizations, particularly with BootstrapFewShotWithRandomSearch, including settings like:

Compilation with Assertions
Compilation + Inference with Assertions

Conclusion

DSPy offers a strong and systematic approach to optimizing language models and their prompts. By following the steps outlined in these examples, you possibly can construct, optimize, and evaluate complex AI systems with ease. DSPy’s modular design and advanced optimizers allow for efficient and effective integration of varied language models, making it a beneficial tool for anyone working in the sector of NLP and AI.

Whether you are constructing a straightforward question-answering system or a more complex pipeline, DSPy provides the pliability and robustness needed to attain high performance and reliability.

Optimize LLM with DSPy : A Step-by-Step Guide to construct, optimize, and evaluate AI systems

What’s DSPy, and Why Do You Need It?

DSPy’s Modular Architecture

Optimizing with DSPy

Getting Began with DSPy

Step 1: Organising the Language Model and Retrieval Model

Step 2: Loading the Dataset

Step 3: Constructing Signatures

Step 4: Constructing the Pipeline

Step 5: Optimizing the Pipeline

Step 6: Evaluating the Pipeline

Step 7: Inspecting Model History

Step 8: Making Predictions

Minimal Working Example with DSPy

Setup

Define the Module

Compile and Evaluate the Model

Data Management in DSPy

Creating Example Objects

Specifying Input Keys

Optimizers in DSPy

Available Optimizers

Selecting an Optimizer

Saving and Loading Optimized Programs

Advanced Features: DSPy Assertions

Using Assertions

Transforming Programs with Assertions

Assertion-Driven Optimizations

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

What Advent of Code Has Taught Me About Data Science

From prophet to product: How AI got here back right down to earth in 2025

Accelerating Protein Language Model ProtST on Intel Gaudi 2

Meta’s next big AI bet: Manus

Announcing Latest Dataset Search Features

Optimize LLM with DSPy : A Step-by-Step Guide to construct, optimize, and evaluate AI systems

What’s DSPy, and Why Do You Need It?

DSPy’s Modular Architecture

Optimizing with DSPy

Getting Began with DSPy

Step 1: Organising the Language Model and Retrieval Model

Step 2: Loading the Dataset

Step 3: Constructing Signatures

Step 4: Constructing the Pipeline

Step 5: Optimizing the Pipeline

Step 6: Evaluating the Pipeline

Step 7: Inspecting Model History

Step 8: Making Predictions

Minimal Working Example with DSPy

Setup

Define the Module

Compile and Evaluate the Model

Data Management in DSPy

Creating Example Objects

Specifying Input Keys

Optimizers in DSPy

Available Optimizers

Selecting an Optimizer

Saving and Loading Optimized Programs

Advanced Features: DSPy Assertions

Using Assertions

Transforming Programs with Assertions

Assertion-Driven Optimizations

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.