Construct Your Own Custom LLM Memory Layer from Scratch

is a fresh start. Unless you explicitly supply information from previous sessions, the model has no built‑in sense of continuity across requests or sessions. This stateless design is great for parallelism and safety, however it poses an enormous challenge for chat applications that requires user-level personalization.

In case your chatbot treats the user as a stranger each time they log in, how can it ever generate personalized responses?

In this text, we’ll construct an easy memory system from scratch, inspired by the favored Mem0 architecture.

Unless otherwise mentioned, all illustrations embedded here were created by me, the creator.

The goal of this text is to teach readers on memory management as a context engineering problem. At the tip of the article you may even find:

A GitHub link that incorporates the complete memory project, you possibly can host yourself
An in-depth YouTube tutorial that goes over the concepts line by line.

Memory as a Context Engineering problem

Context Engineering is the strategy of filling within the context of an LLM with all of the relevant information it needs to finish a task. For my part, memory is certainly one of the toughest and most interesting context engineering problems.

LLMs don’t include memory!

Tackling memory introduces you (as a developer) to a few of a very powerful techniques required in just about all context engineering problems, namely:

Extracting structured information from raw text streams
Summarization
Vector databases
Query generation and similarity search
Query post-processing and re-ranking
Agentic tool calling

And so way more.

High‑level architecture

At a look, the system should give you the chance to do 4 things: extract, embed, retrieve, and maintain. Let’s scout the high-level plans before we start the implementation.

Components

• Extraction: Extracts candidate atomic memories from the present user-assistant messages.
• Vector DB: Embed the extracted factoids into continuous vectors and store them in a vector database.
• Retrieval: When the user asks an issue, we’ll generate a question with an LLM and retrieve memories just like that question.

• Maintenance: Using a ReAct (Reasoning and Acting) loop, the agent decides whether so as to add, update, delete, or no‑op based on the turn and contradictions with existing facts.

The Mem0 architecture (Source: Mem0 paper)

!

Let’s see this in motion!

2) Memory Extraction with DSPy: From Transcript to Factoids

On this section, let’s design a sturdy extraction step that converts conversation transcripts right into a handful of atomic, categorized factoids.

The image shows a diagram of extracting relevant facts from user’s messages and storing them in memory.

What we’re extracting and why it matters

The goal is to make a memory store that may be a per-user, persistent vector-backed database.

What’s a “good” memory?

A brief, self-contained fact—an atomic unit—that could be embedded and retrieved later with high precision.

With DSPy, extracting structured information may be very straightforward. Consider the code snippet below.

We define a DSPy signature called MemoryExtract.
The inputs of this signature (annotated as InputField) are the transcript,
and the expected output (annotated as OutputField) is an inventory of strings containing each factoid.

Context string in, list of memory strings out.

# ... other imports
import dspy
from pydantic import BaseModel

class MemoryExtract(dspy.Signature):
    """
Extract relevant information from the conversation. 
Memories are atomic independent factoids that we must learn in regards to the user.
If transcript doesn't contain any information price extracting, return empty list.
"""

    transcript: str = dspy.InputField()
    memories: list[str] = dspy.OutputField()

memory_extractor = dspy.Predict(MemoryExtract)

In DSPy, the signature’s docstring is used as a system prompt. We will customize the docstring to explicitly tailor the type of knowledge that the LLM will extract from the conversation.

Finally, to extract memories, we pass the conversation history into the memory extractor as a JSON string. Take a look at the code snippet below.

async def extract_memories_from_messages(messages):
    transcript = json.dumps(messages)
    with dspy.context(lm=dspy.LM(model=MODEL_NAME)):
        out = await memory_extractor.acall(transcript=transcript)
    return out.memories # returns an inventory of memories

That’s it! Let’s run the code with a dummy conversation and see what happens.

if __name__ == "__main__":
    messages = [
        {
            "role": "user",
            "content": "I like coffee"
        },
        {
            "role": "assistant",
            "content": "Got it!"
        },
        {
            "role": "user",
            "content": "actually, no I like tea more. I also like football"
        }
    ]
    memories = asyncio.run(extract_memories_from_messages(messages))
    print(memories)

'''
Outputs:

[
    "User used to like tea, but does not anymore",
    "User likes coffee",
    "User likes football"
]
'''

As you possibly can see, we will extract independent factoids from conversations. What does this mean?

If DSPy interests you, take a look at this Context Engineering with DSPy article that goes deeper into the concept. Or watch this video below

Embedding extracted memories

So we will extract memories from conversations. Next, let’s embed them so we will eventually store them in a vector database.

On this project, we’ll use QDrant as our vector database – they’ve a cool free tier that is incredibly fast and supports additional features like hybrid filtering (where you possibly can pass SQL “where”-like attribute filters to your vector query search).

The image shows the means of uploading memories right into a vector database.

Selecting the embedding model and fixing the dimension

For cost, speed, and solid quality on short factoids, we elect text-embedding-3-small. We pin the vector size to 64, which lowers storage and hurries up search while remaining expressive enough for concise memories. It is a hyperparam we will tune later to suit our needs.

client = openai.AsyncClient()
async def generate_embeddings(strings: list[str]):
    out = await client.embeddings.create(
        input=strings,
        model="text-embedding-3-small",
        dimensions=64
    )
    embeddings = [item.embedding for item in out.data]
    return embeddings

To insert into QDrant, let’s create our databases first and create an index on user_id. This may allow us to quickly filter our records by users.

from qdrant_client import AsyncQdrantClient
COLLECTION_NAME = "memories"
async def create_memory_collection():
    if not (await client.collection_exists(COLLECTION_NAME)):
        await client.create_collection(
            collection_name=COLLECTION_NAME,
            vectors_config=VectorParams(size=64, distance=Distance.DOT),
        )

        await client.create_payload_index(
            collection_name=COLLECTION_NAME,
            field_name="user_id",
            field_schema=models.PayloadSchemaType.INTEGER
        )

I wish to define contracts using Pydantic at the highest in order that other modules know the output shape of those functions.

from pydantic import BaseModel

class EmbeddedMemory(BaseModel):
    user_id: int
    memory_text: str
    date: str
    embedding: list[float]

class RetrievedMemory(BaseModel):
    point_id: str
    user_id: int
    memory_text: str
    date: str
    rating: float

Next, let’s write helper functions to insert, delete, and update memories.

async def insert_memories(memories: list[EmbeddedMemory]):
    """
    Given an inventory of memories, insert them to the database
    """

    await client.upsert(
        collection_name=COLLECTION_NAME,
        points=[
            models.PointStruct(
                id=uuid4().hex,
                payload={
                    "user_id": memory.user_id,
                    "memory_text": memory.memory_text,
                    "date": memory.date
                },
                vector=memory.embedding
            )
            for memory in memories
        ]
    )

async def delete_records(point_ids):
    """
    Delete an inventory of point ids from the database
    """

    await client.delete(
        collection_name=COLLECTION_NAME,
        points_selector=models.PointIdsList(
            points=point_ids
        )
    )

Similarly, let’s write one for searching. This accepts a search vector and a user_id, and fetches nearest neighbors to that vector.

from qdrant_client.models import Distance, Filter, models

async def search_memories(
    search_vector: list[float],
    user_id: int,
    topk_neighbors=5
):
    
    # Filter by user_id
    must_conditions: list[models.Condition] = [
        models.FieldCondition(
            key="user_id",
            match=models.MatchValue(value=user_id)
        )
    ]

    outs = await client.query_points(
        collection_name=COLLECTION_NAME,
        query=search_vector,
        with_payload=True,
        query_filter=Filter(must=must_conditions),
        score_threshold=0.1,
        limit=topk_neighbors
    )

    return [
        convert_retrieved_records(point)     
        for point in outs.points
        if point is not None
    ]

Notice how we will set hybrid query filters just like the models.MatchValue filter. Creating the index on user_id allows us to run these queries quickly against our data. You may extend this concept to incorporate category tags, date ranges, and every other metadata that your application cares about. Just ensure to create an index for faster retrieval performance.

In the following chapter, we’ll connect this storage layer to our agent loop using DSPy Signatures and ReAct (Reasoning and Acting).

Memory Retrieval

On this section, we construct a clean retrieval interface that pulls essentially the most relevant, per-user memories for a given turn.

Our algorithm is straightforward – we’ll create a tool-calling chatbot agent. At every turn, the agent receives the transcript of the conversation and must generate a solution. Let’s define the DSPy signature.

class ResponseGenerator(dspy.Signature):
    """
You might be given a past conversation transcript between user and an AI agent. Also the newest query by the user.
You could have the choice to look up the past memories from a vector database to fetch relevant context if required. 
For those who cannot find the reply to user's query from transcript or from your individual internal knowledge, use the provided search tool calls to look for information.
It's essential to output the ultimate response, and in addition resolve the newest interaction must be recorded into the memory database. Latest memories are supposed to store recent information that the user provides.
Latest memories ought to be made when the USER provides recent info. It isn't to save lots of information in regards to the the AI or the assistant.
    """
    transcript: list[dict] = dspy.InputField()
    query: str = dspy.InputField()

    response: str = dspy.OutputField()
    save_memory: bool = dspy.OutputField(description=
        "True if a brand new memory record must be created for the newest interaction"
                                  )

The docstring of the dspy Signature acts as additional instructions we pass into the LLM to assist it pick its actions. Also, notice the save_memory flag we marked as an OutputField. We’re asking the LLM also to output if a brand new memory must be saved due to the newest interaction with the reply.

We also need to unravel how we wish to fetch relevant memories into the agent’s context. One option is to at all times execute the search_memories function, but there are two big problems with this:

Not all user questions need a memory retrieval.
While the search_memories function expects a search vector, it isn’t at all times straightforward “. It might be the whole transcript, or simply the user’s latest message, or it might be a metamorphosis of the present conversation context.

Thankfully, we will default to tool-calling. When the agent thinks it lacks context to perform a request, it may possibly invoke a tool call to fetch relevant memories related to the conversation’s context. In DSPy, tools could be created by just writing vanilla Python function with a docstring. The LLM reads this docstring to make a decision when and call this tool.

    async def fetch_similar_memories(search_text: str):
        """
Search memories from vector database if conversation requires additional context.

Args:
- search_text : The string to embed and do vector similarity search
        """
        
        search_vector = (await generate_embeddings([search_text]))[0]
        memories = await search_memories(search_vector, 
                                         user_id=user_id)
        memories_str = [
            f"id={m_.id}ntext={m_.text}ncreated_at={m_.date}"
            for m_ in memories
        ]
        return {
            "memories": memories_str
        }

Note that we keep track of the user’s id externally and use it from our source of truth without asking the LLM to generate it. This guarantees isolation contextual to the present chat session.

The image illustrates the means of fetching relevant memories from a vector database based on a question, that are then utilized by a Large Language Model (LLM) together with the present conversation.

Next, let’s create a ReAct agent with DSPy. ReAct stands for “Reasoning and Acting”. Mainly, the LLM agent observes the information (on this case, the conversation history), reasons about it, after which acts

An motion could be to generate a solution directly or attempt to retrieve memories first.

    response_generator = dspy.ReAct(
        ResponseGenerator,
        tools=[fetch_similar_memories],
        max_iters=4
    )

In an agentic flow, the DSPy ReAct policy can craft a concise search_text from the present turn and the known task. The ReAct agent can call the fetch_similar_memories upto 4 times to look for memories before it must answer the user’s query.

Other Retrieval Strategies

You too can select other retrieval strategies than simply similarity search. Listed below are some ideas:

Keyword Search – Look into algorithms like BM-25 or TF-IDF
Category Filtering – For those who force every memory to have clear metadata tagging (like “food”, “sports”, “habits”), the agent can generate queries to look these specific subcategories as a substitute of the entire memory stack.
Time Queries – Allow the agent to retrieve records from specific time ranges!

These selections largely rely upon your application.

Whatever your retrieval strategy is, once the tool fetches the LLM answers, the agent goes to generate answers from the retrieved data! Remember, it also outputs that save_memory flag? We will trigger our custom update logic when it’s turned to true.

out = await response_generator.acall(
    transcript=past_messages,
    query=query,
)

response = out.response # the response
save_memory = out.save_memory # the LLM's decision to save lots of memory or not

past_messages.extend(
    [
    {"role": "user", "content": question},
    {"role": "assistant", "content": response},
    ]
) # update conversation stack

if (save_memory): # Update memories provided that LLM outputs this flag as true
    update_result = await update_memories(
        user_id=user_id,
        messages=past_messages,
    )

Let’s see how the update step works.

Memory Maintenance

Memory isn’t an easy log of records. It’s an ever-evolving pool of knowledge. Some memories ought to be deleted since it isn’t any longer relevant. Some memories have to be updated since the underlying world conditions have modified.

For instance, suppose we had a memory for “user loves tea”, and we just got to know that the “user hates tea”. As a substitute of making a brand recent memory, we should always delete the old memory and create a brand new one.

Given a brand new memory and an existing vector database state, how can we determine the updated database state?

When the response generator agent decides to save lots of recent memories, we’ll use a separate agentic flow to make a decision do the updates. The Update memory agent receives as input the brand new memory, and an inventory of comparable memories to the conversation state.


    .... # if save_memory is True
    response = await update_memories_agent(
        user_id=user_id,
        existing_memories=similar_memories,
        messages=messages
    )

Once we’ve got decided to update the memory database, there are 4 logical things the memory manager agent can do:

• add_memory(text): Inserts a brand-new atomic factoid. It computes a fresh embedding and writes the record for the present user. It also needs to apply deduplication logic before insertion.
• update_memory(id, updated_text): Replaces an existing memory’s text. It deletes the old point, re-embeds the brand new text, and reinserts it under the identical user, optionally preserving or adjusting categories. That is the canonical solution to handle refinements or corrections.
• delete_memories(ids): Removes a number of memories which are now not valid on account of contradictions or obsolescence.
• no_op(): Explicitly does nothing if the upkeep agent decides that the brand new memory is irrelevant or already fully captured within the database state.

Again this architecture is inspired by the Mem0 research paper.

The code below shows these tools integrated right into a DSPy ReAct agent with a structured signature and power selection loop.

class MemoryWithIds(BaseModel):
    memory_id: int
    memory_text: str

class UpdateMemorySignature(dspy.Signature):
    """
You might be given the conversation between user and assistant and a few similar memories from the database. Your goal is to make a decision  mix the brand new memories into the database with the present memories.

Actions meaning:
- ADD: add recent memories into the database as a brand new memory
- UPDATE: update an existing memory with richer information.
- DELETE: remove memory items from the database that are not required anymore on account of recent information
- NOOP: No must take any motion

If no motion is required you possibly can finish.

Think less and do actions.
    """
    messages: list[dict] = dspy.InputField()
    existing_memories: list[MemoryWithIds] = dspy.InputField()
    summary: str = dspy.OutputField(
        description="Summarize what you probably did. Very short (lower than 10 words)"
    )

Next, let’s write the tools our maintenance agent needs. We’d like functions so as to add, delete, update memories, and a dummy no_op function the LLM can call when it desires to “pass”.

async def update_memories_agent(
    user_id: int, 
    messages: list[dict], 
    existing_memories: list[RetrievedMemory]
):

    def get_point_id_from_memory_id(memory_id):
        return existing_memories[memory_id].point_id
        
    async def add_memory(memory_ext: str) -> str:
        """
    Add the new_memory into the database.
        """
        embeddings = await generate_embeddings(
            [memory_text]
        )
        await insert_memories(
            memories = [
                EmbeddedMemory(
                    user_id=user_id,
                    memory_text=memory_text,
                    date=datetime.now().strftime("%Y-%m-%d %H:%m"),
                    embedding=embeddings[0]
                )
            ]
        )

        return f"Memory: '{memory_text}' was added to DB"

    async def update(memory_id: int, 
                     updated_memory_text: str,
                     ):
        """
    Updating memory_id to make use of updated_memory_text

    Args:
    memory_id: integer index of the memory to interchange

    updated_memory_text: Easy atomic factoid to interchange the old memory with the brand new memory
        """

        point_id = get_point_id_from_memory_id(memory_id)
        await delete_records([point_id])

        embeddings = await generate_embeddings(
            [updated_memory_text]
        )
        
        await insert_memories(
            memories = [
                EmbeddedMemory(
                    user_id=user_id,
                    memory_text=updated_memory_text,
                    categories=categories,
                    date=datetime.now().strftime("%Y-%m-%d %H:%m"),
                    embedding=embeddings[0]
                )
            ]
        )
        return f"Memory {memory_id} has been updated to: '{updated_memory_text}'"

    async def noop():
        """
Call this isn't any motion is required
        """
        return "No motion done"

    async def delete(memory_ids: list[int]):
        """
    Remove these memory_ids from the database
        """        
        await delete_records(memory_ids)
        return f"Memory {memory_ids} deleted"


    memory_updater = dspy.ReAct(
        UpdateMemorySignature,
            tools=[add_memory, update, delete, noop],
            max_iters=3
    )

    out = await memory_updater.acall(
        messages=messages,
        existing_memories=memory_ids
    )

And that’s it! Depending on what motion the ReAct agent chooses, we will simply insert, delete, update, or ignore the brand new memories. Below you possibly can see an easy example of how things look once we run the code.

Example of how sessions with a memory-enabled agent would go. Notice we exit out of the session midway, however the agent remembers key details I had shared earlier. It could possibly also later update these information adaptively in response to conversation state!

The complete version of the code also has additional features like metadata tagging for accurate retrieval which I didn’t cover in this text to maintain it beginner-friendly. You’ll want to take a look at the GitHub repo below or the YouTube tutorial to explore the complete project!

What’s next

You may watch the complete video tutorial that goes into more detail about constructing Memory agents here.

The code repo could be found here: https://github.com/avbiswas/mem0-dspy

This tutorial explained the constructing blocks of a memory system. Listed below are some ideas on expand this concept:

A Graph Memory system – as a substitute of using a vector database, store memories in a graph database. This implies, your dspy modules should extract triplets as a substitute of flat strings to represent memories.
Metadata – Alongside text, insert additional attribute filters. For instance, you possibly can group all “food” related memories. This may allow the LLM agents to question specific tags while fetching memories, as a substitute of querying all memories without delay.
Optimizing prompts per user: You may keep track of integral information in your memory database and directly inject it into the system prompt. These get passed into each message as session memory.
File-Based Systems: One other common pattern that’s emerging is file-based retrieval. The core principles remain the identical that we discussed here, but as a substitute of a vector database, you should use a file system. Inserting and updating records means writing .md files. And querying will often involve additional indexing steps or just use tools like searches or grep.

My Patreon:
https://www.patreon.com/NeuralBreakdownwithAVB

My YouTube channel:
https://www.youtube.com/@avb_fj

Follow me on Twitter:
https://x.com/neural_avb

Read my articles:
https://towardsdatascience.com/creator/neural-avb/

Construct Your Own Custom LLM Memory Layer from Scratch

Memory as a Context Engineering problem

High‑level architecture

Components

2) Memory Extraction with DSPy: From Transcript to Factoids

What we’re extracting and why it matters

Embedding extracted memories

Selecting the embedding model and fixing the dimension

Memory Retrieval

Other Retrieval Strategies

Memory Maintenance

What’s next

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Google quantum-proofs HTTPS by squeezing 2.5kB of information into 64-byte space – Ars Technica

Generative AI, Discriminative Human

Featured video: Coding for underwater robotics

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM

Coding the Pong Game from Scratch in Python

Construct Your Own Custom LLM Memory Layer from Scratch

Memory as a Context Engineering problem

High‑level architecture

Components

2) Memory Extraction with DSPy: From Transcript to Factoids

What we’re extracting and why it matters

Embedding extracted memories

Selecting the embedding model and fixing the dimension

Memory Retrieval

Other Retrieval Strategies

Memory Maintenance

What’s next

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.