This began because my Obsidian assistant kept getting amnesia. I didn’t wish to get up Pinecone or Redis just so Claude could do not forget that Alice approved the Q3 budget last week. Seems, with 200K+ context windows, you would possibly not need any of that.
I would like to share a brand new mechanism that I’ve began running. It’s a system built on SQLite and direct LLM reasoning, no vector databases, no embedding pipeline. Vector search was mostly a workaround for tiny context windows and keeping prompts from getting messy. With modern context sizes, you may often skip that and just let the model read your memories directly.
The Setup
I take detailed notes, each in my personal life and at work. I used to scrawl in notebooks that will get misplaced or get stuck on a shelf and never be referenced again. A couple of years ago, I moved to Obsidian for every little thing, and it has been incredible. Within the last 12 months, I’ve began hooking up genAI to my notes. Today I run each Claude Code (for my personal notes) and Kiro-CLI (for my work notes). I can ask questions, get them to do roll-ups for leadership, track my goals, and write my reports. But it surely’s at all times had one big Achilles’ heel: memory. Once I ask a couple of meeting, it uses an Obsidian MCP to look my vault. It’s time-consuming, error-prone, and I would like it to be higher.
The plain fix is a vector database. Embed the memories. Store the vectors. Do a similarity search at query time. It really works. But it surely also means a Redis stack, a Pinecone account, or a locally running Chroma instance, plus an embedding API, plus pipeline code to stitch all of it together. For a private tool, that’s rather a lot, and there may be an actual risk that it won’t work exactly like I would like it to. I would like to ask, what happened on ‘Feb 1 2026’ or ‘recap the last meeting I had with this person’, things that embeddings and RAG aren’t great with.
Then I ran across Google’s always-on-memory agent https://github.com/GoogleCloudPlatform/generative-ai/tree/fundamental/gemini/agents/always-on-memory-agent. The concept is pretty easy: don’t do a similarity search in any respect; just give the LLM your recent memories directly and let it reason over them.
I desired to know if that held up on AWS Bedrock with Claude Haiku 4.5. So I built it (together with Claude Code, in fact) and added in some extra bells and whistles.
Visit my GitHub repo, but ensure to come back back!
https://github.com/ccrngd1/ProtoGensis/tree/fundamental/memory-agent-bedrock
An Insight That Changes the Math
Older models topped out at 4K or 8K tokens. You couldn’t fit greater than just a few documents in a prompt. Embeddings allow you to retrieve the documents without loading every little thing. That was genuinely obligatory. Haiku 4.5 offers a context window of 250k, so what can we do with that?
A structured memory (summary, entities, topics, importance rating) runs about 300 tokens. Which implies we are able to get about 650 memories before you hit the ceiling. In practice, it’s a bit less for the reason that system prompt and query also devour tokens, but for a private assistant that tracks meetings, notes, and conversations, that’s months of context.
No embeddings, no vector indexes, no cosine similarity.
The LLM reasons directly over semantics, and it’s at that than cosine similarity.
The Architecture
The orchestrator isn’t a separate service. It’s a Python class contained in the FastAPI process that coordinates the three agents.
The IngestAgent job is easy: take raw text and ask Haiku what’s value remembering. It extracts a summary, entities (names, places, things), topics, and an importance rating from 0 to 1. That package goes into the `memories` table.
The ConsolidateAgent runs with intelligent scheduling: at startup if any memories exist, when a threshold is reached (5+ memories by default), and every day as a forced pass. When triggered, it batches unconsolidated memories and asks Haiku to seek out cross-cutting connections and generate insights. Results land in a `consolidations` table. The system tracks the last consolidation timestamp to make sure regular processing even with low memory accumulation.
The QueryAgent reads recent memories plus consolidation insights right into a single prompt and returns a synthesized answer with citation IDs. That’s the entire query path.
What Actually Gets Stored
While you ingest text like “Met with Alice today. Q3 budget is approved, $2.4M,” the system doesn’t just dump that raw string into the database. As an alternative, the IngestAgent sends it to Haiku and asks, “What’s essential here?”
The LLM extracts structured metadata:
{
"id": "a3f1c9d2-...",
"summary": "Alice confirmed Q3 budget approval of $2.4M",
"entities": ["Alice", "Q3 budget"],
"topics": ["finance", "meetings"],
"importance": 0.82,
"source": "notes",
"timestamp": "2026-03-27T14:23:15.123456+00:00",
"consolidated": 0
}
The memories table holds these individual records. At ~300 tokens per memory when formatted right into a prompt (including the metadata), the theoretical ceiling is around 650 memories in Haiku’s 200K context window. I intentionally set the default to be 50 recent memories, so I’m well wanting that ceiling.
When the ConsolidateAgent runs, it doesn’t just summarize memories. It reasons over them. It finds patterns, draws connections, and generates insights about what the memories mean together. Those insights get stored as separate records within the consolidations table:
{
"id": "3c765a26-...",
"memory_ids": ["a3f1c9d2-...", "b7e4f8a1-...", "c9d2e5b3-..."],
"connections": "All three meetings with Alice mentioned budget concerns...",
"insights": "Budget oversight appears to be a recurring priority...",
"timestamp": "2026-03-27T14:28:00.000000+00:00"
}
While you query, the system loads each the raw memories the consolidation insights into the identical prompt. The LLM reasons over each layers directly, including recent facts plus synthesized patterns. That’s the way you get answers like “Alice has raised budget concerns in three separate meetings [memory:a3f1c9d2, memory:b7e4f8a1] and the pattern suggests it is a high priority [consolidation:3c765a26].”
This two-table design is the whole persistence layer. A single SQLite file. No Redis. No Pinecone. No embedding pipeline. Just structured records that an LLM can reason over directly.
What the Consolidation Agent Actually Does
Most memory systems are purely retrieval. They store, search, and return similar text. The consolidation agent works in another way; It reads a batch of unconsolidated memories and asks, “What connects these?”, “What do these have in common?”, “How do these relate?”
Those insights get written as a separate consolidations record. While you query, you get each the raw memories the synthesized insights. The agent isn’t just recalling. It’s reasoning.
The sleeping brain analogy from the unique Google implementation seem pretty accurate. During idle time, the system is processing quite than simply waiting. That is something I often struggle with when constructing agents: how can I make them more autonomous in order that they’ll work after I don’t, and that is a superb use of that “downtime”.
For a private tool, this matters. “You’ve had three meetings with Alice this month, and all of them mentioned budget concerns” is more useful than three individual recall hits.
The unique design used a straightforward threshold for consolidation: it waited for five memories before consolidating. That works for lively use. But when you’re only ingesting sporadically, a note here, a picture there, you would possibly wait days before hitting the brink. Meanwhile, those memories sit unprocessed, and queries don’t profit from the consolidation agent’s pattern recognition.
So, I made a decision so as to add two more triggers. When the server starts, it checks for unconsolidated memories from the previous session and processes them immediately. No waiting. And on a every day timer (configurable), it forces a consolidation pass if anything is waiting, no matter whether the 5-memory threshold has been met. So even a single note per week still gets consolidated inside 24 hours.
The unique threshold-based mode still runs for lively use. But now there’s a security net underneath it. In case you’re actively ingesting, the brink catches it. In case you’re not, the every day pass does. And on restart, nothing falls through the cracks.
File Watching and Change Detection
I actually have an Obsidian vault with a whole lot of notes, and I don’t wish to manually ingest each. I would like to point the watcher on the vault and let it handle the remainder. That’s exactly what this does.
On startup, the watcher scans the directory and ingests every little thing it hasn’t seen before. It runs two modes within the background: a fast scan every 60 seconds checks for brand new files (fast, no hash calculation, just “is that this path within the database?”), and a full scan every half-hour, calculates SHA256 hashes, and compares them to stored values. If a file has modified, the system deletes the old memories, cleans up any consolidations that referenced them, re-ingests the new edition, and updates the tracking record. No duplicates. No stale data.
For private note workflows, the watcher covers what you’d expect:
- Text files (.txt, .md, .json, .csv, .log, .yaml, .yml)
- Images (.png, .jpg, .jpeg, .gif, .webp), analyzed via Claude Haiku’s vision capabilities
- PDFs (.pdf), text extracted via PyPDF2
Recursive scanning and directory exclusions are configurable. Edit a note in Obsidian, and inside half-hour, the agent’s memory reflects the change.
Why No Vector DB
Whether you wish embeddings in your personal notes boils right down to two things: what number of notes you’ve got and the way you wish to search them.
Vector search is genuinely obligatory when you’ve got thousands and thousands of documents and may’t fit the relevant ones in context. It’s a retrieval optimization for large-scale problems.
At personal scale, you’re working with a whole lot of memories, not thousands and thousands. Vector means you’re running an embedding pipeline, paying for the API calls, managing the index, and implementing similarity search to unravel an issue that a 200K context window already solves.
Here’s how I believe concerning the tradeoffs:
Complexity
Accuracy
Scale
I couldn’t justify having to setup and maintain a vector database, even FAISS for the few notes that I generate.
On top of that, this latest method gives me higher accuracy for the best way I would like to look my notes.
Seeing It In Motion
Here’s what using it actually looks like. Configuration is handled via a .env file with sensible defaults. You possibly can copy of the instance directly and begin using it (assuming you’ve got run aws configure on you’re machine already).
cp .env.example .env
Then, start the server with the file watcher lively
./scripts/run-with-watcher.sh
CURL the /ingest endpoint with to check a sample ingestion. That is option, simply to show how it really works. You possibly can skip this when you’re organising in an actual use case.
-H "Content-Type: application/json"
-d '{"text": "Met with Alice today. Q3 budget is approved, $2.4M.", "source": "notes"}'
The response will appear to be
{
"id": "a3f1c9d2-...",
"summary": "Alice confirmed Q3 budget approval of $2.4M.",
"entities": ["Alice", "Q3 budget"],
"topics": ["finance", "meetings"],
"importance": 0.82,
"source": "notes"
}
To question it later CURL the query endpoint with
query?q=What+did+Alice+say+about+the+budget
Or use the CLI:
python cli.py ingest "Paris is the capital of France." --source wikipedia
python cli.py query "What do you already know about France?"
python cli.py consolidate # trigger manually
python cli.py status # see memory count, consolidation state
Making It Useful Beyond CURL
curl works, but you’re not going to twist your memory system at 2 am when you’ve got an idea, so the project has two integration paths.
Claude Code / Kiro-CLI skill. I added a native skill that auto-activates when relevant. Say “do not forget that Alice approved the Q3 budget” and it stores it without you needing to invoke anything. Ask “what did Alice say concerning the budget?” next week, and it checks memory before answering. It handles ingestion, queries, file uploads, and standing checks through natural conversation. That is how I interact with the memory system most frequently, since I are likely to live in CC/Kiro more often than not.
CLI. For terminal users or scripting
python cli.py ingest "Paris is the capital of France." --source wikipedia
python cli.py query "What do you already know about France?"
python cli.py consolidate
python cli.py status
python cli.py list --limit 10
The CLI talks to the identical SQLite database, so you may mix API, CLI, and skill usage interchangeably. Ingest from a script, query from Claude Code, and check status from the terminal. All of it hits the identical store.
What’s Next
The excellent news, the system works, and I’m using it today, but listed here are just a few additions it may benefit from.
Importance-weighted query filtering. At once, the query agent reads the N most up-to-date memories. Which means old but essential memories can get pushed out by recent noise. I would like to filter by importance rating before constructing the context, but I’m unsure yet how aggressive to be. I don’t desire a high-importance memory from two months ago to vanish simply because I ingested a bunch of meeting notes this week.
Metadata filtering. Similarly, since each memory has associated metadata, I could use that metadata to filter out memories which can be obviously mistaken. If I’m asking questions on Alice, I don’t need any memories that only involve Bob or Charlie. For my use case, this may very well be based on my note hierarchy, since I keep notes aligned to customers and/or specific projects.
Delete and update endpoints. The shop is append-only at once. That’s superb until you ingest something mistaken and want to repair it. DELETE /memory/{id} is an obvious gap. I just haven’t needed it badly enough yet to construct it.
MCP integration. Wrapping this as an MCP server would let any Claude-compatible client use it as persistent memory. That’s probably the highest-lift thing on this list, nevertheless it’s also probably the most work.
Try It
The project is up on GitHub as a part of an ongoing series I began, where I implement research papers, explore leading-edge ideas, and repurpose handy tools for bedrock (https://github.com/ccrngd1/ProtoGensis/tree/fundamental/memory-agent-bedrock).
It’s Python with no exotic dependencies, just boto3, FastAPI, and SQLite.
The default model is `us.anthropic.claude-haiku-4-5-20251001-v1:0` (Bedrock cross-region inference profile), configurable via .env.
A note on security: the server has no authentication by default; it’s designed for local use. In case you expose it on a network, add auth first. The SQLite database will contain every little thing you’ve ever ingested, so treat it accordingly (chmod 600 memory.db is a superb start).
In case you’re constructing personal AI tooling and stalling on the memory problem, this pattern is value a glance. Let me know when you determine to try it out, how it really works for you, and which project you’re using it on.
About
Nicholaus Lawson is a Solution Architect with a background in software engineering and AIML. He has worked across many verticals, including Industrial Automation, Health Care, Financial Services, and Software firms, from start-ups to large enterprises.
This text and any opinions expressed by Nicholaus are his own and never a mirrored image of his current, past, or future employers or any of his colleagues or affiliates.
Be happy to attach with Nicholaus via LinkedIn at https://www.linkedin.com/in/nicholaus-lawson/
