Mnemo Local Memory Layer: Portable Recall for Ollama and Custom LLM Apps
"The Mnemo GitHub README is used for the project positioning, Docker + Ollama quickstart, Python SDK example, Rust crate architecture, and benchmark table."
"The Ollama official site and GitHub repository provide context for running local LLMs."
"OpenAI Codex Skills documentation explains how agent skills package instructions, resources, and scripts into reusable capabilities."
"Spec Kit integration docs show how spec-driven tools write commands and context structure for different AI coding agents."
Searches for Mnemo usually come from a practical pain, not from curiosity about another local LLM tool. You may already have Ollama running. You may already call a local model through an OpenAI-compatible API. The awkward part comes next: after the session ends, the model loses project decisions, user preferences, API constraints, and the debugging note you gave it yesterday. You can keep pasting that material into the prompt, but the prompt gets longer, and stale information starts fighting with current information.
Mnemo sits in that gap. It keeps long-term memory on your own machine, stores entities and chunks through SQLite and graph relationships, then returns retrieved context to Ollama, OpenAI, Anthropic, or another compatible backend. It does not give the model native memory, and it is not a universal knowledge base. Treat it as a local memory service that you can replace, back up, inspect, and roll back.
Which layer of memory Mnemo handles
Most local LLM workflows have three layers: the model generates, the application coordinates the flow, and the memory layer brings useful information back across sessions. Ollama handles the first layer by running models locally. RAG usually answers, “Which document chunks are similar to this question?” Mnemo aims at the smaller cross-session layer.
A local coding assistant is a good example. Today you tell it, “This project is not using LangChain for now; keep the API layer on FastAPI and SQLite.” Tomorrow you start a new session and ask how to add retrieval. A useful assistant should recover that decision instead of proposing a completely different stack.
| Option | Good for | Common failure mode |
|---|---|---|
| Long prompt | A few fixed preferences and project rules | Keeps growing, and old rules are hard to update |
| Markdown memory | Human-readable decisions and notes | Weak automatic recall, relationship tracking stays manual |
| Vector-store RAG | Docs, FAQ pages, and knowledge-base chunks | Similarity does not tell you which fact is still valid |
| Mnemo-style memory layer | Entities, relationships, session facts, and retrieved context | Needs governance; bad memories can pollute later answers |
That makes Mnemo a better fit after you have the basics in place, such as calling the Ollama API. First make the model answer reliably. Then decide which pieces of information deserve to become memory. Reversing that order turns a fragile demo into a hard-to-debug state machine.
Architecture: Rust API, SQLite, and graph traversal
The Mnemo README splits the repository into four Rust crates. mnemo-core owns entity extraction, graph operations, retrieval, and the database layer. mnemo-api exposes an Axum REST API. mnemo-cli is the command-line client. mnemo-bench holds the benchmark suites. For a local tool, that structure matters because it shows the project is more than a prompt that summarizes old conversations.
SQLite stores state, graph links add clues
Many memory tools chunk each conversation turn, create embeddings, and retrieve by similarity. That works for some jobs, but two issues show up quickly. The same person, project, or decision can appear in several sessions. Two facts can conflict, and vector similarity will not decide which one should win.
Mnemo’s public description puts more weight on entity deduplication and graph-first retrieval. In practice, it extracts entities from text, merges them with existing entities, and uses relationship edges during retrieval. If “API gateway,” “auth middleware,” and “FastAPI service” appear in different sessions, the graph can connect them when you ask about the system later.
Graph expansion still needs a leash. The README says graph-expanded results participate with a lower score so direct matches rank ahead of inferred context. That is a useful trade-off: graph links should bring in clues, not bury the evidence that directly matched the query.
Treat benchmark numbers as project measurements
The README benchmark table is specific: Apple M2, SQLite WAL, in-memory petgraph, and a debug-build retrieval pipeline around 4.2 ms, with the note that release builds are faster. That tells us the local path has been measured. It does not prove your setup will behave the same way. Your result depends on data volume, extraction calls, disk speed, model backend, and retrieval policy.
I would watch three things before latency: whether written memories can be replayed, whether retrieved results explain their source, and whether wrong memories can be deleted or corrected. Slow code can often be optimized. A wrong memory with no source is much harder to trust.
Run the smallest Docker + Ollama path
Do not connect Mnemo to your main project on the first day. Use a temporary folder, follow the Docker + Ollama route in the upstream README, and decide later whether it belongs in your application.
git clone https://github.com/zaydmulani09/mnemo
cd mnemo
docker compose up -d
# Pull the README example model the first time
docker exec mnemo-ollama ollama pull llama3
# Check the API service
curl http://localhost:8080/health
If you have already worked through the Ollama beginner guide, this flow will feel familiar. The difference is that Mnemo starts a memory API beside the Ollama container. Later, your app talks to the memory service instead of stuffing every past decision into the model context.
Use the Python SDK for a smoke test
The README also gives a tiny Python SDK path. It tests one thing: write a memory, then ask a natural-language question and see whether it comes back.
from mnemo import MnemoClient
client = MnemoClient()
client.ingest("I am building a Rust vector database called vecdb")
print(client.get_context("Which database project am I working on?"))
When you run this, do not judge only the final model response. Check the service logs, database files, API response structure, and whether the memory survives a restart. The baseline for long-term memory is not polished wording. It is durable, inspectable, recoverable state.
Use the binary path when Ollama already exists
If Ollama is already running on your machine, the README also describes a binary route:
cargo install --path crates/mnemo-api
export MNEMO_LLM_BASE_URL=http://localhost:11434/v1
mnemo-api
This path fits an existing local LLM setup. Keep your own models, ports, and monitoring, and add Mnemo as a separate service. If you later move to a cloud backend, configure an OpenAI-compatible base URL, API key, model, and provider instead.
Check the fit before adopting it
The Mnemo README gives a useful boundary: if you already use a managed agent harness that handles memory well, you may not need Mnemo. That warning matters. More memory layers can mean more hidden state.
| Your situation | Suggested move |
|---|---|
| A local Ollama script where you keep pasting project background | Try Mnemo |
| A custom support or coding agent that needs cross-session decisions | Run a small pilot |
| One-off Q&A over a document set | Start with Ollama embeddings and RAG |
| A mature platform already provides export, correction, and audit | Avoid a second memory layer for now |
| Team data has complex permissions and no access-control plan | Define permissions before adding memory |
Local-first has a clear benefit: the data stays on your machine, the SQLite file is easy to back up, and you do not have to send every project conversation to a cloud service. It still needs security work. Decide who can read the database, where backups live, whether logs contain API keys, and how bad memories get removed.
Three guardrails for the memory layer
Long-term memory is useful only while it stays governable. Before connecting Mnemo to a real agent, I would write three rules into the project itself.
Every memory needs a source
A memory should not end as a lonely summary sentence. It should point back to a conversation, file, task, or API result. If an agent says, “This project uses FastAPI,” you should be able to trace where that claim came from.
That is also the main lesson from the earlier AI agent memory guide. Long-term memory is not a larger clipboard. Without source, time, and validity, old conclusions start wearing new clothes.
Retrieved context needs a budget
A fast local service should still return a small set of evidence. For many tasks, 5 to 15 high-relevance memories with source hints are enough. If the model needs more, let it query again instead of pushing dozens of possibly related notes into the prompt.
This keeps context rot down. Agents often fail with plenty of material in hand because the material is stale, duplicated, or contradictory. The memory layer should filter before the prompt grows.
Bad memories need a withdrawal path
The most dangerous memory is wrong, not missing. Suppose the model once stored “production schema can be changed directly,” and the team later requires migration review. If that old memory stays active, the agent will eventually make a risky suggestion.
So the pilot needs withdrawal actions: delete a memory, mark it expired, re-extract one project, or clear one user space. Without those moves, long-term memory becomes debt.
Troubleshooting checklist
Problems with a tool like Mnemo usually live between the local service, the model backend, and retrieval. Check them in this order:
curl http://localhost:8080/healthfails: check whether the Docker containers are running and whether the port is already occupied.- Ollama cannot pull the model: run
ollama listinside the container and confirm the model exists; use a smaller model if the network is slow. - API calls hang: verify that
MNEMO_LLM_BASE_URLpoints to an OpenAI-compatible endpoint. Ollama commonly listens on11434. - The answer ignores memory: confirm that
ingestsucceeded, then inspect the retrieval context instead of judging only the final response. - Memory disappears after restart: check whether the SQLite data path is mounted to a persistent volume.
- Results get messy: reduce retrieved context, deduplicate entities, and expire outdated project decisions.
These checks beat prompt tinkering. A prompt can change how the model talks. It cannot fix a service that never started or a database path that disappears with the container.
Suggested reading and a safe pilot
If you have not run a local model yet, start with the Ollama beginner guide. Once model calls are stable, move to Ollama API calls and Ollama embeddings. Mnemo fits after those basics as an agent-memory pilot.
A 7-day pilot can stay small:
- Pick one project, not your whole machine.
- Write 30 to 50 real memories about the stack, rejected options, common errors, and API constraints.
- Ask the same 10 replay questions each day and record correct, missing, and wrong retrieval.
- Delete or expire bad memories, then check whether the next answer changes.
- Restart the container and the machine, then confirm the memories remain.
- Add database backups, permission checks, and secret scanning.
- Decide only then whether your main agent should use it.
Mnemo’s useful target is modest: bring a small set of important context back into the next task, while leaving humans able to inspect, edit, back up, and withdraw that context. Once you can do that, a local LLM starts to feel less like a disposable chat window and more like a sustainable tool.
Run a 7-day Mnemo trial for your local LLM workflow
Test Mnemo on one project with a small memory set and replayable questions before connecting it to your main agent.
⏱️ Estimated time: 7 days
- 1
Step1: Run the official quickstart
Use the Docker + Ollama path from the GitHub README, pull a small model, and confirm that `/health` responds correctly. - 2
Step2: Write a small set of real memories
Use one project only. Add 30 to 50 memories covering the stack, rejected options, API constraints, and troubleshooting notes. - 3
Step3: Prepare replay questions
Ask the same 10 cross-session questions each day and record correct retrieval, missing retrieval, and wrong retrieval. - 4
Step4: Verify persistence after restart
Restart the container and the machine. Confirm the SQLite data remains and the same memories can still be retrieved. - 5
Step5: Practice deletion and expiration
Write one intentionally wrong memory, then delete it or mark it expired. Confirm later answers stop using the old fact. - 6
Step6: Add backup and secret checks
Check database permissions, backup location, and logs for API keys before connecting Mnemo to your main agent.
FAQ
How is Mnemo different from ordinary RAG?
Does Mnemo have to run with Ollama?
Should I treat Mnemo's benchmark table as a production guarantee?
Is a local memory layer automatically safer?
When should I skip Mnemo?
9 min read · Published on: Jun 5, 2026 · Modified on: Jun 8, 2026
Ollama Local LLM Guide
If you landed here from search, the fastest way to build context is to jump to the previous or next post in this same series.
Previous
Ollama GPU Acceleration Configuration: CUDA, ROCm, and Metal Platform Guide
Comprehensive guide to Ollama GPU acceleration configuration covering NVIDIA CUDA, AMD ROCm, and Apple Metal platforms. Includes hardware requirements, driver installation, verification steps, troubleshooting, and VRAM shortage solutions for 50x faster local LLM inference
Part 18 of 19
Next
This is the latest post in the series so far.
Related Posts
Getting Started with Ollama: Your First Step to Running LLMs Locally
Getting Started with Ollama: Your First Step to Running LLMs Locally
Complete Guide to Ollama Model Management: Download, Switch, Delete & Version Control
Complete Guide to Ollama Model Management: Download, Switch, Delete & Version Control
Ollama Modelfile Parameters Explained: A Complete Guide to Creating Custom Models
Comments
Sign in with GitHub to leave a comment