Switch Language
Toggle Theme

Mnemo Local Memory Layer: Portable Recall for Ollama and Custom LLM Apps

"The Mnemo GitHub README is used for the project positioning, Docker + Ollama quickstart, Python SDK example, Rust crate architecture, and benchmark table."

"The Ollama official site and GitHub repository provide context for running local LLMs."

"OpenAI Codex Skills documentation explains how agent skills package instructions, resources, and scripts into reusable capabilities."

"Spec Kit integration docs show how spec-driven tools write commands and context structure for different AI coding agents."

Searches for Mnemo usually come from a practical pain, not from curiosity about another local LLM tool. You may already have Ollama running. You may already call a local model through an OpenAI-compatible API. The awkward part comes next: after the session ends, the model loses project decisions, user preferences, API constraints, and the debugging note you gave it yesterday. You can keep pasting that material into the prompt, but the prompt gets longer, and stale information starts fighting with current information.

Mnemo sits in that gap. It keeps long-term memory on your own machine, stores entities and chunks through SQLite and graph relationships, then returns retrieved context to Ollama, OpenAI, Anthropic, or another compatible backend. It does not give the model native memory, and it is not a universal knowledge base. Treat it as a local memory service that you can replace, back up, inspect, and roll back.

Which layer of memory Mnemo handles

Most local LLM workflows have three layers: the model generates, the application coordinates the flow, and the memory layer brings useful information back across sessions. Ollama handles the first layer by running models locally. RAG usually answers, “Which document chunks are similar to this question?” Mnemo aims at the smaller cross-session layer.

A local coding assistant is a good example. Today you tell it, “This project is not using LangChain for now; keep the API layer on FastAPI and SQLite.” Tomorrow you start a new session and ask how to add retrieval. A useful assistant should recover that decision instead of proposing a completely different stack.

OptionGood forCommon failure mode
Long promptA few fixed preferences and project rulesKeeps growing, and old rules are hard to update
Markdown memoryHuman-readable decisions and notesWeak automatic recall, relationship tracking stays manual
Vector-store RAGDocs, FAQ pages, and knowledge-base chunksSimilarity does not tell you which fact is still valid
Mnemo-style memory layerEntities, relationships, session facts, and retrieved contextNeeds governance; bad memories can pollute later answers

That makes Mnemo a better fit after you have the basics in place, such as calling the Ollama API. First make the model answer reliably. Then decide which pieces of information deserve to become memory. Reversing that order turns a fragile demo into a hard-to-debug state machine.

Architecture: Rust API, SQLite, and graph traversal

The Mnemo README splits the repository into four Rust crates. mnemo-core owns entity extraction, graph operations, retrieval, and the database layer. mnemo-api exposes an Axum REST API. mnemo-cli is the command-line client. mnemo-bench holds the benchmark suites. For a local tool, that structure matters because it shows the project is more than a prompt that summarizes old conversations.

Many memory tools chunk each conversation turn, create embeddings, and retrieve by similarity. That works for some jobs, but two issues show up quickly. The same person, project, or decision can appear in several sessions. Two facts can conflict, and vector similarity will not decide which one should win.

Mnemo’s public description puts more weight on entity deduplication and graph-first retrieval. In practice, it extracts entities from text, merges them with existing entities, and uses relationship edges during retrieval. If “API gateway,” “auth middleware,” and “FastAPI service” appear in different sessions, the graph can connect them when you ask about the system later.

Graph expansion still needs a leash. The README says graph-expanded results participate with a lower score so direct matches rank ahead of inferred context. That is a useful trade-off: graph links should bring in clues, not bury the evidence that directly matched the query.

Treat benchmark numbers as project measurements

The README benchmark table is specific: Apple M2, SQLite WAL, in-memory petgraph, and a debug-build retrieval pipeline around 4.2 ms, with the note that release builds are faster. That tells us the local path has been measured. It does not prove your setup will behave the same way. Your result depends on data volume, extraction calls, disk speed, model backend, and retrieval policy.

I would watch three things before latency: whether written memories can be replayed, whether retrieved results explain their source, and whether wrong memories can be deleted or corrected. Slow code can often be optimized. A wrong memory with no source is much harder to trust.

Run the smallest Docker + Ollama path

Do not connect Mnemo to your main project on the first day. Use a temporary folder, follow the Docker + Ollama route in the upstream README, and decide later whether it belongs in your application.

git clone https://github.com/zaydmulani09/mnemo
cd mnemo
docker compose up -d

# Pull the README example model the first time
docker exec mnemo-ollama ollama pull llama3

# Check the API service
curl http://localhost:8080/health

If you have already worked through the Ollama beginner guide, this flow will feel familiar. The difference is that Mnemo starts a memory API beside the Ollama container. Later, your app talks to the memory service instead of stuffing every past decision into the model context.

Use the Python SDK for a smoke test

The README also gives a tiny Python SDK path. It tests one thing: write a memory, then ask a natural-language question and see whether it comes back.

from mnemo import MnemoClient

client = MnemoClient()

client.ingest("I am building a Rust vector database called vecdb")
print(client.get_context("Which database project am I working on?"))

When you run this, do not judge only the final model response. Check the service logs, database files, API response structure, and whether the memory survives a restart. The baseline for long-term memory is not polished wording. It is durable, inspectable, recoverable state.

Use the binary path when Ollama already exists

If Ollama is already running on your machine, the README also describes a binary route:

cargo install --path crates/mnemo-api
export MNEMO_LLM_BASE_URL=http://localhost:11434/v1
mnemo-api

This path fits an existing local LLM setup. Keep your own models, ports, and monitoring, and add Mnemo as a separate service. If you later move to a cloud backend, configure an OpenAI-compatible base URL, API key, model, and provider instead.

Check the fit before adopting it

The Mnemo README gives a useful boundary: if you already use a managed agent harness that handles memory well, you may not need Mnemo. That warning matters. More memory layers can mean more hidden state.

Your situationSuggested move
A local Ollama script where you keep pasting project backgroundTry Mnemo
A custom support or coding agent that needs cross-session decisionsRun a small pilot
One-off Q&A over a document setStart with Ollama embeddings and RAG
A mature platform already provides export, correction, and auditAvoid a second memory layer for now
Team data has complex permissions and no access-control planDefine permissions before adding memory

Local-first has a clear benefit: the data stays on your machine, the SQLite file is easy to back up, and you do not have to send every project conversation to a cloud service. It still needs security work. Decide who can read the database, where backups live, whether logs contain API keys, and how bad memories get removed.

Three guardrails for the memory layer

Long-term memory is useful only while it stays governable. Before connecting Mnemo to a real agent, I would write three rules into the project itself.

Every memory needs a source

A memory should not end as a lonely summary sentence. It should point back to a conversation, file, task, or API result. If an agent says, “This project uses FastAPI,” you should be able to trace where that claim came from.

That is also the main lesson from the earlier AI agent memory guide. Long-term memory is not a larger clipboard. Without source, time, and validity, old conclusions start wearing new clothes.

Retrieved context needs a budget

A fast local service should still return a small set of evidence. For many tasks, 5 to 15 high-relevance memories with source hints are enough. If the model needs more, let it query again instead of pushing dozens of possibly related notes into the prompt.

This keeps context rot down. Agents often fail with plenty of material in hand because the material is stale, duplicated, or contradictory. The memory layer should filter before the prompt grows.

Bad memories need a withdrawal path

The most dangerous memory is wrong, not missing. Suppose the model once stored “production schema can be changed directly,” and the team later requires migration review. If that old memory stays active, the agent will eventually make a risky suggestion.

So the pilot needs withdrawal actions: delete a memory, mark it expired, re-extract one project, or clear one user space. Without those moves, long-term memory becomes debt.

Troubleshooting checklist

Problems with a tool like Mnemo usually live between the local service, the model backend, and retrieval. Check them in this order:

  • curl http://localhost:8080/health fails: check whether the Docker containers are running and whether the port is already occupied.
  • Ollama cannot pull the model: run ollama list inside the container and confirm the model exists; use a smaller model if the network is slow.
  • API calls hang: verify that MNEMO_LLM_BASE_URL points to an OpenAI-compatible endpoint. Ollama commonly listens on 11434.
  • The answer ignores memory: confirm that ingest succeeded, then inspect the retrieval context instead of judging only the final response.
  • Memory disappears after restart: check whether the SQLite data path is mounted to a persistent volume.
  • Results get messy: reduce retrieved context, deduplicate entities, and expire outdated project decisions.

These checks beat prompt tinkering. A prompt can change how the model talks. It cannot fix a service that never started or a database path that disappears with the container.

Suggested reading and a safe pilot

If you have not run a local model yet, start with the Ollama beginner guide. Once model calls are stable, move to Ollama API calls and Ollama embeddings. Mnemo fits after those basics as an agent-memory pilot.

A 7-day pilot can stay small:

  1. Pick one project, not your whole machine.
  2. Write 30 to 50 real memories about the stack, rejected options, common errors, and API constraints.
  3. Ask the same 10 replay questions each day and record correct, missing, and wrong retrieval.
  4. Delete or expire bad memories, then check whether the next answer changes.
  5. Restart the container and the machine, then confirm the memories remain.
  6. Add database backups, permission checks, and secret scanning.
  7. Decide only then whether your main agent should use it.

Mnemo’s useful target is modest: bring a small set of important context back into the next task, while leaving humans able to inspect, edit, back up, and withdraw that context. Once you can do that, a local LLM starts to feel less like a disposable chat window and more like a sustainable tool.

Run a 7-day Mnemo trial for your local LLM workflow

Test Mnemo on one project with a small memory set and replayable questions before connecting it to your main agent.

⏱️ Estimated time: 7 days

  1. 1

    Step1: Run the official quickstart

    Use the Docker + Ollama path from the GitHub README, pull a small model, and confirm that `/health` responds correctly.
  2. 2

    Step2: Write a small set of real memories

    Use one project only. Add 30 to 50 memories covering the stack, rejected options, API constraints, and troubleshooting notes.
  3. 3

    Step3: Prepare replay questions

    Ask the same 10 cross-session questions each day and record correct retrieval, missing retrieval, and wrong retrieval.
  4. 4

    Step4: Verify persistence after restart

    Restart the container and the machine. Confirm the SQLite data remains and the same memories can still be retrieved.
  5. 5

    Step5: Practice deletion and expiration

    Write one intentionally wrong memory, then delete it or mark it expired. Confirm later answers stop using the old fact.
  6. 6

    Step6: Add backup and secret checks

    Check database permissions, backup location, and logs for API keys before connecting Mnemo to your main agent.

FAQ

How is Mnemo different from ordinary RAG?
Ordinary RAG usually retrieves document chunks by text similarity. Mnemo focuses more on entity deduplication, graph relationships, and cross-session facts, so it is better suited for project decisions, preferences, API constraints, and memories that change over time.
Does Mnemo have to run with Ollama?
No. The GitHub README says it can work with Ollama, OpenAI, Anthropic, or an OpenAI-compatible backend. Ollama is simply the most convenient local path for a free first test.
Should I treat Mnemo's benchmark table as a production guarantee?
No. The README benchmark is a project benchmark under Apple M2, SQLite WAL, and in-memory petgraph. It shows the local path has been measured, but your own results depend on data size, hardware, model backend, and retrieval policy.
Is a local memory layer automatically safer?
Local-first keeps the data on your machine and makes backup and auditing easier. You still need to handle database file permissions, sensitive log content, backup locations, and cleanup for incorrect memories.
When should I skip Mnemo?
If you only need one-off document Q&A, ordinary RAG is simpler. If your managed agent platform already provides exportable, correctable, auditable memory, do not rush to add a second state layer.

9 min read · Published on: Jun 5, 2026 · Modified on: Jun 8, 2026

Comments

Sign in with GitHub to leave a comment