Agent Memory System Design: From Session to Long-Term Memory

You spent 30 minutes discussing project details with an AI Agent, covering architecture decisions, tech stack choices, and risk assessments. The next day, you open the same conversation, and it asks: “What would you like to discuss?”

Everything from yesterday—your preferences, discussion conclusions, progress tracking—is gone.

Honestly, this isn’t an Agent capability issue. The problem lies in architectural design: you gave it a brain, but forgot to give it memory.

LLMs are stateless by default. Each request starts with a blank slate, unless you actively build a memory system. I’ve seen too many teams discover this problem only after launching their Agent: users complain “why are you asking again?” or “why did you forget what we agreed on?”—and then they scramble to fix it.

This article shares a complete blueprint for Agent memory system design: how to choose between four memory types, how to build a five-stage pipeline, which framework to pick, and how to control costs.

Chapter 1: Why Agents Need Memory Systems

LLMs naturally have “goldfish memory.” You send a request, it gives a response, done. The next request arrives, and it’s in a fresh state with no recollection. This isn’t a flaw—it’s a design feature. Independent inference per request ensures predictable outputs.

But in Agent scenarios, this feature is a disaster.

Imagine a customer service Agent. The user says “I want to change my address.” The Agent replies “Sure, please provide the new address.” The user says “Just use the warehouse address from last time.” Now the Agent is completely lost: which time? Which warehouse? It knows nothing.

Even worse is “Context Rot”—you keep stuffing things into the conversation, the context grows longer, and garbage information accumulates. The user asks a simple question, but the Agent has to dig through dozens of conversation rounds. According to Redis’s official blog, full-context solutions can push p95 latency to 17.12 seconds with 14x token overhead.

The cost difference is staggering. I’ve seen a comparison: a full-context solution burns $1 million per month, while a selective memory approach costs just $100K. That’s a 10x gap.

17.12s

P95 Latency

Full-context solution

14x

Token Overhead

Full-context vs selective memory

$1M

Full-context Cost

Monthly cost comparison

$100K

Selective Memory Cost

Monthly cost comparison

数据来源: Redis Official Blog / Mem0 Team Estimates

The core contradiction: you want the Agent to remember everything, but you can’t stuff everything into the context window. The solution is simple—give it a memory system.

Memory systems solve three core problems:

Cross-session Continuity: The user says they prefer Chinese responses today. Tomorrow, next week, next month—they open it, and the Agent should remember this preference.

Personalized Experience: Every user has different usage habits, business contexts, and histories. A memory system lets the Agent “recognize” users.

Crash Recovery: An Agent fails halfway through execution. With a memory system, it can resume from where it left off after restart—no need to start over.

Chapter 2: Four Memory Types—From Cognitive Science to Technical Architecture

Memory isn’t a single storage space—it has layers and divisions. Cognitive scientists divide human memory into working memory, episodic memory, semantic memory, and long-term memory. Agent architecture design can draw from this model.

Working Memory

Working memory is the Agent’s “mind” during the current session. While you converse with it, everything it processes lives here—what the user just said, current task progress, intermediate reasoning results.

Storage is straightforward: the context window. Lifecycle is short—when the conversation ends, working memory clears. The next conversation starts fresh.

Technically, most frameworks use Redis or KV Store as the backend, plus a Checkpointer to periodically save state. LangGraph’s MemorySaver is a typical example: after each node executes, it saves a state snapshot to memory or database.

Episodic Memory

Episodic memory records “what happened.” What questions the user asked last time, how the Agent responded, what decisions were made—these events are stored chronologically, like a running log.

Unlike working memory, episodic memory persists across sessions. Today’s conversation ends, tomorrow’s conversation can still query yesterday’s event records.

Storage typically uses event streams (Redis Streams) or time-series databases. A key optimization strategy is “summary compression”—raw events can be verbose. Using LLM to condense them into compact versions preserves key information while saving storage and retrieval costs.

Semantic Memory

Semantic memory is “what is known.” It stores abstract knowledge and facts—“user prefers Chinese responses,” “company headquarters is in Shanghai,” “Product A pricing is 500 yuan.”

These don’t care when they were learned or which conversation they came from—only the knowledge itself matters.

Storage primarily uses vector databases (Pinecone, Weaviate, Milvus) plus knowledge graphs. After vectorization, HNSW or IVF indexes accelerate retrieval. Knowledge graphs store entity relationships—“User A prefers B,” “Company C is located in D.”

Long-Term Memory

Long-term memory is “who the user is.” It stores user profiles, preference settings, long-term domain knowledge—things that don’t change easily and are valid across all sessions.

Storage uses persistent databases—PostgreSQL, MongoDB, or cloud provider solutions (Alibaba Cloud AnalyticDB, PolarDB). Retrieval typically uses semantic search + RAG, combined with attribute filtering (e.g., filtering by user ID).

These four memory types aren’t isolated—they form a pyramid structure: working memory at the bottom, fastest but shortest-lived; long-term memory at the top, most persistent but slowest to retrieve. The Agent pulls information from different layers based on task requirements.

Chapter 3: Five-Stage Memory Pipeline—From Extraction to Forgetting

Memory isn’t just about storing conversations. It requires a complete pipeline: extraction, consolidation, storage, retrieval, and forgetting. Each stage has its nuances.

Stage 1: Extraction

Not every sentence in a conversation needs to be remembered. “Hello,” “Thanks,” “Hold on a second”—these noise messages waste space if stored.

The extraction stage’s task is identifying which information is worth preserving. The typical approach is LLM classification + rule filtering. LLM judges whether a message has long-term value (“user prefers Chinese responses” vs “user said hello”), while rule filtering handles obvious patterns (e.g., too short, pure greetings).

# Extraction stage pseudocode
def extract_memories(conversation):
    candidates = []
    for message in conversation:
        # LLM classification: worth remembering?
        classification = llm.classify(message, "memory_candidate")
        if classification == "worth_remembering":
            candidates.append(message)
    # Rule filtering: remove obvious noise
    candidates = filter_noise(candidates)
    return candidates

Stage 2: Consolidation

Extracted information might be duplicated. “User likes Chinese” might have appeared in three different conversations—you don’t need to store it three times.

Consolidation’s tasks: merge duplicates, update old memories, build knowledge graph triples.

For example:

Old memory: “User prefers Chinese responses”
New extraction: “User says they prefer concise Chinese responses”
Consolidated: “User prefers concise Chinese responses” (merged and refined)

# Consolidation stage pseudocode
def consolidate_memories(new_memories, existing_memories):
    for new in new_memories:
        # Check if duplicate or related to existing memory
        similar = find_similar(new, existing_memories)
        if similar:
            # Merge or update
            merged = llm.merge(new, similar)
            update_memory(similar.id, merged)
        else:
            # Add new memory
            add_memory(new)

Stage 3: Storage

Storage involves two key decisions: storage format and indexing method.

Storage format depends on memory type: working memory uses KV Store, episodic memory uses event streams, semantic memory uses vector databases, long-term memory uses relational databases.

Indexing affects retrieval performance. Mainstream choices are HNSW and IVF:

HNSW: Suitable for small-to-medium datasets (100K to millions), higher recall at the same latency, but higher memory consumption.
IVF: Suitable for large datasets (millions to billions), high memory efficiency, but slightly lower precision—relevant vectors might be missed if not in target buckets.

According to Redis blog data, HNSW typically achieves higher recall at the same latency target, while IVF saves memory at large scale. The choice depends on your data volume and precision requirements.

Stage 4: Retrieval

When the Agent needs to use memories, the retrieval stage pulls relevant information.

Pure vector search sometimes lacks precision. A better approach is “hybrid retrieval”—a combination of vector search + full-text search + attribute filtering.

For example, if the user asks “What was that warehouse address from last time?”:

Vector search: Find semantically similar memories (“warehouse address,” “logistics info”)
Attribute filtering: Only look at this user’s memories
Temporal sorting: Prioritize the most recent memories

# Hybrid retrieval pseudocode
def retrieve_memories(query, user_id):
    # Vector search
    vector_results = vector_db.search(query, top_k=20)
    # Attribute filtering: only current user
    filtered = [m for m in vector_results if m.user_id == user_id]
    # Temporal sorting: prioritize recent
    sorted_results = sort_by_time(filtered, descending=True)
    return sorted_results[:5]

Stage 5: Forgetting

Forgetting sounds negative, but in memory systems, it’s crucial. Without forgetting, storage expands infinitely, and noise drowns out valuable information.

Two main forgetting strategies:

Temporal Decay: Memory importance decreases over time. A preference setting from a month ago might be outdated—weight automatically decreases.

Importance-Based Eviction: Evaluate importance based on access frequency, user feedback, validation count. Low-importance memories get periodically cleaned.

An easily overlooked problem is “one-time error solidification.” A user casually mentions incorrect information, and the Agent stores it as “fact”—this is dangerous. The solution is adding validation logic in the consolidation stage, or marking low-confidence memories as “pending confirmation.”

Build Agent Memory System

Five-stage pipeline for memory extraction, consolidation, storage, retrieval, and forgetting

⏱️ Estimated time: 60 min

1
Step1: Extraction Stage: Identify Valuable Information
Identify information worth remembering from conversations:

• Use LLM classification to determine long-term value
• Rule filtering removes obvious noise (greetings, meaningless short sentences)
• Key indicators: user preferences, business constraints, decision conclusions
2
Step2: Consolidation Stage: Merge and Update
Process extracted results, avoid duplicate storage:

• Detect similar memories, merge duplicates
• Update old memories (e.g., preference refinement)
• Build knowledge graph triples
• Mark low-confidence information as "pending confirmation"
3
Step3: Storage Stage: Choose Appropriate Solution
Select storage and indexing based on memory type:

• Working Memory: Redis/KV Store (sub-millisecond)
• Episodic Memory: Redis Streams/Time-series database
• Semantic Memory: Vector database + HNSW/IVF indexing
• Long-Term Memory: PostgreSQL/MongoDB + RAG retrieval
4
Step4: Retrieval Stage: Hybrid Retrieval Strategy
Combine multiple retrieval methods for precision:

• Vector search: semantic similarity matching
• Full-text search: keyword exact matching
• Attribute filtering: filter by user_id, time range
• Temporal sorting: prioritize recent memories
5
Step5: Forgetting Stage: Prevent Bloat and Noise
Periodically clean low-value memories:

• Temporal decay: importance decreases over time
• Access frequency eviction: demote long-unaccessed memories
• Prevent error solidification: validation logic + pending confirmation tags
• Periodic reflection: LLM checks for contradictions or outdated info

Chapter 4: Framework Comparison—Mem0 vs Zep vs LangMem vs LangChain

Mature memory frameworks exist—no need to reinvent the wheel. The question is: which one to choose?

These four frameworks each have their strengths. Let me start with a comparison table:

Dimension	Mem0	Zep	LangMem	LangChain Native
Type	Managed platform (has open-source version)	Context engineering platform	LangGraph library	Base framework
Knowledge Graph	Pro version supports	Core feature	Not supported	Needs plugin
Self-hosted	Open-source version available	Cloud only	Fully local	Fully local
SDK	Python, JS, MCP Server	Python, TS, Go	Python only	Python
Pricing	Free → $19 → $249/month	$25/month+	Free	Free

Selection Decision Tree

Before choosing a framework, ask yourself three questions:

Q1: Need knowledge graphs?
    → Yes → Mem0 Pro or Zep (both have mature graph capabilities)
    → No → Continue to Q2

Q2: Need managed service?
    → Yes → Mem0 (simple onboarding) or MemoClaw (no API key setup)
    → No → Continue to Q3

Q3: Using LangGraph?
    → Yes → LangMem (native integration, no extra dependencies)
    → No → Build yourself (use LangChain Checkpointer + vector database)

Real-World Scenario Recommendations

Intelligent Customer Service → Mem0

Customer service Agents need to remember user preferences, order history, complaint records. This information suits knowledge graph storage—“User A purchased Product B,” “User A complained about Issue C.” Mem0’s managed version saves ops costs; the Pro version provides graph capabilities.

Medical Diagnosis Agent → Zep

Medical scenarios involve complex entity relationships and timelines—when symptoms appeared, when medications were adjusted, when test results changed. Zep’s core advantage is “Temporal Facts,” precisely tracking time dimensions of events, suitable for scenarios requiring medical history reasoning.

Internal Tool Agent → LangMem

If your Agent is already built with LangGraph, adding LangMem is simplest. It’s a native LangGraph library with no extra dependencies—Checkpointer and memory storage in one.

Rapid Prototype Validation → MemoClaw

Want to try memory system effects without registering accounts or configuring API keys? MemoClaw provides “memory as a service”—just call store/recall interfaces. Suitable for prototyping; production-grade projects may need stronger frameworks.

Mem0’s Integration Ecosystem

Worth mentioning is Mem0’s integration coverage. According to Mem0’s official blog data from early 2026, it supports integrations with 21 frameworks and platforms—including OpenAI, LangChain, LlamaIndex, CrewAI, AutoGen, and more. If you’re using mainstream frameworks, there’s likely a ready-made integration package.

Chapter 5: Production Implementation—Cost Control and Performance Optimization

A working demo doesn’t mean a working production system. Before launching a memory system, three problems must be solved: is performance fast enough, are costs low enough, is security tight enough?

Index Selection: Precision vs Scale

Vector retrieval’s performance bottleneck is indexing. Three mainstream choices:

FLAT: Brute-force search, perfect precision, but slow. Suitable for small-scale data (under 10K), or scenarios requiring 100% accuracy.

HNSW: Hierarchical Navigable Small World graph, high recall, fast speed. Suitable for small-to-medium scale (100K to millions), but higher memory consumption—millions of vectors need several GB of memory.

IVF: Inverted File index, buckets vectors, searches only a few buckets during retrieval. Suitable for large scale (millions to billions), high memory efficiency, but slightly lower precision—relevant vectors might be missed if not in target buckets.

Selection logic is straightforward: small data volume, choose FLAT or HNSW; large data volume, choose IVF. If you have extremely high precision requirements (e.g., medical diagnosis), prioritize high recall over speed—choose HNSW.

Latency Optimization: From Seconds to Milliseconds

A user asks a question, the Agent retrieves memories, reasons, generates a response—latency stacks at each step. Full-context solutions are slow because they process ultra-long contexts before inference, with p95 latency reaching 17 seconds.

Optimization approach: place retrieval before inference, and make it fast.

Redis as a unified platform achieves sub-millisecond queries. It simultaneously supports vector search, event streams, KV storage—working memory, episodic memory, semantic memory can all live in one place, eliminating cross-service network latency.

Another pitfall is “stacked multiple inference.” Some designs do: retrieve → use LLM to organize results → then reason to answer. Two LLM calls, doubled latency. Better approach: inject retrieval results directly into context, one inference pass.

Cost Control: The Secret to 10x Savings

I mentioned earlier: full-context vs selective memory has a 10x cost gap. How?

Three core strategies:

Selective Memory: Only store valuable information, don’t stuff all conversation history. Extraction stage filters noise; storage stage controls memory quantity.

Summary Compression: Raw conversations might be thousands of words, summaries can be hundreds. Periodically use LLM to compress episodic memories into compact versions, reducing token consumption.

Smart Forgetting: Storage expands infinitely; periodically clean low-importance memories. Temporal decay + access frequency eviction keeps the memory pool at a controllable scale.

According to Mem0’s official team estimates, selective memory can compress monthly costs from $1 million to $100K—mainly from reduced token overhead and storage costs.

Security and Privacy: Memory Isolation

Memory systems store user data—security design can’t be sloppy.

Memory Isolation: Each user’s memories are stored independently; retrieval strictly filters by user_id. Absolutely no “User A retrieved User B’s memories” incidents.

Memory Poisoning Defense: Malicious users might deliberately input false information, hoping the Agent stores wrong facts. Consolidation stage needs validation logic—mark low-confidence information as “pending confirmation,” don’t write directly to long-term memory.

Data Desensitization: Sensitive information (phone numbers, ID numbers) must be desensitized before storage. Permission control is needed when restoring after retrieval.

Consistency Maintenance: Distributed Locks + Reflection

With multi-instance deployment, memory consistency becomes an issue. Instance A updates a memory, Instance B might still be using the old version.

Two mechanisms solve this:

Distributed Locks + Version Control: Lock before updating memory, write new version after updating. Retrieval defaults to latest version, avoiding reading stale data.

Periodic Reflection: Periodically let LLM check the memory base, discover contradictions or outdated information, actively clean or update. Alibaba Cloud AnalyticDB’s solution has this built-in.

Conclusion

At its core, a memory system isn’t an “optional feature” for Agents—it’s the core capability that distinguishes them from ordinary LLM interfaces. An Agent without memory starts fresh with every conversation. It can never truly “understand” users or maintain coherence in long-horizon tasks.

But installing a memory system isn’t a one-and-done deal. You need to think clearly: do you need knowledge graphs? Can you accept managed services? What’s your current framework binding? Once these questions are answered, framework selection becomes clear.

If you’re still unsure, I recommend starting experiments with LangMem or Mem0’s open-source version—minimum investment, most intuitive results. Once you’re comfortable with working memory, then consider expanding to episodic and long-term memory.

FAQ

Why do Agents need memory systems? Doesn't LLM already have context?

LLM context is temporary—it disappears when the conversation ends. The user opens the same conversation the next day, and the Agent remembers nothing from before. Memory systems solve three problems: cross-session continuity, personalized experience, and crash recovery.

What's the difference between the four memory types? What technologies implement each?

Working Memory (current session, Redis/KV Store), Episodic Memory (event history, Redis Streams/time-series database), Semantic Memory (knowledge facts, vector database + knowledge graph), Long-Term Memory (user profile, PostgreSQL/MongoDB).

Which should I choose: Mem0, Zep, or LangMem?

Customer service → Mem0 (knowledge graphs), healthcare → Zep (temporal facts), LangGraph projects → LangMem (native integration), rapid prototyping → MemoClaw (no config). Key is answering: need knowledge graphs? Accept managed services?

How to control memory system costs? Full-context is too expensive, what to do?

Three strategies: selective memory (only store valuable info), summary compression (LLM periodically compresses episodic memories), smart forgetting (temporal decay + access frequency eviction). According to Mem0 estimates, monthly costs can drop from $1M to $100K.

What security issues should I watch for when launching a memory system?

Memory isolation (strict filtering by user_id, prevent cross-contamination), memory poisoning defense (validation logic + pending confirmation tags), data desensitization (sensitive info desensitized before storage), consistency maintenance (distributed locks + periodic reflection).

Vector index: HNSW or IVF?

Small data volume (100K to millions) → HNSW, high recall but higher memory consumption; large volume (millions to billions) → IVF, high memory efficiency but slightly lower precision. High-precision scenarios like healthcare → HNSW.

14 min read · Published on: Apr 23, 2026 · Modified on: Apr 25, 2026

default

AI & Intelligence

Series Reading Path Part 13 of 24

AI Development

If you landed here from search, the fastest way to build context is to jump to the previous or next post in this same series.

View Series Hub

Getting Started with MCP Server Development: Build Your First MCP Service from Scratch

Learn MCP Server development from scratch! This hands-on guide uses TypeScript native SDK to build a weather query service with complete implementation of Tools, Resources, and Prompts. Perfect for frontend/full-stack developers - get started in 30 minutes.

Part 12 of 24

AI Agent Development in Practice: Architecture Design and Implementation Guide

Deep dive into AI Agent architecture design: comparison of ReAct, Plan-and-Execute, and Multi-Agent patterns, five multi-agent orchestration patterns explained, with Claude Agent SDK practical code examples to help you master from theory to practice.

Part 14 of 24

Nov 21, 2025 AI & Intelligence

Complete Workers AI Tutorial: 10,000 Free LLM API Calls Daily, 90% Cheaper Than OpenAI

Nov 21, 2025 AI & Intelligence

Complete Workers AI Tutorial: 10,000 Free LLM API Calls Daily, 90% Cheaper Than OpenAI

Nov 25, 2025 AI & Intelligence

AI-Powered Refactoring of 10,000 Lines: A Real Story of Doing a Month's Work in 2 Weeks

AI-Assisted Code Refactoring in Practice

Nov 25, 2025 AI & Intelligence

AI-Powered Refactoring of 10,000 Lines: A Real Story of Doing a Month's Work in 2 Weeks

Dec 1, 2025 AI & Intelligence

OpenAI Blocked in China? Set Up Workers Proxy for Free in 5 Minutes (Complete Code Included)

Cloudflare Workers AI API proxy configuration diagram

Dec 1, 2025 AI & Intelligence

Chapter 1: Why Agents Need Memory Systems

Chapter 2: Four Memory Types—From Cognitive Science to Technical Architecture

Working Memory

Episodic Memory

Semantic Memory

Long-Term Memory

Chapter 3: Five-Stage Memory Pipeline—From Extraction to Forgetting

Stage 1: Extraction

Stage 2: Consolidation

Stage 3: Storage

Stage 4: Retrieval

Stage 5: Forgetting

Build Agent Memory System

Step1: Extraction Stage: Identify Valuable Information

Step2: Consolidation Stage: Merge and Update

Step3: Storage Stage: Choose Appropriate Solution

Step4: Retrieval Stage: Hybrid Retrieval Strategy

Step5: Forgetting Stage: Prevent Bloat and Noise

Chapter 4: Framework Comparison—Mem0 vs Zep vs LangMem vs LangChain

Selection Decision Tree

Real-World Scenario Recommendations

Mem0’s Integration Ecosystem

Chapter 5: Production Implementation—Cost Control and Performance Optimization

Index Selection: Precision vs Scale

Latency Optimization: From Seconds to Milliseconds

Cost Control: The Secret to 10x Savings

Security and Privacy: Memory Isolation

Consistency Maintenance: Distributed Locks + Reflection

Conclusion

FAQ

AI Development

Getting Started with MCP Server Development: Build Your First MCP Service from Scratch

AI Agent Development in Practice: Architecture Design and Implementation Guide

Related Posts

Complete Workers AI Tutorial: 10,000 Free LLM API Calls Daily, 90% Cheaper Than OpenAI

Complete Workers AI Tutorial: 10,000 Free LLM API Calls Daily, 90% Cheaper Than OpenAI

AI-Powered Refactoring of 10,000 Lines: A Real Story of Doing a Month's Work in 2 Weeks

AI-Powered Refactoring of 10,000 Lines: A Real Story of Doing a Month's Work in 2 Weeks

OpenAI Blocked in China? Set Up Workers Proxy for Free in 5 Minutes (Complete Code Included)

OpenAI Blocked in China? Set Up Workers Proxy for Free in 5 Minutes (Complete Code Included)

Comments