Gloss Key Takeaways

Classic RAG breaks down for agents because much of what they need is session-born state (preferences, partial work, tool results) that isn’t in a static corpus and doesn’t embed well.
Treat memory as a first-class system: agents should write back, consolidate, and forget—not just retrieve top-k chunks by similarity.
A workable agent memory stack has three distinct layers with different lifetimes and access patterns: episodic (timestamped events), semantic (stable facts/preferences), and working (task scratchpad).
Episodic memory is best queried by recency/session/entity, semantic memory is where embeddings and deduplication shine, and working memory should be summarized into longer-term stores at task end.
You can implement this with a simple reference architecture: an agent loop plus separate working/episodic/semantic stores and an async consolidator, using Postgres (and pgvector for semantic) rather than a single noisy vector store.

Three layered translucent spheres representing episodic, semantic, and working memory

From RAG to Agentic Memory, a Working Blueprint

The 2026 consensus among people actually shipping agents is that classic RAG is hitting a wall. Stuff documents into a vector store, retrieve top-k by cosine similarity, paste into the prompt, hope the model picks the right sentences. It works for static FAQs. It falls apart the moment your agent needs to act over time, remember a user, or correct itself after a wrong move. The fix is not a bigger embedding model. The fix is treating memory as a first-class system, not a search index bolted onto a chat loop.

This is what people mean when they talk about agentic or contextual memory. The agent does not just retrieve, it remembers, forgets, consolidates, and writes back. Below is a working blueprint you can build this week, with the layers, the runtime, and a minimal code example.

Why RAG quietly broke

RAG assumes the right answer already exists somewhere in your corpus and the only problem is finding it. Agents violate that assumption immediately. Half of what an agent needs is information that did not exist before this session: the user's stated preferences, partial work from earlier turns, the result of a tool call that failed two minutes ago, the running plan it is halfway through executing. None of that lives in your wiki. None of it has a useful embedding.

Pile that into a single vector store and one of two things happens. Either the retrieval drowns in noise, because every recent conversation gets embedded and ranked the same way as your product docs. Or you keep your store clean and the agent forgets everything the user said five turns ago. Neither is a memory system. Both are excuses dressed up as architecture.

The three layers that actually matter

Borrow the structure from cognitive science, not because brains are LLMs but because the categories are useful.

Episodic memory stores specific events. The user asked X at timestamp T. The tool returned this error. The agent decided to take this branch. Episodic entries are append-only, timestamped, and contextual. You query them by recency, by session, by entity, rarely by raw similarity.

Semantic memory stores facts and stable preferences. The user prefers metric units. This customer is on the enterprise plan. The codebase uses pnpm, not npm. Semantic entries are deduplicated, refined over time, and queried by topic or entity. This is the layer where embeddings actually earn their keep.

Working memory is the scratchpad for the current task. The plan, the intermediate results, the next tool call. It lives for the duration of one task and gets summarized into episodic or semantic memory when the task ends. Working memory is the thing most agents skip, which is why they lose the plot two tool calls in.

Three concentric rings representing episodic, semantic, and working memory orbits

A useful mental model: episodic is the journal, semantic is the address book, working is the sticky note on the desk. Different access patterns, different lifetimes, different stores.

Reference architecture

You do not need a new framework. You need four boxes and a clear contract between them.

            +------------------+
            |   Agent Loop     |
            | (LangGraph or    |
            |  custom runtime) |
            +--------+---------+
                     |
        +------------+-------------+
        |            |             |
+-------v----+ +-----v-----+ +-----v------+
|  Working   | | Episodic  | |  Semantic  |
|  Memory    | |  Store    | |  Store     |
| (in-proc)  | | (Postgres)| | (Postgres  |
|            | |           | | + pgvector)|
+------------+ +-----------+ +------------+
                     |             |
                     +------+------+
                            |
                     +------v------+
                     | Consolidator|
                     |  (async)    |
                     +-------------+

Postgres for everything. One table for episodes with a timestamp, session id, actor, and event payload. One table for semantic facts with an entity, a key, a value, and an embedding. Working memory stays in process and never touches disk unless the task is interrupted. The consolidator is a background job that reads recent episodes, extracts stable facts, deduplicates them against semantic memory, and writes back. It runs every few minutes, not on every turn.

LangGraph fits this shape cleanly because it already models state as a typed object that flows through nodes. If you do not want the dependency, a 200-line Python loop with explicit state works fine. The runtime choice is the least interesting decision in this stack.

Minimal code example

Here is the read path, which is where most teams overcomplicate things.

from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import Any

@dataclass
class MemoryContext:
    working: dict[str, Any] = field(default_factory=dict)
    recent_episodes: list[dict] = field(default_factory=list)
    relevant_facts: list[dict] = field(default_factory=list)

def build_context(session_id: str, user_id: str, query: str, db, embed) -> MemoryContext:
    ctx = MemoryContext()
    # Working memory is whatever the current graph node holds. Pass it in.
    # Episodic: last N turns from this session, plus any recent turns from this user.
    ctx.recent_episodes = db.fetch_episodes(
        session_id=session_id,
        since=datetime.utcnow() - timedelta(hours=1),
        limit=12,
    )
    # Semantic: top-k facts about this user and the entities mentioned in the query.
    qvec = embed(query)
    ctx.relevant_facts = db.search_facts(
        owner=user_id,
        query_vector=qvec,
        limit=8,
        min_score=0.78,
    )
    return ctx

def render_prompt(ctx: MemoryContext, query: str) -> str:
    parts = []
    if ctx.relevant_facts:
        parts.append("Known facts:\n" + "\n".join(f"- {f['key']}: {f['value']}" for f in ctx.relevant_facts))
    if ctx.recent_episodes:
        parts.append("Recent activity:\n" + "\n".join(f"[{e['ts']}] {e['summary']}" for e in ctx.recent_episodes))
    parts.append(f"User: {query}")
    return "\n\n".join(parts)

Two things to notice. First, the prompt has structure. Facts are labeled as facts, episodes are labeled as episodes, the user's query is the user's query. The model is much better at using context when the context is honest about what it is. Second, there is no single retrieval call. Episodic retrieval is keyed by session and time. Semantic retrieval is keyed by user and vector similarity. Mixing them produces garbage.

The write path is where the consolidator earns its keep. After each turn, append an episode. Periodically, run a small extraction prompt over recent episodes that asks "what stable facts about this user or these entities were established here?" and upsert the answers into semantic memory with a confidence score. Decay confidence over time. Drop facts that have not been confirmed in N days. This is the part that turns a chat log into a memory.

What this buys you

An agent built this way does three things RAG cannot. It improves over a session, because working and episodic memory carry forward. It improves over a user's lifetime, because semantic memory accumulates without you manually curating it. And it can be debugged, because every claim the agent makes traces back to a specific episode or fact with a timestamp.

You will still use vector search. You will still index documents. RAG is not gone, it is just one feature of a larger system. The system is the memory, and once you have it, the agents you can build stop feeling like clever search engines and start feeling like collaborators who remember the last conversation.

Build the four boxes. Keep them separate. Let the consolidator do the slow work in the background. The hard part of agentic memory is not the embeddings, it is admitting that not all information is the same shape.

Gloss What This Means For You

If you’re building an agent, stop treating “memory” as one vector index and instead split it into working, episodic, and semantic layers with clear rules for what gets written where and how it’s retrieved. Keep working memory in-process for the current task, append events to an episodic journal keyed by session and time, and maintain a deduped semantic store for stable facts and user preferences where embeddings actually help. Add a simple async consolidator that periodically summarizes and promotes useful working/episodic details into semantic memory, so the agent improves over time without drowning retrieval in noise.