Dilip Singh logo
All posts
AI ArchitectureAdvanced2026-04-20·11 min read

Designing Agent Memory: Short-Term, Long-Term, Episodic & Semantic

How to architect memory for AI agents that need to learn from past interactions. Short-term context windows, long-term vector memory, episodic memory, and semantic distillation patterns.

Memory is the Hardest Agent Problem

LLMs are stateless. Every "memory" your agent has is something you explicitly retrieve and inject into its context. Designing that retrieval well is what separates a chatbot from an actual assistant.

I split agent memory into four layers, each with its own storage and access pattern.

Layer 1: Short-Term (Working Memory)

The last N turns of the current conversation. Lives in Redis with a TTL.

python
async def append_turn(thread_id: str, role: str, content: str):
    await redis.lpush(f"thread:{thread_id}:turns",
                       json.dumps({"role": role, "content": content, "ts": time.time()}))
    await redis.ltrim(f"thread:{thread_id}:turns", 0, 19)  # Keep last 20
    await redis.expire(f"thread:{thread_id}:turns", 86400)  # 24h TTL

Layer 2: Long-Term (Semantic Memory)

Facts about the user, distilled across all their sessions. Stored as embeddings.

python
async def extract_facts(thread_id: str):
    turns = await get_turns(thread_id)
    facts = await llm.extract_structured(
        prompt=FACT_EXTRACTION_PROMPT,
        text="\n".join(t["content"] for t in turns),
        schema=FactList,
    )
    for fact in facts.facts:
        await qdrant.upsert("user_facts", [{
            "id": uuid4().hex,
            "vector": await embed(fact.statement),
            "payload": {
                "user_id": user_id, "fact": fact.statement,
                "confidence": fact.confidence, "source_thread": thread_id,
                "ts": time.time(),
            },
        }])

async def recall_facts(user_id: str, query: str, k: int = 5) -> list[str]: qv = await embed(query) results = qdrant.search("user_facts", qv, query_filter={"must": [{"key": "user_id", "match": {"value": user_id}}]}, limit=k) return [r.payload["fact"] for r in results] ```

Layer 3: Episodic Memory

Specific past events with timestamps — "Last Tuesday we discussed X". Used for temporal recall:

python
class Episode(BaseModel):
    user_id: str
    summary: str
    happened_at: datetime
    participants: list[str]
    outcome: str | None

Stored in Postgres (not vector DB) because temporal queries dominate semantic ones.

Layer 4: Procedural Memory

Patterns the agent has learned about how to do its job. This is your evolving system prompt, examples library, and tool selection heuristics.

python
SUCCESSFUL_PATTERNS = await db.fetch("""
    SELECT pattern, success_rate
    FROM agent_patterns
    WHERE task_type = $1 AND success_rate > 0.85
    ORDER BY success_rate DESC LIMIT 5
""", task_type)

Inject these as few-shot examples in the system prompt.

Putting It Together: Context Assembly

python
async def build_context(thread_id: str, user_id: str, current_query: str) -> str:
    short_term = await get_turns(thread_id, n=10)
    long_term = await recall_facts(user_id, current_query, k=5)
    episodic = await recall_episodes(user_id, current_query, k=2)
    patterns = await get_patterns(task_type)

return f""" [User Facts] {format_facts(long_term)}

[Recent Episodes] {format_episodes(episodic)}

[Conversation So Far] {format_turns(short_term)}

[Current Query] {current_query} """.strip() ```

Privacy Hygiene

  • Source (which session/document?)
  • Confidence (how sure are we?)
  • Expiration (when does this become stale?)
  • User-controlled deletion (one click forgets everything)
DS
Dilip Singh
Lead Software Architect · Hureka Technologies

14+ years building enterprise software and AI systems. Architecting multi-agent AI platforms, RAG pipelines, voice AI, and high-performance SaaS for global clients.