Dilip Singh logo
All posts
AI ArchitectureAdvanced2026-06-20·13 min read

LangGraph for Production: Stateful Multi-Agent Workflows That Actually Ship

LangGraph adds graph-based state machines to LangChain. Learn how to model multi-agent coordination, conditional branching, human-in-the-loop, and persistent state for production AI workflows.

Why LangGraph and Not Just LangChain?

LangChain chains are linear. Real production agents need cycles: an agent calls a tool, evaluates the result, decides whether to call another tool, and only stops when a condition is met. That's a graph, not a chain — and LangGraph models it natively.

After shipping LangGraph in three production systems at Hureka, I now reach for it whenever a workflow has branching, retries, or multiple agents collaborating.

Modeling State as a Graph

python
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated, Sequence
from langchain_core.messages import BaseMessage

class AgentState(TypedDict): messages: Annotated[Sequence[BaseMessage], "Conversation history"] plan: list[str] completed: list[str] needs_human: bool

def planner(state: AgentState) -> AgentState: plan = llm_plan(state["messages"]) return {"plan": plan, "completed": [], "needs_human": False}

def executor(state: AgentState) -> AgentState: next_step = state["plan"][len(state["completed"])] result = execute_step(next_step) return {"completed": state["completed"] + [result]}

def router(state: AgentState) -> str: if state["needs_human"]: return "human" if len(state["completed"]) < len(state["plan"]): return "executor" return END

graph = StateGraph(AgentState) graph.add_node("planner", planner) graph.add_node("executor", executor) graph.add_node("human", human_review) graph.set_entry_point("planner") graph.add_conditional_edges("executor", router) graph.add_edge("planner", "executor") graph.add_edge("human", "executor") app = graph.compile() ```

Persistence and Resumability

LangGraph's checkpointer saves state after every node — your workflow survives crashes, restarts, and long-running human review delays.

python
from langgraph.checkpoint.postgres import PostgresSaver

checkpointer = PostgresSaver.from_conn_string(DB_URL) app = graph.compile(checkpointer=checkpointer)

# Resume by thread_id — picks up exactly where it left off config = {"configurable": {"thread_id": "user-abc-session-42"}} result = await app.ainvoke({"messages": [user_input]}, config=config) ```

Human-in-the-Loop Without Polling

python
graph.add_node("human", lambda s: {"needs_human": True})

app = graph.compile( checkpointer=checkpointer, interrupt_before=["human"] # Pause graph, return control )

# Frontend polls for paused threads paused_state = app.get_state(config) if paused_state.next == ("human",): human_decision = await get_human_approval(paused_state) await app.aupdate_state(config, {"needs_human": False}) await app.ainvoke(None, config=config) # Resume ```

Lessons from Production

  1. 1Type your state — TypedDict catches 80% of bugs before runtime
  2. 2Keep nodes pure — A node should take state and return a partial update, nothing else
  3. 3Use checkpointers from day one — Adding persistence later means rewriting
  4. 4Visualize the graph — `app.get_graph().draw_mermaid()` saves hours in code review
  5. 5Test the router functions separately — Routing logic is the most error-prone part
DS
Dilip Singh
Lead Software Architect · Hureka Technologies

14+ years building enterprise software and AI systems. Architecting multi-agent AI platforms, RAG pipelines, voice AI, and high-performance SaaS for global clients.