Dilip Singh logo
All posts
AI ArchitectureIntermediate2026-06-05·10 min read

Prompt Engineering in Production: Templates, Versioning & A/B Testing

Production prompt engineering is software engineering. Learn how to template, version, evaluate, and A/B test prompts so your AI features improve continuously instead of regressing silently.

Prompts Are Code

The biggest mistake teams make is treating prompts as configuration strings. Prompts deserve version control, code review, tests, and rollout strategies — the same as any other code.

Template Everything

python
from jinja2 import Template

CLASSIFY_INTENT = Template(""" You are an intent classifier for a {{ domain }} support system.

Classify the message into one of: {{ ", ".join(intents) }}.

Recent context (last {{ context_n }} messages): {% for m in recent_messages %} {{ m.role }}: {{ m.content }} {% endfor %}

Message: {{ user_message }}

Respond with ONLY the intent label. """.strip())

prompt = CLASSIFY_INTENT.render( domain="healthcare", intents=["appointment", "billing", "clinical", "other"], context_n=5, recent_messages=history[-5:], user_message=current_msg, ) ```

Version Every Prompt

python
PROMPTS = {
    "classify_intent": {
        "v1": "Classify: {{ message }}",
        "v2": CLASSIFY_INTENT,  # Above template
        "v3": CLASSIFY_INTENT_WITH_EXAMPLES,
    }
}

def render(name: str, version: str, **vars) -> str: template = PROMPTS[name][version] return template.render(vars) if hasattr(template, "render") else template.format(vars) ```

A/B Test Prompts in Production

python
import hashlib

def select_prompt_version(name: str, user_id: str, experiment: dict) -> str: """Deterministic A/B split based on user_id.""" bucket = int(hashlib.md5(f"{name}:{user_id}".encode()).hexdigest(), 16) % 100 cumulative = 0 for version, percentage in experiment["splits"].items(): cumulative += percentage if bucket < cumulative: return version return experiment["control"]

EXPERIMENTS = { "classify_intent": { "control": "v2", "splits": {"v2": 90, "v3": 10}, } } ```

Track Quality Per Version

Every LLM call logs the prompt version to LangFuse:

python
trace = langfuse.trace(
    name="classify_intent",
    metadata={"prompt_version": version, "user_id": user_id}
)
generation = trace.generation(
    name="classify",
    model="claude-sonnet-4-6",
    input=rendered_prompt,
    output=response,
)

# Later: compare accuracy across versions SELECT prompt_version, AVG(human_correct::int) FROM generations WHERE name = 'classify_intent' AND created_at > NOW() - INTERVAL '7 days' GROUP BY prompt_version; ```

Prompt Test Suite

Before promoting v3 to 100%, run it through a regression suite:

python
test_cases = [
    {"message": "I need to reschedule my appointment", "expected": "appointment"},
    {"message": "Why was I charged $250?", "expected": "billing"},
    {"message": "My blood pressure is 140/90", "expected": "clinical"},
]

def test_prompt_version(version: str) -> float: correct = 0 for case in test_cases: prompt = render("classify_intent", version, user_message=case["message"]) result = llm(prompt).strip().lower() if result == case["expected"]: correct += 1 return correct / len(test_cases) ```

Set a minimum bar (95% on suite) before any version reaches > 10% traffic.

DS
Dilip Singh
Lead Software Architect · Hureka Technologies

14+ years building enterprise software and AI systems. Architecting multi-agent AI platforms, RAG pipelines, voice AI, and high-performance SaaS for global clients.