Prompt Engineering in Production: Templates, Versioning & A/B Testing
Production prompt engineering is software engineering. Learn how to template, version, evaluate, and A/B test prompts so your AI features improve continuously instead of regressing silently.
Prompts Are Code
The biggest mistake teams make is treating prompts as configuration strings. Prompts deserve version control, code review, tests, and rollout strategies — the same as any other code.
Template Everything
from jinja2 import TemplateCLASSIFY_INTENT = Template(""" You are an intent classifier for a {{ domain }} support system.
Classify the message into one of: {{ ", ".join(intents) }}.
Recent context (last {{ context_n }} messages): {% for m in recent_messages %} {{ m.role }}: {{ m.content }} {% endfor %}
Message: {{ user_message }}
Respond with ONLY the intent label. """.strip())
prompt = CLASSIFY_INTENT.render( domain="healthcare", intents=["appointment", "billing", "clinical", "other"], context_n=5, recent_messages=history[-5:], user_message=current_msg, ) ```
Version Every Prompt
PROMPTS = {
"classify_intent": {
"v1": "Classify: {{ message }}",
"v2": CLASSIFY_INTENT, # Above template
"v3": CLASSIFY_INTENT_WITH_EXAMPLES,
}
}def render(name: str, version: str, **vars) -> str: template = PROMPTS[name][version] return template.render(vars) if hasattr(template, "render") else template.format(vars) ```
A/B Test Prompts in Production
import hashlibdef select_prompt_version(name: str, user_id: str, experiment: dict) -> str: """Deterministic A/B split based on user_id.""" bucket = int(hashlib.md5(f"{name}:{user_id}".encode()).hexdigest(), 16) % 100 cumulative = 0 for version, percentage in experiment["splits"].items(): cumulative += percentage if bucket < cumulative: return version return experiment["control"]
EXPERIMENTS = { "classify_intent": { "control": "v2", "splits": {"v2": 90, "v3": 10}, } } ```
Track Quality Per Version
Every LLM call logs the prompt version to LangFuse:
trace = langfuse.trace(
name="classify_intent",
metadata={"prompt_version": version, "user_id": user_id}
)
generation = trace.generation(
name="classify",
model="claude-sonnet-4-6",
input=rendered_prompt,
output=response,
)# Later: compare accuracy across versions SELECT prompt_version, AVG(human_correct::int) FROM generations WHERE name = 'classify_intent' AND created_at > NOW() - INTERVAL '7 days' GROUP BY prompt_version; ```
Prompt Test Suite
Before promoting v3 to 100%, run it through a regression suite:
test_cases = [
{"message": "I need to reschedule my appointment", "expected": "appointment"},
{"message": "Why was I charged $250?", "expected": "billing"},
{"message": "My blood pressure is 140/90", "expected": "clinical"},
]def test_prompt_version(version: str) -> float: correct = 0 for case in test_cases: prompt = render("classify_intent", version, user_message=case["message"]) result = llm(prompt).strip().lower() if result == case["expected"]: correct += 1 return correct / len(test_cases) ```
Set a minimum bar (95% on suite) before any version reaches > 10% traffic.