LLM Prompt Engineering Kit
Prompts are the new code — and they deserve the same rigor. This kit provides a structured prompt template library, proven chain-of-thought patterns, a version control system for prompts, and an A/B testing framework to measure which prompts actually perform better. Stop guessing and start engineering your prompts systematically.
Key Features
- Prompt Template Library — 50+ battle-tested templates for classification, extraction, summarization, code generation, and creative writing with variable interpolation
- Chain-of-Thought Patterns — Ready-to-use CoT, Tree-of-Thought, and self-consistency templates that improve reasoning accuracy by 15-40%
- Few-Shot Example Management — Dynamic example selection based on semantic similarity to the input query
- Prompt Versioning — Git-like version control for prompts with diff, rollback, and deployment tagging
- A/B Testing Framework — Split traffic between prompt variants, measure performance metrics, and declare winners with statistical significance
- Prompt Optimization — Automated prompt refinement using DSPy-style optimization loops
- Output Format Enforcement — JSON schema constraints, regex validation, and structured output parsing built into templates
Quick Start
from prompt_kit import PromptTemplate, PromptRegistry, FewShotSelector
# 1. Define a prompt template with variables
template = PromptTemplate(
name="customer_classifier",
version="1.2",
system="""You are a customer intent classifier. Classify the customer message
into exactly one of these categories: {categories}.
Respond with JSON: {{"intent": "<category>", "confidence": <0.0-1.0>}}""",
user="{customer_message}",
variables={
"categories": ["billing", "technical", "sales", "cancellation", "general"],
},
output_schema={
"type": "object",
"properties": {
"intent": {"type": "string"},
"confidence": {"type": "number"},
},
"required": ["intent", "confidence"],
},
)
# 2. Render and use
prompt = template.render(customer_message="I can't log into my account")
print(prompt.system) # Fully interpolated system prompt
print(prompt.user) # "I can't log into my account"
# 3. Parse and validate output
result = template.parse_output('{"intent": "technical", "confidence": 0.92}')
print(result.intent) # "technical"
print(result.confidence) # 0.92
Architecture
┌────────────────────────────────────────────┐
│ Prompt Registry │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Template A│ │Template B│ │Template C│ │
│ │ v1.0 │ │ v2.1 │ │ v1.3 │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ ┌────▼────────────▼────────────▼──────┐ │
│ │ Version Control │ │
│ │ Diff │ Rollback │ Tags │ History │ │
│ └────────────────┬────────────────────┘ │
│ │ │
│ ┌────────────────▼────────────────────┐ │
│ │ A/B Test Engine │ │
│ │ Split │ Measure │ Significance │ │
│ └─────────────────────────────────────┘ │
└────────────────────────────────────────────┘
│
▼
┌────────────────┐ ┌────────────────┐
│ Few-Shot │ │ Output │
│ Selector │ │ Parser │
│ (semantic sim) │ │ (JSON schema) │
└────────────────┘ └────────────────┘
Usage Examples
Chain-of-Thought Prompting
from prompt_kit.patterns import ChainOfThought, SelfConsistency
# Standard CoT
cot = ChainOfThought(
task="Solve this math problem step by step.",
examples=[
{"input": "What is 15% of 240?", "reasoning": "15% = 0.15. 0.15 × 240 = 36.", "answer": "36"},
],
)
prompt = cot.render(input="A store has a 25% off sale. The original price is $80. What is the sale price?")
# Self-consistency: Run CoT N times, take majority vote
sc = SelfConsistency(
chain=cot,
num_samples=5,
aggregation="majority_vote", # majority_vote | weighted
)
result = sc.run(input="If a train travels 60mph for 2.5 hours, how far does it go?")
print(result.answer) # "150 miles"
print(result.consistency) # 1.0 (all 5 samples agreed)
Dynamic Few-Shot Selection
from prompt_kit import FewShotSelector
selector = FewShotSelector(
examples_path="examples/classification_examples.jsonl",
embedding_model="text-embedding-3-small",
num_examples=3, # Select 3 most similar examples
)
# Automatically picks the most relevant examples for this input
examples = selector.select("My payment was declined but the charge still shows up")
# Returns 3 examples semantically closest to billing/payment issues
Prompt Versioning
from prompt_kit import PromptRegistry
registry = PromptRegistry(storage="prompts/")
# Save a version
registry.save(template, tag="production")
# List versions
versions = registry.history("customer_classifier")
for v in versions:
print(f"v{v.version} | {v.timestamp} | {v.tag or 'untagged'}")
# Roll back
registry.rollback("customer_classifier", to_version="1.1")
# Diff two versions
diff = registry.diff("customer_classifier", "1.1", "1.2")
print(diff) # Shows exact prompt text changes
A/B Testing Prompts
from prompt_kit.ab_test import ABTest
test = ABTest(
name="classifier_v1_vs_v2",
variants={
"control": registry.get("customer_classifier", version="1.1"),
"treatment": registry.get("customer_classifier", version="1.2"),
},
traffic_split={"control": 0.5, "treatment": 0.5},
metric="accuracy",
min_samples=500,
significance_level=0.05,
)
# Route each request through the test
variant, prompt = test.assign(user_id="user_456")
# ... use prompt, collect result ...
test.record(user_id="user_456", metric_value=1.0) # Correct classification
# Check results
report = test.report()
print(f"Control: {report.control.mean:.3f} ± {report.control.ci:.3f}")
print(f"Treatment: {report.treatment.mean:.3f} ± {report.treatment.ci:.3f}")
print(f"Winner: {report.winner} (p={report.p_value:.4f})")
Configuration
# prompt_kit_config.yaml
registry:
storage: "prompts/" # Local directory for prompt files
format: "yaml" # yaml | json
auto_version: true # Auto-increment version on save
few_shot:
examples_dir: "examples/"
embedding_model: "text-embedding-3-small"
num_examples: 3
similarity_metric: "cosine" # cosine | euclidean
ab_testing:
storage: "sqlite:///ab_tests.db"
default_traffic_split: 0.5
min_samples_per_variant: 500
significance_level: 0.05
auto_promote_winner: false # Require manual promotion
output_parsing:
strict_mode: true # Fail on schema violation
max_retries: 3 # Retry with corrective prompt
retry_model: "gpt-4o-mini" # Cheap model for format fixing
templates_dir: "templates/" # Pre-built template library
Best Practices
- Version every prompt change — A single word change in a system prompt can shift behavior dramatically. Track every edit.
- Use few-shot examples over long instructions — Models learn patterns from examples better than from verbose rules.
- Test with edge cases — Your prompt works on happy paths. Test it with empty inputs, adversarial inputs, and multi-language content.
- Measure before declaring winners — Run A/B tests to statistical significance (p < 0.05). Gut feeling is not a metric.
- Separate concerns in prompts — Keep persona, task instructions, format requirements, and examples in distinct sections.
-
Pin model versions — A prompt optimized for
gpt-4o-2024-08-06may perform differently on the next model release.
Troubleshooting
| Problem | Cause | Fix |
|---|---|---|
| Output doesn't match JSON schema | Model ignores schema constraints | Add "You MUST respond with valid JSON" to system prompt and enable max_retries
|
| Few-shot selector returns irrelevant examples | Embedding model mismatch or too few examples | Use text-embedding-3-small and add 50+ diverse examples to the pool |
| A/B test never reaches significance | Effect size too small or too few samples | Increase min_samples or accept that the variants are equivalent |
| Prompt version rollback doesn't change behavior | Application caching old prompt | Clear application-level prompt cache after rollback |
This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [LLM Prompt Engineering Kit] with all files, templates, and documentation for $39.
Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.
Top comments (0)