DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

LLM Prompt Engineering Kit

LLM Prompt Engineering Kit

Prompts are the new code — and they deserve the same rigor. This kit provides a structured prompt template library, proven chain-of-thought patterns, a version control system for prompts, and an A/B testing framework to measure which prompts actually perform better. Stop guessing and start engineering your prompts systematically.

Key Features

  • Prompt Template Library — 50+ battle-tested templates for classification, extraction, summarization, code generation, and creative writing with variable interpolation
  • Chain-of-Thought Patterns — Ready-to-use CoT, Tree-of-Thought, and self-consistency templates that improve reasoning accuracy by 15-40%
  • Few-Shot Example Management — Dynamic example selection based on semantic similarity to the input query
  • Prompt Versioning — Git-like version control for prompts with diff, rollback, and deployment tagging
  • A/B Testing Framework — Split traffic between prompt variants, measure performance metrics, and declare winners with statistical significance
  • Prompt Optimization — Automated prompt refinement using DSPy-style optimization loops
  • Output Format Enforcement — JSON schema constraints, regex validation, and structured output parsing built into templates

Quick Start

from prompt_kit import PromptTemplate, PromptRegistry, FewShotSelector

# 1. Define a prompt template with variables
template = PromptTemplate(
    name="customer_classifier",
    version="1.2",
    system="""You are a customer intent classifier. Classify the customer message
into exactly one of these categories: {categories}.
Respond with JSON: {{"intent": "<category>", "confidence": <0.0-1.0>}}""",
    user="{customer_message}",
    variables={
        "categories": ["billing", "technical", "sales", "cancellation", "general"],
    },
    output_schema={
        "type": "object",
        "properties": {
            "intent": {"type": "string"},
            "confidence": {"type": "number"},
        },
        "required": ["intent", "confidence"],
    },
)

# 2. Render and use
prompt = template.render(customer_message="I can't log into my account")
print(prompt.system)  # Fully interpolated system prompt
print(prompt.user)    # "I can't log into my account"

# 3. Parse and validate output
result = template.parse_output('{"intent": "technical", "confidence": 0.92}')
print(result.intent)      # "technical"
print(result.confidence)  # 0.92
Enter fullscreen mode Exit fullscreen mode

Architecture

┌────────────────────────────────────────────┐
│            Prompt Registry                  │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐   │
│  │Template A│ │Template B│ │Template C│   │
│  │ v1.0     │ │ v2.1     │ │ v1.3     │   │
│  └────┬─────┘ └────┬─────┘ └────┬─────┘   │
│       │            │            │          │
│  ┌────▼────────────▼────────────▼──────┐   │
│  │         Version Control             │   │
│  │  Diff │ Rollback │ Tags │ History   │   │
│  └────────────────┬────────────────────┘   │
│                   │                        │
│  ┌────────────────▼────────────────────┐   │
│  │         A/B Test Engine             │   │
│  │  Split │ Measure │ Significance     │   │
│  └─────────────────────────────────────┘   │
└────────────────────────────────────────────┘
        │
        ▼
┌────────────────┐    ┌────────────────┐
│ Few-Shot       │    │ Output         │
│ Selector       │    │ Parser         │
│ (semantic sim) │    │ (JSON schema)  │
└────────────────┘    └────────────────┘
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Chain-of-Thought Prompting

from prompt_kit.patterns import ChainOfThought, SelfConsistency

# Standard CoT
cot = ChainOfThought(
    task="Solve this math problem step by step.",
    examples=[
        {"input": "What is 15% of 240?", "reasoning": "15% = 0.15. 0.15 × 240 = 36.", "answer": "36"},
    ],
)
prompt = cot.render(input="A store has a 25% off sale. The original price is $80. What is the sale price?")

# Self-consistency: Run CoT N times, take majority vote
sc = SelfConsistency(
    chain=cot,
    num_samples=5,
    aggregation="majority_vote",  # majority_vote | weighted
)
result = sc.run(input="If a train travels 60mph for 2.5 hours, how far does it go?")
print(result.answer)      # "150 miles"
print(result.consistency)  # 1.0 (all 5 samples agreed)
Enter fullscreen mode Exit fullscreen mode

Dynamic Few-Shot Selection

from prompt_kit import FewShotSelector

selector = FewShotSelector(
    examples_path="examples/classification_examples.jsonl",
    embedding_model="text-embedding-3-small",
    num_examples=3,   # Select 3 most similar examples
)

# Automatically picks the most relevant examples for this input
examples = selector.select("My payment was declined but the charge still shows up")
# Returns 3 examples semantically closest to billing/payment issues
Enter fullscreen mode Exit fullscreen mode

Prompt Versioning

from prompt_kit import PromptRegistry

registry = PromptRegistry(storage="prompts/")

# Save a version
registry.save(template, tag="production")

# List versions
versions = registry.history("customer_classifier")
for v in versions:
    print(f"v{v.version} | {v.timestamp} | {v.tag or 'untagged'}")

# Roll back
registry.rollback("customer_classifier", to_version="1.1")

# Diff two versions
diff = registry.diff("customer_classifier", "1.1", "1.2")
print(diff)  # Shows exact prompt text changes
Enter fullscreen mode Exit fullscreen mode

A/B Testing Prompts

from prompt_kit.ab_test import ABTest

test = ABTest(
    name="classifier_v1_vs_v2",
    variants={
        "control": registry.get("customer_classifier", version="1.1"),
        "treatment": registry.get("customer_classifier", version="1.2"),
    },
    traffic_split={"control": 0.5, "treatment": 0.5},
    metric="accuracy",
    min_samples=500,
    significance_level=0.05,
)

# Route each request through the test
variant, prompt = test.assign(user_id="user_456")
# ... use prompt, collect result ...
test.record(user_id="user_456", metric_value=1.0)  # Correct classification

# Check results
report = test.report()
print(f"Control: {report.control.mean:.3f} ± {report.control.ci:.3f}")
print(f"Treatment: {report.treatment.mean:.3f} ± {report.treatment.ci:.3f}")
print(f"Winner: {report.winner} (p={report.p_value:.4f})")
Enter fullscreen mode Exit fullscreen mode

Configuration

# prompt_kit_config.yaml
registry:
  storage: "prompts/"            # Local directory for prompt files
  format: "yaml"                 # yaml | json
  auto_version: true             # Auto-increment version on save

few_shot:
  examples_dir: "examples/"
  embedding_model: "text-embedding-3-small"
  num_examples: 3
  similarity_metric: "cosine"    # cosine | euclidean

ab_testing:
  storage: "sqlite:///ab_tests.db"
  default_traffic_split: 0.5
  min_samples_per_variant: 500
  significance_level: 0.05
  auto_promote_winner: false     # Require manual promotion

output_parsing:
  strict_mode: true              # Fail on schema violation
  max_retries: 3                 # Retry with corrective prompt
  retry_model: "gpt-4o-mini"    # Cheap model for format fixing

templates_dir: "templates/"      # Pre-built template library
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Version every prompt change — A single word change in a system prompt can shift behavior dramatically. Track every edit.
  2. Use few-shot examples over long instructions — Models learn patterns from examples better than from verbose rules.
  3. Test with edge cases — Your prompt works on happy paths. Test it with empty inputs, adversarial inputs, and multi-language content.
  4. Measure before declaring winners — Run A/B tests to statistical significance (p < 0.05). Gut feeling is not a metric.
  5. Separate concerns in prompts — Keep persona, task instructions, format requirements, and examples in distinct sections.
  6. Pin model versions — A prompt optimized for gpt-4o-2024-08-06 may perform differently on the next model release.

Troubleshooting

Problem Cause Fix
Output doesn't match JSON schema Model ignores schema constraints Add "You MUST respond with valid JSON" to system prompt and enable max_retries
Few-shot selector returns irrelevant examples Embedding model mismatch or too few examples Use text-embedding-3-small and add 50+ diverse examples to the pool
A/B test never reaches significance Effect size too small or too few samples Increase min_samples or accept that the variants are equivalent
Prompt version rollback doesn't change behavior Application caching old prompt Clear application-level prompt cache after rollback

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [LLM Prompt Engineering Kit] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)