Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

LLM Prompt Engineering Kit

#ai #llm #machinelearning #python

LLM Prompt Engineering Kit

Prompts are the new code — and they deserve the same rigor. This kit provides a structured prompt template library, proven chain-of-thought patterns, a version control system for prompts, and an A/B testing framework to measure which prompts actually perform better. Stop guessing and start engineering your prompts systematically.

Key Features

Prompt Template Library — 50+ battle-tested templates for classification, extraction, summarization, code generation, and creative writing with variable interpolation
Chain-of-Thought Patterns — Ready-to-use CoT, Tree-of-Thought, and self-consistency templates that improve reasoning accuracy by 15-40%
Few-Shot Example Management — Dynamic example selection based on semantic similarity to the input query
Prompt Versioning — Git-like version control for prompts with diff, rollback, and deployment tagging
A/B Testing Framework — Split traffic between prompt variants, measure performance metrics, and declare winners with statistical significance
Prompt Optimization — Automated prompt refinement using DSPy-style optimization loops
Output Format Enforcement — JSON schema constraints, regex validation, and structured output parsing built into templates

Quick Start

from prompt_kit import PromptTemplate, PromptRegistry, FewShotSelector

# 1. Define a prompt template with variables
template = PromptTemplate(
    name="customer_classifier",
    version="1.2",
    system="""You are a customer intent classifier. Classify the customer message
into exactly one of these categories: {categories}.
Respond with JSON: {{"intent": "<category>", "confidence": <0.0-1.0>}}""",
    user="{customer_message}",
    variables={
        "categories": ["billing", "technical", "sales", "cancellation", "general"],
    },
    output_schema={
        "type": "object",
        "properties": {
            "intent": {"type": "string"},
            "confidence": {"type": "number"},
        },
        "required": ["intent", "confidence"],
    },
)

# 2. Render and use
prompt = template.render(customer_message="I can't log into my account")
print(prompt.system)  # Fully interpolated system prompt
print(prompt.user)    # "I can't log into my account"

# 3. Parse and validate output
result = template.parse_output('{"intent": "technical", "confidence": 0.92}')
print(result.intent)      # "technical"
print(result.confidence)  # 0.92

Architecture

┌────────────────────────────────────────────┐
│            Prompt Registry                  │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐   │
│  │Template A│ │Template B│ │Template C│   │
│  │ v1.0     │ │ v2.1     │ │ v1.3     │   │
│  └────┬─────┘ └────┬─────┘ └────┬─────┘   │
│       │            │            │          │
│  ┌────▼────────────▼────────────▼──────┐   │
│  │         Version Control             │   │
│  │  Diff │ Rollback │ Tags │ History   │   │
│  └────────────────┬────────────────────┘   │
│                   │                        │
│  ┌────────────────▼────────────────────┐   │
│  │         A/B Test Engine             │   │
│  │  Split │ Measure │ Significance     │   │
│  └─────────────────────────────────────┘   │
└────────────────────────────────────────────┘
        │
        ▼
┌────────────────┐    ┌────────────────┐
│ Few-Shot       │    │ Output         │
│ Selector       │    │ Parser         │
│ (semantic sim) │    │ (JSON schema)  │
└────────────────┘    └────────────────┘

Usage Examples

Chain-of-Thought Prompting

from prompt_kit.patterns import ChainOfThought, SelfConsistency

# Standard CoT
cot = ChainOfThought(
    task="Solve this math problem step by step.",
    examples=[
        {"input": "What is 15% of 240?", "reasoning": "15% = 0.15. 0.15 × 240 = 36.", "answer": "36"},
    ],
)
prompt = cot.render(input="A store has a 25% off sale. The original price is $80. What is the sale price?")

# Self-consistency: Run CoT N times, take majority vote
sc = SelfConsistency(
    chain=cot,
    num_samples=5,
    aggregation="majority_vote",  # majority_vote | weighted
)
result = sc.run(input="If a train travels 60mph for 2.5 hours, how far does it go?")
print(result.answer)      # "150 miles"
print(result.consistency)  # 1.0 (all 5 samples agreed)

Dynamic Few-Shot Selection

from prompt_kit import FewShotSelector

selector = FewShotSelector(
    examples_path="examples/classification_examples.jsonl",
    embedding_model="text-embedding-3-small",
    num_examples=3,   # Select 3 most similar examples
)

# Automatically picks the most relevant examples for this input
examples = selector.select("My payment was declined but the charge still shows up")
# Returns 3 examples semantically closest to billing/payment issues

Prompt Versioning

from prompt_kit import PromptRegistry

registry = PromptRegistry(storage="prompts/")

# Save a version
registry.save(template, tag="production")

# List versions
versions = registry.history("customer_classifier")
for v in versions:
    print(f"v{v.version} | {v.timestamp} | {v.tag or 'untagged'}")

# Roll back
registry.rollback("customer_classifier", to_version="1.1")

# Diff two versions
diff = registry.diff("customer_classifier", "1.1", "1.2")
print(diff)  # Shows exact prompt text changes

A/B Testing Prompts

from prompt_kit.ab_test import ABTest

test = ABTest(
    name="classifier_v1_vs_v2",
    variants={
        "control": registry.get("customer_classifier", version="1.1"),
        "treatment": registry.get("customer_classifier", version="1.2"),
    },
    traffic_split={"control": 0.5, "treatment": 0.5},
    metric="accuracy",
    min_samples=500,
    significance_level=0.05,
)

# Route each request through the test
variant, prompt = test.assign(user_id="user_456")
# ... use prompt, collect result ...
test.record(user_id="user_456", metric_value=1.0)  # Correct classification

# Check results
report = test.report()
print(f"Control: {report.control.mean:.3f} ± {report.control.ci:.3f}")
print(f"Treatment: {report.treatment.mean:.3f} ± {report.treatment.ci:.3f}")
print(f"Winner: {report.winner} (p={report.p_value:.4f})")

Configuration

# prompt_kit_config.yaml
registry:
  storage: "prompts/"            # Local directory for prompt files
  format: "yaml"                 # yaml | json
  auto_version: true             # Auto-increment version on save

few_shot:
  examples_dir: "examples/"
  embedding_model: "text-embedding-3-small"
  num_examples: 3
  similarity_metric: "cosine"    # cosine | euclidean

ab_testing:
  storage: "sqlite:///ab_tests.db"
  default_traffic_split: 0.5
  min_samples_per_variant: 500
  significance_level: 0.05
  auto_promote_winner: false     # Require manual promotion

output_parsing:
  strict_mode: true              # Fail on schema violation
  max_retries: 3                 # Retry with corrective prompt
  retry_model: "gpt-4o-mini"    # Cheap model for format fixing

templates_dir: "templates/"      # Pre-built template library

Best Practices

Version every prompt change — A single word change in a system prompt can shift behavior dramatically. Track every edit.
Use few-shot examples over long instructions — Models learn patterns from examples better than from verbose rules.
Test with edge cases — Your prompt works on happy paths. Test it with empty inputs, adversarial inputs, and multi-language content.
Measure before declaring winners — Run A/B tests to statistical significance (p < 0.05). Gut feeling is not a metric.
Separate concerns in prompts — Keep persona, task instructions, format requirements, and examples in distinct sections.
Pin model versions — A prompt optimized for gpt-4o-2024-08-06 may perform differently on the next model release.

Troubleshooting

Problem	Cause	Fix
Output doesn't match JSON schema	Model ignores schema constraints	Add "You MUST respond with valid JSON" to system prompt and enable `max_retries`
Few-shot selector returns irrelevant examples	Embedding model mismatch or too few examples	Use `text-embedding-3-small` and add 50+ diverse examples to the pool
A/B test never reaches significance	Effect size too small or too few samples	Increase `min_samples` or accept that the variants are equivalent
Prompt version rollback doesn't change behavior	Application caching old prompt	Clear application-level prompt cache after rollback

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [LLM Prompt Engineering Kit] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →

DEV Community

LLM Prompt Engineering Kit

LLM Prompt Engineering Kit

Key Features

Quick Start

Architecture

Usage Examples

Chain-of-Thought Prompting

Dynamic Few-Shot Selection

Prompt Versioning

A/B Testing Prompts

Configuration

Best Practices

Troubleshooting

Related Articles

Top comments (0)