Prompt Management: Versioning, Testing, Collaboration, Deployment

#ai #machinelearning #llm

This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.

Prompt Management: Versioning, Testing, Collaboration, Deployment

Introduction

Prompts are the primary interface for controlling LLM behavior, yet most teams manage them as copy-pasted text files or hardcoded strings in source code. As AI applications grow, prompts need the same rigor as application code: versioning, testing, review, staging, and deployment pipelines. This article covers the tools and workflows for professional prompt management.

Prompt as Code

Store prompts in a structured, version-controlled format:

# prompts/summarization.yaml

name: document_summarizer

version: 2.3.0

model: claude-sonnet-4-20260512

parameters:

  temperature: 0.3

  max_tokens: 1024

system_prompt: |

  You are a technical document summarizer. Follow these rules:

  1. Extract the core thesis and key supporting points

  2. Preserve technical accuracy - do not simplify concepts

  3. Maintain the original document's structure

  4. Output in the requested format

  5. Never add information not present in the source

user_template: |

  Document: {document_text}

  Format: {output_format}

  Max length: {max_length} words

  Summary:

tests:

  - input:

      document_text: "Kubernetes is a container orchestration platform..."

      output_format: bullet_points

      max_length: 100

    expected_output_contains: ["container orchestration", "pods"]

    min_length: 50

    max_length: 150

Prompt Registry

A central registry stores all prompt versions with metadata:

import hashlib

import yaml

from datetime import datetime

class PromptRegistry:

    def __init__(self, storage_backend):

        self.storage = storage_backend

    def register_prompt(self, name: str, prompt_data: dict) -> str:

        version = prompt_data.get("version", "1.0.0")

        prompt_hash = hashlib.sha256(yaml.dump(prompt_data).encode()).hexdigest()[:12]

        entry = {

            "name": name,

            "version": version,

            "hash": prompt_hash,

            "prompt": prompt_data,

            "created_at": datetime.now().isoformat(),

            "status": "draft",

        }

        self.storage.save(f"prompts/{name}/{version}", entry)

        return prompt_hash

    def get_prompt(self, name: str, version: str = "latest") -> dict:

        if version == "latest":

            versions = self.storage.list(f"prompts/{name}")

            version = sorted(versions)[-1]

        return self.storage.load(f"prompts/{name}/{version}")

    def promote_to_production(self, name: str, version: str):

        entry = self.storage.load(f"prompts/{name}/{version}")

        entry["status"] = "production"

        entry["promoted_at"] = datetime.now().isoformat()

        self.storage.save(f"prompts/{name}/{version}", entry)

    def diff(self, name: str, version_a: str, version_b: str) -> str:

        prompt_a = self.get_prompt(name, version_a)["prompt"]

        prompt_b = self.get_prompt(name, version_b)["prompt"]

        return self._compute_diff(prompt_a, prompt_b)

Automated Prompt Testing

Test prompts against a suite of evaluation cases:

class PromptTester:

    def __init__(self, llm_fn):

        self.llm = llm_fn

    def run_tests(self, prompt_entry: dict) -> dict:

        prompt_data = prompt_entry["prompt"]

        tests = prompt_data.get("tests", [])

        results = {"passed": 0, "failed": 0, "details": []}

        for test in tests:

            try:

                result = self._run_single_test(prompt_data, test)

                results["details"].append(result)

                if result["passed"]:

                    results["passed"] += 1

                else:

                    results["failed"] += 1

            except Exception as e:

                results["failed"] += 1

                results["details"].append({

                    "test": test,

                    "passed": False,

                    "error": str(e),

                })

        results["pass_rate"] = results["passed"] / len(tests) if tests else 1.0

        return results

    def _run_single_test(self, prompt_data: dict, test: dict) -> dict:

        # Build the prompt

        system = prompt_data.get("system_prompt", "")

        template = prompt_data.get("user_template", "")

        inputs = test.get("input", {})

        full_prompt = template.format(**inputs) if inputs else template

        # Run the model

        response = self.llm(system, full_prompt, prompt_data.get("parameters", {}))

        # Check assertions

        failures = []

        if "expected_output_contains" in test:

Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.

Found this useful? Check out more developer guides and tool comparisons on AI Study Room.