Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

LLM Application Framework

#machinelearning #python #mlops #datascience

LLM Application Framework

Building a demo with an LLM API takes an afternoon. Building a production LLM application that's reliable, cost-effective, and safe takes months — unless you have the right scaffolding. This framework provides prompt management with versioning and A/B testing, RAG pipelines with chunking strategies and retrieval evaluation, output guardrails that catch hallucinations and policy violations, evaluation harnesses for automated quality testing, and cost tracking that prevents budget surprises. Ship LLM features with the same engineering rigor you apply to traditional ML.

Key Features

Prompt Management System — Version-controlled prompt templates with variable injection, A/B testing support, and rollback. No more prompts buried in application code.
RAG Pipeline Builder — Modular retrieval-augmented generation with configurable chunking (fixed, semantic, recursive), embedding models, vector stores, and reranking.
Output Guardrails — Content filters, hallucination detection via citation verification, PII redaction, and custom policy rules that run before responses reach users.
Evaluation Harness — Automated test suites for relevance, faithfulness, toxicity, and task-specific metrics with human-eval calibration.
Cost Tracker — Per-request token counting, cost attribution by feature/user/team, budget alerts, and optimization recommendations.
Caching Layer — Semantic similarity cache that avoids redundant API calls, reducing costs by 30-50% on repetitive queries.
Fallback Chain — Automatic model failover (e.g., GPT-4 → GPT-3.5 → local model) with latency and quality tradeoff controls.
Structured Output — JSON schema enforcement, retry-with-correction loops, and Pydantic model validation for LLM outputs.

Quick Start

unzip llm-application-framework.zip && cd llm-application-framework
pip install -r requirements.txt

# Set up your API key
export LLM_API_KEY=YOUR_API_KEY_HERE

# Run the example RAG pipeline
python src/llm_framework/core.py --config config.example.yaml --mode rag

# config.example.yaml
llm:
  provider: openai  # openai | anthropic | local
  model: gpt-4
  temperature: 0.1
  max_tokens: 2048
  fallback_model: gpt-3.5-turbo
  api_base: https://api.example.com/v1/

rag:
  chunking:
    strategy: recursive  # fixed | semantic | recursive
    chunk_size: 512
    chunk_overlap: 50
  embedding_model: text-embedding-3-small
  vector_store: chromadb  # chromadb | faiss | pinecone
  top_k: 5
  reranker: cross-encoder

guardrails:
  content_filter: true
  hallucination_check: true
  pii_redaction: true
  max_output_tokens: 1024
  blocked_topics: [medical_advice, legal_advice, financial_advice]

cost_tracking:
  enabled: true
  budget_limit_daily: 50.00
  alert_threshold: 0.8
  log_every_request: true

caching:
  enabled: true
  similarity_threshold: 0.95
  ttl_hours: 24
  max_cache_size_mb: 500

Architecture

┌────────────┐     ┌──────────────┐     ┌──────────────┐
│  User      │────>│  Prompt      │────>│  Cache       │──hit──> Response
│  Request   │     │  Manager     │     │  Layer       │
└────────────┘     └──────────────┘     └──────┬───────┘
                                           miss │
                   ┌──────────────┐     ┌──────▼───────┐
                   │  Guardrails  │<────│  LLM Chain   │
                   │  (Post)      │     │  + RAG       │
                   └──────┬───────┘     └──────────────┘
                          │
                   ┌──────▼───────┐     ┌──────────────┐
                   │  Cost        │────>│  Response     │
                   │  Tracker     │     │  to User      │
                   └──────────────┘     └──────────────┘

Usage Examples

Build a RAG Pipeline

from llm_framework.core import RAGPipeline

rag = RAGPipeline.from_config("config.example.yaml")
rag.index_documents(documents_dir="./knowledge_base/", file_types=["pdf", "md", "txt"])

response = rag.query(question="What is the refund policy for enterprise customers?", top_k=5, include_sources=True)
print(f"{response.answer}\nSources: {[s.title for s in response.sources]} | Confidence: {response.confidence:.2f} | Cost: ${response.cost:.4f}")

Prompt Management with A/B Testing

from llm_framework.core import PromptManager

pm = PromptManager(store_path="./prompts/")
pm.register("summarize", version="v1", template="Summarize in {num_sentences} sentences.\n\nText: {text}")
pm.register("summarize", version="v2", template="Extract {num_sentences} key points as bullets.\n\nText: {text}")

# A/B test with 70/30 split
response = pm.run_ab_test(
    prompt_name="summarize", variants={"v1": 0.7, "v2": 0.3},
    variables={"text": document, "num_sentences": 3},
)

Guardrails and Output Validation

from llm_framework.core import GuardrailChain
from pydantic import BaseModel

class ProductReview(BaseModel):
    sentiment: str
    key_points: list[str]
    rating: float

guardrails = GuardrailChain(pii_redaction=True, hallucination_check=True,
                             output_schema=ProductReview, max_retries=2)
result = guardrails.run(prompt="Analyze this review: {review}", variables={"review": review_text})
print(f"Sentiment: {result.sentiment}, Rating: {result.rating}")

Configuration Reference

Parameter	Type	Default	Description
`llm.provider`	str	`openai`	LLM provider: openai, anthropic, local
`rag.chunking.strategy`	str	`recursive`	Document chunking strategy
`rag.top_k`	int	`5`	Number of retrieved context chunks
`guardrails.hallucination_check`	bool	`true`	Verify claims against retrieved sources
`cost_tracking.budget_limit_daily`	float	`50.00`	Daily cost cap in USD
`caching.similarity_threshold`	float	`0.95`	Minimum similarity for cache hit

Best Practices

Version your prompts like code — Store prompts in files, not strings. Track changes in git. A prompt change can shift model behavior more than a code change.
Always set a cost budget — LLM costs compound fast. Set daily limits and alert at 80%. One runaway loop can burn through a month's budget in hours.
Cache aggressively for internal tools — FAQ bots, documentation search, and report generation have high query repetition. Semantic caching cuts costs 30-50%.
Evaluate before deploying prompt changes — Run your evaluation harness on 100+ test cases before pushing a prompt update to production.
Use structured output for downstream processing — Always enforce a JSON schema when LLM output feeds into application logic. Retry with correction on parse failures.

Troubleshooting

Issue	Cause	Fix
RAG returns irrelevant context	Chunk size too large or embedding mismatch	Reduce `chunk_size` to 256-512, ensure query and docs use the same embedding model
High hallucination rate	Insufficient context or temperature too high	Increase `top_k`, reduce `temperature` to 0.0-0.1, enable `hallucination_check`
Cost exceeding budget	No caching, verbose prompts, or large `max_tokens`	Enable caching, trim system prompts, set `max_tokens` to actual need
Structured output parsing fails	Model not following schema	Use `max_retries: 3` with correction prompt, or switch to a more capable model

This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [LLM Application Framework] with all files, templates, and documentation for $59.

Get the Full Kit →

Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.

Get the Complete Bundle →

DEV Community

LLM Application Framework

LLM Application Framework

Key Features

Quick Start

Architecture

Usage Examples

Build a RAG Pipeline

Prompt Management with A/B Testing

Guardrails and Output Validation

Configuration Reference

Best Practices

Troubleshooting

Related Articles

Top comments (0)