DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

LLM Application Framework

LLM Application Framework

Building a demo with an LLM API takes an afternoon. Building a production LLM application that's reliable, cost-effective, and safe takes months — unless you have the right scaffolding. This framework provides prompt management with versioning and A/B testing, RAG pipelines with chunking strategies and retrieval evaluation, output guardrails that catch hallucinations and policy violations, evaluation harnesses for automated quality testing, and cost tracking that prevents budget surprises. Ship LLM features with the same engineering rigor you apply to traditional ML.

Key Features

  • Prompt Management System — Version-controlled prompt templates with variable injection, A/B testing support, and rollback. No more prompts buried in application code.
  • RAG Pipeline Builder — Modular retrieval-augmented generation with configurable chunking (fixed, semantic, recursive), embedding models, vector stores, and reranking.
  • Output Guardrails — Content filters, hallucination detection via citation verification, PII redaction, and custom policy rules that run before responses reach users.
  • Evaluation Harness — Automated test suites for relevance, faithfulness, toxicity, and task-specific metrics with human-eval calibration.
  • Cost Tracker — Per-request token counting, cost attribution by feature/user/team, budget alerts, and optimization recommendations.
  • Caching Layer — Semantic similarity cache that avoids redundant API calls, reducing costs by 30-50% on repetitive queries.
  • Fallback Chain — Automatic model failover (e.g., GPT-4 → GPT-3.5 → local model) with latency and quality tradeoff controls.
  • Structured Output — JSON schema enforcement, retry-with-correction loops, and Pydantic model validation for LLM outputs.

Quick Start

unzip llm-application-framework.zip && cd llm-application-framework
pip install -r requirements.txt

# Set up your API key
export LLM_API_KEY=YOUR_API_KEY_HERE

# Run the example RAG pipeline
python src/llm_framework/core.py --config config.example.yaml --mode rag
Enter fullscreen mode Exit fullscreen mode
# config.example.yaml
llm:
  provider: openai  # openai | anthropic | local
  model: gpt-4
  temperature: 0.1
  max_tokens: 2048
  fallback_model: gpt-3.5-turbo
  api_base: https://api.example.com/v1/

rag:
  chunking:
    strategy: recursive  # fixed | semantic | recursive
    chunk_size: 512
    chunk_overlap: 50
  embedding_model: text-embedding-3-small
  vector_store: chromadb  # chromadb | faiss | pinecone
  top_k: 5
  reranker: cross-encoder

guardrails:
  content_filter: true
  hallucination_check: true
  pii_redaction: true
  max_output_tokens: 1024
  blocked_topics: [medical_advice, legal_advice, financial_advice]

cost_tracking:
  enabled: true
  budget_limit_daily: 50.00
  alert_threshold: 0.8
  log_every_request: true

caching:
  enabled: true
  similarity_threshold: 0.95
  ttl_hours: 24
  max_cache_size_mb: 500
Enter fullscreen mode Exit fullscreen mode

Architecture

┌────────────┐     ┌──────────────┐     ┌──────────────┐
│  User      │────>│  Prompt      │────>│  Cache       │──hit──> Response
│  Request   │     │  Manager     │     │  Layer       │
└────────────┘     └──────────────┘     └──────┬───────┘
                                           miss │
                   ┌──────────────┐     ┌──────▼───────┐
                   │  Guardrails  │<────│  LLM Chain   │
                   │  (Post)      │     │  + RAG       │
                   └──────┬───────┘     └──────────────┘
                          │
                   ┌──────▼───────┐     ┌──────────────┐
                   │  Cost        │────>│  Response     │
                   │  Tracker     │     │  to User      │
                   └──────────────┘     └──────────────┘
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Build a RAG Pipeline

from llm_framework.core import RAGPipeline

rag = RAGPipeline.from_config("config.example.yaml")
rag.index_documents(documents_dir="./knowledge_base/", file_types=["pdf", "md", "txt"])

response = rag.query(question="What is the refund policy for enterprise customers?", top_k=5, include_sources=True)
print(f"{response.answer}\nSources: {[s.title for s in response.sources]} | Confidence: {response.confidence:.2f} | Cost: ${response.cost:.4f}")
Enter fullscreen mode Exit fullscreen mode

Prompt Management with A/B Testing

from llm_framework.core import PromptManager

pm = PromptManager(store_path="./prompts/")
pm.register("summarize", version="v1", template="Summarize in {num_sentences} sentences.\n\nText: {text}")
pm.register("summarize", version="v2", template="Extract {num_sentences} key points as bullets.\n\nText: {text}")

# A/B test with 70/30 split
response = pm.run_ab_test(
    prompt_name="summarize", variants={"v1": 0.7, "v2": 0.3},
    variables={"text": document, "num_sentences": 3},
)
Enter fullscreen mode Exit fullscreen mode

Guardrails and Output Validation

from llm_framework.core import GuardrailChain
from pydantic import BaseModel

class ProductReview(BaseModel):
    sentiment: str
    key_points: list[str]
    rating: float

guardrails = GuardrailChain(pii_redaction=True, hallucination_check=True,
                             output_schema=ProductReview, max_retries=2)
result = guardrails.run(prompt="Analyze this review: {review}", variables={"review": review_text})
print(f"Sentiment: {result.sentiment}, Rating: {result.rating}")
Enter fullscreen mode Exit fullscreen mode

Configuration Reference

Parameter Type Default Description
llm.provider str openai LLM provider: openai, anthropic, local
rag.chunking.strategy str recursive Document chunking strategy
rag.top_k int 5 Number of retrieved context chunks
guardrails.hallucination_check bool true Verify claims against retrieved sources
cost_tracking.budget_limit_daily float 50.00 Daily cost cap in USD
caching.similarity_threshold float 0.95 Minimum similarity for cache hit

Best Practices

  1. Version your prompts like code — Store prompts in files, not strings. Track changes in git. A prompt change can shift model behavior more than a code change.
  2. Always set a cost budget — LLM costs compound fast. Set daily limits and alert at 80%. One runaway loop can burn through a month's budget in hours.
  3. Cache aggressively for internal tools — FAQ bots, documentation search, and report generation have high query repetition. Semantic caching cuts costs 30-50%.
  4. Evaluate before deploying prompt changes — Run your evaluation harness on 100+ test cases before pushing a prompt update to production.
  5. Use structured output for downstream processing — Always enforce a JSON schema when LLM output feeds into application logic. Retry with correction on parse failures.

Troubleshooting

Issue Cause Fix
RAG returns irrelevant context Chunk size too large or embedding mismatch Reduce chunk_size to 256-512, ensure query and docs use the same embedding model
High hallucination rate Insufficient context or temperature too high Increase top_k, reduce temperature to 0.0-0.1, enable hallucination_check
Cost exceeding budget No caching, verbose prompts, or large max_tokens Enable caching, trim system prompts, set max_tokens to actual need
Structured output parsing fails Model not following schema Use max_retries: 3 with correction prompt, or switch to a more capable model

This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [LLM Application Framework] with all files, templates, and documentation for $59.

Get the Full Kit →

Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)