LLM Application Framework
Building a demo with an LLM API takes an afternoon. Building a production LLM application that's reliable, cost-effective, and safe takes months — unless you have the right scaffolding. This framework provides prompt management with versioning and A/B testing, RAG pipelines with chunking strategies and retrieval evaluation, output guardrails that catch hallucinations and policy violations, evaluation harnesses for automated quality testing, and cost tracking that prevents budget surprises. Ship LLM features with the same engineering rigor you apply to traditional ML.
Key Features
- Prompt Management System — Version-controlled prompt templates with variable injection, A/B testing support, and rollback. No more prompts buried in application code.
- RAG Pipeline Builder — Modular retrieval-augmented generation with configurable chunking (fixed, semantic, recursive), embedding models, vector stores, and reranking.
- Output Guardrails — Content filters, hallucination detection via citation verification, PII redaction, and custom policy rules that run before responses reach users.
- Evaluation Harness — Automated test suites for relevance, faithfulness, toxicity, and task-specific metrics with human-eval calibration.
- Cost Tracker — Per-request token counting, cost attribution by feature/user/team, budget alerts, and optimization recommendations.
- Caching Layer — Semantic similarity cache that avoids redundant API calls, reducing costs by 30-50% on repetitive queries.
- Fallback Chain — Automatic model failover (e.g., GPT-4 → GPT-3.5 → local model) with latency and quality tradeoff controls.
- Structured Output — JSON schema enforcement, retry-with-correction loops, and Pydantic model validation for LLM outputs.
Quick Start
unzip llm-application-framework.zip && cd llm-application-framework
pip install -r requirements.txt
# Set up your API key
export LLM_API_KEY=YOUR_API_KEY_HERE
# Run the example RAG pipeline
python src/llm_framework/core.py --config config.example.yaml --mode rag
# config.example.yaml
llm:
provider: openai # openai | anthropic | local
model: gpt-4
temperature: 0.1
max_tokens: 2048
fallback_model: gpt-3.5-turbo
api_base: https://api.example.com/v1/
rag:
chunking:
strategy: recursive # fixed | semantic | recursive
chunk_size: 512
chunk_overlap: 50
embedding_model: text-embedding-3-small
vector_store: chromadb # chromadb | faiss | pinecone
top_k: 5
reranker: cross-encoder
guardrails:
content_filter: true
hallucination_check: true
pii_redaction: true
max_output_tokens: 1024
blocked_topics: [medical_advice, legal_advice, financial_advice]
cost_tracking:
enabled: true
budget_limit_daily: 50.00
alert_threshold: 0.8
log_every_request: true
caching:
enabled: true
similarity_threshold: 0.95
ttl_hours: 24
max_cache_size_mb: 500
Architecture
┌────────────┐ ┌──────────────┐ ┌──────────────┐
│ User │────>│ Prompt │────>│ Cache │──hit──> Response
│ Request │ │ Manager │ │ Layer │
└────────────┘ └──────────────┘ └──────┬───────┘
miss │
┌──────────────┐ ┌──────▼───────┐
│ Guardrails │<────│ LLM Chain │
│ (Post) │ │ + RAG │
└──────┬───────┘ └──────────────┘
│
┌──────▼───────┐ ┌──────────────┐
│ Cost │────>│ Response │
│ Tracker │ │ to User │
└──────────────┘ └──────────────┘
Usage Examples
Build a RAG Pipeline
from llm_framework.core import RAGPipeline
rag = RAGPipeline.from_config("config.example.yaml")
rag.index_documents(documents_dir="./knowledge_base/", file_types=["pdf", "md", "txt"])
response = rag.query(question="What is the refund policy for enterprise customers?", top_k=5, include_sources=True)
print(f"{response.answer}\nSources: {[s.title for s in response.sources]} | Confidence: {response.confidence:.2f} | Cost: ${response.cost:.4f}")
Prompt Management with A/B Testing
from llm_framework.core import PromptManager
pm = PromptManager(store_path="./prompts/")
pm.register("summarize", version="v1", template="Summarize in {num_sentences} sentences.\n\nText: {text}")
pm.register("summarize", version="v2", template="Extract {num_sentences} key points as bullets.\n\nText: {text}")
# A/B test with 70/30 split
response = pm.run_ab_test(
prompt_name="summarize", variants={"v1": 0.7, "v2": 0.3},
variables={"text": document, "num_sentences": 3},
)
Guardrails and Output Validation
from llm_framework.core import GuardrailChain
from pydantic import BaseModel
class ProductReview(BaseModel):
sentiment: str
key_points: list[str]
rating: float
guardrails = GuardrailChain(pii_redaction=True, hallucination_check=True,
output_schema=ProductReview, max_retries=2)
result = guardrails.run(prompt="Analyze this review: {review}", variables={"review": review_text})
print(f"Sentiment: {result.sentiment}, Rating: {result.rating}")
Configuration Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
llm.provider |
str | openai |
LLM provider: openai, anthropic, local |
rag.chunking.strategy |
str | recursive |
Document chunking strategy |
rag.top_k |
int | 5 |
Number of retrieved context chunks |
guardrails.hallucination_check |
bool | true |
Verify claims against retrieved sources |
cost_tracking.budget_limit_daily |
float | 50.00 |
Daily cost cap in USD |
caching.similarity_threshold |
float | 0.95 |
Minimum similarity for cache hit |
Best Practices
- Version your prompts like code — Store prompts in files, not strings. Track changes in git. A prompt change can shift model behavior more than a code change.
- Always set a cost budget — LLM costs compound fast. Set daily limits and alert at 80%. One runaway loop can burn through a month's budget in hours.
- Cache aggressively for internal tools — FAQ bots, documentation search, and report generation have high query repetition. Semantic caching cuts costs 30-50%.
- Evaluate before deploying prompt changes — Run your evaluation harness on 100+ test cases before pushing a prompt update to production.
- Use structured output for downstream processing — Always enforce a JSON schema when LLM output feeds into application logic. Retry with correction on parse failures.
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| RAG returns irrelevant context | Chunk size too large or embedding mismatch | Reduce chunk_size to 256-512, ensure query and docs use the same embedding model |
| High hallucination rate | Insufficient context or temperature too high | Increase top_k, reduce temperature to 0.0-0.1, enable hallucination_check
|
| Cost exceeding budget | No caching, verbose prompts, or large max_tokens
|
Enable caching, trim system prompts, set max_tokens to actual need |
| Structured output parsing fails | Model not following schema | Use max_retries: 3 with correction prompt, or switch to a more capable model |
This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [LLM Application Framework] with all files, templates, and documentation for $59.
Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.
Top comments (0)