Abdessamad Ammi

Posted on Jan 5 • Originally published at bcloud.consulting

Cómo Reducir Costes LLM un 80% Sin Sacrificar Calidad en Producción

#ai #llm

Publicado originalmente en bcloud.consulting

TL;DR

• 7 técnicas reducen costes LLM 80% manteniendo calidad
• Caso real: 45K → 9K mensual, 500K queries/día
• Model routing + caching = 65% ahorro inmediato
• Fine-tuning modelos pequeños supera GPT-4 para casos específicos
• ROI típico: 2-3 semanas implementación

El Problema de Costes LLM

Los costes LLM pueden destruir un modelo de negocio viable. Con 500K queries diarias, la factura mensual fácilmente supera los 40K.

Pero con arquitectura correcta, ese mismo volumen puede costar menos de 10K.

Las 7 Técnicas de Optimización Probadas

1. Model Routing Inteligente

No todas las queries necesitan GPT-4:

class IntelligentRouter:
    def __init__(self):
        self.classifiers = {
            'complexity': ComplexityClassifier(),
            'intent': IntentClassifier(),
            'domain': DomainClassifier()
        }

    def route_query(self, query: str) -> str:
        complexity = self.classifiers['complexity'].classify(query)
        intent = self.classifiers['intent'].classify(query)

        # Routing logic
        if complexity == 'simple' and intent == 'factual':
            return 'gpt-3.5-turbo'  # $0.002/1K tokens

        if intent == 'creative':
            return 'claude-3-sonnet'  # Better for creative

        if complexity == 'complex':
            return 'gpt-4'  # $0.03/1K tokens only when needed

        if intent == 'code':
            return 'deepseek-coder'  # Specialized and cheaper

        return 'llama-3-70b'  # Default open source

# Resultado: 80% queries van a modelos baratos

2. Caching Semántico Agresivo

class SemanticCache:
    def __init__(self, similarity_threshold=0.95):
        self.cache = {}
        self.embedder = SentenceTransformer('all-MiniLM-L6-v2')
        self.threshold = similarity_threshold

    async def get_or_generate(self, query: str, generator_func):
        # Generate embedding
        query_embedding = self.embedder.encode(query)

        # Check cache
        for cached_query, (cached_embedding, response) in self.cache.items():
            similarity = cosine_similarity(
                [query_embedding],
                [cached_embedding]
            )[0][0]

            if similarity > self.threshold:
                return {
                    'response': response,
                    'cached': True,
                    'similarity': similarity
                }

        # Not in cache, generate
        response = await generator_func(query)

        # Store in cache
        self.cache[query] = (query_embedding, response)

        return {
            'response': response,
            'cached': False
        }

# 65% hit rate en producción = 65% ahorro directo

3. Prompt Compression Sin Pérdida

class PromptCompressor:
    def compress(self, prompt: str) -> str:
        # 1. Remove redundant whitespace
        prompt = ' '.join(prompt.split())

        # 2. Abbreviate common terms
        abbreviations = {
            'por favor': 'pls',
            'información': 'info',
            'descripción': 'desc',
            'configuración': 'config'
        }
        for full, abbr in abbreviations.items():
            prompt = prompt.replace(full, abbr)

        # 3. Remove filler words
        filler_words = ['muy', 'realmente', 'básicamente', 'simplemente']
        for word in filler_words:
            prompt = prompt.replace(f' {word} ', ' ')

        # 4. Use references instead of repetition
        prompt = self.replace_repetitions_with_references(prompt)

        return prompt

    def replace_repetitions_with_references(self, text):
        # Detecta y reemplaza texto repetido con referencias
        # "El usuario Juan García... Juan García..." → "El usuario Juan García [U1]... [U1]..."
        return text  # Implementación simplificada

# Reducción típica: 40-50% tokens

4. Batch Processing Estratégico

class BatchProcessor:
    def __init__(self):
        self.batch_queue = []
        self.batch_size = 100
        self.batch_wait_time = 60  # seconds

    async def add_to_batch(self, query: dict):
        self.batch_queue.append(query)

        if len(self.batch_queue) >= self.batch_size:
            return await self.process_batch()

        # Wait for more queries or timeout
        await asyncio.sleep(self.batch_wait_time)
        return await self.process_batch()

    async def process_batch(self):
        if not self.batch_queue:
            return []

        # OpenAI Batch API - 50% discount
        batch_request = {
            'requests': [
                {
                    'custom_id': q['id'],
                    'method': 'POST',
                    'url': '/v1/chat/completions',
                    'body': {
                        'model': q['model'],
                        'messages': q['messages']
                    }
                } for q in self.batch_queue
            ]
        }

        # Submit batch (50% discount for 24h turnaround)
        response = await openai.batches.create(
            input_file=batch_request,
            endpoint='/v1/chat/completions',
            completion_window='24h'
        )

        self.batch_queue = []
        return response

5. Response Streaming con Early Stopping

class SmartStreaming:
    def __init__(self, quality_threshold=0.9):
        self.quality_checker = QualityChecker()
        self.threshold = quality_threshold

    async def generate_with_early_stop(self, prompt: str):
        response_chunks = []
        total_tokens = 0

        async for chunk in openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            stream=True
        ):
            response_chunks.append(chunk.choices[0].delta.content)
            total_tokens += 1

            # Check if response quality sufficient
            current_response = ''.join(response_chunks)
            quality_score = self.quality_checker.score(
                current_response,
                prompt
            )

            if quality_score > self.threshold:
                # Stop generation early
                break

        return {
            'response': current_response,
            'tokens_saved': estimated_full_tokens - total_tokens,
            'quality_score': quality_score
        }

# Ahorro típico: 20-40% en output tokens

6. Fine-tuning Modelos Pequeños

# Fine-tune Phi-3 para dominio específico
from transformers import AutoModelForCausalLM, Trainer

def fine_tune_small_model(dataset):
    # Cargar modelo base pequeño
    model = AutoModelForCausalLM.from_pretrained("microsoft/phi-3-mini")

    training_args = TrainingArguments(
        output_dir="./phi3-finetuned",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        save_steps=1000,
        save_total_limit=2,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
    )

    trainer.train()

    # Resultado: Modelo 90% más barato que GPT-4
    # Para queries del dominio, igual o mejor calidad

7. Monitoring y Alertas Granulares

class CostMonitor:
    def __init__(self):
        self.metrics = defaultdict(lambda: {
            'tokens': 0,
            'cost': 0,
            'queries': 0
        })

    def track(self, endpoint: str, user: str, tokens: int, model: str):
        cost = self.calculate_cost(tokens, model)

        # Track por múltiples dimensiones
        self.metrics[f'endpoint:{endpoint}']['tokens'] += tokens
        self.metrics[f'endpoint:{endpoint}']['cost'] += cost
        self.metrics[f'user:{user}']['cost'] += cost
        self.metrics[f'model:{model}']['tokens'] += tokens
        self.metrics[f'hour:{datetime.now().hour}']['cost'] += cost

        # Alertas
        if self.metrics[f'user:{user}']['cost'] > 100:  # Daily limit
            self.send_alert(f"User {user} exceeded daily limit")

        if cost > 10:  # Single query cost
            self.send_alert(f"Expensive query: {cost}")

    def generate_report(self):
        return {
            'top_endpoints': self.get_top_by_cost('endpoint'),
            'top_users': self.get_top_by_cost('user'),
            'hourly_pattern': self.get_hourly_pattern(),
            'model_distribution': self.get_model_distribution()
        }

Caso Real: Plataforma SaaS B2B

Antes de optimización:

45K/mes en costes LLM
500K queries/día
Todo GPT-4
Sin caching
Sin monitoring

Implementación (4 semanas):

Week 1: Monitoring + análisis
Week 2: Model routing + caching
Week 3: Batch processing + compression
Week 4: Fine-tuning + optimization

Después:

9K/mes (80% reducción)
Mismo volumen queries
Latencia mejorada
Calidad mantenida

Calculadora de ROI

def calculate_savings(
    daily_queries: int,
    current_model: str = "gpt-4",
    average_tokens: int = 1000
):
    # Costes actuales
    current_cost_per_query = (average_tokens / 1000) * MODEL_PRICES[current_model]
    monthly_current = daily_queries * 30 * current_cost_per_query

    # Costes optimizados
    optimized_costs = {
        'model_routing': monthly_current * 0.3,  # 70% savings
        'caching': monthly_current * 0.35,  # 65% cache hit
        'compression': monthly_current * 0.6,  # 40% token reduction
        'batch': monthly_current * 0.7,  # 30% batch discount
    }

    total_optimized = sum(optimized_costs.values()) / len(optimized_costs)

    return {
        'current_monthly': monthly_current,
        'optimized_monthly': total_optimized,
        'savings': monthly_current - total_optimized,
        'savings_percentage': ((monthly_current - total_optimized) / monthly_current) * 100,
        'annual_savings': (monthly_current - total_optimized) * 12
    }

Conclusiones

→ 80% reducción de costes es realista y alcanzable
→ No requiere sacrificar calidad ni velocidad
→ ROI positivo en 2-3 semanas típicamente
→ Monitoring es crítico para mantener optimización
→ Combinación de técnicas > una sola técnica

Artículo Completo

Este es un resumen. Para implementación completa:

👉 Lee el artículo completo

Incluye:

Código completo de implementación
Dashboards de monitoring
Calculadora interactiva ROI
Guía migración paso a paso

¿Cuánto gastas en LLMs? Comparte tu experiencia 👇

DEV Community