Sachin choudhary

Posted on Aug 1, 2025

Building ServiceLens: From Intern Project to Production-Ready Observability Platform

"When your microservices fail at 3 AM, every second counts. ServiceLens cuts debugging time from hours to minutes."

What I Built: ServiceLens at a Glance

Full-stack observability platform for microservices monitoring and debugging
Self-service onboarding - developers register services in seconds, get instant health monitoring
AI-powered incident analysis using LLMs for root cause suggestions
Predictive load forecasting with Facebook Prophet for capacity planning
Tech Stack: React + FastAPI + MongoDB + Prometheus + Docker

The Challenge: When Monitoring Becomes the Problem

During my summer internship, I started with a simple goal: monitor some FastAPI microservices. But as I dove deeper, I realized traditional monitoring tools create their own problems:

Metric explosion: High-cardinality metrics crash Prometheus
Alert fatigue: Too much noise, not enough signal
Slow debugging: Engineers spend hours correlating logs, metrics, and traces
Capacity surprises: Traffic spikes catch teams off-guard

This became my mission: build an observability platform that actually helps developers ship faster.

The Journey: From Simple Monitoring to Intelligent Platform

Phase 1: Building a Realistic Testbed

I didn't want to monitor toy services, so I engineered a smart traffic generator:

Adaptive rate limiting: Simulates organic growth (20% increase every 5 minutes)
Strategic error injection: Tests alerting without overwhelming systems
UUID correlation IDs: Enables distributed tracing across service calls

This gave me realistic data to work with - messy, high-volume, real-world stuff.

Phase 2: The Platform Architecture

I rebuilt ServiceLens as a centralized, multi-tenant platform with a resilient FastAPI backend:

# Example: Graceful background task orchestration
async def orchestrate_monitoring_tasks():
    """Manages metric scraping and log scanning across tenants"""
    while not shutdown_event.is_set():
        try:
            await asyncio.gather(
                scrape_prometheus_metrics(),
                scan_application_logs(),
                update_health_checks()
            )
        except Exception as e:
            logger.error(f"Task orchestration failed: {e}")
            await asyncio.sleep(30)  # Backoff on failure

Key architectural decisions:

Async-first: Handle thousands of concurrent health checks
Resource-aware: Prevent system overload during metric ingestion
Graceful degradation: No data loss during deployments

Core Features That Solve Real Problems

1. Intelligent Metric Aggregation

The biggest challenge was Prometheus cardinality explosion. My custom scraper doesn't just ingest data - it intelligently processes it:

The Problem: High-cardinality metrics like http_requests_total{user_id="..."} kill Prometheus
My Solution: Smart aggregation parser that:

Sums counters correctly across labels (http_requests_total → total requests)
Preserves individual gauges for important metrics (cpu_usage)
Deduplicates labels and filters noise

This reduced our metric cardinality by 80% while preserving essential insights.

2. High-Performance Log Analysis

Built for scale with real-world constraints:

Virtualized scrolling: Handle millions of log entries without browser crashes
Debounced search: Sub-second filtering across structured logs
Smart filtering: By service, log level, and time range
Export capabilities: CSV download for deeper analysis
Live streaming: Auto-refresh shows latest logs in real-time

3. AI-Powered Incident Diagnostics

Instead of generic ChatGPT integration, I engineered reliability:

# Structured LLM output with fallback parsing
def analyze_incident(metrics, logs, error_patterns):
    prompt = f"""
    Analyze this incident data and respond in valid JSON:
    {{
        "root_cause": "primary issue identified",
        "confidence": 0.85,
        "recommendations": ["action 1", "action 2"]
    }}

    Metrics: {metrics}
    Error patterns: {error_patterns}
    """

    # Resilient parsing handles inconsistent LLM responses
    return parse_llm_response_with_fallback(llm_call(prompt))

Key engineering decisions:

Enforced JSON output: Advanced prompt engineering prevents malformed responses
Cost optimization: TTL-based caching (saves ~60% on API costs)
Graceful degradation: Works even when LLM APIs are down

4. Predictive Load Forecasting

Using Facebook Prophet for capacity planning:

Traffic trend prediction: Spot usage patterns before they become problems
Resource planning: Know when to scale before you need to
Interactive visualization: Built with Streamlit, embedded in React app

Technical Deep-Dive: Solving Metric Cardinality

Here's the core challenge I solved - Prometheus cardinality management:

def aggregate_metrics(raw_metrics):
    """
    Intelligently aggregate high-cardinality metrics
    """
    aggregated = {}

    for metric_name, samples in raw_metrics.items():
        if metric_name.endswith('_total'):  # Counter
            # Sum all counter values, preserve important labels
            total = sum(sample.value for sample in samples)
            aggregated[metric_name] = total
        elif 'user_id' in samples[0].labels:  # High cardinality
            # Aggregate by service, drop user-specific labels
            by_service = defaultdict(list)
            for sample in samples:
                service = sample.labels.get('service', 'unknown')
                by_service[service].append(sample.value)

            aggregated[f"{metric_name}_by_service"] = {
                service: sum(values) for service, values in by_service.items()
            }
        else:  # Low cardinality - preserve as-is
            aggregated[metric_name] = samples

    return aggregated

This approach reduced our Prometheus storage by 75% while maintaining query performance.

Architecture & Tech Stack

Layer	Technologies	Why I Chose Them
Frontend	React.js, Tailwind CSS, Plotly	Fast development, rich visualizations
Backend	FastAPI, asyncio, Pydantic	High performance, async-native
Database	MongoDB (with Motor)	Flexible schema for diverse log formats
AI/ML	Groq API, Ollama, Facebook Prophet	Cost-effective LLM access, proven forecasting
Monitoring	Prometheus, Custom Scraper	Industry standard + custom optimizations
Security	JWT Authentication	Stateless, scalable auth
Infrastructure	Docker, Docker Compose	Reproducible deployments

Key Lessons: Building for Reality

1. Production-Ready Means Details Matter

The difference between a demo and a tool people trust:

Handling edge cases in LLM response parsing
Graceful shutdowns that preserve data integrity
Resource management that prevents system overload
Error handling that fails safely

2. Choose Tools for the Job, Not the Resume

Streamlit for data science dashboards (rapid prototyping)
FastAPI for high-performance APIs (async-native)
React for complex SPAs (component reusability)
MongoDB for flexible log storage (schema evolution)

3. User Experience Drives Technical Decisions

Every technical choice served user needs:

Virtualized scrolling → developers can browse massive log files
Metric aggregation → dashboards load in seconds, not minutes
LLM caching → instant incident analysis during outages

What's Next: Beyond the Internship

I'm excited to bring this systems thinking and user-focused engineering approach to a full-time role in DevOps/Observability. The challenges are just getting started.

Lets Connect connect with me on LinkedIn to discuss observability, distributed systems, or building tools that developers actually love using.

DEV Community

Building ServiceLens: From Intern Project to Production-Ready Observability Platform

Building ServiceLens: From Intern Project to Production-Ready Observability Platform

What I Built: ServiceLens at a Glance

The Challenge: When Monitoring Becomes the Problem

The Journey: From Simple Monitoring to Intelligent Platform

Phase 1: Building a Realistic Testbed

Phase 2: The Platform Architecture

Core Features That Solve Real Problems

1. Intelligent Metric Aggregation

2. High-Performance Log Analysis

3. AI-Powered Incident Diagnostics

4. Predictive Load Forecasting

Technical Deep-Dive: Solving Metric Cardinality

Architecture & Tech Stack

Key Lessons: Building for Reality

1. Production-Ready Means Details Matter

2. Choose Tools for the Job, Not the Resume

3. User Experience Drives Technical Decisions

What's Next: Beyond the Internship

Top comments (0)