Building ServiceLens: From Intern Project to Production-Ready Observability Platform
"When your microservices fail at 3 AM, every second counts. ServiceLens cuts debugging time from hours to minutes."
What I Built: ServiceLens at a Glance
- Full-stack observability platform for microservices monitoring and debugging
- Self-service onboarding - developers register services in seconds, get instant health monitoring
- AI-powered incident analysis using LLMs for root cause suggestions
- Predictive load forecasting with Facebook Prophet for capacity planning
- Tech Stack: React + FastAPI + MongoDB + Prometheus + Docker
The Challenge: When Monitoring Becomes the Problem
During my summer internship, I started with a simple goal: monitor some FastAPI microservices. But as I dove deeper, I realized traditional monitoring tools create their own problems:
- Metric explosion: High-cardinality metrics crash Prometheus
- Alert fatigue: Too much noise, not enough signal
- Slow debugging: Engineers spend hours correlating logs, metrics, and traces
- Capacity surprises: Traffic spikes catch teams off-guard
This became my mission: build an observability platform that actually helps developers ship faster.
The Journey: From Simple Monitoring to Intelligent Platform
Phase 1: Building a Realistic Testbed
I didn't want to monitor toy services, so I engineered a smart traffic generator:
- Adaptive rate limiting: Simulates organic growth (20% increase every 5 minutes)
- Strategic error injection: Tests alerting without overwhelming systems
- UUID correlation IDs: Enables distributed tracing across service calls
This gave me realistic data to work with - messy, high-volume, real-world stuff.
Phase 2: The Platform Architecture
I rebuilt ServiceLens as a centralized, multi-tenant platform with a resilient FastAPI backend:
# Example: Graceful background task orchestration
async def orchestrate_monitoring_tasks():
"""Manages metric scraping and log scanning across tenants"""
while not shutdown_event.is_set():
try:
await asyncio.gather(
scrape_prometheus_metrics(),
scan_application_logs(),
update_health_checks()
)
except Exception as e:
logger.error(f"Task orchestration failed: {e}")
await asyncio.sleep(30) # Backoff on failure
Key architectural decisions:
- Async-first: Handle thousands of concurrent health checks
- Resource-aware: Prevent system overload during metric ingestion
- Graceful degradation: No data loss during deployments
Core Features That Solve Real Problems
1. Intelligent Metric Aggregation
The biggest challenge was Prometheus cardinality explosion. My custom scraper doesn't just ingest data - it intelligently processes it:
The Problem: High-cardinality metrics like http_requests_total{user_id="..."} kill Prometheus
My Solution: Smart aggregation parser that:
- Sums counters correctly across labels (
http_requests_total→ total requests) - Preserves individual gauges for important metrics (
cpu_usage) - Deduplicates labels and filters noise
This reduced our metric cardinality by 80% while preserving essential insights.
2. High-Performance Log Analysis
Built for scale with real-world constraints:
- Virtualized scrolling: Handle millions of log entries without browser crashes
- Debounced search: Sub-second filtering across structured logs
- Smart filtering: By service, log level, and time range
- Export capabilities: CSV download for deeper analysis
- Live streaming: Auto-refresh shows latest logs in real-time
3. AI-Powered Incident Diagnostics
Instead of generic ChatGPT integration, I engineered reliability:
# Structured LLM output with fallback parsing
def analyze_incident(metrics, logs, error_patterns):
prompt = f"""
Analyze this incident data and respond in valid JSON:
{{
"root_cause": "primary issue identified",
"confidence": 0.85,
"recommendations": ["action 1", "action 2"]
}}
Metrics: {metrics}
Error patterns: {error_patterns}
"""
# Resilient parsing handles inconsistent LLM responses
return parse_llm_response_with_fallback(llm_call(prompt))
Key engineering decisions:
- Enforced JSON output: Advanced prompt engineering prevents malformed responses
- Cost optimization: TTL-based caching (saves ~60% on API costs)
- Graceful degradation: Works even when LLM APIs are down
4. Predictive Load Forecasting
Using Facebook Prophet for capacity planning:
- Traffic trend prediction: Spot usage patterns before they become problems
- Resource planning: Know when to scale before you need to
- Interactive visualization: Built with Streamlit, embedded in React app
Technical Deep-Dive: Solving Metric Cardinality
Here's the core challenge I solved - Prometheus cardinality management:
def aggregate_metrics(raw_metrics):
"""
Intelligently aggregate high-cardinality metrics
"""
aggregated = {}
for metric_name, samples in raw_metrics.items():
if metric_name.endswith('_total'): # Counter
# Sum all counter values, preserve important labels
total = sum(sample.value for sample in samples)
aggregated[metric_name] = total
elif 'user_id' in samples[0].labels: # High cardinality
# Aggregate by service, drop user-specific labels
by_service = defaultdict(list)
for sample in samples:
service = sample.labels.get('service', 'unknown')
by_service[service].append(sample.value)
aggregated[f"{metric_name}_by_service"] = {
service: sum(values) for service, values in by_service.items()
}
else: # Low cardinality - preserve as-is
aggregated[metric_name] = samples
return aggregated
This approach reduced our Prometheus storage by 75% while maintaining query performance.
Architecture & Tech Stack
| Layer | Technologies | Why I Chose Them |
|---|---|---|
| Frontend | React.js, Tailwind CSS, Plotly | Fast development, rich visualizations |
| Backend | FastAPI, asyncio, Pydantic | High performance, async-native |
| Database | MongoDB (with Motor) | Flexible schema for diverse log formats |
| AI/ML | Groq API, Ollama, Facebook Prophet | Cost-effective LLM access, proven forecasting |
| Monitoring | Prometheus, Custom Scraper | Industry standard + custom optimizations |
| Security | JWT Authentication | Stateless, scalable auth |
| Infrastructure | Docker, Docker Compose | Reproducible deployments |
Key Lessons: Building for Reality
1. Production-Ready Means Details Matter
The difference between a demo and a tool people trust:
- Handling edge cases in LLM response parsing
- Graceful shutdowns that preserve data integrity
- Resource management that prevents system overload
- Error handling that fails safely
2. Choose Tools for the Job, Not the Resume
- Streamlit for data science dashboards (rapid prototyping)
- FastAPI for high-performance APIs (async-native)
- React for complex SPAs (component reusability)
- MongoDB for flexible log storage (schema evolution)
3. User Experience Drives Technical Decisions
Every technical choice served user needs:
- Virtualized scrolling → developers can browse massive log files
- Metric aggregation → dashboards load in seconds, not minutes
- LLM caching → instant incident analysis during outages
What's Next: Beyond the Internship
I'm excited to bring this systems thinking and user-focused engineering approach to a full-time role in DevOps/Observability. The challenges are just getting started.
Lets Connect connect with me on LinkedIn to discuss observability, distributed systems, or building tools that developers actually love using.
Top comments (0)