Day 15: From 'Works on My Machine' to Bulletproof CI/CD - Building Development Insurance
The Plan: Continue advanced AI feature development
The Reality: "Sometimes the most important work is building bulletproof infrastructure"
Welcome to Day 15 of building an AI-native observability platform in 30 days! Today focused on implementing comprehensive CI/CD infrastructure - a systematic transformation from "works on my machine" to production-ready automation that exposed critical issues and led to major architectural improvements.
The GitHub Actions Implementation: Building Development Insurance
Rather than continuing with feature development, Day 15 focused on establishing bulletproof CI/CD infrastructure. This proved to be the right decision as it immediately exposed issues that would have caused problems later.
Primary Workflow: claude-code-integration.yml
The main workflow provides comprehensive automation with multiple triggers:
name: Claude Code Integration Pipeline
on:
pull_request:
types: [opened, synchronize, reopened]
issue_comment:
types: [created]
push:
branches: [main, test/*, feat/*]
workflow_dispatch:
Key Features Implemented:
- Multi-trigger automation: PR comments, PRs, pushes, manual dispatch
- Claude Code integration: Automated PR reviews with AI assistance
- Comprehensive test pipeline: TypeScript, ESLint, Prettier, unit, integration, E2E
- Docker services orchestration: Full-stack testing with real services
- Coverage reporting: Integrated with PR comments for immediate feedback
Protection Workflow: never-break-main.yml
The secondary workflow provides production-grade main branch protection:
name: Never Break Main - Comprehensive Validation
on:
pull_request:
branches: [main]
push:
branches: [main]
Production-Ready Validation:
- 30-minute comprehensive testing with real services
- Database migration validation with ClickHouse
- OpenTelemetry demo integration testing
- Docker build verification across all services
- Coverage thresholds with automated reporting
The "Works on My Machine" Problem Discovery
The moment we implemented CI/CD, several critical issues became apparent:
1. Docker Volume Mount Pollution
Issue Discovered: The UI development setup was creating .pnpm-store
directories on the host system during Docker builds.
# The problematic volume mount
volumes:
- ./ui:/app
- /app/node_modules
Root Cause: pnpm's default store directory was being created in the mounted volume, polluting the host repository.
Solution Implemented:
# Configure pnpm to use isolated store directory
RUN pnpm config set store-dir /tmp/pnpm-store
2. Integration Test Architecture Issues
Issue Discovered: Tests passed locally but failed in CI due to service connectivity problems.
Problems Found:
- Container orchestration timing issues
- Port conflicts between services
- Database connection string inconsistencies
Solutions Applied:
- Strategic service startup delays with health checks
- Standardized environment variable patterns
- Comprehensive infrastructure validation commands
3. Build System Inconsistencies
Issue Discovered: Different build behaviors between local and CI environments.
# Local (works)
pnpm install
# CI (failed initially)
pnpm install --frozen-lockfile
Root Cause: Lockfile inconsistencies and node-gyp compilation issues in CI environment.
Solution: Strategic use of --ignore-scripts
and --no-frozen-lockfile
flags based on context.
The Storage Architecture Consolidation
While fixing CI issues, we discovered architectural complexity that needed addressing:
Eliminating Duplicate Storage Layers
Before: Multiple storage implementations with inconsistent patterns
// Multiple storage classes with different approaches
class SimpleStorage { /* custom implementation */ }
class StorageAPIClient { /* Effect-TS patterns */ }
After: Unified Effect-TS architecture throughout
// Single source of truth with consistent patterns
export interface StorageAPIClient {
readonly writeOTLP: (data: OTLPData, encodingType?: 'protobuf' | 'json') => Effect.Effect<void, StorageError>
readonly queryRaw: (sql: string) => Effect.Effect<unknown[], StorageError>
readonly healthCheck: () => Effect.Effect<{ clickhouse: boolean; s3: boolean }, StorageError>
}
Type Safety Improvements
The CI/CD implementation exposed numerous type safety issues that were silently failing locally:
Issues Found:
- 15+ instances of
any
types across frontend and backend - Missing null safety patterns
- Inconsistent error handling approaches
Solutions Applied:
// Before: Type safety compromises
const result: any = response.data
const items: any[] = result.items
// After: Comprehensive type safety
interface TraceQueryResult {
trace_id: string
service_name: string
encoding_type: string
}
const result = response.data as TraceQueryResult[]
Comprehensive Test Coverage Enhancement
The CI/CD pipeline exposed gaps in test coverage:
New Test Categories Added:
- Encoding type validation: JSON vs protobuf ingestion testing
- Storage consolidation tests: Effect-TS pattern validation
- Integration connectivity: Service-to-service communication testing
- Docker volume behavior: Build system artifact testing
Measurable Results: The CI/CD Impact
The systematic approach delivered concrete improvements:
Test Suite Excellence
✅ Unit Tests: 140/140 passing (100% success rate)
✅ Integration Tests: Comprehensive storage and encoding validation
✅ E2E Tests: 36/39 passing (92% success rate)
✅ Type Safety: All ESLint violations resolved, zero `any` types
Infrastructure Reliability
- Build consistency: Same results in local and CI environments
- Clean repository: No build artifacts or pollution
- Service orchestration: Reliable multi-container testing
- Automated quality gates: Broken code blocked from main branch
Developer Experience Improvements
- Fast feedback: PR-level testing with 5-minute results
- Clear error reporting: Detailed failure analysis with line-by-line coverage
- Automated documentation: Screenshot integration and visual updates
- AI-assisted reviews: Claude Code integration for code quality suggestions
Technical Deep Dive: Critical Fixes Applied
1. Docker Configuration Optimization
# UI Dockerfile improvements
FROM node:18-alpine AS development
RUN pnpm config set store-dir /tmp/pnpm-store # Prevents host pollution
WORKDIR /app
2. Service Health Check Strategy
# docker-compose.yml health check implementation
healthcheck:
test: ['CMD', 'clickhouse-client', '--user', 'otel', '--password', 'otel123', '--query', 'SELECT 1']
interval: 10s
timeout: 5s
retries: 10
start_period: 30s
3. Test Infrastructure Commands
// package.json - standardized test commands
{
"scripts": {
"dev:validate": "node test/validate-infrastructure.js",
"test:integration": "vitest --config vitest.integration.config.ts",
"test:e2e": "playwright test --reporter=line"
}
}
Strategic Implications: Why Infrastructure First Matters
This diversion from feature development to infrastructure proved essential:
1. Hidden Issue Discovery
CI/CD immediately exposed problems that would have caused deployment failures later.
2. Quality Gate Establishment
No broken code can reach main branch - establishes sustainable development velocity.
3. Team Collaboration Readiness
Clean CI/CD enables future team members to contribute confidently.
4. Production Deployment Foundation
Infrastructure patterns established today scale directly to enterprise deployment.
Looking Ahead: The Halfway Point Tomorrow
Day 15's infrastructure work positions us perfectly for Day 16 - the halfway milestone:
✅ Bulletproof CI/CD: Automated testing and quality gates operational
✅ Clean Architecture: Unified storage patterns with Effect-TS throughout
✅ Type Safety: Zero any
types, comprehensive error handling
✅ Production Readiness: Infrastructure patterns ready for enterprise scale
✅ Developer Experience: Fast feedback loops and automated workflows
The remaining 15 days can focus on advanced AI features with confidence that our foundation is rock-solid.
Key Takeaways for AI-Native Development
- CI/CD reveals truth: "Works on my machine" problems become apparent immediately with proper automation
- Infrastructure first: Invest in bulletproof foundations before advanced features
- Systematic fixes: Root cause analysis prevents cascading issues later
- Type safety pays: Comprehensive typing eliminates entire categories of bugs
- Effect-TS scales: Functional patterns provide structure that grows with complexity
Day 15 proves that sometimes the most important development work isn't writing new features - it's building the infrastructure that makes everything else possible.
This post is part of my 30-day challenge to build an AI-native observability platform. Follow along as we explore how systematic infrastructure development creates the foundation for advanced AI features.
Previously: Day 14: AI Model Differentiation
Next: Day 16: The Halfway Milestone - Advanced Features Begin
Source Code: GitHub Repository
Top comments (2)
Love how you reframe CI/CD as “development insurance”—a fresh, practical take on the classic “works on my machine” problem. Using GitHub Actions to expose real issues, then consolidating storage with Effect-TS and solid health checks shows real engineering discipline. Appreciate the emphasis on type safety and reproducibility!
@voncartergriffen You know... I think this was my biggest fear and experience using LLM's for code generation - when you follow known engineering disciplines such as:
A lot of this project is proving that using very well known engineering disciplines I can "tame the beast" of an seemingly runaway LLM code generation approach. The documentation is used to further this such that in the context engineering arena, only the docs need to be consulted (akin to executable specs) that reduce the context window for the LLM to still generate good code.
Thanks for following along the journey!