Day 23: LLM Manager Service Layer Refactor - Consolidating Multi-Model AI Integration
September 4th, 2025
Day 23 was an intensive 10-hour development sprint focused on consolidating multiple redundant LLM manager implementations into a unified Effect-TS service layer. This refactor resolved performance issues, fixed broken multi-model routing, and established AI integration patterns for the final week of development.
The Problem: Technical Debt from Rapid Prototyping
After 22 days of rapid development, the LLM integration had accumulated significant technical debt:
# Multiple competing implementations
src/llm-manager/llm-manager.ts # Original implementation
src/llm-manager/simple-manager.ts # Simplified version
src/llm-manager/llm-manager-live.ts # Effect-TS attempt
src/ui-generator/query-generator/*.ts # Duplicate LLM logic
# Result: 3+ different ways to call LLMs
# Only local models working, GPT/Claude routing broken
# 25+ second timeouts on integration tests
Phase 1: Performance Issue Resolution (Morning)
The day began with integration tests timing out after 25+ seconds. Investigation revealed our diagnostic prompts had grown to over 9,000 characters.
Initial query generation showing verbose SQL with problematic service name handling and malformed queries
// Before: Overly verbose instructions
export const DIAGNOSTIC_QUERY_INSTRUCTIONS = `
You are an expert ClickHouse SQL query generator for OpenTelemetry trace analysis.
CRITICAL REQUIREMENTS:
1. Generate ONLY valid ClickHouse SQL - no markdown, no explanations
2. Use the exact schema provided
3. Focus on traces with actual issues (errors, high latency, unusual patterns)
4. Create CTEs for complex filtering logic
5. Apply trace-level filtering using problematic_traces CTE
[... 9,000+ more characters of instructions ...]
`;
The Solution: Streamlined Prompting
We simplified to focused, directive prompts:
// After: Concise, focused instructions
const CORE_SQL_RULES = `
Generate ClickHouse SQL for OpenTelemetry traces.
Schema: trace_id, span_id, service_name, operation_name, duration_ns, status_code
Focus on: errors (status_code != 'STATUS_CODE_OK'), high latency (duration_ns > 1000000000)
Format: Raw SQL only, no markdown
`;
Result: 25+ seconds → 2-3 seconds (significant improvement)
Successful query results after optimization showing percentile analysis across services
Phase 2: Service Layer Consolidation - PR #46 (Afternoon)
The main achievement of Day 23 was consolidating all LLM implementations into a unified Effect-TS Layer architecture. This refactor was crucial for establishing proper dependency injection patterns and making the codebase more maintainable:
Before: Fragmented Implementation
// Multiple competing patterns across the codebase
class LLMManager { /* Original approach */ }
class SimpleManager { /* Simplified but limited */ }
const LLMManagerLive = /* Effect-TS but incomplete */
// Each with different:
// - Configuration patterns
// - Error handling approaches
// - Model routing logic
// - API client implementations
After: Unified Effect-TS Layer Architecture
The key innovation in PR #46 was adopting Effect-TS Layer patterns throughout the LLM manager, enabling proper dependency injection and testability:
// Layer-based architecture with proper dependency injection
export const LLMManagerLive = Layer.succeed(
LLMManager,
LLMManager.of({
generateSQL: (request) =>
Effect.gen(function* () {
const model = yield* selectOptimalModel(request)
const result = yield* executeWithModel(model, request)
return yield* validateAndReturn(result)
}).pipe(
Effect.timeout("30 seconds"),
Effect.retry({ times: 2 })
),
analyzeTraces: (traces) =>
Effect.all([
gptAnalysis(traces),
claudeAnalysis(traces),
llamaAnalysis(traces)
], {
concurrency: "unbounded",
discard: false
}).pipe(
Effect.map(consolidateAnalysis)
)
})
)
Key Refactoring Achievements
- Code Reduction: 809 lines deleted (net), ~50% redundancy eliminated
- Effect-TS Layer Architecture: Proper dependency injection and composition patterns
- Fixed Multi-Model Routing: Previously only worked with local models
- Structured Error Handling: Effect-TS patterns for graceful degradation
- Type Safety: Eliminated TypeScript compilation errors
- Testability: Mock layers can be easily swapped for testing
- Test Coverage: All 178/179 tests passing with mock layer implementation
Phase 3: Testing Strategy Documentation - ADR-015 (Evening)
Architectural Decision Record ADR-015 was created to document a multi-level testing strategy for future implementation. This strategy proposes using Effect-TS Layer patterns to enable different testing levels with varying speed/realism trade-offs, though the actual implementation is planned for future development.
Phase 4: Comprehensive Test Suite Expansion
Created 6 new test suites validating AI diagnostic capabilities.
UI component with integrated "Generate Diagnostic Query" button for critical path analysis
The test suites were created to validate the entire diagnostic pipeline from UI interaction to query execution.
Test Suite Expansion
describe("Diagnostic Query Generation", () => {
test("generates valid ClickHouse SQL", async () => {
const query = await generateDiagnosticQuery(PROBLEMATIC_TRACES)
// Syntax validation
expect(query).toMatch(/^WITH problematic_traces AS/)
expect(query).not.toMatch(/```
{% endraw %}
/) // No markdown
// Schema compliance
expect(query).toMatch(/FROM traces/)
expect(query).toMatch(/status_code != 'STATUS_CODE_OK'/)
// Performance patterns
expect(query).toMatch(/start_time >= now\(\) - INTERVAL 15 MINUTE/)
})
test("focuses on actual problems", async () => {
const traces = generateProblematicTraceScenarios()
const query = await generateDiagnosticQuery(traces)
const results = await executeQuery(query)
expect(results.problematic_count).toBeGreaterThan(0)
expect(results.health_status).toBe('unhealthy')
})
})
{% raw %}
Phase 5: Unit Test Coverage Improvement
The final phase addressed CI/CD failures due to low test coverage:
Coverage Improvement
bash
# Before
File | % Stmts | % Lines | % Funcs
-------------------|---------|---------|--------
llm-manager/ | 0.83 | 0.46 | 0.00
# After
File | % Stmts | % Lines | % Funcs
-------------------|---------|---------|--------
llm-manager/ | 48.21 | 42.33 | 35.71
# Significant improvement in line coverage
39 New Unit Tests Added
Focus areas for unit testing:
- Configuration Management: Environment variable handling and validation
- Model Registry: Model metadata and capability tracking
- API Client Abstraction: HTTP client behavior and error scenarios
- Route Management: Intelligent model selection logic
Technical Lessons Learned
1. Consolidation Before Innovation
The refactor taught us that technical debt compounds quickly in AI systems. By consolidating first, we:
- Reduced complexity by 50%
- Fixed previously hidden bugs
- Established consistent patterns
- Improved performance significantly
2. Effect-TS Layer Pattern for AI Orchestration
typescript
// Complex AI workflows become elegant
const parallelAnalysis = Effect.all(
models.map(model =>
analyzeWithModel(model, data).pipe(
Effect.timeout("30 seconds"),
Effect.retry({ times: 2 })
)
),
{ concurrency: "unbounded" }
).pipe(
Effect.map(consolidateResults),
Effect.catchAll(() => Effect.succeed(fallbackAnalysis))
)
The Effect-TS Layer pattern provides type safety, timeout handling, and structured error management, which is particularly important for the LLM manager refactor in PR #46.
3. Testing AI Systems Requires Multiple Strategies
The ADR-015 testing strategy document proposes a multi-level approach that would balance speed, accuracy, and cost - though this remains to be implemented in future development.
4. Prompt Optimization Impacts Performance
The most impactful optimization was simplifying prompts. Verbose instructions not only slow responses but also affect model output quality.
Progress Update: Day 23 of 30
We're now 78% complete (up from 73% this morning), entering the final week with:
Technical Foundation:
- ✅ Unified LLM integration architecture
- ✅ Sub-3-second response times
- ✅ Comprehensive testing strategy
- ✅ 178/179 tests passing consistently
Quality Metrics Achieved:
Metric | Target | Achieved | Status |
---|---|---|---|
Integration Tests | 169 passing | ✅ 169/169 | EXCEEDED |
LLM Performance | <10s response | ✅ <3s response | EXCEEDED |
Test Coverage | >5% LLM manager | ✅ 42.33% | EXCEEDED |
Code Quality | TypeScript clean | ✅ All compile | MET |
What's Next: 4-Day Break, Then Final Sprint
After this 10-hour sprint, a 4-day break begins (family visiting). The project resumes Monday in excellent technical position:
Week 4 Focus:
- Production deployment automation
- Performance monitoring integration
- Documentation completion
- Demo preparation and showcase
Key Takeaways for AI System Development
- Consolidate Early: Address technical debt in AI integration layers before it compounds
- Use Effect-TS Layers: The Layer pattern provides excellent dependency injection for AI services
- Test Strategically: Multiple testing levels help balance speed and accuracy
- Optimize Prompts: Prompt length and complexity directly impact performance
- Measure Everything: AI system behavior needs continuous monitoring
The refactoring work on Day 23 focused on architectural improvements rather than new features, establishing the technical foundation needed for the final week's development. The Effect-TS Layer refactor in PR #46 particularly improved the codebase's maintainability and testability.
This post is part of the "30-Day AI-Native Observability Platform" series, documenting the complete development journey from concept to production deployment.
Top comments (0)