Clay Roach

Posted on Aug 24 • Originally published at dev.to

Day 10: Infrastructure Crisis & Strategic Foundation - When Everything Comes Together

#opentelemetry #protobuf #strategy #team

Day 10: Infrastructure Crisis & Strategic Foundation - When Everything Comes Together

The Plan: Continue LLM Manager implementation and begin AI-powered analytics

The Morning Reality: "Wait... why are all our service names showing as protobuf JSON objects instead of strings?!"

The Afternoon Discovery: "Hold on, we have 5 major strategic ADRs and design documents sitting unmerged!"

Welcome to Day 10 of building an AI-native observability platform in 30 days. Some development days give you exciting new features, others give you critical infrastructure fixes, but the truly valuable days give you both—along with the strategic clarity to move forward confidently.

The Crisis: When Data Structure Assumptions Break

Picture this: You're building an AI-native observability platform, feeling confident about your foundation, ready to implement sophisticated analytics. You spin up your OpenTelemetry demo, check the ingested data, and find this in your database:

{
  "service_name": {
    "$typeName": "opentelemetry.proto.common.v1.AnyValue",
    "stringValue": "frontend"
  },
  "span_name": {
    "$typeName": "opentelemetry.proto.common.v1.AnyValue", 
    "stringValue": "/api/products"
  }
}

Instead of clean values like:

{
  "service_name": "frontend",
  "span_name": "/api/products"
}

This is what we call a "data structure crisis"—the kind that makes you question every assumption you've built your system on.

Root Cause Analysis: The @bufbuild/protobuf Format

The issue traced back to commit 64be377 where we upgraded to @bufbuild/protobuf for better TypeScript integration. This library represents protobuf data with explicit type metadata and case-based value extraction—excellent for type safety, challenging for direct data processing.

Here's what @bufbuild/protobuf gives you:

// Instead of simple values
const serviceName = "frontend";

// You get complex nested objects
const serviceName = {
  $typeName: "opentelemetry.proto.common.v1.AnyValue",
  stringValue: "frontend"
};

// Or for integers (the BigInt surprise)
const duration = {
  $typeName: "opentelemetry.proto.common.v1.AnyValue", 
  intValue: 1500000n  // Note: BigInt, not number!
};

The Fix: Recursive Protobuf Value Extraction

The solution required building a comprehensive recursive extraction function that handles all protobuf value types while managing JavaScript's quirks:

function extractProtobufValue(value: any): any {
  // Handle protobuf objects with type metadata
  if (value && typeof value === 'object' && value.$typeName) {
    // String values
    if (value.stringValue !== undefined) return value.stringValue;

    // Integer values (convert BigInt to string for JSON compatibility)
    if (value.intValue !== undefined) return String(value.intValue);

    // Boolean values
    if (value.boolValue !== undefined) return value.boolValue;

    // Double/float values
    if (value.doubleValue !== undefined) return value.doubleValue;

    // Array values (recursive processing)
    if (value.arrayValue) {
      return value.arrayValue.values?.map(extractProtobufValue) || [];
    }

    // Key-value list values (nested object processing)
    if (value.kvlistValue) {
      const result: Record<string, any> = {};
      value.kvlistValue.values?.forEach((kv: any) => {
        if (kv.key) {
          result[kv.key] = extractProtobufValue(kv.value);
        }
      });
      return result;
    }
  }

  // Handle JavaScript BigInt serialization issues
  if (typeof value === 'bigint') {
    return value.toString();
  }

  // Handle Buffer trace/span IDs (convert to hex strings)
  if (Buffer.isBuffer(value)) {
    return value.toString('hex');
  }

  return value;
}

The JavaScript BigInt Challenge

One fascinating aspect of this fix was dealing with BigInt serialization. Modern protobuf libraries use BigInt for integer values to handle the full range of 64-bit integers, but JavaScript's JSON.stringify() doesn't natively support BigInt:

// This fails with "TypeError: Do not know how to serialize a BigInt"
JSON.stringify({ duration: 1500000n });

// This works
JSON.stringify({ duration: "1500000" });

Our solution handles this transparently in the extraction layer, converting all BigInt values to strings while preserving their numeric meaning for later processing.

UI Improvements: Making Long Status Codes Readable

While fixing the backend, we also tackled a UI usability issue. OpenTelemetry status codes like STATUS_CODE_UNSET were causing column width overflow issues. The solution was elegant:

// Display mapping for better UI
const statusDisplayMap = {
  'STATUS_CODE_UNSET': 'UNSET',
  'STATUS_CODE_OK': 'OK', 
  'STATUS_CODE_ERROR': 'ERROR'
};

// But preserve full semantic meaning in tooltips
<span title={fullStatusCode}>
  {statusDisplayMap[status] || status}
</span>

This maintains OpenTelemetry compliance in the backend while providing a clean, readable UI experience.

Testing Strategy: Validation Through Real Data

The fix was validated through comprehensive integration with the OpenTelemetry demo, successfully processing telemetry from all 16 services:

Frontend Services: frontend, frontend-proxy
Business Logic: ad, cart, checkout, payment, recommendation
Data Services: product-catalog, currency, shipping
Infrastructure: email, accounting, fraud-detection
Support: load-generator, flagd

Each service now contributes clean, properly extracted telemetry data ready for AI processing.

Development Workflow Improvements

This debugging session also led to several workflow improvements:

// Enhanced vitest configuration
export default defineConfig({
  test: {
    // Prevent false test discovery in dependencies
    exclude: ['**/node_modules/**', '**/dist/**']
  }
});

# New validation script for ad-hoc testing
pnpm dev:validate  # Quick ingestion validation

The AI-Native Observability Connection

This fix was crucial for our AI-native vision. Clean, properly structured telemetry data is essential for:

Pattern Recognition: AI models need consistent data formats
Anomaly Detection: Statistical analysis requires numeric values, not nested objects
Dashboard Generation: LLMs generating queries need predictable schema
Context Understanding: Attribute extraction enables semantic analysis

Without this fix, our planned AI features would have been processing garbage data wrapped in protobuf metadata.

Key Takeaways for Complex System Development

1. Library Upgrades Have Consequences

Every library change can introduce subtle breaking changes in data processing pipelines. Always validate end-to-end data flow after upgrades.

2. Type Safety vs. Processing Simplicity

@bufbuild/protobuf provides excellent TypeScript integration but requires extraction layers for data processing. The type safety is worth the complexity.

3. JavaScript's Serialization Quirks

BigInt support in JavaScript is powerful but comes with JSON serialization challenges. Plan for explicit conversion in data processing layers.

4. UI/Backend Separation

Keep display formatting separate from semantic data storage. Tooltips can bridge the gap between usability and debugging needs.

5. Real-World Validation

Integration testing with actual OpenTelemetry demo services reveals issues that unit tests miss. Always validate with realistic data.

The Strategic Foundation: Recovered ADRs and Design Vision

While debugging protobuf issues, we discovered something equally important: a treasure trove of strategic work sitting unmerged in a forgotten branch. Five major Architecture Decision Records (ADRs) and comprehensive design documents that provide crucial guidance for the platform's future:

The Strategic Documents Recovered

ADR-007: MCP Server Architecture - Standardized LLM interfaces for observability data
ADR-008: Automated Market Intelligence - AI-powered competitive analysis system
ADR-009: Project Guardian Optimization - GitHub Actions security and workflow automation
ADR-010: Enhanced EUM Framework - Advanced end-user monitoring capabilities
ADR-011: Blockchain Business Model - Customer-investor participation strategy

Team Evolution and Development Philosophy

The strategic documents also captured key insights from discussions about the future of software development:

The Junior Developer Advantage: Recent college graduates are better positioned for LLM-native development than mid-career developers. They have no preconceived notions about "what a backend developer should be" and natural comfort with AI tools.

Dual 4-Hour Segments Philosophy:

Segment 1: Deep focus development with AI-assisted coding
Segment 2: Optimization, customer engagement, and process improvement

The Automation Engineer Paradigm: Developers can't compete with LLMs for code generation, but excel as automation engineers who work at the edges of AI generation, ensure quality, and eliminate redundancies.

Rapid Development Cognitive Advantages

A key insight emerged: when development timescales shift from weeks/months to hours/days, it fundamentally changes how we approach software design. You can keep the entire feature in your head from conception to completion, get immediate integration learning while context is fresh, and act as the first consumer of your own features.

This enabled better architectural decisions—for example, choosing LLM-based topology analysis over complex graph-based data stores, because we could test the effectiveness immediately rather than committing to unnecessary infrastructure.

The Combined Impact: Infrastructure + Strategy

The day's dual achievements—fixing critical infrastructure AND recovering strategic guidance—created a powerful foundation:

Clean Data Foundation: All 16 OpenTelemetry demo services now contribute properly formatted telemetry data, enabling reliable AI processing.

Strategic Clarity: Five ADRs provide architectural guidance for advanced features like MCP servers, market intelligence, and customer-investor participation models.

Development Philosophy: Clear understanding of how AI-assisted rapid development creates cognitive advantages and enables better architecture through immediate validation cycles.

Tomorrow's Focus: Analytics Implementation with Strategic Guidance

With both reliable protobuf parsing AND strategic direction in place, Day 11 will focus on implementing AI-powered analytics features:

Anomaly detection using clean telemetry data
Pattern recognition across service interactions guided by ADR strategies
Dashboard generation with MCP server architecture considerations
Foundation for LLM-driven insights with market intelligence integration

The crisis is resolved, the strategy is clear, and we're ready to build sophisticated AI features on both reliable data and sound architectural principles.

The Bigger Picture: AI-Native Development Evolution

This day reinforces two key principles of AI-native development:

Data quality is everything - No amount of sophisticated ML models can compensate for corrupted input data
Strategic documentation is as critical as code - ADRs and design documents provide the architectural guidance that prevents technical debt and enables scalable development

In traditional development, you might accept some data quality issues and work around them. Strategic documents often become stale or forgotten. In an AI-native platform, both clean data AND clear strategic guidance are essential for the AI to understand and effectively contribute to your platform's evolution.

The time invested in bulletproof data ingestion AND comprehensive strategic documentation pays dividends in every AI feature built on top.

Day 10 Status: Critical Infrastructure + Strategic Foundation Complete ✅

Next Challenge: AI-Powered Analytics Implementation with Clear Architectural Guidance

Key Learnings: Data quality and strategic clarity are both foundations of AI-native observability

Part of the "30-Day AI-Native Observability Platform" series. Follow along as we build a complete observability platform using AI-assisted development in just 30 days.

DEV Community

Day 10: Infrastructure Crisis & Strategic Foundation - When Everything Comes Together

Day 10: Infrastructure Crisis & Strategic Foundation - When Everything Comes Together

The Crisis: When Data Structure Assumptions Break

Root Cause Analysis: The @bufbuild/protobuf Format

The Fix: Recursive Protobuf Value Extraction

The JavaScript BigInt Challenge

UI Improvements: Making Long Status Codes Readable

Testing Strategy: Validation Through Real Data

Development Workflow Improvements

The AI-Native Observability Connection

Key Takeaways for Complex System Development

1. Library Upgrades Have Consequences

2. Type Safety vs. Processing Simplicity

3. JavaScript's Serialization Quirks

4. UI/Backend Separation

5. Real-World Validation

The Strategic Foundation: Recovered ADRs and Design Vision

The Strategic Documents Recovered

Team Evolution and Development Philosophy

Rapid Development Cognitive Advantages

The Combined Impact: Infrastructure + Strategy

Tomorrow's Focus: Analytics Implementation with Strategic Guidance

The Bigger Picture: AI-Native Development Evolution

Top comments (0)