DEV Community

Cover image for Building Living Data DNA Platform: Lineage, Time Travel, and AI-Guided Incident Response
Harish Kotra (he/him)
Harish Kotra (he/him)

Posted on

Building Living Data DNA Platform: Lineage, Time Travel, and AI-Guided Incident Response

Data platforms rarely fail in one obvious place. A small schema change in one table can cascade across transformations, dashboards, and executive reports.

This project, Living Data DNA Platform, was built to make that chain visible and actionable.

In this post, I’ll walk through:

  • What the app does
  • Architecture and data flow
  • Backend implementation details
  • Frontend implementation details
  • Deployment
  • Lessons learned

What This Project Solves

The platform ingests metadata from OpenMetadata, converts it into a normalized “DNA snapshot” model, stores time-based changes, and runs an AI agent pipeline to:

  • Detect issues
  • Explain root cause
  • Suggest fixes
  • Produce stakeholder-ready summaries

It also includes a guided “incident replay” mode so teams can demo a complete incident lifecycle in under a minute.

Tech Stack

  • Backend: FastAPI, SQLAlchemy
  • Frontend: Next.js (App Router), TypeScript, React Flow
  • Database: PostgreSQL (local/dev), SQLite (Cloud Run lightweight mode)
  • Metadata source: OpenMetadata REST APIs
  • LLM layer: OpenAI-compatible endpoint
  • External context: Tavily + BrightData MCP
  • Infra: Docker Compose (local), Cloud Run (deployed)

Architecture

Architecture Diagram

Core Backend Design

1) Metadata ingestion from OpenMetadata

The ingestion service fetches and normalizes metadata into an internal shape used by the DNA builder.

# backend/app/services/ingestion.py
class MetadataIngestionService:
    async def sync(self, db: Session) -> dict:
        normalized = await self.client.fetch_normalized_metadata()
        ...
        dna = build_dna(item)
        snapshot = DnaSnapshot(dataset_id=dataset.id, trust_score=dna["trust_score"], genes=dna["genes"])
        db.add(snapshot)
Enter fullscreen mode Exit fullscreen mode

2) DNA Builder

DNA snapshots represent metadata “genes”:

  • schema_gene
  • lineage_gene
  • usage_gene
  • ownership_gene
# backend/app/services/dna_builder.py
def build_dna(metadata: dict) -> dict:
    trust_score = calculate_trust_score(metadata)
    genes = {
        "schema_gene": metadata.get("schema", []),
        "lineage_gene": metadata.get("lineage", []),
        "usage_gene": metadata.get("usage", {}),
        "ownership_gene": {
            "owner": metadata.get("owner", "unknown"),
            "description": metadata.get("description", ""),
        },
    }
    return {"dataset": metadata["dataset"], "trust_score": trust_score, "genes": genes}
Enter fullscreen mode Exit fullscreen mode

3) Temporal Engine

Each new snapshot is compared against prior snapshots to produce schema and lineage diffs.

# backend/app/services/temporal_engine.py
def compute_schema_diff(previous_genes: dict, current_genes: dict) -> dict:
    ...

def compute_lineage_diff(previous_genes: dict, current_genes: dict) -> dict:
    ...
Enter fullscreen mode Exit fullscreen mode

4) Agent Orchestration

The orchestrator runs a staged pipeline:

  1. Observer detects issues
  2. Analyst performs root cause analysis
  3. Fixer simulates remediation
  4. Explainer produces final narrative
# backend/app/services/agents/orchestrator.py
issues = self.observer.detect_issues(dataset_name, latest, previous, edges)
analysis = await self.analyst.analyze_issue(...)
fix = self.fixer.suggest_fix(issue, analysis)
narrative, sections = await self.explainer.explain(...)
Enter fullscreen mode Exit fullscreen mode

5) Demo Orchestration Endpoint

The POST /demo/magic-run endpoint creates a reproducible incident path:

  • Upstream schema mutation
  • Downstream break
  • Propagated risk
  • Boardroom brief payload for UI
# backend/app/services/demo_magic.py
def run_magic_demo(db: Session) -> dict:
    ...
    return {
        "datasets": [...],
        "lineage": [...],
        "incident": {...},
        "metrics": {...},
        "boardroom_brief": {...},
        "timeline_events": [...],
    }
Enter fullscreen mode Exit fullscreen mode

API Surface

Main routes:

  • GET /dna/{dataset}
  • GET /timeline/{dataset}
  • GET /graph
  • POST /analyze
  • POST /simulate-fix
  • POST /refresh-openmetadata
  • POST /demo/magic-run

Example:

curl -X POST http://localhost:8000/analyze \
  -H "content-type: application/json" \
  -d '{"dataset":"sales.orders","question":"Why did this dataset break?"}'
Enter fullscreen mode Exit fullscreen mode

Frontend Design

The UI is split into purpose-driven views:

  • Dashboard: trust/risk overview and active incidents
  • Graph: lineage map with mutation/blast-radius signal
  • Timeline: event progression and snapshot history
  • Copilot: structured incident explanation

To avoid browser CORS instability for Copilot calls, Next.js server routes proxy frontend requests:

// frontend/app/api/analyze/route.ts
export async function POST(req: NextRequest) {
  const body = await req.json();
  const upstream = await fetch(`${API_BASE}/analyze`, {...});
  ...
}
Enter fullscreen mode Exit fullscreen mode

And the client calls same-origin API routes:

// frontend/components/CopilotClient.tsx
const res = await fetch("/api/analyze", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ dataset, question }),
});
Enter fullscreen mode Exit fullscreen mode

Local Development

cp .env.example .env
docker compose up --build
Enter fullscreen mode Exit fullscreen mode

Open:

  • Frontend: http://localhost:3000
  • Backend: http://localhost:8000/docs

Deployment Notes (Cloud Run)

Backend and frontend are deployed as separate Cloud Run services.

Important details:

  • Backend Docker entrypoint binds to ${PORT} for Cloud Run compatibility.
  • Source deploy uses .gcloudignore to avoid uploading large local folders (node_modules, .next, etc.).
  • Frontend is built with NEXT_PUBLIC_API_BASE pointing to backend Cloud Run URL.

What Worked Well

  • Having a deterministic demo path (/demo/magic-run) made testing and presenting much easier.
  • Separating ingestion, DNA modeling, and agent logic kept backend maintainable.
  • Next.js API proxy routes reduced client-side networking friction.

What I’d Improve Next

  • Add persistent managed Postgres for production Cloud Run data retention.
  • Add auth/tenant separation for multi-user usage.
  • Add CI checks for contract drift between backend response schemas and frontend types.
  • Add richer graph layouts and node-level drilldowns.

Living Data DNA Platform started as a hackathon build, but the architecture is practical for real metadata operations: ingest, detect, explain, and recover with clear lineage context.

If you’re building on top of OpenMetadata, this pattern can be a solid base for reliability workflows, incident response, and metadata governance operations.

Video Demo

Github: https://github.com/harishkotra/living-data-dna-platform

Top comments (0)