Harish Kotra (he/him)

Posted on Apr 27

Building Living Data DNA Platform: Lineage, Time Travel, and AI-Guided Incident Response

#ai #opensource #programming #dailybuild2026

Data platforms rarely fail in one obvious place. A small schema change in one table can cascade across transformations, dashboards, and executive reports.

This project, Living Data DNA Platform, was built to make that chain visible and actionable.

In this post, I’ll walk through:

What the app does
Architecture and data flow
Backend implementation details
Frontend implementation details
Deployment
Lessons learned

What This Project Solves

The platform ingests metadata from OpenMetadata, converts it into a normalized “DNA snapshot” model, stores time-based changes, and runs an AI agent pipeline to:

Detect issues
Explain root cause
Suggest fixes
Produce stakeholder-ready summaries

It also includes a guided “incident replay” mode so teams can demo a complete incident lifecycle in under a minute.

Tech Stack

Backend: FastAPI, SQLAlchemy
Frontend: Next.js (App Router), TypeScript, React Flow
Database: PostgreSQL (local/dev), SQLite (Cloud Run lightweight mode)
Metadata source: OpenMetadata REST APIs
LLM layer: OpenAI-compatible endpoint
External context: Tavily + BrightData MCP
Infra: Docker Compose (local), Cloud Run (deployed)

Architecture

Core Backend Design

1) Metadata ingestion from OpenMetadata

The ingestion service fetches and normalizes metadata into an internal shape used by the DNA builder.

# backend/app/services/ingestion.py
class MetadataIngestionService:
    async def sync(self, db: Session) -> dict:
        normalized = await self.client.fetch_normalized_metadata()
        ...
        dna = build_dna(item)
        snapshot = DnaSnapshot(dataset_id=dataset.id, trust_score=dna["trust_score"], genes=dna["genes"])
        db.add(snapshot)

2) DNA Builder

DNA snapshots represent metadata “genes”:

schema_gene
lineage_gene
usage_gene
ownership_gene

# backend/app/services/dna_builder.py
def build_dna(metadata: dict) -> dict:
    trust_score = calculate_trust_score(metadata)
    genes = {
        "schema_gene": metadata.get("schema", []),
        "lineage_gene": metadata.get("lineage", []),
        "usage_gene": metadata.get("usage", {}),
        "ownership_gene": {
            "owner": metadata.get("owner", "unknown"),
            "description": metadata.get("description", ""),
        },
    }
    return {"dataset": metadata["dataset"], "trust_score": trust_score, "genes": genes}

3) Temporal Engine

Each new snapshot is compared against prior snapshots to produce schema and lineage diffs.

# backend/app/services/temporal_engine.py
def compute_schema_diff(previous_genes: dict, current_genes: dict) -> dict:
    ...

def compute_lineage_diff(previous_genes: dict, current_genes: dict) -> dict:
    ...

4) Agent Orchestration

The orchestrator runs a staged pipeline:

Observer detects issues
Analyst performs root cause analysis
Fixer simulates remediation
Explainer produces final narrative

# backend/app/services/agents/orchestrator.py
issues = self.observer.detect_issues(dataset_name, latest, previous, edges)
analysis = await self.analyst.analyze_issue(...)
fix = self.fixer.suggest_fix(issue, analysis)
narrative, sections = await self.explainer.explain(...)

5) Demo Orchestration Endpoint

The POST /demo/magic-run endpoint creates a reproducible incident path:

Upstream schema mutation
Downstream break
Propagated risk
Boardroom brief payload for UI

# backend/app/services/demo_magic.py
def run_magic_demo(db: Session) -> dict:
    ...
    return {
        "datasets": [...],
        "lineage": [...],
        "incident": {...},
        "metrics": {...},
        "boardroom_brief": {...},
        "timeline_events": [...],
    }

API Surface

Main routes:

GET /dna/{dataset}
GET /timeline/{dataset}
GET /graph
POST /analyze
POST /simulate-fix
POST /refresh-openmetadata
POST /demo/magic-run

Example:

curl -X POST http://localhost:8000/analyze \
  -H "content-type: application/json" \
  -d '{"dataset":"sales.orders","question":"Why did this dataset break?"}'

Frontend Design

The UI is split into purpose-driven views:

Dashboard: trust/risk overview and active incidents
Graph: lineage map with mutation/blast-radius signal
Timeline: event progression and snapshot history
Copilot: structured incident explanation

To avoid browser CORS instability for Copilot calls, Next.js server routes proxy frontend requests:

// frontend/app/api/analyze/route.ts
export async function POST(req: NextRequest) {
  const body = await req.json();
  const upstream = await fetch(`${API_BASE}/analyze`, {...});
  ...
}

And the client calls same-origin API routes:

// frontend/components/CopilotClient.tsx
const res = await fetch("/api/analyze", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ dataset, question }),
});

Local Development

cp .env.example .env
docker compose up --build

Open:

Frontend: http://localhost:3000
Backend: http://localhost:8000/docs

Deployment Notes (Cloud Run)

Backend and frontend are deployed as separate Cloud Run services.

Important details:

Backend Docker entrypoint binds to ${PORT} for Cloud Run compatibility.
Source deploy uses .gcloudignore to avoid uploading large local folders (node_modules, .next, etc.).
Frontend is built with NEXT_PUBLIC_API_BASE pointing to backend Cloud Run URL.

What Worked Well

Having a deterministic demo path (/demo/magic-run) made testing and presenting much easier.
Separating ingestion, DNA modeling, and agent logic kept backend maintainable.
Next.js API proxy routes reduced client-side networking friction.

What I’d Improve Next

Add persistent managed Postgres for production Cloud Run data retention.
Add auth/tenant separation for multi-user usage.
Add CI checks for contract drift between backend response schemas and frontend types.
Add richer graph layouts and node-level drilldowns.

Living Data DNA Platform started as a hackathon build, but the architecture is practical for real metadata operations: ingest, detect, explain, and recover with clear lineage context.

If you’re building on top of OpenMetadata, this pattern can be a solid base for reliability workflows, incident response, and metadata governance operations.