Build a Data-to-Graph Pipeline with DLT, DuckDB & Cognee 🧠📈

Mona Hamid — Wed, 09 Jul 2025 04:56:27 +0000

What We’ll Build
In this post, we’ll show how to:

Load NYC Taxi data via a REST API

Store it in DuckDB using DLT

Visualize the relationships using Cognee

Step 1 – Ingest Data with DLT

@dlt.resource(write_disposition="replace", name="zoomcamp_data")
def zoomcamp_data():
    url = "https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api"
    response = requests.get(url)
    df = pd.DataFrame(response.json())
    df['Trip_Pickup_DateTime'] = pd.to_datetime(df['Trip_Pickup_DateTime'])

    df['tag'] = pd.cut(
        df['Trip_Pickup_DateTime'],
        bins=[
            pd.Timestamp("2009-06-01"),
            pd.Timestamp("2009-06-10"),
            pd.Timestamp("2009-06-20"),
            pd.Timestamp("2009-06-30")
        ],
        labels=["first_10_days", "second_10_days", "last_10_days"]
    )
    yield df[df['tag'].notnull()]

Step 2 – Run Pipeline to DuckDB

pipeline = dlt.pipeline(
    pipeline_name="zoomcamp_pipeline",
    destination="duckdb",
    dataset_name="zoomcamp_tagged_data"
)
pipeline.run(zoomcamp_data())

Step 3 – Enrich and Visualize with Cognee

wait cognee.add(df_set1_json, node_set=["first_10_days"])
await cognee.add(df_set2_json, node_set=["second_10_days"])
await cognee.add(df_set3_json, node_set=["last_10_days"])

Result 🎉
Upload your notebook and see interactive graphs emerge from your dataset.

🧪 DuckDB + 🧵 DLT + 🧠 Cognee = Magic!

# Building an MLOps Monitoring Architecture That Actually Works

Mona Hamid — Mon, 23 Jun 2025 15:11:48 +0000

The Problem 😅

You've probably been here:

Deploy ML model ✅
Model works great initially ✅
Stakeholders are happy ✅
Then... 📉 silent degradation
Business metrics drop 📊
"Why didn't we know sooner?" 🤔

Traditional monitoring doesn't work for ML models.

The Architecture 🏗️

Built a 3-layer monitoring system:

Layer 1: Models & Data 🤖

┌─────────────────┐ ┌─────────────────┐
│ ML Model │ │ Data Storage │
│ (FastAPI) │◄───┤ (PostgreSQL/S3) │
└─────────────────┘ └─────────────────┘

Layer 2: Processing ⚙️

┌─────────────────┐ ┌─────────────────┐
│ Drift Detection │ │ Orchestration │
│ (Evidently AI) │◄───┤ (Prefect) │
└─────────────────┘ └─────────────────┘

Layer 3: Alerts & Viz 📊

┌─────────────────┐ ┌─────────────────┐
│ Dashboards │ │ Alerts │
│ (Grafana) │◄───┤ (Slack/PagerDuty)│
└─────────────────┘ └─────────────────┘

Key Monitoring Metrics 📈

🎯 Prediction Drift

Detect when model outputs change distribution:


python
from evidently.metrics import DatasetDriftMetric

def check_prediction_drift(reference, current):
    metric = DatasetDriftMetric()
    result = metric.calculate(reference, current)
    return result.drift_detected
📊 Feature Drift
Monitor input feature distributions:

Mean/median shifts
Standard deviation changes
Quantile-based detection

❌ Data Quality
Real-time validation:

Missing value %
Outlier detection
Schema changes

📉 Performance Metrics
When ground truth available:

Accuracy trends
F1-score evolution
Business KPI correlation

Implementation Example 💻
pythonclass MLMonitor:
    def __init__(self, reference_data):
        self.reference_data = reference_data
        self.slack_webhook = os.getenv('SLACK_WEBHOOK')

    def monitor_predictions(self, current_data):
        """Main monitoring function"""

        # 1. Check for drift
        drift_result = self.check_drift(current_data)

        # 2. Validate data quality  
        quality_result = self.check_quality(current_data)

        # 3. Send alerts if needed
        if drift_result['drift_detected']:
            self.send_alert(f"🚨 Drift detected: {drift_result['drift_score']:.3f}")

        # 4. Update dashboards
        self.update_metrics(drift_result, quality_result)

    def check_drift(self, current_data):
        """Drift detection with Evidently"""
        from evidently.report import Report
        from evidently.metric_preset import DataDriftPreset

        report = Report(metrics=[DataDriftPreset()])
        report.run(self.reference_data, current_data)

        return report.as_dict()

    def send_alert(self, message):
        """Send Slack notification"""
        import requests

        payload = {
            "text": message,
            "channel": "#ml-alerts",
            "username": "ML Monitor Bot"
        }

        requests.post(self.slack_webhook, json=payload)
Results 📊
After implementing this system:
MetricBeforeAfterDetection Time2-3 days2-3 hoursMonthly Incidents83False Positive Rate40%5%Stakeholder Confidence😐😍
Tech Stack Choices 🛠️
Why Evidently AI?

Open source & flexible
Excellent drift algorithms
Great documentation
Active community

Why Grafana?

Beautiful dashboards
Real-time capabilities
PostgreSQL integration
Industry standard

Why Prefect over Airflow?

Modern Python-first approach
Better error handling
Easier Kubernetes deployment
Superior observability

Lessons Learned 💡
✅ What Worked

Start simple - Basic drift detection first
Tune thresholds - Avoid alert fatigue
Pretty dashboards - Stakeholders love visuals
Automation - Let system handle simple fixes

❌ What Failed

Too many alerts initially - Alert fatigue is real
Complex metrics upfront - Confused the team
Manual processes - Doesn't scale


What's Next? 🔮
Planning to add:

Automated retraining triggers
A/B testing integration
Cost monitoring per prediction
Explainability tracking with SHAP

Conclusion 🎉
ML monitoring isn't optional anymore. This architecture has:

Caught issues 10x faster
Reduced incidents by 60%
Improved stakeholder trust
Made our ML systems actually reliable

Key takeaway: Treat monitoring as a first-class citizen in your ML pipeline.

What monitoring challenges are you facing? Share in the comments!

DEV Community: Mona Hamid