Saman Tabatabaeian

Posted on Feb 20

How I Built a Self-Correcting ML Pipeline That Runs for Days Without Human Intervention

#python #machinelearning #opensource #astronomy

Most ML pipelines are batch jobs: data in, predictions out, human reviews results. But what if you need a pipeline that runs autonomously for days, continuously ingesting data from external sources, making decisions, and — critically — correcting its own mistakes without anyone watching?

That's what I built for AstroLens, an open-source tool for detecting astronomical anomalies in sky survey images. Version 1.1.0 introduces Streaming Discovery: a mode where the system runs for days, downloads tens of thousands of images from sky surveys, analyzes each one with an ML ensemble, and generates daily reports — all while continuously adjusting its own parameters.

In a 3-day validation run, it processed 20,997 images, flagged 3,458 anomaly candidates, independently found known supernovae and gravitational lenses, and ran 140 self-correction cycles with zero errors and zero human intervention.

Here's how the self-correcting architecture works.

The core pipeline

┌─────────────────────────────────────────────────────┐
│                 STREAMING DISCOVERY                 │
│                                                     │
│  ┌──────────┐   ┌──────────┐   ┌──────────────────┐ │
│  │  Survey   │──▶│    ML    │──▶│    Catalog        │
│  │ Ingestor  │   │ Ensemble │   │  Cross-reference ││
│  └─────┬────┘   └────┬─────┘   └────────┬─────────┘ │
│        │              │                   │         │
│        ▼              ▼                   ▼         │
│  ┌─────────────────────────────────────────────────┐│
│  │             State Manager (aiosqlite)           ││
│  └──────────────────────┬──────────────────────────┘│
│                         │                           │
│  ┌──────────────────────▼──────────────────────────┐│
│  │          Self-Correction Controller             ││
│  │  • Threshold decay    • Source rebalancing      ││
│  │  • OOD recalibration  • Error recovery          ││
│  │  • YOLO auto-retrain  • Health monitoring       ││
│  └──────────────────────────────────────────── ────┘│
│                         │                           │
│                    ┌────▼────┐                      │
│                    │ Reports │                      │
│                    └─────────┘                      │
└─────────────────────────────────────────────────────┘

The ML ensemble itself has three stages: a Vision Transformer (ViT-B/16) for feature extraction, an out-of-distribution detection ensemble (Mahalanobis + energy-based + kNN with majority voting), and YOLOv8 for transient localization. But the architecture decisions that matter most for autonomous operation aren't in the models — they're in everything around them.

Self-correction #1: Adaptive threshold decay

The first problem with running OOD detection for days is threshold drift. Your initial thresholds are calibrated on whatever reference data you start with, but as you ingest thousands of images, your understanding of "normal" changes.

The pipeline uses a decay function that gradually adjusts OOD thresholds based on the accumulated distribution of scores:

threshold_adjustment = base_threshold * (1 - decay_rate * log(1 + images_processed))

Early in a run, thresholds are tight — you'd rather miss anomalies than flood yourself with false positives while the reference distribution is thin. As more data comes in, thresholds relax proportionally to the log of processed images. The logarithmic curve is deliberate: you want rapid early adjustment but diminishing change as the distribution stabilizes.

The system also tracks the running mean and variance of OOD scores per source. If a source's score distribution shifts significantly (measured by a simple KL-divergence check against the last calibration window), it triggers a recalibration.

Self-correction #2: Source rebalancing

AstroLens downloads from multiple sky surveys: SDSS, ZTF, DECaLS, Pan-STARRS, Hubble, Galaxy Zoo, and specialized catalogs. Not all sources yield anomalies equally.

During the 3-day run, ZTF produced a 32.5% anomaly rate (expected — it's a transient survey designed to catch changing objects). Gravitational lens catalogs hit 60%. General-purpose survey sources were much lower.

The pipeline tracks anomaly yield per source over a rolling window and rebalances query allocation proportionally. Sources producing more interesting candidates get more queries. Sources producing mostly normal galaxies get fewer.

This is a simple but powerful optimization: instead of equal time across all sources, the pipeline automatically concentrates effort where it's productive.

# Simplified rebalancing logic
def compute_source_weights(source_stats, window_size=500):
    weights = {}
    for source, stats in source_stats.items():
        recent = stats.last_n(window_size)
        anomaly_rate = recent.anomaly_count / max(recent.total_count, 1)
        error_rate = recent.error_count / max(recent.total_count, 1)
        weights[source] = anomaly_rate * (1 - error_rate)
    total = sum(weights.values()) or 1
    return {s: w / total for s, w in weights.items()}

A floor prevents any source from being starved entirely — you always want some diversity in your input to avoid feedback loops.

Self-correction #3: YOLO auto-retrain

This was the most dramatic result. The YOLOv8 transient detection model started the 3-day run at 51.5% mAP50. By the end, it reached 99.5%.

How: as the pipeline processes images, anomaly candidates with high-confidence OOD scores and successful catalog cross-references become pseudo-labeled training data. When enough new labeled samples accumulate (configurable threshold), the pipeline triggers a YOLO fine-tuning cycle on the expanded dataset.

The key safeguard is that training data only comes from candidates that were independently validated — either by the OOD ensemble's majority vote or by matching a known object in SIMBAD/NED. This avoids the obvious failure mode of training on your own false positives.

Each retrain cycle checkpoints the previous model. If the new model's validation mAP drops below the previous checkpoint, the system rolls back automatically.

Self-correction #4: Error recovery

External APIs fail. Survey servers rate-limit you. Images download corrupted. Over a 3-day run, things go wrong constantly.

The error recovery system operates at multiple levels:

Request level: Exponential backoff with jitter on transient HTTP failures. Configurable retry limits per source.
Image level: Corrupted or unparseable images are logged and skipped. If a source's error rate exceeds a threshold, it's temporarily deprioritized (feeds into source rebalancing).
Pipeline level: The state manager persists every pipeline decision to aiosqlite. If the process crashes — power failure, OOM, whatever — it restarts and picks up from the last committed state. No image is processed twice, no candidate is lost.
Session level: Each streaming session tracks cumulative health metrics. If aggregate error rates or processing latency breach configurable bounds, the pipeline can pause, generate an interim report, and alert.

During the 3-day run, 140 self-correction events were triggered across these systems. All were handled autonomously.

Daily reporting

At configurable intervals (default: 24 hours), the pipeline generates a Markdown report containing:

Total images processed, anomaly candidates found, unique sky regions covered
Per-source breakdown: anomaly rates, error rates, top candidates
YOLO training status and performance trajectory
Threshold adjustment history
Pipeline health: uptime, memory usage, processing rate trends
Highlighted candidates with coordinates, survey thumbnails, and catalog cross-matches

These reports serve as both audit logs and scientific outputs.

What went wrong (and how the system handled it)

Survey API rate limits: ZTF and SDSS both throttled requests during peak hours. The backoff system handled this gracefully, but it meant processing throughput varied by time of day. The pipeline adapted by shifting more queries to less-loaded sources during peak hours.

YOLO training on small batches: Early retrain cycles had very few samples, risking overfitting. The minimum sample threshold prevents this from happening too early, but there's a tension between waiting for enough data and getting improved detection sooner. The current heuristic works but could be more sophisticated.

OOD threshold sensitivity: The initial thresholds matter more than expected. If they're too tight at the start, you miss anomalies that would have improved downstream training data. If too loose, early YOLO training data is noisy. The logarithmic decay is a compromise, but the cold-start problem is real.

The Most Significant Detections

Found images are published here
Every detection was cross-referenced against SIMBAD, the international astronomical reference database maintained by the Centre de Données astronomiques de Strasbourg. All 269 queried positions returned matches to known catalogued objects, validating that the pipeline is looking at real astronomical sources.

Key takeaways

State persistence is non-negotiable for long-running pipelines. Every decision, every intermediate result, every parameter change must be recoverable. aiosqlite with WAL mode handles this well for single-node operation.
Self-correction beats monitoring. Instead of alerting a human when thresholds drift, let the system adjust. Save human attention for genuinely novel decisions.
Majority voting across diverse methods is more robust than any single OOD detection approach. Mahalanobis, energy-based, and kNN each fail differently — their intersection is much more reliable.
Logarithmic schedules appear everywhere in this system. Threshold decay, rebalancing windows, retrain triggers — diminishing adjustment as data accumulates is a recurring pattern.
Test with known objects. The pipeline's ability to independently find SN 2014J, NGC 3690, and SDSS J0252+0039 is validation, not discovery. Building this kind of ground truth into autonomous systems is essential for trust.

Try it

AstroLens is MIT licensed and runs on any laptop (CPU, Apple Silicon, or NVIDIA GPU). Desktop app, web interface, and CLI.

git clone https://github.com/samantaba/astroLens.git
cd astroLens
pip install -r requirements.txt
python main.py --mode web

GitHub: https://github.com/samantaba/astroLens

Contributions welcome — especially around new survey integrations, alternative OOD methods, and visualization improvements. If you've built long-running ML pipelines, I'd love to hear what patterns worked for you.

Built by Saman Tabatabaeian — AWS Solutions Architect, Cloud & DevOps, ML Engineer.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.