DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Data Labeling Pipeline

Data Labeling Pipeline

A comprehensive annotation workflow system that brings structure and quality control to the most time-consuming part of any ML project: labeling data. This pipeline includes annotation interface templates, quality assurance scripts that catch labeler disagreements early, active learning selectors that prioritize the most informative samples, and export tools that convert annotations into every major training format. Stop burning budget on redundant labels — build a labeling operation that scales.

Key Features

  • Annotation Workflow Engine — Define multi-stage labeling pipelines with review gates, consensus requirements, and automatic task routing.
  • Quality Assurance Suite — Inter-annotator agreement (Cohen's Kappa, Fleiss' Kappa), outlier detection, and automatic flagging of low-confidence labels.
  • Active Learning Selectors — Uncertainty sampling, query-by-committee, and diversity sampling to reduce labeling cost by 40-60%.
  • Label Studio Integration — Pre-built project templates for image classification, NER, text classification, and bounding box tasks.
  • Format Converters — Export to COCO, VOC, YOLO, spaCy, HuggingFace Datasets, and CSV formats with a single command.
  • Labeler Performance Tracking — Per-annotator accuracy, speed, and agreement metrics with dashboard-ready JSON output.
  • Pre-labeling Pipeline — Use a weak model to generate initial labels, reducing annotator effort to verification rather than creation.
  • Version Control — Track annotation schema changes, label corrections, and dataset snapshots with full audit trails.

Quick Start

unzip data-labeling-pipeline.zip && cd data-labeling-pipeline
pip install -r requirements.txt

# Initialize a labeling project
python src/data_labeling_pipeline/core.py init --config config.example.yaml
Enter fullscreen mode Exit fullscreen mode
# config.example.yaml
project:
  name: product_classification
  task_type: image_classification  # image_classification | ner | bbox | text_classification
  classes: [electronics, clothing, furniture, food, other]

quality:
  min_annotators_per_sample: 2
  agreement_threshold: 0.75  # Cohen's Kappa
  review_gate: true
  auto_flag_low_confidence: true

active_learning:
  enabled: true
  strategy: uncertainty  # uncertainty | committee | diversity | hybrid
  model_checkpoint: ./models/weak_classifier.pt
  query_batch_size: 100
  retrain_every_n_labels: 500

export:
  format: coco  # coco | voc | yolo | huggingface | csv
  output_dir: ./labeled_data/
  train_val_split: 0.85
Enter fullscreen mode Exit fullscreen mode

Architecture

┌──────────────┐     ┌───────────────┐     ┌──────────────┐
│  Raw Data    │────>│  Pre-Labeler  │────>│  Task Queue  │
│  Ingestion   │     │ (Weak Model)  │     │  (Priority)  │
└──────────────┘     └───────────────┘     └──────┬───────┘
                                                   │
┌──────────────┐     ┌───────────────┐     ┌──────▼───────┐
│  QA & Review │<────│  Consensus    │<────│  Annotators  │
│  Dashboard   │     │  Engine       │     │  (Human)     │
└──────┬───────┘     └───────────────┘     └──────────────┘
       │
┌──────▼───────┐     ┌───────────────┐
│  Active      │────>│  Export &     │
│  Learning    │     │  Versioning   │
└──────────────┘     └───────────────┘
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Calculate Inter-Annotator Agreement

from data_labeling_pipeline.core import QualityAssurance

qa = QualityAssurance()

# Load annotations from two annotators
annotations_a = qa.load_annotations("annotator_1.json")
annotations_b = qa.load_annotations("annotator_2.json")

# Calculate agreement
report = qa.compute_agreement(annotations_a, annotations_b)
print(f"Cohen's Kappa: {report['cohens_kappa']:.3f}")
print(f"Percent Agreement: {report['percent_agreement']:.1%}")
print(f"Disagreement samples: {len(report['disagreements'])}")

# Flag samples needing re-review
qa.flag_for_review(report["disagreements"], reason="annotator_disagreement")
Enter fullscreen mode Exit fullscreen mode

Active Learning Sample Selection

from data_labeling_pipeline.core import ActiveLearner
import torch

# Load your current weak model
model = torch.load("./models/weak_classifier.pt")
model.eval()

learner = ActiveLearner(
    model=model,
    strategy="uncertainty",
    pool_size=10000,
)

# Select the 100 most informative unlabeled samples
selected_indices = learner.query(n_instances=100)
print(f"Selected {len(selected_indices)} samples for labeling")
print(f"Average uncertainty: {learner.last_uncertainty_mean:.4f}")

# Export selected samples to labeling tool
learner.export_to_label_studio(selected_indices, project_id="product_cls")
Enter fullscreen mode Exit fullscreen mode

Convert Between Annotation Formats

from data_labeling_pipeline.utils import FormatConverter

converter = FormatConverter()

# COCO to YOLO
converter.convert(
    input_path="./annotations/coco_annotations.json",
    input_format="coco",
    output_path="./annotations/yolo/",
    output_format="yolo",
    image_dir="./images/",
)

# Export to HuggingFace Datasets
converter.to_huggingface(
    annotations_path="./annotations/coco_annotations.json",
    output_dir="./hf_dataset/",
    push_to_hub=False,
)
Enter fullscreen mode Exit fullscreen mode

Configuration Reference

Parameter Type Default Description
quality.min_annotators_per_sample int 2 Annotations required before consensus
quality.agreement_threshold float 0.75 Minimum Kappa for auto-approval
active_learning.strategy str uncertainty Sample selection strategy
active_learning.query_batch_size int 100 Samples per active learning round
export.train_val_split float 0.85 Train/validation split ratio

Best Practices

  1. Start with clear labeling guidelines — Write a labeling guide with edge case examples before any annotation begins. Ambiguous guidelines are the #1 source of label noise.
  2. Use pre-labeling for > 1000 samples — Even a weak model (70% accuracy) cuts annotator time in half. Humans are faster at verification than creation.
  3. Monitor agreement continuously — Don't wait until the end to check quality. Run QA after every 200 labels and retrain annotators if Kappa drops below 0.7.
  4. Version your label schema — When you add or rename classes mid-project, track the change and re-map historical annotations.
  5. Sample for review, don't review everything — Spot-check 10-15% of labels from each annotator. Full review is cost-prohibitive and unnecessary with good annotators.

Troubleshooting

Issue Cause Fix
Low inter-annotator agreement (<0.5) Ambiguous label guidelines Revise guidelines with concrete examples for each class boundary
Active learning selects similar samples Diversity component missing Switch to hybrid strategy which combines uncertainty + diversity
Export fails with missing images Image paths in annotations are absolute Use --rebase-paths flag to convert to relative paths
Pre-labeling accuracy too low Weak model undertrained Train weak model on at least 500 manually labeled samples first

This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [Data Labeling Pipeline] with all files, templates, and documentation for $29.

Get the Full Kit →

Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)