Thesius Code

Posted on Mar 23 • Originally published at datanest-stores.pages.dev

Data Labeling Pipeline

#machinelearning #python #mlops #datascience

Data Labeling Pipeline

A comprehensive annotation workflow system that brings structure and quality control to the most time-consuming part of any ML project: labeling data. This pipeline includes annotation interface templates, quality assurance scripts that catch labeler disagreements early, active learning selectors that prioritize the most informative samples, and export tools that convert annotations into every major training format. Stop burning budget on redundant labels — build a labeling operation that scales.

Key Features

Annotation Workflow Engine — Define multi-stage labeling pipelines with review gates, consensus requirements, and automatic task routing.
Quality Assurance Suite — Inter-annotator agreement (Cohen's Kappa, Fleiss' Kappa), outlier detection, and automatic flagging of low-confidence labels.
Active Learning Selectors — Uncertainty sampling, query-by-committee, and diversity sampling to reduce labeling cost by 40-60%.
Label Studio Integration — Pre-built project templates for image classification, NER, text classification, and bounding box tasks.
Format Converters — Export to COCO, VOC, YOLO, spaCy, HuggingFace Datasets, and CSV formats with a single command.
Labeler Performance Tracking — Per-annotator accuracy, speed, and agreement metrics with dashboard-ready JSON output.
Pre-labeling Pipeline — Use a weak model to generate initial labels, reducing annotator effort to verification rather than creation.
Version Control — Track annotation schema changes, label corrections, and dataset snapshots with full audit trails.

Quick Start

unzip data-labeling-pipeline.zip && cd data-labeling-pipeline
pip install -r requirements.txt

# Initialize a labeling project
python src/data_labeling_pipeline/core.py init --config config.example.yaml

# config.example.yaml
project:
  name: product_classification
  task_type: image_classification  # image_classification | ner | bbox | text_classification
  classes: [electronics, clothing, furniture, food, other]

quality:
  min_annotators_per_sample: 2
  agreement_threshold: 0.75  # Cohen's Kappa
  review_gate: true
  auto_flag_low_confidence: true

active_learning:
  enabled: true
  strategy: uncertainty  # uncertainty | committee | diversity | hybrid
  model_checkpoint: ./models/weak_classifier.pt
  query_batch_size: 100
  retrain_every_n_labels: 500

export:
  format: coco  # coco | voc | yolo | huggingface | csv
  output_dir: ./labeled_data/
  train_val_split: 0.85

Architecture

┌──────────────┐     ┌───────────────┐     ┌──────────────┐
│  Raw Data    │────>│  Pre-Labeler  │────>│  Task Queue  │
│  Ingestion   │     │ (Weak Model)  │     │  (Priority)  │
└──────────────┘     └───────────────┘     └──────┬───────┘
                                                   │
┌──────────────┐     ┌───────────────┐     ┌──────▼───────┐
│  QA & Review │<────│  Consensus    │<────│  Annotators  │
│  Dashboard   │     │  Engine       │     │  (Human)     │
└──────┬───────┘     └───────────────┘     └──────────────┘
       │
┌──────▼───────┐     ┌───────────────┐
│  Active      │────>│  Export &     │
│  Learning    │     │  Versioning   │
└──────────────┘     └───────────────┘

Usage Examples

Calculate Inter-Annotator Agreement

from data_labeling_pipeline.core import QualityAssurance

qa = QualityAssurance()

# Load annotations from two annotators
annotations_a = qa.load_annotations("annotator_1.json")
annotations_b = qa.load_annotations("annotator_2.json")

# Calculate agreement
report = qa.compute_agreement(annotations_a, annotations_b)
print(f"Cohen's Kappa: {report['cohens_kappa']:.3f}")
print(f"Percent Agreement: {report['percent_agreement']:.1%}")
print(f"Disagreement samples: {len(report['disagreements'])}")

# Flag samples needing re-review
qa.flag_for_review(report["disagreements"], reason="annotator_disagreement")

Active Learning Sample Selection

from data_labeling_pipeline.core import ActiveLearner
import torch

# Load your current weak model
model = torch.load("./models/weak_classifier.pt")
model.eval()

learner = ActiveLearner(
    model=model,
    strategy="uncertainty",
    pool_size=10000,
)

# Select the 100 most informative unlabeled samples
selected_indices = learner.query(n_instances=100)
print(f"Selected {len(selected_indices)} samples for labeling")
print(f"Average uncertainty: {learner.last_uncertainty_mean:.4f}")

# Export selected samples to labeling tool
learner.export_to_label_studio(selected_indices, project_id="product_cls")

Convert Between Annotation Formats

from data_labeling_pipeline.utils import FormatConverter

converter = FormatConverter()

# COCO to YOLO
converter.convert(
    input_path="./annotations/coco_annotations.json",
    input_format="coco",
    output_path="./annotations/yolo/",
    output_format="yolo",
    image_dir="./images/",
)

# Export to HuggingFace Datasets
converter.to_huggingface(
    annotations_path="./annotations/coco_annotations.json",
    output_dir="./hf_dataset/",
    push_to_hub=False,
)

Configuration Reference

Parameter	Type	Default	Description
`quality.min_annotators_per_sample`	int	`2`	Annotations required before consensus
`quality.agreement_threshold`	float	`0.75`	Minimum Kappa for auto-approval
`active_learning.strategy`	str	`uncertainty`	Sample selection strategy
`active_learning.query_batch_size`	int	`100`	Samples per active learning round
`export.train_val_split`	float	`0.85`	Train/validation split ratio

Best Practices

Start with clear labeling guidelines — Write a labeling guide with edge case examples before any annotation begins. Ambiguous guidelines are the #1 source of label noise.
Use pre-labeling for > 1000 samples — Even a weak model (70% accuracy) cuts annotator time in half. Humans are faster at verification than creation.
Monitor agreement continuously — Don't wait until the end to check quality. Run QA after every 200 labels and retrain annotators if Kappa drops below 0.7.
Version your label schema — When you add or rename classes mid-project, track the change and re-map historical annotations.
Sample for review, don't review everything — Spot-check 10-15% of labels from each annotator. Full review is cost-prohibitive and unnecessary with good annotators.

Troubleshooting

Issue	Cause	Fix
Low inter-annotator agreement (<0.5)	Ambiguous label guidelines	Revise guidelines with concrete examples for each class boundary
Active learning selects similar samples	Diversity component missing	Switch to `hybrid` strategy which combines uncertainty + diversity
Export fails with missing images	Image paths in annotations are absolute	Use `--rebase-paths` flag to convert to relative paths
Pre-labeling accuracy too low	Weak model undertrained	Train weak model on at least 500 manually labeled samples first

This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [Data Labeling Pipeline] with all files, templates, and documentation for $29.

Get the Full Kit →

Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.

Get the Complete Bundle →

DEV Community

Data Labeling Pipeline

Data Labeling Pipeline

Key Features

Quick Start

Architecture

Usage Examples

Calculate Inter-Annotator Agreement

Active Learning Sample Selection

Convert Between Annotation Formats

Configuration Reference

Best Practices

Troubleshooting

Related Articles

Top comments (0)