Data Labeling Pipeline
A comprehensive annotation workflow system that brings structure and quality control to the most time-consuming part of any ML project: labeling data. This pipeline includes annotation interface templates, quality assurance scripts that catch labeler disagreements early, active learning selectors that prioritize the most informative samples, and export tools that convert annotations into every major training format. Stop burning budget on redundant labels — build a labeling operation that scales.
Key Features
- Annotation Workflow Engine — Define multi-stage labeling pipelines with review gates, consensus requirements, and automatic task routing.
- Quality Assurance Suite — Inter-annotator agreement (Cohen's Kappa, Fleiss' Kappa), outlier detection, and automatic flagging of low-confidence labels.
- Active Learning Selectors — Uncertainty sampling, query-by-committee, and diversity sampling to reduce labeling cost by 40-60%.
- Label Studio Integration — Pre-built project templates for image classification, NER, text classification, and bounding box tasks.
- Format Converters — Export to COCO, VOC, YOLO, spaCy, HuggingFace Datasets, and CSV formats with a single command.
- Labeler Performance Tracking — Per-annotator accuracy, speed, and agreement metrics with dashboard-ready JSON output.
- Pre-labeling Pipeline — Use a weak model to generate initial labels, reducing annotator effort to verification rather than creation.
- Version Control — Track annotation schema changes, label corrections, and dataset snapshots with full audit trails.
Quick Start
unzip data-labeling-pipeline.zip && cd data-labeling-pipeline
pip install -r requirements.txt
# Initialize a labeling project
python src/data_labeling_pipeline/core.py init --config config.example.yaml
# config.example.yaml
project:
name: product_classification
task_type: image_classification # image_classification | ner | bbox | text_classification
classes: [electronics, clothing, furniture, food, other]
quality:
min_annotators_per_sample: 2
agreement_threshold: 0.75 # Cohen's Kappa
review_gate: true
auto_flag_low_confidence: true
active_learning:
enabled: true
strategy: uncertainty # uncertainty | committee | diversity | hybrid
model_checkpoint: ./models/weak_classifier.pt
query_batch_size: 100
retrain_every_n_labels: 500
export:
format: coco # coco | voc | yolo | huggingface | csv
output_dir: ./labeled_data/
train_val_split: 0.85
Architecture
┌──────────────┐ ┌───────────────┐ ┌──────────────┐
│ Raw Data │────>│ Pre-Labeler │────>│ Task Queue │
│ Ingestion │ │ (Weak Model) │ │ (Priority) │
└──────────────┘ └───────────────┘ └──────┬───────┘
│
┌──────────────┐ ┌───────────────┐ ┌──────▼───────┐
│ QA & Review │<────│ Consensus │<────│ Annotators │
│ Dashboard │ │ Engine │ │ (Human) │
└──────┬───────┘ └───────────────┘ └──────────────┘
│
┌──────▼───────┐ ┌───────────────┐
│ Active │────>│ Export & │
│ Learning │ │ Versioning │
└──────────────┘ └───────────────┘
Usage Examples
Calculate Inter-Annotator Agreement
from data_labeling_pipeline.core import QualityAssurance
qa = QualityAssurance()
# Load annotations from two annotators
annotations_a = qa.load_annotations("annotator_1.json")
annotations_b = qa.load_annotations("annotator_2.json")
# Calculate agreement
report = qa.compute_agreement(annotations_a, annotations_b)
print(f"Cohen's Kappa: {report['cohens_kappa']:.3f}")
print(f"Percent Agreement: {report['percent_agreement']:.1%}")
print(f"Disagreement samples: {len(report['disagreements'])}")
# Flag samples needing re-review
qa.flag_for_review(report["disagreements"], reason="annotator_disagreement")
Active Learning Sample Selection
from data_labeling_pipeline.core import ActiveLearner
import torch
# Load your current weak model
model = torch.load("./models/weak_classifier.pt")
model.eval()
learner = ActiveLearner(
model=model,
strategy="uncertainty",
pool_size=10000,
)
# Select the 100 most informative unlabeled samples
selected_indices = learner.query(n_instances=100)
print(f"Selected {len(selected_indices)} samples for labeling")
print(f"Average uncertainty: {learner.last_uncertainty_mean:.4f}")
# Export selected samples to labeling tool
learner.export_to_label_studio(selected_indices, project_id="product_cls")
Convert Between Annotation Formats
from data_labeling_pipeline.utils import FormatConverter
converter = FormatConverter()
# COCO to YOLO
converter.convert(
input_path="./annotations/coco_annotations.json",
input_format="coco",
output_path="./annotations/yolo/",
output_format="yolo",
image_dir="./images/",
)
# Export to HuggingFace Datasets
converter.to_huggingface(
annotations_path="./annotations/coco_annotations.json",
output_dir="./hf_dataset/",
push_to_hub=False,
)
Configuration Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
quality.min_annotators_per_sample |
int | 2 |
Annotations required before consensus |
quality.agreement_threshold |
float | 0.75 |
Minimum Kappa for auto-approval |
active_learning.strategy |
str | uncertainty |
Sample selection strategy |
active_learning.query_batch_size |
int | 100 |
Samples per active learning round |
export.train_val_split |
float | 0.85 |
Train/validation split ratio |
Best Practices
- Start with clear labeling guidelines — Write a labeling guide with edge case examples before any annotation begins. Ambiguous guidelines are the #1 source of label noise.
- Use pre-labeling for > 1000 samples — Even a weak model (70% accuracy) cuts annotator time in half. Humans are faster at verification than creation.
- Monitor agreement continuously — Don't wait until the end to check quality. Run QA after every 200 labels and retrain annotators if Kappa drops below 0.7.
- Version your label schema — When you add or rename classes mid-project, track the change and re-map historical annotations.
- Sample for review, don't review everything — Spot-check 10-15% of labels from each annotator. Full review is cost-prohibitive and unnecessary with good annotators.
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| Low inter-annotator agreement (<0.5) | Ambiguous label guidelines | Revise guidelines with concrete examples for each class boundary |
| Active learning selects similar samples | Diversity component missing | Switch to hybrid strategy which combines uncertainty + diversity |
| Export fails with missing images | Image paths in annotations are absolute | Use --rebase-paths flag to convert to relative paths |
| Pre-labeling accuracy too low | Weak model undertrained | Train weak model on at least 500 manually labeled samples first |
This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [Data Labeling Pipeline] with all files, templates, and documentation for $29.
Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.
Top comments (0)