Building a Unified Benchmarking Pipeline for Computer Vision — Without Rewriting Code for Every Task

Michal S — Sun, 07 Dec 2025 20:25:54 +0000

This project was developed as part of the Extra-Tech Computer Vision Bootcamp, in collaboration with Applied Materials and ExtraTech .

I would like to acknowledge the mentors and instructors who supported this work throughout the bootcamp,
particularly Daniel Berger, Sara Polikman (Applied Materials), and Sara Shimon (ExtraTech),
for their guidance, technical insights, and continuous feedback.

Motivation

In working with advanced Computer Vision models, one challenge keeps resurfacing:
the models evolve quickly — but the evaluation and comparison workflow remains fragmented, cumbersome, and inconsistent.

Classification, Detection, and Segmentation each come with different data formats, different adapters, and entirely different benchmark structures.
When every task “speaks a different language,” even something as simple as comparing two models becomes non-trivial.

At some point, it became clear that the real challenge wasn’t the models —
it was the infrastructure.

How do you build a single engine capable of running models across tasks,
while still enforcing a critical principle:
comparisons happen only within the same task and the same benchmark,
in a way that is reliable, consistent, and fully reproducible?

The goal wasn’t to “unify the entire world,”
but to establish a shared language within each task,
where different models operate under the same evaluation environment —
the same data, the same metrics —
so that comparisons finally become what they should be:
clean, fair, and data-driven.

This post presents the system I built:
a Unified Benchmarking Pipeline that consolidates everything needed to run and compare Computer Vision models —
at the Task level, at the Benchmark level, and with a streamlined Developer Experience.

If you’ve ever found yourself writing new scripts for every model,
switching between COCO, YOLO, and PNG masks,
or trying to reproduce a past run that was never properly documented —
this post is for you

1. The Problem: Fragmentation at Scale

When I started running real-world experiments across classification, detection, and segmentation tasks, I kept hitting the same wall:

each task had its own data format, its own scripts, and its own metric implementations.

Even a “simple” question like “Which model is actually better?” turned into a manual, error-prone investigation.

Task	Data Format	Output	Metrics
Classification	Folder structure	Label index	Accuracy, F1
Detection	COCO/YOLO JSON/TXT	Bounding boxes	mAP
Segmentation	PNG masks	Pixel-level mask	IoU

Over time, this fragmentation had very concrete consequences:

Non-reproducible experiments – small differences in scripts, preprocessing, or metric code lead to results that cannot be trusted or repeated.
Wasted engineering time – every new benchmark requires writing yet another custom integration instead of reusing a stable pipeline.
Inconsistent comparisons – Model A is evaluated with script X, Model B with script Y, so numbers look “precise” but are not truly comparable.
Poor scalability – adding a new task or dataset means duplicating logic instead of plugging into a shared evaluation engine.

1.2 Design Goals

To address this, I defined a set of architectural principles meant to standardize evaluation across all CV tasks while keeping the system flexible enough for real-world research workflows:

Single source of truth — a benchmark should be defined entirely through configuration, not code.
Task-agnostic execution — classification, detection, and segmentation should share the same runner interface.
Strong validation — configuration errors must be detected early, before any computation begins.
Production-grade reliability — support for concurrent executions, deterministic outputs, and clear traceability.
Zero-code extensibility — adding a new benchmark should require only a new configuration file, not changes to the system itself.

With these principles established, the next challenge was determining how to represent benchmarks in a way that could generalize across task types without introducing new code paths for each one.

This requirement led directly to the system’s first foundational layer.

2. Layer 1: A Declarative Approach — One YAML Defines the Entire Benchmark

From the start, it was clear that the system needed a way to describe any benchmark —
classification, detection, or segmentation — without introducing new code paths each time.
The solution was to adopt a fully declarative representation: a benchmark defined entirely in YAML.

This choice established a single, consistent interface for the entire evaluation pipeline.

Every benchmark is a self-contained YAML specification

Below is an actual example taken from the production environment:

name: plantdoc_cls
task: classification
domain: "plant_disease"

benchmark:
  id: "plantdoc_cls"
  name: "PlantDoc-Classification"
  task: "classification"
  version: "v1"

dataset:
  kind: "classification_folder"
  remote:
    bucket: "datasets"
    prefix: "PlantDoc-Classification/v1/PlantDoc-Dataset"
  cache_dir: "~/.cache/agvision/datasets/plantdoc_cls"
  train_dir: "train"
  val_dir: "test"
  extensions: [".jpg", ".jpeg", ".png"]
  class_names_file: null

eval:
  batch_size: 16
  average: "macro"
  metrics: ["accuracy", "precision", "recall", "f1"]
  device: "auto"
  params:
    num_workers: 2
    shuffle: false

outputs:
  root_dir: "runs"

2.1 Why YAML?

A unified pipeline cannot rely on task-specific Python scripts, because any code-level definition introduces
inconsistencies, version drift, and duplicated logic.

YAML provides several advantages that directly support reproducible evaluation:

Human-readable structure — Engineers can review, edit, and reason about benchmarks without diving into code.
Version control compatibility — Benchmark definitions live in Git, enabling consistent experiments across users and environments.
Clear separation of concerns — The dataset, evaluation rules, and output structure are declared as data, not hard-coded in the logic.
Strict validation — Each configuration is validated against a typed schema before execution, eliminating malformed or incomplete definitions early.

This declarative model ensures that the definition of a benchmark is independent from its execution,
which is essential when supporting multiple tasks and dataset formats through a single unified engine.

2.2 Configuration as a Contract

One of the most important architectural insights was recognizing that the YAML file acts as a contract
between all components of the system.

Each benchmark specification encapsulates the expectations and responsibilities of the following roles:

Benchmark authors — define the task type, dataset layout, and evaluation criteria.
Execution engine — interprets the validated configuration and runs the evaluation deterministically.
Model providers — supply models compatible with a given task without needing to adjust code for each dataset.
UI/API clients — trigger runs, compare results, and inspect outputs through a stable, configuration-driven interface.

This contract-based structure ensures consistent behavior across tasks, datasets, and users.
Even as new benchmarks are introduced, the underlying pipeline remains unchanged.

The practical impact

Defining benchmarks declaratively leads to a major simplification:

Adding a new benchmark requires only providing a new YAML file.

No new scripts.

No branching logic.

No duplicated preprocessing or metric code.

This design directly addresses the scalability issues described earlier and removes a significant amount of engineering overhead.

2.3 From YAML to a Typed, Executable Specification

While YAML is expressive and accessible, it is inherently untyped.
To ensure that evaluations are reliable and deterministic, the system transforms each YAML file into
a strongly typed configuration object as part of the AppConfig layer.

This conversion step provides:

Schema validation — catching missing fields, incompatible types, or invalid structures before execution begins.
Normalization — resolving paths, defaults, and device selection in a predictable way.
Cross-field consistency checks — ensuring, for example, that the task type matches the dataset adapter and metric set.

This layer is foundational for the system’s reliability and is what enables the benchmark pipeline
to scale across tasks, datasets, and model types without sacrificing correctness.

The next section describes this layer in detail.

3. Layer 2: The AppConfig Layer — From YAML to Executable Specification

The AppConfig layer validates all inputs, blocking invalid YAML before execution.

3.1 The AppConfig Architecture

class TaskType(str, Enum):
    ...

class PathsConfig(BaseModel):
    ...

class DatasetConfig(BaseModel):
    ...

class ModelConfig(BaseModel):
    ...

class AppConfig(BaseModel):
    """
    Full application configuration object consumed by the UI worker and          
    runners.
    """
    task: TaskType
    domain: Domain
    benchmark_name: str
    dataset: DatasetConfig
    model: ModelConfig
    eval: EvalConfig = Field(default_factory=EvalConfig)
    paths: PathsConfig = Field(default_factory=PathsConfig)
    logging: LoggingConfig = Field(default_factory=LoggingConfig)

DatasetConfig, EvalConfig, and TaskConfig are validated independently and then fused into a single typed AppConfig — the unified configuration object that drives the entire benchmarking pipeline.

3.2 What the AppConfig Layer Provides

Type Safety — Python’s type system guarantees that each field adheres to the expected structure.
Automatic Validation — Invalid configurations are rejected early with clear, actionable error messages.
Normalization — Paths are resolved, defaults applied, and device selection handled consistently.
Cross-field Validation — Ensures consistency across related fields (e.g., task type ↔ dataset kind).
Self-documenting Structure — Field descriptions act as built-in documentation for maintainers and users.
IDE Support — Full autocomplete, static analysis, and type hints improve the development experience.

3.3 Loading and Validation Flow

The configuration lifecycle includes:

Loading the YAML file
Schema validation
Normalization of paths and defaults
Creation of the strongly typed AppConfig
Execution only after the configuration is fully validated

Key benefit:

Invalid configurations are caught immediately, with detailed error messages — long before any GPU time is wasted.

4. Layer 3: Dataset Adapters — Unifying Heterogeneous Formats

Once a benchmark configuration is validated, the next challenge is handling heterogeneous dataset formats.

Each task type relies on completely different on-disk structures, annotation schemas, and iteration patterns.

To unify these differences, the system applies a consistent Adapter-based architecture.

4.1 The Adapter Pattern

The Adapter pattern provides a uniform iteration interface for all dataset types:

for image, target in adapter:
    ...

4.2 Adapters Include

ClassificationFolderAdapter
CocoDetectionAdapter
YoloDetectionAdapter
MaskSegmentationAdapter

Each adapter encapsulates dataset-specific loading logic and exposes a unified interface to the evaluation pipeline.

4.3 What Each Adapter Does

Every adapter:

Reads the dataset (images, annotations, metadata)
Normalizes annotation formats into a consistent internal structure
Exposes standardized outputs across different CV tasks

This design removes dozens of conditional branches and eliminates format-specific parsing inside the runners.

4.4 Why This Design Works

Single execution model — the run() method is identical across all tasks.
Isolated task-specific logic — only preprocessing, postprocessing, and metrics differ.
Easy extensibility — adding a new task requires implementing only a small abstract interface.
Highly testable — each adapter can be independently unit tested.
Maintainable — changes to the evaluation flow propagate uniformly across all tasks.

5. Layer 4: Task Runners — Executing Models Consistently Across Benchmarks

Once datasets are unified through adapters, the next layer is responsible for executing models in a consistent and task-agnostic way.

The system includes three modular runners:

ClassifierRunner
DetectorRunner
SegmenterRunner

All runners expose the same execution API:

result = runner.run(dataset, model, config)

What Each Runner Handles

Every runner is responsible for:

Forward passes
Output normalization — mapping raw model outputs into a unified internal format
Prediction logging
Metric computation
Artifact generation — saving predictions, overlays, and runtime metadata
Real-time UI reporting

This unified execution model ensures that any model can run on any benchmark, as long as the configuration matches the required task type.

Bringing All Layers Together

After breaking down the system into its individual layers — Dataset Adapters, Task-specific Runners, the YAML-driven AppConfig, and the evaluation engine — it becomes useful to step back and look at the architecture from a higher level.

The diagram below illustrates how all components interact within a single unified pipeline:

how datasets flow into adapters,
how models are loaded and normalized,
how runners orchestrate the evaluation,
and how results propagate back into metrics, artifacts, and the UI.

6. From Script to System: Client–Server Architecture

To support multiple users and parallel evaluations, the project evolved from a local script into a fully scalable Client–Server architecture.

This shift enabled distributed execution, resource sharing, and robust management of concurrent evaluation workloads.

Server Responsibilities

The server layer centralizes orchestration and reliability:

Job scheduling — organizing evaluation tasks and assigning them to available workers
Queue management — ensuring ordered, predictable processing of multiple evaluation requests
Load balancing — distributing workloads efficiently across workers or compute nodes
Artifact storage (MinIO) — storing predictions, logs, and evaluation outputs as versioned artifacts
Model and version tracking — maintaining reproducible mappings between models, benchmarks, and outputs
Failure isolation — preventing individual crashes from affecting the broader system

7. Client (PyQt) Responsibilities

The PyQt-based desktop client provides an accessible front end for researchers and engineers, handling all user-driven interactions with the evaluation pipeline.

Its responsibilities include:

Uploading models — loading ONNX or task-specific formats into the system
Selecting benchmarks — choosing the appropriate YAML specification for each evaluation
Configuring runs — device selection, batch size, metric presets, and overrides
Real-time logs — streaming progress, status messages, and intermediate outputs
Comparing metrics across runs — visualizing performance differences between models and benchmarks
Downloading prediction artifacts — retrieving images, overlays, and structured outputs generated by the runners

This client–server architecture transforms the pipeline from a single-use script into a scalable, interactive research tool that supports parallel experimentation and consistent evaluation across teams.

8. Key Engineering Lessons Learned

Throughout the development of this system, several engineering principles proved consistently valuable:

Configuration should drive execution — not the other way around. A declarative benchmark definition ensures reproducibility and removes hidden logic.
Strong validation (Pydantic) prevents hours of debugging — catching structural errors before execution dramatically improves reliability.
Adapters normalize complexity — avoiding format-specific logic scattered throughout the codebase.
Modular runners keep task logic replaceable — enabling clean extensions and isolating preprocessing, postprocessing, and metrics per task.
Incremental evaluation is essential for real-world datasets — allowing resumes, caching, and faster experimentation cycles.
Client–Server separation transforms a pipeline into a production-grade system — supporting parallel workloads, shared resources, and failure isolation.

Conclusion

By structuring the pipeline around clear boundaries — declarative configuration, strict validation, normalized datasets, and modular execution — the system achieves something simple but important: a consistent and reproducible way to evaluate Computer Vision models across tasks and benchmarks.

This foundation keeps the pipeline stable, easy to extend, and practical for real research work, without requiring new code each time the problem or dataset changes.

DEV Community: Michal S