<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Michal S</title>
    <description>The latest articles on DEV Community by Michal S (@michalsegal11).</description>
    <link>https://dev.to/michalsegal11</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3650702%2Ffdc71f29-cf73-4c52-a7e9-4edc43de885b.png</url>
      <title>DEV Community: Michal S</title>
      <link>https://dev.to/michalsegal11</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/michalsegal11"/>
    <language>en</language>
    <item>
      <title>Building a Unified Benchmarking Pipeline for Computer Vision — Without Rewriting Code for Every Task</title>
      <dc:creator>Michal S</dc:creator>
      <pubDate>Sun, 07 Dec 2025 20:25:54 +0000</pubDate>
      <link>https://dev.to/michalsegal11/building-a-unified-benchmarking-pipeline-for-computer-vision-without-rewriting-code-for-every-task-3978</link>
      <guid>https://dev.to/michalsegal11/building-a-unified-benchmarking-pipeline-for-computer-vision-without-rewriting-code-for-every-task-3978</guid>
      <description>&lt;p&gt;This project was developed as part of the Extra-Tech Computer Vision Bootcamp, in collaboration with Applied Materials and ExtraTech .&lt;/p&gt;

&lt;p&gt;I would like to acknowledge the mentors and instructors who supported this work throughout the bootcamp,&lt;br&gt;
particularly Daniel Berger, Sara Polikman (Applied Materials), and Sara Shimon (ExtraTech),&lt;br&gt;
for their guidance, technical insights, and continuous feedback.&lt;/p&gt;
&lt;h2&gt;
  
  
  Motivation
&lt;/h2&gt;

&lt;p&gt;In working with advanced Computer Vision models, one challenge keeps resurfacing:&lt;br&gt;
the models evolve quickly — but the evaluation and comparison workflow remains &lt;strong&gt;fragmented&lt;/strong&gt;, cumbersome, and &lt;strong&gt;inconsistent&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Classification, Detection, and Segmentation each come with &lt;strong&gt;different data formats&lt;/strong&gt;, &lt;strong&gt;different adapters&lt;/strong&gt;, and entirely &lt;strong&gt;different benchmark structures&lt;/strong&gt;.&lt;br&gt;
When every task “&lt;em&gt;speaks a different language&lt;/em&gt;,” even something as simple as comparing two models becomes &lt;strong&gt;non-trivial&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;At some point, it became clear that the real challenge wasn’t the models —&lt;br&gt;
it was the &lt;strong&gt;infrastructure&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;How do you build a single engine capable of running models across tasks,&lt;br&gt;
while still enforcing a critical principle:&lt;br&gt;
comparisons happen only within the &lt;strong&gt;same task&lt;/strong&gt; and the &lt;strong&gt;same benchmark&lt;/strong&gt;,&lt;br&gt;
in a way that is &lt;strong&gt;reliable&lt;/strong&gt;, consistent, and fully &lt;strong&gt;reproducible&lt;/strong&gt;?&lt;/p&gt;

&lt;p&gt;The goal wasn’t to “&lt;em&gt;unify the entire world&lt;/em&gt;,”&lt;br&gt;
but to establish a &lt;strong&gt;shared language&lt;/strong&gt; within each task,&lt;br&gt;
where different models operate under the same &lt;strong&gt;evaluation environment&lt;/strong&gt; —&lt;br&gt;
the same data, the &lt;strong&gt;same metrics&lt;/strong&gt; —&lt;br&gt;
so that comparisons finally become what they should be:&lt;br&gt;
clean, fair, and &lt;strong&gt;data-driven&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This post presents the system I built:&lt;br&gt;
a &lt;strong&gt;Unified Benchmarking Pipeline&lt;/strong&gt; that consolidates everything needed to run and compare Computer Vision models —&lt;br&gt;
at the Task level, at the Benchmark level, and with a streamlined Developer Experience.&lt;/p&gt;

&lt;p&gt;If you’ve ever found yourself writing new scripts for every model,&lt;br&gt;
&lt;strong&gt;switching between COCO, YOLO, and PNG masks&lt;/strong&gt;,&lt;br&gt;
or trying to reproduce a past run that was never properly documented —&lt;br&gt;
this post is for you&lt;/p&gt;
&lt;h2&gt;
  
  
  1. The Problem: Fragmentation at Scale
&lt;/h2&gt;

&lt;p&gt;When I started running real-world experiments across &lt;strong&gt;classification, detection, and segmentation&lt;/strong&gt; tasks, I kept hitting the same wall:&lt;br&gt;&lt;br&gt;
each task had its &lt;strong&gt;own data format, its own scripts, and its own metric implementations&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Even a “simple” question like &lt;em&gt;“Which model is actually better?”&lt;/em&gt; turned into a manual, error-prone investigation.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Data Format&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;th&gt;Metrics&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Classification&lt;/td&gt;
&lt;td&gt;Folder structure&lt;/td&gt;
&lt;td&gt;Label index&lt;/td&gt;
&lt;td&gt;Accuracy, F1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Detection&lt;/td&gt;
&lt;td&gt;COCO/YOLO JSON/TXT&lt;/td&gt;
&lt;td&gt;Bounding boxes&lt;/td&gt;
&lt;td&gt;mAP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Segmentation&lt;/td&gt;
&lt;td&gt;PNG masks&lt;/td&gt;
&lt;td&gt;Pixel-level mask&lt;/td&gt;
&lt;td&gt;IoU&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Over time, this fragmentation had very concrete consequences:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Non-reproducible experiments&lt;/strong&gt; – small differences in scripts, preprocessing, or metric code lead to results that cannot be trusted or repeated.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wasted engineering time&lt;/strong&gt; – every new benchmark requires writing yet another custom integration instead of reusing a stable pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inconsistent comparisons&lt;/strong&gt; – Model A is evaluated with script X, Model B with script Y, so numbers look “precise” but are not truly comparable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Poor scalability&lt;/strong&gt; – adding a new task or dataset means duplicating logic instead of plugging into a shared evaluation engine.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjcp9w0sw0ojesa2sp59c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjcp9w0sw0ojesa2sp59c.png" alt=" " width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  1.2 Design Goals
&lt;/h2&gt;

&lt;p&gt;To address this, I defined a set of architectural principles meant to standardize evaluation across all CV tasks while keeping the system flexible enough for real-world research workflows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single source of truth&lt;/strong&gt; — a benchmark should be defined entirely through configuration, not code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Task-agnostic execution&lt;/strong&gt; — classification, detection, and segmentation should share the same runner interface.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strong validation&lt;/strong&gt; — configuration errors must be detected early, before any computation begins.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production-grade reliability&lt;/strong&gt; — support for concurrent executions, deterministic outputs, and clear traceability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-code extensibility&lt;/strong&gt; — adding a new benchmark should require only a new configuration file, not changes to the system itself.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With these principles established, the next challenge was determining how to represent benchmarks in a way that could generalize across task types without introducing new code paths for each one.&lt;br&gt;&lt;br&gt;
This requirement led directly to the system’s first foundational layer.&lt;/p&gt;
&lt;h2&gt;
  
  
  2. Layer 1: A Declarative Approach — One YAML Defines the Entire Benchmark
&lt;/h2&gt;

&lt;p&gt;From the start, it was clear that the system needed a way to describe any benchmark —&lt;br&gt;
classification, detection, or segmentation — &lt;strong&gt;without introducing new code paths each time&lt;/strong&gt;.&lt;br&gt;
The solution was to adopt a fully declarative representation: a benchmark defined entirely in YAML.&lt;/p&gt;

&lt;p&gt;This choice established a single, consistent interface for the entire evaluation pipeline.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Every benchmark is a self-contained YAML specification&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Below is an actual example taken from the production environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name: plantdoc_cls
task: classification
domain: "plant_disease"

benchmark:
  id: "plantdoc_cls"
  name: "PlantDoc-Classification"
  task: "classification"
  version: "v1"

dataset:
  kind: "classification_folder"
  remote:
    bucket: "datasets"
    prefix: "PlantDoc-Classification/v1/PlantDoc-Dataset"
  cache_dir: "~/.cache/agvision/datasets/plantdoc_cls"
  train_dir: "train"
  val_dir: "test"
  extensions: [".jpg", ".jpeg", ".png"]
  class_names_file: null

eval:
  batch_size: 16
  average: "macro"
  metrics: ["accuracy", "precision", "recall", "f1"]
  device: "auto"
  params:
    num_workers: 2
    shuffle: false

outputs:
  root_dir: "runs"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2.1 Why YAML?
&lt;/h2&gt;

&lt;p&gt;A unified pipeline cannot rely on task-specific Python scripts, because any code-level definition introduces&lt;br&gt;
inconsistencies, version drift, and duplicated logic.&lt;/p&gt;

&lt;p&gt;YAML provides several advantages that directly support reproducible evaluation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Human-readable structure&lt;/strong&gt; — Engineers can review, edit, and reason about benchmarks without diving into code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version control compatibility&lt;/strong&gt; — Benchmark definitions live in Git, enabling consistent experiments across users and environments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clear separation of concerns&lt;/strong&gt; — The dataset, evaluation rules, and output structure are declared as data, not hard-coded in the logic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strict validation&lt;/strong&gt; — Each configuration is validated against a typed schema before execution, eliminating malformed or incomplete definitions early.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This declarative model ensures that the &lt;em&gt;definition&lt;/em&gt; of a benchmark is independent from its &lt;em&gt;execution&lt;/em&gt;,&lt;br&gt;
which is essential when supporting multiple tasks and dataset formats through a single unified engine.&lt;/p&gt;


&lt;h2&gt;
  
  
  2.2 Configuration as a Contract
&lt;/h2&gt;

&lt;p&gt;One of the most important architectural insights was recognizing that the YAML file acts as a &lt;strong&gt;contract&lt;/strong&gt;&lt;br&gt;
between all components of the system.&lt;/p&gt;

&lt;p&gt;Each benchmark specification encapsulates the expectations and responsibilities of the following roles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark authors&lt;/strong&gt; — define the task type, dataset layout, and evaluation criteria.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution engine&lt;/strong&gt; — interprets the validated configuration and runs the evaluation deterministically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model providers&lt;/strong&gt; — supply models compatible with a given task without needing to adjust code for each dataset.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UI/API clients&lt;/strong&gt; — trigger runs, compare results, and inspect outputs through a stable, configuration-driven interface.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This contract-based structure ensures consistent behavior across tasks, datasets, and users.&lt;br&gt;
Even as new benchmarks are introduced, the underlying pipeline remains unchanged.&lt;/p&gt;


&lt;h3&gt;
  
  
  The practical impact
&lt;/h3&gt;

&lt;p&gt;Defining benchmarks declaratively leads to a major simplification:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Adding a new benchmark requires only providing a new YAML file.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;No new scripts.&lt;br&gt;&lt;br&gt;
No branching logic.&lt;br&gt;&lt;br&gt;
No duplicated preprocessing or metric code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4fdr8wknc0aurr90pao7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4fdr8wknc0aurr90pao7.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This design directly addresses the scalability issues described earlier and removes a significant amount of engineering overhead.&lt;/p&gt;


&lt;h2&gt;
  
  
  2.3 From YAML to a Typed, Executable Specification
&lt;/h2&gt;

&lt;p&gt;While YAML is expressive and accessible, it is inherently untyped.&lt;br&gt;
To ensure that evaluations are reliable and deterministic, the system transforms each YAML file into&lt;br&gt;
a &lt;strong&gt;strongly typed configuration object&lt;/strong&gt; as part of the AppConfig layer.&lt;/p&gt;

&lt;p&gt;This conversion step provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Schema validation&lt;/strong&gt; — catching missing fields, incompatible types, or invalid structures before execution begins.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Normalization&lt;/strong&gt; — resolving paths, defaults, and device selection in a predictable way.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-field consistency checks&lt;/strong&gt; — ensuring, for example, that the task type matches the dataset adapter and metric set.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This layer is foundational for the system’s reliability and is what enables the benchmark pipeline&lt;br&gt;
to scale across tasks, datasets, and model types without sacrificing correctness.&lt;/p&gt;

&lt;p&gt;The next section describes this layer in detail.&lt;/p&gt;
&lt;h2&gt;
  
  
  3. Layer 2: The AppConfig Layer — From YAML to Executable Specification
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq7smla4kdrq0zn1xms8n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq7smla4kdrq0zn1xms8n.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The AppConfig layer validates all inputs, blocking invalid YAML before execution.&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;3.1 The AppConfig Architecture&lt;/strong&gt;
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;TaskType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Enum&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PathsConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DatasetConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ModelConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AppConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Full application configuration object consumed by the UI worker and          
    runners.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;TaskType&lt;/span&gt;
    &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Domain&lt;/span&gt;
    &lt;span class="n"&gt;benchmark_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DatasetConfig&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ModelConfig&lt;/span&gt;
    &lt;span class="nb"&gt;eval&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;EvalConfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;EvalConfig&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;paths&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PathsConfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;PathsConfig&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LoggingConfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;LoggingConfig&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F198f9wjoi7fd9bpgvuey.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F198f9wjoi7fd9bpgvuey.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;DatasetConfig, EvalConfig, and TaskConfig are validated independently and then fused into a single typed AppConfig — the unified configuration object that drives the entire benchmarking pipeline.&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;3.2 What the AppConfig Layer Provides&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Type Safety&lt;/strong&gt; — Python’s type system guarantees that each field adheres to the expected structure.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic Validation&lt;/strong&gt; — Invalid configurations are rejected early with clear, actionable error messages.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Normalization&lt;/strong&gt; — Paths are resolved, defaults applied, and device selection handled consistently.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-field Validation&lt;/strong&gt; — Ensures consistency across related fields (e.g., task type ↔ dataset kind).
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-documenting Structure&lt;/strong&gt; — Field descriptions act as built-in documentation for maintainers and users.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IDE Support&lt;/strong&gt; — Full autocomplete, static analysis, and type hints improve the development experience.
&lt;/li&gt;
&lt;/ol&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;3.3 Loading and Validation Flow&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The configuration lifecycle includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Loading the YAML file
&lt;/li&gt;
&lt;li&gt;Schema validation
&lt;/li&gt;
&lt;li&gt;Normalization of paths and defaults
&lt;/li&gt;
&lt;li&gt;Creation of the strongly typed &lt;code&gt;AppConfig&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Execution only after the configuration is fully validated
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key benefit:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Invalid configurations are caught immediately, with detailed error messages — long before any GPU time is wasted.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;4. Layer 3: Dataset Adapters — Unifying Heterogeneous Formats&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Once a benchmark configuration is validated, the next challenge is handling &lt;strong&gt;heterogeneous dataset formats&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
Each task type relies on completely different on-disk structures, annotation schemas, and iteration patterns.&lt;br&gt;&lt;br&gt;
To unify these differences, the system applies a consistent &lt;strong&gt;Adapter-based architecture&lt;/strong&gt;.&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;4.1 The Adapter Pattern&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The Adapter pattern provides a uniform iteration interface for all dataset types:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;adapter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;4.2 Adapters Include&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ClassificationFolderAdapter&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CocoDetectionAdapter&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;YoloDetectionAdapter&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;MaskSegmentationAdapter&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each adapter encapsulates dataset-specific loading logic and exposes a unified interface to the evaluation pipeline.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;4.3 What Each Adapter Does&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Every adapter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reads&lt;/strong&gt; the dataset (images, annotations, metadata)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Normalizes&lt;/strong&gt; annotation formats into a consistent internal structure
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exposes standardized outputs&lt;/strong&gt; across different CV tasks
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This design removes dozens of conditional branches and eliminates format-specific parsing inside the runners.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xf00stlmh89dfsqgpgf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7xf00stlmh89dfsqgpgf.png" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;4.4 Why This Design Works&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single execution model&lt;/strong&gt; — the &lt;code&gt;run()&lt;/code&gt; method is identical across all tasks.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Isolated task-specific logic&lt;/strong&gt; — only preprocessing, postprocessing, and metrics differ.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Easy extensibility&lt;/strong&gt; — adding a new task requires implementing only a small abstract interface.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Highly testable&lt;/strong&gt; — each adapter can be independently unit tested.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintainable&lt;/strong&gt; — changes to the evaluation flow propagate uniformly across all tasks.
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;5. Layer 4: Task Runners — Executing Models Consistently Across Benchmarks&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Once datasets are unified through adapters, the next layer is responsible for &lt;strong&gt;executing models in a consistent and task-agnostic way&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The system includes three modular runners:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ClassifierRunner&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DetectorRunner&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;SegmenterRunner&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All runners expose the same execution API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;runner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;What Each Runner Handles&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Every runner is responsible for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Forward passes&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output normalization&lt;/strong&gt; — mapping raw model outputs into a unified internal format
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prediction logging&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metric computation&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Artifact generation&lt;/strong&gt; — saving predictions, overlays, and runtime metadata
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time UI reporting&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This unified execution model ensures that &lt;strong&gt;any model&lt;/strong&gt; can run on &lt;strong&gt;any benchmark&lt;/strong&gt;, as long as the configuration matches the required task type.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fka9uhv0cuf9rgrg7ceq6.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fka9uhv0cuf9rgrg7ceq6.jpg" alt=" " width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Bringing All Layers Together&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;After breaking down the system into its individual layers — &lt;strong&gt;Dataset Adapters&lt;/strong&gt;, &lt;strong&gt;Task-specific Runners&lt;/strong&gt;, the &lt;strong&gt;YAML-driven AppConfig&lt;/strong&gt;, and the &lt;strong&gt;evaluation engine&lt;/strong&gt; — it becomes useful to step back and look at the architecture from a higher level.&lt;/p&gt;

&lt;p&gt;The diagram below illustrates how all components interact within a single unified pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how datasets flow into adapters,
&lt;/li&gt;
&lt;li&gt;how models are loaded and normalized,
&lt;/li&gt;
&lt;li&gt;how runners orchestrate the evaluation,
&lt;/li&gt;
&lt;li&gt;and how results propagate back into metrics, artifacts, and the UI.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff5vhxywqxs9fbgrln1ru.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff5vhxywqxs9fbgrln1ru.jpg" alt=" " width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;6. From Script to System: Client–Server Architecture&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To support multiple users and parallel evaluations, the project evolved from a local script into a fully scalable &lt;strong&gt;Client–Server architecture&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
This shift enabled distributed execution, resource sharing, and robust management of concurrent evaluation workloads.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Server Responsibilities&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The server layer centralizes orchestration and reliability:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Job scheduling&lt;/strong&gt; — organizing evaluation tasks and assigning them to available workers
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Queue management&lt;/strong&gt; — ensuring ordered, predictable processing of multiple evaluation requests
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load balancing&lt;/strong&gt; — distributing workloads efficiently across workers or compute nodes
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Artifact storage (MinIO)&lt;/strong&gt; — storing predictions, logs, and evaluation outputs as versioned artifacts
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model and version tracking&lt;/strong&gt; — maintaining reproducible mappings between models, benchmarks, and outputs
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure isolation&lt;/strong&gt; — preventing individual crashes from affecting the broader system
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;7. Client (PyQt) Responsibilities&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The PyQt-based desktop client provides an accessible front end for researchers and engineers, handling all user-driven interactions with the evaluation pipeline.&lt;/p&gt;

&lt;p&gt;Its responsibilities include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Uploading models&lt;/strong&gt; — loading ONNX or task-specific formats into the system
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Selecting benchmarks&lt;/strong&gt; — choosing the appropriate YAML specification for each evaluation
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuring runs&lt;/strong&gt; — device selection, batch size, metric presets, and overrides
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time logs&lt;/strong&gt; — streaming progress, status messages, and intermediate outputs
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comparing metrics across runs&lt;/strong&gt; — visualizing performance differences between models and benchmarks
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Downloading prediction artifacts&lt;/strong&gt; — retrieving images, overlays, and structured outputs generated by the runners
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This client–server architecture transforms the pipeline from a single-use script into a &lt;strong&gt;scalable, interactive research tool&lt;/strong&gt; that supports parallel experimentation and consistent evaluation across teams.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;8. Key Engineering Lessons Learned&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Throughout the development of this system, several engineering principles proved consistently valuable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Configuration should drive execution&lt;/strong&gt; — not the other way around. A declarative benchmark definition ensures reproducibility and removes hidden logic.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strong validation (Pydantic) prevents hours of debugging&lt;/strong&gt; — catching structural errors before execution dramatically improves reliability.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adapters normalize complexity&lt;/strong&gt; — avoiding format-specific logic scattered throughout the codebase.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Modular runners keep task logic replaceable&lt;/strong&gt; — enabling clean extensions and isolating preprocessing, postprocessing, and metrics per task.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incremental evaluation is essential for real-world datasets&lt;/strong&gt; — allowing resumes, caching, and faster experimentation cycles.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client–Server separation transforms a pipeline into a production-grade system&lt;/strong&gt; — supporting parallel workloads, shared resources, and failure isolation.
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;By structuring the pipeline around clear boundaries — declarative configuration, strict validation, normalized datasets, and modular execution — the system achieves something simple but important: a consistent and reproducible way to evaluate Computer Vision models across tasks and benchmarks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This foundation keeps the pipeline stable, easy to extend, and practical for real research work, without requiring new code each time the problem or dataset changes.&lt;/strong&gt;&lt;/p&gt;




</description>
      <category>ai</category>
      <category>architecture</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
