DEV Community

Ye Allen
Ye Allen

Posted on

Build a Config-Driven Evaluation Harness for Multimodal AI Models

AI applications rarely depend on a single model forever.

A product may begin with text generation, then add document analysis, image creation, audio processing, video generation, or agent workflows. As these requirements grow, developers need a repeatable way to test models without scattering provider-specific logic across the codebase.

This tutorial presents a simple, config-driven evaluation harness for comparing AI models by workflow.

The goal is not to create a universal benchmark. It is to make model decisions measurable, repeatable, and easier to update.

What the Evaluation Harness Should Do

A practical evaluation harness should be able to:

  • define models and routes in configuration
  • load realistic test cases
  • run the same workflow against multiple models
  • record latency and success status
  • validate structured outputs
  • estimate or record usage cost
  • support text and asynchronous media jobs
  • export comparable results

The product should not need to know which provider serves a model. It should request a capability through a common internal interface.

Define the Core Types

Start with a few TypeScript types:

type Modality = "text" | "image" | "video" | "audio";

interface ModelTarget {
  id: string;
  model: string;
  route?: string;
  modality: Modality;
  enabled: boolean;
}

interface TestCase {
  id: string;
  workflow: string;
  modality: Modality;
  input: unknown;
  expected?: {
    requiredFields?: string[];
    maxLatencyMs?: number;
  };
}

interface EvaluationResult {
  testCaseId: string;
  workflow: string;
  targetId: string;
  model: string;
  route?: string;
  success: boolean;
  latencyMs: number;
  formatValid: boolean;
  error?: string;
  output?: unknown;
}
Enter fullscreen mode Exit fullscreen mode

These types separate three concerns:

  1. The model being tested
  2. The product workflow
  3. The recorded result

That separation becomes important when one model is tested across several workflows or when multiple models are evaluated for the same task.

Keep Model Targets in Configuration

Avoid hardcoding model decisions throughout the application.

const targets: ModelTarget[] = [
  {
    id: "fast-support-model",
    model: process.env.SUPPORT_MODEL ?? "configured-text-model",
    modality: "text",
    enabled: true,
  },
  {
    id: "rag-reasoning-model",
    model: process.env.RAG_MODEL ?? "configured-reasoning-model",
    modality: "text",
    enabled: true,
  },
  {
    id: "product-image-model",
    model: process.env.IMAGE_MODEL ?? "configured-image-model",
    modality: "image",
    enabled: true,
  },
];
Enter fullscreen mode Exit fullscreen mode

The model identifiers above are placeholders. In a real application, use identifiers supported by the selected AI API platform.

Configuration makes it easier to test new models, compare routes, or respond to availability changes without rewriting business logic.

Create Workflow-Based Test Cases

Public benchmarks are useful for discovery, but internal tests should represent the actual product.

const testCases: TestCase[] = [
  {
    id: "support-001",
    workflow: "support_chat",
    modality: "text",
    input: {
      messages: [
        {
          role: "user",
          content: "Explain how to reset an API credential safely.",
        },
      ],
    },
    expected: {
      maxLatencyMs: 5000,
    },
  },
  {
    id: "agent-001",
    workflow: "agent_structured_output",
    modality: "text",
    input: {
      task: "Return a support ticket with title, priority, and summary.",
    },
    expected: {
      requiredFields: ["title", "priority", "summary"],
      maxLatencyMs: 8000,
    },
  },
  {
    id: "image-001",
    workflow: "product_image",
    modality: "image",
    input: {
      prompt: "A clean studio product image on a neutral background",
    },
    expected: {
      maxLatencyMs: 60000,
    },
  },
];
Enter fullscreen mode Exit fullscreen mode

A useful dataset should contain normal requests, difficult inputs, formatting requirements, multilingual examples, and known failure cases.

Start with 10 to 30 examples for each important workflow. A small, relevant dataset is more useful than a large collection of unrelated prompts.

Create an Adapter Interface

Text, image, video, and audio APIs may use different request formats. Hide those differences behind an adapter.

interface ModelAdapter {
  run(target: ModelTarget, test: TestCase): Promise<unknown>;
}
Enter fullscreen mode Exit fullscreen mode

A text adapter could use a familiar chat-completion format:

class TextModelAdapter implements ModelAdapter {
  constructor(
    private baseUrl: string,
    private apiKey: string
  ) {}

  async run(target: ModelTarget, test: TestCase): Promise<unknown> {
    const response = await fetch(`${this.baseUrl}/chat/completions`, {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        Authorization: `Bearer ${this.apiKey}`,
      },
      body: JSON.stringify({
        model: target.model,
        ...(test.input as object),
      }),
    });

    if (!response.ok) {
      throw new Error(`Request failed with status ${response.status}`);
    }

    return response.json();
  }
}
Enter fullscreen mode Exit fullscreen mode

Set baseUrl, credentials, model names, and endpoint paths according to the documentation of the platform being used.

OpenAI-compatible request formats can simplify many text integrations because familiar SDKs and tools may already support them. However, image, video, audio, and specialized models may require separate endpoints or asynchronous processing.

Run and Record Each Evaluation

The runner measures latency and captures errors without stopping the entire test suite.

async function evaluate(
  adapter: ModelAdapter,
  target: ModelTarget,
  test: TestCase
): Promise<EvaluationResult> {
  const startedAt = performance.now();

  try {
    const output = await adapter.run(target, test);
    const latencyMs = Math.round(performance.now() - startedAt);

    return {
      testCaseId: test.id,
      workflow: test.workflow,
      targetId: target.id,
      model: target.model,
      route: target.route,
      success: true,
      latencyMs,
      formatValid: validateOutput(output, test.expected),
      output,
    };
  } catch (error) {
    return {
      testCaseId: test.id,
      workflow: test.workflow,
      targetId: target.id,
      model: target.model,
      route: target.route,
      success: false,
      latencyMs: Math.round(performance.now() - startedAt),
      formatValid: false,
      error: error instanceof Error ? error.message : "Unknown error",
    };
  }
}
Enter fullscreen mode Exit fullscreen mode

The validation function can begin simply:

function validateOutput(
  output: unknown,
  expected?: TestCase["expected"]
): boolean {
  if (!expected?.requiredFields) {
    return true;
  }

  if (!output || typeof output !== "object") {
    return false;
  }

  return expected.requiredFields.every(
    (field) => field in (output as Record<string, unknown>)
  );
}
Enter fullscreen mode Exit fullscreen mode

For production evaluations, add schema validation with a library such as Zod or JSON Schema.

Handle Asynchronous Media Jobs

Video and some image or audio APIs may return a job identifier instead of the final asset.

The adapter should then:

  1. Submit the generation request.
  2. Store the returned job ID.
  3. Poll the documented status endpoint.
  4. Stop after a configured timeout.
  5. Record the completion time.
  6. Save the resulting asset URL and metadata.

Media evaluation records may include:

interface MediaMetadata {
  width?: number;
  height?: number;
  durationSeconds?: number;
  format?: string;
  jobCompletionMs?: number;
  assetUrl?: string;
}
Enter fullscreen mode Exit fullscreen mode

This makes it possible to compare operational behavior as well as creative quality.

Compare Models by Workflow

Do not produce one global model ranking.

A model that performs well for document reasoning may not be the best option for support chat. A high-quality image model may be too slow for an interactive editing workflow.

Summarize results by workflow:

interface WorkflowSummary {
  workflow: string;
  model: string;
  successRate: number;
  averageLatencyMs: number;
  formatSuccessRate: number;
  averageCost?: number;
}
Enter fullscreen mode Exit fullscreen mode

The final selection should balance:

  • output quality
  • successful request rate
  • response latency
  • formatting reliability
  • route availability
  • usage cost
  • workflow requirements

Keep the selected model and route configurable after evaluation.

Continue Testing After Launch

Initial evaluation is only the beginning.

Production traffic will expose new inputs, failure patterns, and user expectations. Add difficult production examples back into the test dataset and rerun them when model settings change.

Useful production metrics include:

  • request success rate
  • latency percentiles
  • cost by workflow
  • invalid structured outputs
  • timeout and retry frequency
  • media generation failures
  • route availability
  • user corrections

This creates an evaluation process based on real product behavior instead of a one-time demonstration.

Using VectorNode for Model Evaluation

VectorNode is a pay-as-you-go multi-model AI API platform for independent developers and small AI teams building with text, image, video, and audio models.

It provides one account for testing and accessing GPT, Claude, Gemini, DeepSeek, Qwen, and hundreds of other supported models through developer-friendly APIs.

Developers can use its Playground for initial testing, compare available models and routes, and then move representative evaluations into their own test harness.

This approach is useful for AI applications, agents, RAG systems, chatbots, automation workflows, developer tools, and multimodal products.

Learn more:

https://www.vectronode.com/

Start testing with VectorNode.

Top comments (0)