Ye Allen

Posted on Jun 14

Build a Config-Driven Evaluation Harness for Multimodal AI Models

#ai #typescript #api #webdev

AI applications rarely depend on a single model forever.

A product may begin with text generation, then add document analysis, image creation, audio processing, video generation, or agent workflows. As these requirements grow, developers need a repeatable way to test models without scattering provider-specific logic across the codebase.

This tutorial presents a simple, config-driven evaluation harness for comparing AI models by workflow.

The goal is not to create a universal benchmark. It is to make model decisions measurable, repeatable, and easier to update.

What the Evaluation Harness Should Do

A practical evaluation harness should be able to:

define models and routes in configuration
load realistic test cases
run the same workflow against multiple models
record latency and success status
validate structured outputs
estimate or record usage cost
support text and asynchronous media jobs
export comparable results

The product should not need to know which provider serves a model. It should request a capability through a common internal interface.

Define the Core Types

Start with a few TypeScript types:

type Modality = "text" | "image" | "video" | "audio";

interface ModelTarget {
  id: string;
  model: string;
  route?: string;
  modality: Modality;
  enabled: boolean;
}

interface TestCase {
  id: string;
  workflow: string;
  modality: Modality;
  input: unknown;
  expected?: {
    requiredFields?: string[];
    maxLatencyMs?: number;
  };
}

interface EvaluationResult {
  testCaseId: string;
  workflow: string;
  targetId: string;
  model: string;
  route?: string;
  success: boolean;
  latencyMs: number;
  formatValid: boolean;
  error?: string;
  output?: unknown;
}

These types separate three concerns:

The model being tested
The product workflow
The recorded result

That separation becomes important when one model is tested across several workflows or when multiple models are evaluated for the same task.

Keep Model Targets in Configuration

Avoid hardcoding model decisions throughout the application.

const targets: ModelTarget[] = [
  {
    id: "fast-support-model",
    model: process.env.SUPPORT_MODEL ?? "configured-text-model",
    modality: "text",
    enabled: true,
  },
  {
    id: "rag-reasoning-model",
    model: process.env.RAG_MODEL ?? "configured-reasoning-model",
    modality: "text",
    enabled: true,
  },
  {
    id: "product-image-model",
    model: process.env.IMAGE_MODEL ?? "configured-image-model",
    modality: "image",
    enabled: true,
  },
];

The model identifiers above are placeholders. In a real application, use identifiers supported by the selected AI API platform.

Configuration makes it easier to test new models, compare routes, or respond to availability changes without rewriting business logic.

Create Workflow-Based Test Cases

Public benchmarks are useful for discovery, but internal tests should represent the actual product.

const testCases: TestCase[] = [
  {
    id: "support-001",
    workflow: "support_chat",
    modality: "text",
    input: {
      messages: [
        {
          role: "user",
          content: "Explain how to reset an API credential safely.",
        },
      ],
    },
    expected: {
      maxLatencyMs: 5000,
    },
  },
  {
    id: "agent-001",
    workflow: "agent_structured_output",
    modality: "text",
    input: {
      task: "Return a support ticket with title, priority, and summary.",
    },
    expected: {
      requiredFields: ["title", "priority", "summary"],
      maxLatencyMs: 8000,
    },
  },
  {
    id: "image-001",
    workflow: "product_image",
    modality: "image",
    input: {
      prompt: "A clean studio product image on a neutral background",
    },
    expected: {
      maxLatencyMs: 60000,
    },
  },
];

A useful dataset should contain normal requests, difficult inputs, formatting requirements, multilingual examples, and known failure cases.

Start with 10 to 30 examples for each important workflow. A small, relevant dataset is more useful than a large collection of unrelated prompts.

Create an Adapter Interface

Text, image, video, and audio APIs may use different request formats. Hide those differences behind an adapter.

interface ModelAdapter {
  run(target: ModelTarget, test: TestCase): Promise<unknown>;
}

A text adapter could use a familiar chat-completion format:

class TextModelAdapter implements ModelAdapter {
  constructor(
    private baseUrl: string,
    private apiKey: string
  ) {}

  async run(target: ModelTarget, test: TestCase): Promise<unknown> {
    const response = await fetch(`${this.baseUrl}/chat/completions`, {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        Authorization: `Bearer ${this.apiKey}`,
      },
      body: JSON.stringify({
        model: target.model,
        ...(test.input as object),
      }),
    });

    if (!response.ok) {
      throw new Error(`Request failed with status ${response.status}`);
    }

    return response.json();
  }
}

Set baseUrl, credentials, model names, and endpoint paths according to the documentation of the platform being used.

OpenAI-compatible request formats can simplify many text integrations because familiar SDKs and tools may already support them. However, image, video, audio, and specialized models may require separate endpoints or asynchronous processing.

Run and Record Each Evaluation

The runner measures latency and captures errors without stopping the entire test suite.

async function evaluate(
  adapter: ModelAdapter,
  target: ModelTarget,
  test: TestCase
): Promise<EvaluationResult> {
  const startedAt = performance.now();

  try {
    const output = await adapter.run(target, test);
    const latencyMs = Math.round(performance.now() - startedAt);

    return {
      testCaseId: test.id,
      workflow: test.workflow,
      targetId: target.id,
      model: target.model,
      route: target.route,
      success: true,
      latencyMs,
      formatValid: validateOutput(output, test.expected),
      output,
    };
  } catch (error) {
    return {
      testCaseId: test.id,
      workflow: test.workflow,
      targetId: target.id,
      model: target.model,
      route: target.route,
      success: false,
      latencyMs: Math.round(performance.now() - startedAt),
      formatValid: false,
      error: error instanceof Error ? error.message : "Unknown error",
    };
  }
}

The validation function can begin simply:

function validateOutput(
  output: unknown,
  expected?: TestCase["expected"]
): boolean {
  if (!expected?.requiredFields) {
    return true;
  }

  if (!output || typeof output !== "object") {
    return false;
  }

  return expected.requiredFields.every(
    (field) => field in (output as Record<string, unknown>)
  );
}

For production evaluations, add schema validation with a library such as Zod or JSON Schema.

Handle Asynchronous Media Jobs

Video and some image or audio APIs may return a job identifier instead of the final asset.

The adapter should then:

Submit the generation request.
Store the returned job ID.
Poll the documented status endpoint.
Stop after a configured timeout.
Record the completion time.
Save the resulting asset URL and metadata.

Media evaluation records may include:

interface MediaMetadata {
  width?: number;
  height?: number;
  durationSeconds?: number;
  format?: string;
  jobCompletionMs?: number;
  assetUrl?: string;
}

This makes it possible to compare operational behavior as well as creative quality.

Compare Models by Workflow

Do not produce one global model ranking.

A model that performs well for document reasoning may not be the best option for support chat. A high-quality image model may be too slow for an interactive editing workflow.

Summarize results by workflow:

interface WorkflowSummary {
  workflow: string;
  model: string;
  successRate: number;
  averageLatencyMs: number;
  formatSuccessRate: number;
  averageCost?: number;
}

The final selection should balance:

output quality
successful request rate
response latency
formatting reliability
route availability
usage cost
workflow requirements

Keep the selected model and route configurable after evaluation.

Continue Testing After Launch

Initial evaluation is only the beginning.

Production traffic will expose new inputs, failure patterns, and user expectations. Add difficult production examples back into the test dataset and rerun them when model settings change.

Useful production metrics include:

request success rate
latency percentiles
cost by workflow
invalid structured outputs
timeout and retry frequency
media generation failures
route availability
user corrections

This creates an evaluation process based on real product behavior instead of a one-time demonstration.

Using VectorNode for Model Evaluation

VectorNode is a pay-as-you-go multi-model AI API platform for independent developers and small AI teams building with text, image, video, and audio models.

It provides one account for testing and accessing GPT, Claude, Gemini, DeepSeek, Qwen, and hundreds of other supported models through developer-friendly APIs.

Developers can use its Playground for initial testing, compare available models and routes, and then move representative evaluations into their own test harness.

This approach is useful for AI applications, agents, RAG systems, chatbots, automation workflows, developer tools, and multimodal products.

Learn more:

https://www.vectronode.com/

Start testing with VectorNode.