DEV Community

Cover image for Building a Multi-Provider LLM Benchmark with Automated GitHub Actions
Manas Mudbari
Manas Mudbari

Posted on

Building a Multi-Provider LLM Benchmark with Automated GitHub Actions

A core problem we tackled when building realtime LLM based signal analysis is LLM token efficiency: when you're feeding time-series data (stock prices, IoT sensors, blockchain events) into LLMs, the serialization format matters. A lot.

We needed hard numbers to prove it. So we built an automated benchmark system that runs every two weeks, tests four data formats across four major LLM providers, and publishes live results on our website.

Here's how we built it.


Proving Token Efficiency at Scale

Time-series data is structurally simple but verbose. JSON, the industry default, repeats keys on every row. CSV is better, but still repeats full timestamps and values. For LLMs, this repetition directly translates to tokens—and tokens cost money.

We developed TSLN (Time-Series Lean Notation), a format that exploits temporal regularity and delta encoding to reduce token count by up to 87%. But claiming efficiency isn't enough. We needed:

  1. Reproducible benchmarks across multiple LLM providers
  2. Automated execution so results stay current
  3. Public transparency so developers can verify our claims

The result: an open-source benchmark suite that runs automatically via GitHub Actions and displays live results on turboline.ai.


Architecture Overview

┌─────────────────────────────────────────────────────┐
│  GitHub Actions (Bi-weekly cron + manual trigger)   │
│  ┌───────────────────────────────────────────────┐  │
│  │  1. Checkout repo                             │  │
│  │  2. Install Python deps (openai, anthropic...) │  │
│  │  3. Run benchmark script                      │  │
│  │  4. Commit results to public/data/*.json      │  │
│  │  5. Push to main branch                       │  │
│  └───────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────┘
                        │
                        │ Git push triggers Railway
                        ▼
┌─────────────────────────────────────────────────────┐
│  Railway (Automated CI/CD)                          │
│  ┌───────────────────────────────────────────────┐  │
│  │  1. Detect commit to main                     │  │
│  │  2. Build Next.js site                        │  │
│  │  3. Deploy to production                      │  │
│  └───────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────┘
                        │
                        ▼
          Website auto-loads /data/benchmark-results.json
Enter fullscreen mode Exit fullscreen mode

Key components:

  • Python benchmark runner (benchmark/run_full_benchmark.py)
  • GitHub Actions workflow (.github/workflows/run-benchmark.yml)
  • Next.js frontend (React component with Recharts visualization)
  • Railway deployment (automatic on git push)

The Benchmark Script

The core script tests four serialization formats:

Data Formats Tested

  1. JSON - Baseline format with full object notation
  2. CSV - Header row with comma-separated values
  3. TSLN - Time-Series Lean Notation (our format)
  4. TOON - Token-Oriented Object Notation (pipe-delimited)

Sample Data Generation

def generate_sample_data(format_name: str) -> str:
    """Generate 100 data points in different formats."""

    if format_name == "json":
        return json.dumps({
            "data": [
                {"timestamp": f"2024-01-01T09:{str(i).zfill(2)}:00Z", 
                 "value": 150.0 + i}
                for i in range(100)
            ]
        })

    elif format_name == "csv":
        rows = ["timestamp,value"]
        rows.extend([
            f"2024-01-01T09:{str(i).zfill(2)}:00Z,{150.0 + i}"
            for i in range(100)
        ])
        return "\n".join(rows)

    elif format_name == "tsln":
        # Delta-encoded compact format
        values = [str(150.0 + i) for i in range(100)]
        return "t:2024-01-01T09:00:00Z|i:60|v:" + ",".join(values)

    elif format_name == "toon":
        # Pipe-delimited format
        return "timestamp|value\n" + "\n".join([
            f"2024-01-01T09:{str(i).zfill(2)}:00Z|{150.0 + i}"
            for i in range(100)
        ])
Enter fullscreen mode Exit fullscreen mode

Token Counting & Cost Calculation

Each benchmark:

  1. Generates sample data (100 stock price data points)
  2. Counts tokens using a simple heuristic (~4 chars/token)
  3. Calculates costs using provider-specific pricing:
    • OpenAI GPT-4o-mini: $0.15/1M tokens
    • Anthropic Claude Haiku: $0.80/1M tokens
    • Google Gemini 1.5 Flash: $0.075/1M tokens
    • Deepseek: $0.14/1M tokens
def run_single_benchmark(provider: str, model: str, 
                        format_name: str, data: str):
    prompt = "Analyze this time-series data and summarize trends."
    full_prompt = f"{prompt}\n\n{data}"
    input_tokens = estimate_tokens(full_prompt)

    # Provider-specific cost rates
    cost_per_1m_tokens = {
        "openai": {"gpt-4o-mini": 0.15},
        "anthropic": {"claude-haiku-4-5-20251001": 0.8},
        "google": {"gemini-1.5-flash": 0.075},
        "deepseek": {"deepseek-chat": 0.14}
    }

    rate = cost_per_1m_tokens[provider][model]
    cost_usd = (input_tokens / 1_000_000) * rate
    cost_per_100k_datapoints = cost_usd * 100

    return {
        "provider": provider,
        "model": model,
        "format": format_name.upper(),
        "input_tokens": input_tokens,
        "cost_per_100k_datapoints": cost_per_100k_datapoints,
        # ... more metadata
    }
Enter fullscreen mode Exit fullscreen mode

Summary Statistics

After running all combinations (4 formats × 4 providers = 16 tests), we aggregate results:

def calculate_summary(results):
    """Calculate per-format averages and savings vs JSON."""
    format_groups = {}
    for r in results:
        fmt = r["format"]
        if fmt not in format_groups:
            format_groups[fmt] = []
        format_groups[fmt].append(r)

    format_stats = {}
    for fmt, items in format_groups.items():
        avg_tokens = sum(r["input_tokens"] for r in items) / len(items)
        avg_cost = sum(r["cost_per_100k_datapoints"] for r in items) / len(items)

        format_stats[fmt.lower()] = {
            "avg_input_tokens": round(avg_tokens),
            "avg_cost_per_100k": round(avg_cost, 4),
            "sample_count": len(items),
            "savings_vs_json_percent": 0.0
        }

    # Calculate savings relative to JSON baseline
    if "json" in format_stats:
        json_cost = format_stats["json"]["avg_cost_per_100k"]
        for fmt, stats in format_stats.items():
            if fmt != "json":
                savings = ((json_cost - stats["avg_cost_per_100k"]) / json_cost) * 100
                stats["savings_vs_json_percent"] = round(savings, 1)

    return format_stats
Enter fullscreen mode Exit fullscreen mode

GitHub Actions Automation

The workflow runs on a bi-weekly schedule but can also be triggered manually:

name: Run Benchmark Bi-weekly

on:
  schedule:
    # Every 2 weeks on Sunday at 00:00 UTC
    - cron: '0 0 */14 * 0'
  workflow_dispatch:  # Manual trigger via GitHub UI

permissions: 
  contents: write  # Required to commit results

jobs:
  run-benchmark:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with: 
          token: ${{ secrets.GITHUB_TOKEN }}
          persist-credentials: true

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install Python dependencies
        run: |
          pip install openai anthropic google-generativeai

      - name: Run benchmark script
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
          DEEPSEEK_API_KEY: ${{ secrets.DEEPSEEK_API_KEY }}
        run: |
          python benchmark/run_full_benchmark.py

      - name: Commit and push results
        run: |
          git config --local user.email "github-actions[bot]@users.noreply.github.com"
          git config --local user.name "GitHub Actions"
          git add public/data/benchmark-results.json
          git diff --staged --quiet || git commit -m "Update benchmark results [automated]"
          git push
Enter fullscreen mode Exit fullscreen mode

Key Features

API Secrets Management: API keys are stored as GitHub repository secrets and injected as environment variables during workflow execution.

Conditional Commits: The git diff --staged --quiet || pattern ensures we only commit when results actually change.

Automated Deployment: After pushing to main, Railway automatically detects the change and redeploys the Next.js site within ~2 minutes.


Real Results from Latest Run

Here's what our latest benchmark (January 20, 2026) shows for 100 stock price data points:

Format Avg Tokens Cost/100k Points Savings vs JSON
JSON 1,397 $0.0404 — (baseline)
CSV 698 $0.0202 50.0%
TOON 698 $0.0202 50.0%
TSLN 177 $0.0052 87.3%

Key findings:

  • TSLN uses 87.3% fewer tokens than JSON across all providers
  • CSV and TOON are equivalent at ~50% savings (both avoid JSON's key repetition)
  • Savings are consistent across OpenAI, Anthropic, Google, and Deepseek
  • For 100k data points, JSON costs ~$4 while TSLN costs ~$0.52 (average across providers)

Frontend Visualization

The benchmark results are visualized on our homepage using React + Recharts:

Features

Provider Tabs: Switch between aggregated view and provider-specific breakdowns (OpenAI, Anthropic, Google, Deepseek).

Interactive Table: Shows format comparison with highlighting for best performance.

Cost Comparison Chart: Bar chart using Recharts with color-coded formats:

  • 🔴 JSON (red) - baseline
  • 🟠 CSV (orange)
  • 🔵 TOON (blue)
  • 🟢 TSLN (green) - most efficient

Stats Cards: Display best format, max savings %, and test success rate.

Implementation Snippet

export default function LLMBenchmark() {
  const [data, setData] = useState<BenchmarkData | null>(null)
  const [activeTab, setActiveTab] = useState('average')

  useEffect(() => {
    // Load from static JSON generated by GitHub Actions
    fetch('/data/benchmark-results.json')
      .then(res => res.json())
      .then(setData)
  }, [])

  // Calculate provider-specific or averaged stats
  const getProviderStats = (providerId: string) => {
    if (providerId === 'average') {
      return data?.summary.format_stats
    }

    // Filter and aggregate by provider
    const providerResults = data.results.filter(
      r => r.provider === providerId
    )
    // ... aggregate by format
  }

  return (
    <section>
      {/* Provider tabs */}
      <div className="flex gap-2">
        {PROVIDERS.map(provider => (
          <button onClick={() => setActiveTab(provider.id)}>
            {provider.logo && <Image src={provider.logo} />}
            {provider.name}
          </button>
        ))}
      </div>

      {/* Results table and chart */}
      <ResponsiveContainer>
        <BarChart data={chartData}>
          <Bar dataKey="cost">
            {chartData.map((entry, i) => (
              <Cell fill={formatColors[entry.format]} />
            ))}
          </Bar>
        </BarChart>
      </ResponsiveContainer>
    </section>
  )
}
Enter fullscreen mode Exit fullscreen mode

CI/CD Pipeline: GitHub Actions → Railway

Our deployment pipeline is fully automated:

1. GitHub Actions Runs Benchmark

  • Trigger: Cron schedule (bi-weekly) or manual dispatch
  • Actions: Run Python benchmark, commit JSON results
  • Output: public/data/benchmark-results.json committed to main

2. Railway Detects Git Push

  • Connected to GitHub: Railway monitors our main branch
  • Auto-build: Detects commit, runs npm run build
  • Auto-deploy: Ships new build to production

3. Next.js Loads Static JSON

  • Static file: Results JSON is in /public, served directly
  • Client-side fetch: React component loads on mount
  • Fast & cacheable: No backend needed for benchmark data

This architecture is serverless-friendly: the benchmark results are just static JSON, so we avoid database costs and API rate limits.


TypeScript Type Safety

We maintain type definitions shared between Python output and TypeScript frontend:

// lib/benchmark-types.ts
export interface BenchmarkResult {
  provider: string
  model: string
  format: string
  input_tokens: number
  output_tokens: number
  cost_usd: number
  cost_per_100k_datapoints: number
  timestamp: string
  success: boolean
}

export interface FormatSummaryStats {
  avg_input_tokens: number
  avg_cost_per_100k: number
  sample_count: number
  savings_vs_json_percent?: number
}

export interface BenchmarkData {
  benchmark_date: string
  job_id: string
  config: { formats: string[]; providers: string[]; /* ... */ }
  results: BenchmarkResult[]
  summary: {
    format_stats: Record<string, FormatSummaryStats>
    best_format: string
    max_savings_percent: number
  }
}
Enter fullscreen mode Exit fullscreen mode

This ensures the Python output schema matches what the React component expects.


Live Visualization Deep Dive

The frontend uses Framer Motion for animations and Recharts for data visualization:

Provider Switching

Users can toggle between:

  • Average - Aggregated stats across all providers
  • OpenAI - GPT-4o-mini specific results
  • Anthropic - Claude Haiku specific results
  • Google - Gemini Flash specific results
  • Deepseek - Deepseek Chat specific results

When switching tabs, we filter data.results by provider and recalculate format averages dynamically:

const getProviderStats = (providerId: string) => {
  if (providerId === 'average') {
    return data?.summary.format_stats
  }

  const providerResults = data.results.filter(
    r => r.provider === providerId
  )

  // Group by format and calculate averages
  const formatGroups = {}
  providerResults.forEach(result => {
    const fmt = result.format.toLowerCase()
    if (!formatGroups[fmt]) formatGroups[fmt] = []
    formatGroups[fmt].push(result)
  })

  return Object.entries(formatGroups).reduce((stats, [fmt, results]) => {
    stats[fmt] = {
      avg_input_tokens: avg(results, 'input_tokens'),
      avg_cost_per_100k: avg(results, 'cost_per_100k_datapoints'),
      savings_vs_json_percent: /* calculated vs JSON */
    }
    return stats
  }, {})
}
Enter fullscreen mode Exit fullscreen mode

Color Coding

Each format has a semantic color:

  • 🔴 JSON (red) - Most expensive baseline
  • 🟠 CSV (orange) - Moderate efficiency
  • 🔵 TOON (blue) - Moderate efficiency
  • 🟢 TSLN (green) - Highest efficiency

Responsive Design

The chart uses ResponsiveContainer from Recharts to adapt to mobile/tablet/desktop. Table columns stack on smaller screens.


Lessons Learned

1. Static JSON > Database for Benchmark Results

We initially considered storing results in a database, but realized:

  • Results only update bi-weekly
  • No user-specific data
  • Static files are faster and free

2. GitHub Actions Auto-Commit is Powerful

The pattern of running a script, committing output, and pushing back to the repo unlocks:

  • Automated data pipelines
  • Version-controlled results
  • GitOps-style transparency

3. Railway's GitHub Integration is Seamless

We didn't write a single line of deploy config. Railway just watches our main branch and redeploys on every push. Perfect for small teams.

4. Token Efficiency Compounds Quickly

At 100 data points, TSLN saves ~$0.035 per run. At 10,000 data points, that becomes $3.50 per run. For production systems ingesting millions of time-series events, the savings are material.

Top comments (0)