A core problem we tackled when building realtime LLM based signal analysis is LLM token efficiency: when you're feeding time-series data (stock prices, IoT sensors, blockchain events) into LLMs, the serialization format matters. A lot.
We needed hard numbers to prove it. So we built an automated benchmark system that runs every two weeks, tests four data formats across four major LLM providers, and publishes live results on our website.
Here's how we built it.
Proving Token Efficiency at Scale
Time-series data is structurally simple but verbose. JSON, the industry default, repeats keys on every row. CSV is better, but still repeats full timestamps and values. For LLMs, this repetition directly translates to tokens—and tokens cost money.
We developed TSLN (Time-Series Lean Notation), a format that exploits temporal regularity and delta encoding to reduce token count by up to 87%. But claiming efficiency isn't enough. We needed:
- Reproducible benchmarks across multiple LLM providers
- Automated execution so results stay current
- Public transparency so developers can verify our claims
The result: an open-source benchmark suite that runs automatically via GitHub Actions and displays live results on turboline.ai.
Architecture Overview
┌─────────────────────────────────────────────────────┐
│ GitHub Actions (Bi-weekly cron + manual trigger) │
│ ┌───────────────────────────────────────────────┐ │
│ │ 1. Checkout repo │ │
│ │ 2. Install Python deps (openai, anthropic...) │ │
│ │ 3. Run benchmark script │ │
│ │ 4. Commit results to public/data/*.json │ │
│ │ 5. Push to main branch │ │
│ └───────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
│
│ Git push triggers Railway
▼
┌─────────────────────────────────────────────────────┐
│ Railway (Automated CI/CD) │
│ ┌───────────────────────────────────────────────┐ │
│ │ 1. Detect commit to main │ │
│ │ 2. Build Next.js site │ │
│ │ 3. Deploy to production │ │
│ └───────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
│
▼
Website auto-loads /data/benchmark-results.json
Key components:
-
Python benchmark runner (
benchmark/run_full_benchmark.py) -
GitHub Actions workflow (
.github/workflows/run-benchmark.yml) - Next.js frontend (React component with Recharts visualization)
- Railway deployment (automatic on git push)
The Benchmark Script
The core script tests four serialization formats:
Data Formats Tested
- JSON - Baseline format with full object notation
- CSV - Header row with comma-separated values
- TSLN - Time-Series Lean Notation (our format)
- TOON - Token-Oriented Object Notation (pipe-delimited)
Sample Data Generation
def generate_sample_data(format_name: str) -> str:
"""Generate 100 data points in different formats."""
if format_name == "json":
return json.dumps({
"data": [
{"timestamp": f"2024-01-01T09:{str(i).zfill(2)}:00Z",
"value": 150.0 + i}
for i in range(100)
]
})
elif format_name == "csv":
rows = ["timestamp,value"]
rows.extend([
f"2024-01-01T09:{str(i).zfill(2)}:00Z,{150.0 + i}"
for i in range(100)
])
return "\n".join(rows)
elif format_name == "tsln":
# Delta-encoded compact format
values = [str(150.0 + i) for i in range(100)]
return "t:2024-01-01T09:00:00Z|i:60|v:" + ",".join(values)
elif format_name == "toon":
# Pipe-delimited format
return "timestamp|value\n" + "\n".join([
f"2024-01-01T09:{str(i).zfill(2)}:00Z|{150.0 + i}"
for i in range(100)
])
Token Counting & Cost Calculation
Each benchmark:
- Generates sample data (100 stock price data points)
- Counts tokens using a simple heuristic (~4 chars/token)
-
Calculates costs using provider-specific pricing:
- OpenAI GPT-4o-mini: $0.15/1M tokens
- Anthropic Claude Haiku: $0.80/1M tokens
- Google Gemini 1.5 Flash: $0.075/1M tokens
- Deepseek: $0.14/1M tokens
def run_single_benchmark(provider: str, model: str,
format_name: str, data: str):
prompt = "Analyze this time-series data and summarize trends."
full_prompt = f"{prompt}\n\n{data}"
input_tokens = estimate_tokens(full_prompt)
# Provider-specific cost rates
cost_per_1m_tokens = {
"openai": {"gpt-4o-mini": 0.15},
"anthropic": {"claude-haiku-4-5-20251001": 0.8},
"google": {"gemini-1.5-flash": 0.075},
"deepseek": {"deepseek-chat": 0.14}
}
rate = cost_per_1m_tokens[provider][model]
cost_usd = (input_tokens / 1_000_000) * rate
cost_per_100k_datapoints = cost_usd * 100
return {
"provider": provider,
"model": model,
"format": format_name.upper(),
"input_tokens": input_tokens,
"cost_per_100k_datapoints": cost_per_100k_datapoints,
# ... more metadata
}
Summary Statistics
After running all combinations (4 formats × 4 providers = 16 tests), we aggregate results:
def calculate_summary(results):
"""Calculate per-format averages and savings vs JSON."""
format_groups = {}
for r in results:
fmt = r["format"]
if fmt not in format_groups:
format_groups[fmt] = []
format_groups[fmt].append(r)
format_stats = {}
for fmt, items in format_groups.items():
avg_tokens = sum(r["input_tokens"] for r in items) / len(items)
avg_cost = sum(r["cost_per_100k_datapoints"] for r in items) / len(items)
format_stats[fmt.lower()] = {
"avg_input_tokens": round(avg_tokens),
"avg_cost_per_100k": round(avg_cost, 4),
"sample_count": len(items),
"savings_vs_json_percent": 0.0
}
# Calculate savings relative to JSON baseline
if "json" in format_stats:
json_cost = format_stats["json"]["avg_cost_per_100k"]
for fmt, stats in format_stats.items():
if fmt != "json":
savings = ((json_cost - stats["avg_cost_per_100k"]) / json_cost) * 100
stats["savings_vs_json_percent"] = round(savings, 1)
return format_stats
GitHub Actions Automation
The workflow runs on a bi-weekly schedule but can also be triggered manually:
name: Run Benchmark Bi-weekly
on:
schedule:
# Every 2 weeks on Sunday at 00:00 UTC
- cron: '0 0 */14 * 0'
workflow_dispatch: # Manual trigger via GitHub UI
permissions:
contents: write # Required to commit results
jobs:
run-benchmark:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
token: ${{ secrets.GITHUB_TOKEN }}
persist-credentials: true
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install Python dependencies
run: |
pip install openai anthropic google-generativeai
- name: Run benchmark script
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
DEEPSEEK_API_KEY: ${{ secrets.DEEPSEEK_API_KEY }}
run: |
python benchmark/run_full_benchmark.py
- name: Commit and push results
run: |
git config --local user.email "github-actions[bot]@users.noreply.github.com"
git config --local user.name "GitHub Actions"
git add public/data/benchmark-results.json
git diff --staged --quiet || git commit -m "Update benchmark results [automated]"
git push
Key Features
API Secrets Management: API keys are stored as GitHub repository secrets and injected as environment variables during workflow execution.
Conditional Commits: The git diff --staged --quiet || pattern ensures we only commit when results actually change.
Automated Deployment: After pushing to main, Railway automatically detects the change and redeploys the Next.js site within ~2 minutes.
Real Results from Latest Run
Here's what our latest benchmark (January 20, 2026) shows for 100 stock price data points:
| Format | Avg Tokens | Cost/100k Points | Savings vs JSON |
|---|---|---|---|
| JSON | 1,397 | $0.0404 | — (baseline) |
| CSV | 698 | $0.0202 | 50.0% |
| TOON | 698 | $0.0202 | 50.0% |
| TSLN | 177 | $0.0052 | 87.3% ✨ |
Key findings:
- TSLN uses 87.3% fewer tokens than JSON across all providers
- CSV and TOON are equivalent at ~50% savings (both avoid JSON's key repetition)
- Savings are consistent across OpenAI, Anthropic, Google, and Deepseek
- For 100k data points, JSON costs ~$4 while TSLN costs ~$0.52 (average across providers)
Frontend Visualization
The benchmark results are visualized on our homepage using React + Recharts:
Features
Provider Tabs: Switch between aggregated view and provider-specific breakdowns (OpenAI, Anthropic, Google, Deepseek).
Interactive Table: Shows format comparison with highlighting for best performance.
Cost Comparison Chart: Bar chart using Recharts with color-coded formats:
- 🔴 JSON (red) - baseline
- 🟠 CSV (orange)
- 🔵 TOON (blue)
- 🟢 TSLN (green) - most efficient
Stats Cards: Display best format, max savings %, and test success rate.
Implementation Snippet
export default function LLMBenchmark() {
const [data, setData] = useState<BenchmarkData | null>(null)
const [activeTab, setActiveTab] = useState('average')
useEffect(() => {
// Load from static JSON generated by GitHub Actions
fetch('/data/benchmark-results.json')
.then(res => res.json())
.then(setData)
}, [])
// Calculate provider-specific or averaged stats
const getProviderStats = (providerId: string) => {
if (providerId === 'average') {
return data?.summary.format_stats
}
// Filter and aggregate by provider
const providerResults = data.results.filter(
r => r.provider === providerId
)
// ... aggregate by format
}
return (
<section>
{/* Provider tabs */}
<div className="flex gap-2">
{PROVIDERS.map(provider => (
<button onClick={() => setActiveTab(provider.id)}>
{provider.logo && <Image src={provider.logo} />}
{provider.name}
</button>
))}
</div>
{/* Results table and chart */}
<ResponsiveContainer>
<BarChart data={chartData}>
<Bar dataKey="cost">
{chartData.map((entry, i) => (
<Cell fill={formatColors[entry.format]} />
))}
</Bar>
</BarChart>
</ResponsiveContainer>
</section>
)
}
CI/CD Pipeline: GitHub Actions → Railway
Our deployment pipeline is fully automated:
1. GitHub Actions Runs Benchmark
- Trigger: Cron schedule (bi-weekly) or manual dispatch
- Actions: Run Python benchmark, commit JSON results
-
Output:
public/data/benchmark-results.jsoncommitted to main
2. Railway Detects Git Push
- Connected to GitHub: Railway monitors our main branch
-
Auto-build: Detects commit, runs
npm run build - Auto-deploy: Ships new build to production
3. Next.js Loads Static JSON
-
Static file: Results JSON is in
/public, served directly - Client-side fetch: React component loads on mount
- Fast & cacheable: No backend needed for benchmark data
This architecture is serverless-friendly: the benchmark results are just static JSON, so we avoid database costs and API rate limits.
TypeScript Type Safety
We maintain type definitions shared between Python output and TypeScript frontend:
// lib/benchmark-types.ts
export interface BenchmarkResult {
provider: string
model: string
format: string
input_tokens: number
output_tokens: number
cost_usd: number
cost_per_100k_datapoints: number
timestamp: string
success: boolean
}
export interface FormatSummaryStats {
avg_input_tokens: number
avg_cost_per_100k: number
sample_count: number
savings_vs_json_percent?: number
}
export interface BenchmarkData {
benchmark_date: string
job_id: string
config: { formats: string[]; providers: string[]; /* ... */ }
results: BenchmarkResult[]
summary: {
format_stats: Record<string, FormatSummaryStats>
best_format: string
max_savings_percent: number
}
}
This ensures the Python output schema matches what the React component expects.
Live Visualization Deep Dive
The frontend uses Framer Motion for animations and Recharts for data visualization:
Provider Switching
Users can toggle between:
- Average - Aggregated stats across all providers
- OpenAI - GPT-4o-mini specific results
- Anthropic - Claude Haiku specific results
- Google - Gemini Flash specific results
- Deepseek - Deepseek Chat specific results
When switching tabs, we filter data.results by provider and recalculate format averages dynamically:
const getProviderStats = (providerId: string) => {
if (providerId === 'average') {
return data?.summary.format_stats
}
const providerResults = data.results.filter(
r => r.provider === providerId
)
// Group by format and calculate averages
const formatGroups = {}
providerResults.forEach(result => {
const fmt = result.format.toLowerCase()
if (!formatGroups[fmt]) formatGroups[fmt] = []
formatGroups[fmt].push(result)
})
return Object.entries(formatGroups).reduce((stats, [fmt, results]) => {
stats[fmt] = {
avg_input_tokens: avg(results, 'input_tokens'),
avg_cost_per_100k: avg(results, 'cost_per_100k_datapoints'),
savings_vs_json_percent: /* calculated vs JSON */
}
return stats
}, {})
}
Color Coding
Each format has a semantic color:
- 🔴 JSON (red) - Most expensive baseline
- 🟠 CSV (orange) - Moderate efficiency
- 🔵 TOON (blue) - Moderate efficiency
- 🟢 TSLN (green) - Highest efficiency
Responsive Design
The chart uses ResponsiveContainer from Recharts to adapt to mobile/tablet/desktop. Table columns stack on smaller screens.
Lessons Learned
1. Static JSON > Database for Benchmark Results
We initially considered storing results in a database, but realized:
- Results only update bi-weekly
- No user-specific data
- Static files are faster and free
2. GitHub Actions Auto-Commit is Powerful
The pattern of running a script, committing output, and pushing back to the repo unlocks:
- Automated data pipelines
- Version-controlled results
- GitOps-style transparency
3. Railway's GitHub Integration is Seamless
We didn't write a single line of deploy config. Railway just watches our main branch and redeploys on every push. Perfect for small teams.
4. Token Efficiency Compounds Quickly
At 100 data points, TSLN saves ~$0.035 per run. At 10,000 data points, that becomes $3.50 per run. For production systems ingesting millions of time-series events, the savings are material.
Top comments (0)