Yuri Tománek

Posted on Jan 19

Building Ahoj Metrics: Nearly 2 Years, Multiple Rewrites, One Rails SaaS

#performance #rails #saas #showdev

TL;DR: I spent nearly 2 years building a SaaS that runs Lighthouse audits from 18 global regions in ~2 minutes average. After exploring Go, Rust, and TypeScript, I came back to Rails. Here's the journey, the tech stack, and what I learned.

The Problem

As a developer, I've always struggled to answer one question: "How fast is my site... really?"

Sure, you can run Lighthouse locally. But your MacBook on fast WiFi doesn't represent your users in Sydney on 3G, or customers in São Paulo on a typical mobile connection.

I wanted a tool that could:

Test from multiple global regions simultaneously
Run automated monitoring with alerts
Track performance over time
Be fast and reliable (no 10-minute waits)

Nothing on the market hit all these points at a reasonable price, so in February 2024 I started building Ahoj Metrics.

The Journey: Nearly 2 Years of Iteration

I started this project on February 20th, 2024 with a TypeScript version. Since then, I've:

Rewritten the backend multiple times (lost count honestly)
Explored Go, Rust, and pure TypeScript for the backend
Kept coming back to Rails - I've been using it since version 1.0 in early 2006 (20+ years!)
Completely overhauled the UI 3-4 times

Why did I keep rewriting? I was chasing "the perfect stack." Go felt too verbose. Rust was overkill. TypeScript for backend felt like reinventing the wheel.

Final decision: Rails. I have 20+ years of experience with it, the ecosystem is mature, and I can ship fast. Sometimes the best tool is the one you know deeply.

The Tech Stack

Backend: Rails 8.1

Why Rails? 20+ years of experience, mature ecosystem, fast prototyping
Solid Queue for background jobs (no Redis needed)
Nanoid IDs instead of integers for cleaner URLs and security
Multi-database setup: Primary, Cache, Queue, Cable databases
JWT authentication for API access

Frontend: React + TypeScript + Vite

Hosted separately on Cloudflare Pages for edge performance
Brutalist design system (clean, fast, no-nonsense)
DaisyUI for component primitives
PostHog for product analytics

Infrastructure: Fly.io + AWS ECS

Fly.io Machines API for ephemeral Lighthouse workers, spawn on-demand, test from 18 regions, destroy immediately
ARM64 architecture (Graviton) for cost efficiency
Self-hosted Raspberry Pi runner for CI/CD (4.5x faster builds, $0 cost)

Payments: Polar

Clean API, developer-friendly webhooks
Supports subscription lifecycle (upgrades, downgrades, failed payments)

Key Technical Decisions

1. Ephemeral Workers on Fly.io

Instead of keeping workers running 24/7, we spawn Fly.io machines on-demand using their Machines API.

Here's the actual code:

class FlyMachinesService
  API_BASE_URL = "https://api.machines.dev/v1"

  def self.create_machine(region:, env:, app_name:)
    url = "#{API_BASE_URL}/apps/#{app_name}/machines"

    body = {
      region: region,
      config: {
        image: ENV.fetch("WORKER_IMAGE", "registry.fly.io/am-worker:latest"),
        size: "performance-2x",  # 4 vCPU, 8GB RAM
        auto_destroy: true,      # Key: destroy after completion
        restart: {
          policy: "no"
        },
        stop_config: {
          timeout: "30s",
          signal: "SIGTERM"
        },
        env: env,
        services: []
      }
    }

    response = HTTParty.post(
      url,
      headers: headers,
      body: body.to_json,
      timeout: 30
    )

    if response.success?
      Response.new(success: true, data: response.parsed_response)
    else
      Response.new(
        success: false,
        error: "API error: #{response.code} - #{response.body}"
      )
    end
  end
end

Why this works:

No idle worker costs (only pay for seconds of actual usage)
~2 minute average audit runtime
Scales to 18 regions simultaneously
Clean slate for every test (no cache pollution)
performance-2x size gives Lighthouse enough resources to run smoothly

2. Self-Hosted ARM64 CI/CD

Building Docker images on GitHub-hosted x86 runners using QEMU emulation was painfully slow (18+ minutes).

Solution: Raspberry Pi 4/5 as a self-hosted runner with native ARM64 builds.

Results:

Build time: 4m 25s (down from 18+ min)
Cost: $0/build (vs $0.035/build on GitHub-hosted)
Deploys to AWS ECS Graviton (ARM64)

3. Solid Queue for Background Jobs

Rails 8 ships with Solid Queue, and I leaned in hard:

# config/recurring.yml
production:
  monitor_scheduler:
    class: MonitorSchedulerJob
    queue: default
    schedule: every minute

No Redis, no Sidekiq, no extra infrastructure. Just PostgreSQL.

Jobs run every minute to check which monitors are due for testing, spawn Fly.io workers, and aggregate results.

Challenges & Lessons Learned

1. Quota Management: Simpler Than Expected

Free users get 20 audits/month. I initially overcomplicated this.

Key insights:

An audit is an audit, regardless of how many regions you test
- Test from 1 region? 1 audit consumed.
- Test from 5 regions? Still 1 audit consumed.
Monitor runs don't count toward quota - they only track usage in the background job, not through the quota system
Monitors are only available on paid plans (Starter+) anyway

The tier limits are defined in a simple config:

module HasTierLimits
  TIER_CONFIG = {
    free:       { quota: 20,             retention_days: 30,  monitors: 0,              team_members: 0,              max_regions: 5 },
    starter:    { quota: 100,            retention_days: 90,  monitors: 5,              team_members: 3,              max_regions: 5 },
    pro:        { quota: 500,            retention_days: nil, monitors: 20,             team_members: 10,             max_regions: 5 },
    enterprise: { quota: Float::INFINITY, retention_days: nil, monitors: Float::INFINITY, team_members: Float::INFINITY, max_regions: 18 }
  }.freeze
end

Quota tracking only happens for user-initiated audits:

def track_usage
  period = Time.current.strftime("%Y-%m")
  billable_user = quota_owner  # Team owner or current user
  usage = billable_user.usage_records.find_or_create_by(period: period)
  usage.increment!(:reports_count)
end

2. Multi-Region Testing Architecture

Testing from multiple regions simultaneously is the core feature. Here's how it works:

ReportRequest → Multiple Reports

User creates 1 ReportRequest (e.g., "Test example.com from 5 regions")
System creates 5 Report records (one per region)
5 Fly.io workers spawn simultaneously
Each worker runs Lighthouse and reports back independently
ReportRequest aggregates results and determines overall status

The challenge: handling timeouts, retries, and partial failures.

class ReportRequest < ApplicationRecord
  has_many :reports

  # After each report updates, check if all are done
  def check_completion!
    return unless reports.all?(&:completed?)

    update!(status: "completed")
    update_cached_stats!  # Calculate avg performance across regions
    check_monitor_alert if site_monitor.present?  # Alert if below threshold
  end
end

Key decisions:

Each report is independent (can fail without blocking others)
Cached stats on ReportRequest for fast dashboard queries
Monitor alerts trigger after aggregating all regional results

3. Frontend Performance Matters

We're a performance monitoring tool. If our site is slow, we lose credibility.

Stats (Lighthouse from 18 regions):

Performance: 95-100 across all regions
LCP: < 1.2s globally
CLS: 0.001

How:

Static React frontend on Cloudflare Pages (edge network)
Aggressive code splitting
No heavy frameworks (no Next.js, just Vite)

What I'd Do Differently

1. Stop Chasing Perfection, Start Shipping

This is the biggest lesson. I wasted nearly 2 years rewriting and polishing, trying to build the "perfect" stack and the "perfect" UI.

Then I heard a quote by Eugène Delacroix: "The artist who aims at perfection in everything achieves it in nothing."

That hit hard. I realized that was me - constantly chasing perfection, never shipping.

So I stopped. I gave myself 2-3 days to finish core features, make sure it worked, and shipped it last Saturday. Any issues can be fixed while it's live.

Lesson: Done is better than perfect. Ship, iterate, improve based on real feedback.

2. Stop Rewriting, Use What You Know

I wasted months rewriting in different languages. Go, Rust, TypeScript - none of them were fundamentally better than Rails for this use case.

Lesson: Use what you know. Ship fast. Iterate based on real feedback, not hypothetical performance gains.

3. Start with Waitlist + Landing Page

I built the entire product before validating demand. Bad move.

Better approach: Landing page → waitlist → validate → build MVP.

4. Simpler Pricing Tiers

Four tiers (Free, Starter, Pro, Enterprise) is too many. Should've started with Free + Pro.

5. Public API from Day 1

Everyone asks for API access. Should've prioritized this earlier.

What's Next

Now that it's live, here's what I'm working on:

Webhooks & API - Full REST API for programmatic access and CI/CD integration
CrUX API Integration - Real User Monitoring data from Google's Chrome UX Report
Custom RUM Tool - Installable Real User Monitoring for your own customers
AI Insights - Automated performance recommendations and anomaly detection
Improved Graphs - Better data visualization for performance trends over time
More Regions - Expanding beyond 18 regions to cover more edge locations

Want to see something specific? Let me know in the comments!

Current Status

✅ Launched on ahojmetrics.com
✅ Free tier: 20 audits/month
✅ Paid plans: $35-$299/month
✅ 18 global regions available
✅ ~2 minute average audit runtime
✅ Automated monitoring with alerts
✅ Team collaboration

Try It Out

You can sign up for free at ahojmetrics.com (no credit card required).

Test your site from Sydney, London, Tokyo, São Paulo, and 14 other regions. See how Core Web Vitals, performance scores, and load times differ globally.

I'd love feedback from the dev community! What features would make this more useful for you?

Questions? Drop them in the comments or connect with me on LinkedIn.

Happy to share more details about any part of the stack!

DEV Community