DEV Community: Ariel Frischer

Agentic Coding Strategy: What Works, What Backfires

Ariel Frischer — Sun, 31 May 2026 19:58:32 +0000

The honest answer is not "use more agents" or "buy the biggest model." The best agentic coding strategy depends on the shape of the task.

The pattern I trust right now:

Small, sequential work: one strong agent with tight context.
Large, parallel work: specialized agents with clear handoffs.
Production work: specs before code, tests after code, and deterministic checks between phases.
Cost-sensitive work: route easy steps to cheaper models and reserve frontier models for ambiguity, architecture, and review.
Familiar codebases: use AI carefully. It can slow experienced developers down.

The biggest mistake is treating all coding work as the same workload.

1. Multi-agent teams help when the work is genuinely parallel

Multi-agent coding systems look strongest when the task can be split into roles: planner, coder, reviewer, tester, docs writer, migration specialist, security reviewer.

The best evidence in the supplied notes points to specialization and cross-validation as the useful mechanism. A multi-agent comparison from vibecoding.app reports 72.2% on SWE-bench Verified for multi-agent teams versus about 65% for single-agent baselines using similar model classes. The same writeup reports stronger review performance: 60.1% code-review F1 versus about 51% for single-agent review, plus better critical bug detection.

I would not interpret that as "always run five agents." I read it as:

Separate planning from execution when scope is broad.
Use a reviewer agent when correctness matters.
Fan out only when tasks do not need the same mutable context.
Keep handoffs explicit: changed files, task intent, tests run, known risks.

The tradeoff is real. Multi-agent runs cost more tokens, create coordination overhead, and make failures harder to debug. If one agent can hold the whole problem and the task is sequential, adding agents usually adds latency.

2. Hierarchical decomposition beats one giant plan

Long-horizon work fails when the agent tries to hold the entire plan in one flat list. The useful move is hierarchy:

Goal.
Milestones.
Interfaces between milestones.
File-level tasks.
Verification gates.

This is not ceremony. It limits compounding error.

Spec Kit Agents is a good example of this direction. The paper describes a staged workflow with SPEC, PLAN, TASKS, and IMPLEMENT phases, plus context-grounding hooks before each stage and validation hooks after. The reported result is modest but useful: context-grounding improved judged quality by 0.15 on a 1-5 composite score and improved SWE-bench Lite Pass@1 by 1.7 percentage points, reaching 58.2%.

That is the right lesson: specs do not magically make agents brilliant. They make failures visible earlier.

3. Spec-driven and test-driven workflows solve different problems

Specs define intent. Tests define success.

For agentic coding, I want both:

A short spec that names the behavior, constraints, non-goals, and acceptance criteria.
A plan that lists touched files and risky assumptions.
Tests or checks that prove the change works.
A final review pass that reads the diff against the original spec.

Spec-driven development is most useful when the agent might hallucinate APIs, ignore repo conventions, or drift from the original goal. Test-driven development is most useful when the expected behavior can be encoded as an executable signal.

Where this backfires: small changes with obvious local scope. Do not write a five-page spec for a two-line validation fix.

4. The harness matters more than people want to admit

MindStudio summarized a harsh result: the same model can show up to 6x performance variation from harness design alone. Their practical recommendations are worth operationalizing:

Remove tools the agent does not need.
Keep irrelevant context out of the prompt.
Test whether verifiers and search loops help your workload before assuming they do.
Put orchestration logic where the model can understand it.

This matches my own bias: agent performance is often a systems problem, not just a model-selection problem.

The question is not "what is the best model?" It is:

What context does the model see?
What tools can it call?
What work is deterministic outside the model?
What gets verified before the next step?
How easy is it to inspect why the run failed?

If those answers are bad, a stronger model mostly gives you a more expensive failure.

5. Model tiering is the cost control strategy

Do not send every step to your most expensive model.

A practical routing policy:

Cheap or local model: search, summarization, boilerplate, formatting, doc drafts.
Workhorse model: normal implementation, test generation, straightforward refactors.
Frontier model: ambiguous architecture, hard debugging, security-sensitive changes, final review.

The exact model names will change. The routing rule should not.

Agentic work can burn tokens unpredictably. A 2026 arXiv paper on token consumption in agentic coding tasks reports that agentic tasks can consume far more tokens than simple code chat, that runs on the same task can vary dramatically in token usage, and that higher token spend does not reliably mean higher accuracy.

So the default should be measured escalation, not automatic frontier-model usage.

6. Experienced developers on familiar codebases should be selective

The METR randomized controlled trial is the caution flag. In their early-2025 study, 16 experienced open-source developers worked on 246 real issues in repositories they knew well. Allowing AI tools increased completion time by 19%, even though developers believed AI had made them faster.

That does not prove AI slows everyone down. METR is careful about that. It does show a real failure mode:

The developer already knows the system.
The task depends on local conventions.
The AI produces plausible code that requires review and cleanup.
The review cost exceeds the generation savings.

For senior developers in familiar code, AI should often be scoped to narrow support tasks: search, test scaffolds, migration drafts, alternative designs, and review checklists.

My decision framework

Situation	Default strategy	Why
Solo dev, small project	Single strong agent	Lower overhead, faster iteration
Familiar codebase, precise edit	Minimal AI assistance	Review cost can exceed generation savings
Large feature across subsystems	Planner plus parallel implementers	Real parallelism and scoped context
Production code review	Specialized reviewer agents	Fresh passes catch different bug classes
Long-horizon project	Hierarchical decomposition	Prevents flat-plan drift
Clear behavior with known tests	Spec plus tests	Intent and success are both explicit
Cost-sensitive pipeline	Model tiering	Spend frontier tokens only where needed

The bucket answer

For most solo developers, the best default is still a single strong agent with excellent context management.

For complex, parallelizable production work, the best default is:

A short spec.
Hierarchical task decomposition.
Specialized agents only where the work naturally splits.
Deterministic checks between handoffs.
A reviewer agent before human review.
Model routing by task difficulty.

More agents are not the strategy. Better task boundaries are the strategy.

Sources:

Multi-agent benchmark and tradeoff summary: https://vibecoding.app/blog/multi-agent-vs-single-agent-coding
Spec Kit Agents paper: https://arxiv.org/html/2604.05278v1
Harness design discussion: https://www.mindstudio.ai/blog/better-model-vs-better-harness-agent-benchmark-score
METR experienced developer RCT: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
Agentic token consumption paper: https://arxiv.org/abs/2604.22750

arc-agent: AI System Design Generator

Ariel Frischer — Sun, 17 May 2026 01:41:38 +0000

Finally a structured approach to system design: arc-agent.

It is a Go CLI for turning a system design prompt into a real design workspace: requirements, entities, APIs, high-level design, optional deep dives, and architecture diagrams.

The core idea is simple: do not ask an agent for one big architecture document and hope it is coherent. Split the work into stages, validate every artifact, repair failures, and render the result into files that are easy to review.

What it produces

An arc-agent workspace contains structured YAML artifacts plus generated outputs:

arc.yaml
01-requirements.yaml
02-entities.yaml
03-api.yaml
04-high-level-design.yaml
06-diagram.yaml
DESIGN.md
diagram.mmd
diagram.excalidraw.png

That gives you a readable design doc, Mermaid source, and image artifacts without losing the structured intermediate data.

Quickstart

Install the CLI:

curl -fsSL https://raw.githubusercontent.com/ariel-frischer/arc-agent/main/install.sh | sh

Then ask it for a design:

arc-agent new "Design a URL shortener like Bitly" --out designs/bitly
arc-agent inspect designs/bitly
arc-agent validate designs/bitly
arc-agent render designs/bitly --format all

You can run it through an agent provider such as OpenCode, Codex, or Claude Code. You can also use direct mode with OpenCode Go structured outputs when you want stricter, faster generation.

Repo: https://github.com/ariel-frischer/arc-agent

Autospec: Spec-Driven Development for AI Coding Agents

Ariel Frischer — Thu, 14 May 2026 19:54:25 +0000

AI coding agents are powerful, but the workflow can get messy fast: vague prompts, drifting context, half-finished plans, and implementation work that starts before the requirements are clear.

autospec is a CLI for bringing structure back into that loop.

It turns feature work into a repeatable flow:

specify -> plan -> tasks -> implement

Each stage produces YAML artifacts, so the output is structured, reviewable, and easy for tools to validate.

Install

curl -fsSL https://raw.githubusercontent.com/ariel-frischer/autospec/main/install.sh | sh

You will also need Git and one supported coding agent:

Claude Code
Codex CLI
OpenCode

Set up a project

From inside your repo:

autospec doctor
autospec init
autospec constitution

doctor checks dependencies, init sets up autospec config, and constitution creates project-level principles for future specs.

The basic workflow

Start by generating only the spec:

autospec run -s "Add user authentication with OAuth"

Autospec creates a feature directory like:

specs/
  001-user-authentication/
    spec.yaml

Review and edit the spec before continuing. Then run the rest:

autospec run -pti

That continues through:

plan -> tasks -> implement

The resulting directory grows into:

specs/
  001-user-authentication/
    spec.yaml
    plan.yaml
    tasks.yaml

One-command mode

For smaller features, you can run the full flow at once:

autospec run -a "Add a caching layer for API responses"

-a means all core stages:

-spti

That expands to:

specify -> plan -> tasks -> implement

Useful commands

# Planning only: specify, plan, tasks
autospec prep "Add billing webhooks"

# Full workflow shortcut
autospec all "Add team invitations"

# Implementation only
autospec implement

# Resume from a specific phase
autospec implement --from-phase 3

# Run a single task
autospec implement --task T003

# Show progress
autospec status
autospec st -v

Pick your agent

Autospec can run with different coding agents:

autospec run -a --agent claude "Add unit tests"
autospec run -a --agent codex "Add CLI smoke tests"
autospec run -a --agent opencode "Add REST API endpoints"

You can also configure the agent during init:

autospec init --ai codex
autospec init --ai claude,codex,opencode

Why it is useful

The main benefit is not that it writes code for you. The benefit is that it slows the agent down in the right places.

Instead of jumping straight from prompt to patch, autospec gives you review points:

Is the requirement correct?
Is the plan sane?
Are the tasks ordered properly?
Can implementation resume cleanly if something fails?

Because the artifacts are YAML-first, they are easier to inspect, validate, diff, and update than a long chat transcript.

What Happens If Mythos Ships Before the Patches Do

Ariel Frischer — Fri, 17 Apr 2026 01:37:59 +0000

Anthropic did not ship Claude Mythos Preview to the public. They staged it through Project Glasswing, a coordinated disclosure program routing the model to critical-infrastructure operators and upstream open-source maintainers first. The public gets the model after the patches land, not before.

It is worth asking what the other timeline looks like. Same model, same capabilities, but pushed to the API on launch day. What actually happens?

What the weapon does

Mythos Preview is not a better fuzzer. It reasons about code. The published evaluations are the relevant data:

Thousands of previously unknown vulnerabilities across every major operating system (Linux, Windows, macOS, OpenBSD, FreeBSD) and every major browser (Chrome, Safari, Edge, Firefox).
Tier-5 control-flow hijack on ten separate, fully-patched OSS-Fuzz targets. Opus 4.6, for comparison, reached tier-3 on one.
Multi-bug exploit chains against the Linux kernel, the kind of work previously associated with elite human researchers.
A guest-to-host memory-corruption flaw in a production hypervisor. That one matters because it breaks the boundary cloud providers sell you.
A 27-year-old OpenBSD TCP SACK kernel-crash chain and a 16-year-old FFmpeg H.264 decoder flaw, both hiding in plain sight.
Roughly $20,000 for one thousand agent runs against OpenBSD, surfacing dozens of findings. The marginal cost per exploit is dinner money.

Exploit capability was not explicitly trained. It emerged as a downstream consequence of code-reasoning improvements, which is the part that should concern anyone modeling where capability is headed.

The defender has a structural problem

Every serious answer to "what would happen" turns on one number: the gap between how fast an attacker can weaponize and how fast a defender can patch.

The attacker cycle, with Mythos in hand, is minutes per target. Spin up a hundred parallel agents and it is seconds per target in aggregate.

The defender cycle is this:

Browser emergency patch: three to seven days to ship, then weeks for users to actually apply it.
Enterprise Windows rollout: thirty to ninety days is routine.
Embedded systems, routers, IoT, industrial control: months to never.

That gap is not a detail. It is the entire game.

Who gets hit

Drive-by browser compromise is the unsexy answer that matters most. Every consumer device on the internet runs one of four browsers that have known-exploitable zero-days in the public release scenario. Malvertising networks and watering-hole campaigns do not require users to make mistakes. They require users to load a webpage.

Four concentric rings of harm, from center out:

Consumers. Info-stealers, banking trojans, ransomware delivered via ordinary web traffic. Tens to hundreds of millions of endpoints touched in the first month. Not a guess. The browsers in question have billions of users between them, and the exploits work before patches ship.
Cloud tenants. The hypervisor escape means a ten-dollar-per-hour attacker VM on the same physical host as your production workload can pivot to it. Multi-tenant isolation was the architectural assumption underneath the entire public-cloud industry.
Critical infrastructure. Hospitals, utilities, municipal government, school districts. The organizations least equipped to patch in days rather than months. Every Change Healthcare, every Colonial Pipeline, but concurrent.
The long tail. Home routers, consumer IoT, industrial controllers, embedded medical devices. These never fully patch. They become a permanent botnet substrate.

Timeline

Rough, but grounded in how past mass-exploitation events actually unfolded:

Hour 0 to 24. Proofs of concept spread on private channels. Nation-state actors scale first because they already have the infrastructure.
Day 1 to 7. First malvertising waves. Browser vendors push emergency patches. Adoption is days to weeks behind.
Week 1 to 4. Enterprise ransomware wave hits before patch rollouts complete. Cloud tenant breaches start surfacing in disclosures.
Month 1 to 3. Hospitals, schools, small businesses without patching discipline absorb the impact. Long-tail exploitation of infrastructure that will never get patched begins.

The right comparison is not a single past event

Every analogy people reach for undershoots. EternalBlue gave us WannaCry. Heartbleed exposed roughly seventeen percent of secure web servers. Log4Shell touched hundreds of millions of devices. Stagefright covered most of Android. Spectre covered most CPUs.

The counterfactual Mythos release is not any one of those. It is all of them simultaneously, plus an agent that weaponizes each one autonomously for the price of a coffee. The direct-harm population, meaning people who lose money, have data stolen, or lose access to services they need, is plausibly north of one hundred million in the first quarter. The indirect-harm population, through degraded healthcare and finance and utilities, is effectively everyone connected to the internet.

Why Glasswing is the actual story

The conversation around Mythos has focused on whether Anthropic is being paternalistic by withholding it. That framing misses the point. The model is withheld because the defender patch cycle cannot keep up with the attacker weaponization cycle, and the only way to close that gap is to patch before the weapon is public.

Project Glasswing is the patch window. The reason the public release is delayed is that the staged release is the one that results in fewer people getting hurt.

The counterfactual question is useful mostly because it makes the existing decision legible. The decision is not "do we want this capability in the world." The capability is coming, from Anthropic or from someone else, with or without coordinated disclosure. The decision is whether the first day it exists in the open is a day defenders have had a chance to prepare for.

That is the whole argument.

OpenClaw vs Hermes Agent: A Comprehensive Comparison

Ariel Frischer — Thu, 09 Apr 2026 02:34:35 +0000

Both connect LLMs to messaging platforms and let agents run code, manage memory, and automate tasks. But they come at the problem from opposite ends.

OpenClaw is a TypeScript gateway. One daemon manages connections to WhatsApp, Telegram, Discord, and a dozen other platforms, routing messages to isolated agents with separate workspaces, tools, and memory. You define what each agent can do. The gateway handles the rest — five minutes to first message.

Hermes Agent is Nous Research's Python agent runtime. It ships with 47 tools, agent-managed memory, and a self-improvement loop where the agent creates its own skills from experience. Where OpenClaw separates the gateway from the agent, Hermes bundles everything into a single monolithic class.

The core tension: OpenClaw optimizes for operational control across multiple agents. Hermes optimizes for single-agent depth and adaptability.

	OpenClaw	Hermes Agent
GitHub	openclaw/openclaw	NousResearch/hermes-agent
Stars	~352K	~37.5K
Contributors	360+	210+
Language	TypeScript (Node.js)	Python
First release	Nov 2025	Jul 2025
Latest version	2026.4.8	v0.8.0 (2026.4.8)
License	MIT	MIT
npm downloads	~1.6M/week	N/A (shell installer)
Docs	docs.openclaw.ai	hermes-agent.nousresearch.com
Skills marketplace	ClawHub (5,700+ skills)	Skills Hub (643 skills)

Architecture

OpenClaw runs a single Gateway daemon that owns all channel connections, routes messages to agent sessions, and manages state. Agents are defined in config — each with its own workspace, model, tool policies, and memory. The Gateway doesn't care what agents do internally; you can swap implementations, run different models per agent, and serve multiple users from one process. An ACP Bridge provides IDE integration with Zed, Codex, and Claude Code.

Hermes puts the agent at the center. CLI, Gateway, ACP adapter, and API server all instantiate the same AIAgent class — conversation loop, tool dispatch, memory management, skill creation all in one place. This centralizes logic but couples everything tightly.

OpenClaw separates concerns. Hermes consolidates them.

In practice: separated layers can be swapped, scaled, and debugged independently. Consolidated systems are simpler to start with, but changes ripple across the whole stack.

Messaging platforms

OpenClaw: WhatsApp, Telegram, Discord, Slack, Signal, iMessage, Matrix, MS Teams, Google Chat, IRC, Nostr, Twitch, WebChat, Zalo — 14 platforms via plugins.

Hermes: Telegram, Discord, Slack, WhatsApp, Signal, Matrix, Mattermost, Email, SMS, DingTalk, Feishu, WeCom, Home Assistant, Webhooks — 14 platforms.

OpenClaw has iMessage, Google Chat, and Nostr. Hermes adds Home Assistant, Email, SMS, and enterprise platforms (DingTalk, Feishu, WeCom).

Model providers

OpenClaw uses a config-driven approach with fallback chains per agent. Any OpenAI-compatible API works, plus explicit handling for Anthropic and OpenRouter. Switching models requires editing config in multiple places — a known friction point.

Hermes has 18+ built-in providers including Google AI Studio, Nous Portal, GLM, Kimi, and MiniMax. Switch models mid-session with hermes model. Three internal API modes handle format differences transparently.

Skills

Both use the agentskills.io standard — SKILL.md files with YAML frontmatter, portable between frameworks.

OpenClaw loads skills from six sources with per-agent allowlists. Install from ClawHub (5,700+ community skills). Skills gate on OS, required binaries, and environment variables.

Hermes adds self-improving skills. Every 15 turns, the agent considers creating a skill from what it just learned. The skill_manage tool lets the agent create, update, and delete skills during use. Skills Hub has 643 skills (77 built-in, 505 community).

The self-improvement loop is Hermes' headline feature. Whether it produces useful skills depends on the model and how repetitive your workflows are.

Memory

Memory is where the philosophies diverge most.

OpenClaw uses flat-file memory with no hard limits — MEMORY.md for curated long-term memory, daily logs, and semantic vector search. Each agent has isolated memory; cross-agent search is opt-in. You curate what stays.

Hermes uses bounded, agent-managed memory: 2,200 characters for MEMORY.md, 1,375 for USER.md. The agent decides what to remember, consolidates entries, and prunes old ones. Memory freezes at session start to preserve prompt cache. All sessions live in SQLite with FTS5 — every past conversation is searchable.

Hermes also integrates with 8 external memory providers (Honcho, Mem0, Hindsight, plus vector databases). OpenClaw has none built in.

The trade-off: OpenClaw gives you control — you know exactly what's in memory. Hermes automates curation but may forget things you wanted kept.

Tools and sandboxing

OpenClaw ships core tools: file ops, shell, browser (Chrome DevTools), web fetch, messaging, cron, and subagent spawning. Each agent gets an explicit tool allowlist. Sandbox runtimes: Docker, SSH, OpenShell.

Hermes ships 47 tools across 40 toolsets — including 11 browser automation tools, image generation, TTS, Home Assistant control, and RL trajectory generation. Six terminal backends, with Modal and Daytona offering serverless environments that hibernate when idle.

OpenClaw is lean by design — add what you need via skills. Hermes ships everything.

For security, OpenClaw uses policy-based tool filtering with allowlists/denylists. Hermes has pattern-based dangerous command detection plus an optional LLM auto-approval system.

Sessions

OpenClaw sessions reset daily by default. Stored as JSON + JSONL transcripts. Four isolation modes from shared to per-account-channel-peer. Auto-cleanup after 30 days.

Hermes sessions persist in SQLite with lineage tracking — when context compresses, old turns get summarized and the new session chains to the old one. Profile isolation (hermes -p profilename) gives completely separate configs, memory, and sessions.

Multi-agent

OpenClaw was built for this. The Gateway routes to multiple agents with specificity-weighted matching. Each agent is fully isolated. Subagents spawn with configurable cleanup and timeout tiers (5 minutes to 2 hours). Run a coding agent for one user, research agents for cron jobs, and restricted agents for public channels — all from one Gateway.

Hermes focuses on single-agent depth. delegate_task spawns temporary subagents with separate iteration budgets. execute_code lets Python scripts call tools via RPC, collapsing pipelines into zero-context-cost turns. For multiple independent agents, use profiles — each a separate installation.

One Gateway serving different agents to different users? OpenClaw. One capable agent that delegates subtasks? Hermes.

Automation

OpenClaw runs cron through the Gateway — one-shot, fixed interval, or cron expression — with four execution styles. Jobs run as isolated subagent sessions with model overrides and deliver to any channel.

Hermes has cron with natural language scheduling ("every Monday at 9am"). Jobs are agent tasks with skill attachments. Inactivity-based timeouts track actual tool activity, not wall clock time.

Both support heartbeats for periodic self-checks.

CLI and developer experience

OpenClaw has three interfaces. openclaw tui connects to the Gateway from the terminal — session selection, history replay, deliver mode (forwards replies to messaging channels), and remote gateway support. openclaw dashboard opens a browser Control UI for chat, sessions, and channel config. The CLI handles onboarding, diagnostics (openclaw doctor), channel login, and skill management.

Hermes centers on a terminal TUI — hermes drops you in. Multiline editing, streaming output, mid-conversation interruption (redirect the agent while it's working), conversation history, and slash commands. The CLI covers setup, model switching, session search, and profile management.

Both have TUIs and CLIs. OpenClaw adds a browser UI. Hermes' TUI is richer with mid-conversation interruption and inline streaming.

Self-improvement

OpenClaw skills are static unless you edit them. Predictable by design.

Hermes learns from experience — creating and refining skills via skill_manage. The RL integration (trajectory generation, Atropos) records agent runs as training data.

This is the fundamental split. OpenClaw: you control the agent's capabilities. Hermes: the agent grows its own.

Pick OpenClaw for multi-agent routing, operational control, or gateway/agent separation. Pick Hermes for self-improving skills, serverless execution, or Python/ML extensibility.

References

OpenClaw GitHub — source, issues, releases
OpenClaw Documentation — setup guides, channel config, API reference
ClawHub — OpenClaw skills marketplace (5,700+ community skills)
OpenClaw npm — install stats, version history
Hermes Agent GitHub — source, issues, releases
Hermes Agent Documentation — user guide, developer guide, skills reference
Hermes Skills Hub — 643 skills (built-in, optional, community)
agentskills.io — open standard for portable agent skills
Hermes v0.8.0 Release Notes — MCP OAuth 2.1, Google AI Studio provider

I Built an AI Rental Management Platform for My Brother. Here's What Actually Happened.

Ariel Frischer — Wed, 08 Apr 2026 22:21:04 +0000

My brother manages rental properties on the side while working full-time as a real estate agent. I watched him spend 8-10 hours per week repeating the same ten screening questions over the phone -- income, pets, move-in date, credit, rental history -- between showings, after dinner, every single day. Nothing recorded anywhere.

71 percent of landlords rank tenant screening among their top three burdens. 45 percent of renters expect a reply within hours. Miss that window and the prospect moves on.

I didn't set out to build a product. I set out to solve my brother's problem.

Three Weeks of Real Data

Metric	Result
Prospect sessions	126
Contacts captured	71
Completed pre-screenings	55
Response time	Hours → under 30 seconds
Time saved	8-10 hrs/week

126 conversations my brother didn't have to initiate, manage, or follow up on. 71 contacts captured automatically as a byproduct of the conversation. 55 pre-qualified prospects with structured, comparable data.

The pre-screening flow is still in beta and has been improved since these early numbers.

How It Works

A prospect reaches out -- text, email, wherever. You send them your pre-screening link, or they find you on your public agent page where all your listings live in one place. Instead of waiting for a callback, they're talking to voice AI in seconds -- in their preferred language.

The AI walks them through screening conversationally -- not a robotic form. You get a clean email summary with every answer organized, contact info captured, and the full transcript saved to your dashboard. All while you were showing another property, driving, or sleeping.

What Rentalot Does

Voice AI pre-screening -- handles calls in four languages, transcribes everything in real time
AI management chat -- draft emails, look up contacts, check showings, manage properties from a web chat interface
CLI + MCP tools -- plug Claude, Gemini, Codex, or any AI agent directly into your rental data
Automated email follow-ups -- references the actual conversation, not generic templates
Google Calendar + Cal.com integration -- AI schedules showings directly from your availability
Public agent page -- one link shows all your listings, prospects self-select and submit
Organized records -- every contact, every answer, every interaction saved and searchable

Already have listings on Zillow or Apartments.com? Export your properties and bulk-import them into Rentalot using an AI agent with our open-source rentalot skill -- no manual data entry.

The Point

Leasing is repetitive. Same questions, same follow-ups, same scheduling back-and-forth -- day after day. Most leasing agents didn't get into real estate to spend their evenings on phone tag and data entry.

AI doesn't replace the human side of leasing. It removes the grind so you can get back to the parts you actually enjoy -- working with people, closing deals, building something.

Looking for Early Users

Rentalot is in alpha -- free trial, no credit card. I'm looking for my first 20 customers to iterate with. If you give me real feedback that shapes the product, I'll personally help with any technical issues and build the features you need.

If you manage rentals on the side and your evenings look like my brother's used to -- buried in screening calls, losing track of prospects, watching leads go cold overnight -- that's exactly the problem this was built to solve.

rentalot.ai

Stop Vibe Coding. Start Spec-Driven Development.

Ariel Frischer — Tue, 31 Mar 2026 23:02:46 +0000

If you type "add user auth" into Claude and ship whatever comes back, you're not engineering. You're contributing to AI slop - stop it.

Andrej Karpathy coined "vibe coding" in early 2025 — type a prompt, accept the output, move on. It felt like a superpower. Then the data came in. Experienced developers using AI tools were 19% slower on real codebases¹, and AI co-authored PRs had 1.7x more major issues². Faster keystrokes, worse software.

The models keep improving — but better generation doesn't fix misaligned intent or the cascade of design decisions that follow. That's what autospec solves.

Vibe coding fails at alignment, not generation

Modern models can reason about architecture, decompose problems, and generate plausible code. None of that matters if they're solving the wrong problem.

When you type "add user auth," the model guesses: OAuth or email/password? Sessions or JWTs? Middleware placement? Error response format? You discover which guesses were wrong after the code exists. That's the misalignment problem. No amount of model intelligence fixes it because the model never had your intent in the first place.

Spec-driven development solves this. The workflow is spec → plan → tasks → implement. Instead of jumping straight to code, the first step generates a spec.yaml — a structured artifact with requirements, acceptance criteria, edge cases, and constraints, all shaped by your project's constitution.yaml. From there you iterate on the spec: edit it by hand, or use autospec clarify to open an interactive session where you and the AI refine scope, resolve ambiguities, and tighten requirements until the spec actually captures your intent. Only then does planning and implementation begin, carrying that alignment forward.

How autospec enforces this

autospec is a streamlined open-source spec-driven workflow that orchestrates Claude Code and/or OpenCode agents.

Constitution first. A constitution defines your project's non-negotiable rules — quality standards, architectural constraints, security requirements — with explicit priority levels and enforcement mechanisms. autospec infers initial principles from your codebase (Makefile targets, CI config, README) and you refine from there. Every command runs under these constraints.

# .autospec/constitution.yaml (trimmed)
preamble: |
  autospec is a Go CLI that orchestrates AI-driven specification workflows.
  These principles ensure code quality, maintainability, and reliable execution.

principles:
  - name: "Test-First Development"
    id: "PRIN-001"
    priority: "NON-NEGOTIABLE"
    description: "Tests written before implementation. Tests define behavior."
    enforcement:
      - mechanism: "CI pipeline"
        description: "Build fails if tests fail"
    exceptions:
      - "Prototype/spike code explicitly marked as such"

  - name: "Idiomatic Go"
    id: "PRIN-002"
    priority: "MUST"
    description: "Follow Go community conventions."
    enforcement:
      - mechanism: "Code review and linting"
        description: "golangci-lint + reviewer verification"

  - name: "Performance Standards"
    id: "PRIN-003"
    priority: "MUST"
    description: "Validation <10ms, config <100ms, user ops <1s."

  - name: "Actionable Errors"
    id: "PRIN-007"
    priority: "MUST"
    description: "Errors include context, expected vs actual, and fix hints."

sections:
  - name: "Go Idioms"
    content: |
      Error handling: Wrap with context using fmt.Errorf("doing X: %w", err).
      Table tests: Use map[string]struct{} with t.Run and t.Parallel().
      Functions: Keep under 40 lines, extract helpers as needed.
      Interfaces: Accept interfaces, return concrete types.

Every principle has an ID, a priority level (NON-NEGOTIABLE, MUST, SHOULD, MAY), enforcement mechanisms, and documented exceptions. The constitution also includes project-specific sections — coding idioms, naming conventions, quality gates — that get injected into every autospec session so the AI operates under the same constraints your team does.

Structured stages. The core workflow runs spec → plan → tasks → implement. From a plain-English feature description, autospec generates a spec.yaml, then a plan.yaml, then tasks.yaml. Code only gets written after all three exist and are valid.

Here's what a real spec.yaml looks like for "add user authentication":

# specs/001-user-auth/spec.yaml (trimmed)
feature:
  branch: "001-user-auth"
  status: "Draft"
  input: "Add user authentication to the application"

user_stories:
  - id: "US-001"
    title: "User can log in with email and password"
    priority: "P1"
    as_a: "registered user"
    i_want: "to log in with my email and password"
    so_that: "I can access my account"
    acceptance_scenarios:
      - given: "I have a registered account"
        when: "I submit valid credentials"
        then: "I am logged in and redirected to dashboard"

requirements:
  functional:
    - id: "FR-001"
      description: "MUST support email/password authentication"
      testable: true
      acceptance_criteria: "Users can log in with valid email and password"
    - id: "FR-002"
      description: "MUST hash passwords before storage"
      testable: true
      acceptance_criteria: "Passwords are stored using bcrypt with cost factor 12"
  non_functional:
    - id: "NFR-002"
      category: "security"
      description: "Must rate limit login attempts"
      measurable_target: "Max 5 attempts per minute per IP"

edge_cases:
  - scenario: "User enters email with different case"
    expected_behavior: "Email comparison is case-insensitive"
  - scenario: "Session token expires during active use"
    expected_behavior: "User is prompted to log in again"

out_of_scope:
  - "OAuth/social login integration"
  - "Two-factor authentication"

Every assumption, constraint, edge case, and requirement is explicit YAML — not markdown you eyeball, but structured data you can validate programmatically. Schema validation catches missing fields and invalid references before the next stage runs. When validation fails, autospec feeds specific errors back to the AI and it self-corrects. No manual prompt editing.

Per-phase isolation. Tasks in tasks.yaml are grouped into phases — logical units like "setup," "core logic," "tests." Each phase runs in a fresh context window, so the agent isn't dragging 10,000 tokens of prior work into every call. We estimate a 38-task feature drops from ~$257 to ~$42 (83% cost reduction) with this approach, and it prevents context degradation — phase 4 executes with the same clarity as phase 1. As each task completes, autospec updates its status directly in tasks.yaml — progress is always visible and resumable. If a session gets interrupted, run autospec implement and it picks up exactly where you left off.

Non-interactive by default. No back-and-forth chatting, no manual approvals mid-session. The AI gets instructions and builds. You review artifacts between stages, not during. Interactive mode (autospec clarify) exists for when you actually want a conversation to refine the spec.

Why autospec over GitHub Spec Kit?

autospec was inspired by GitHub Spec Kit and stays true to its core workflow: spec → plan → tasks → implement. That flow is the right idea. But Spec Kit's execution has real gaps that autospec closes:

	GitHub Spec Kit	autospec
Output format	Markdown	YAML — machine-readable, schema-validated
Validation	Manual review	Automatic with retry logic on failure
Context efficiency	Full prompt each time	Per-phase/task session isolation (80%+ cost savings)
Phase orchestration	Manual	Automated with dependency ordering
Status tracking	Manual	Auto-updates `spec.yaml` and `tasks.yaml` as work progresses
Implementation	Shell scripts	Go binary — type-safe, single install, cross-platform

The biggest difference is validation. Because every artifact is structured YAML, autospec can programmatically validate each stage before the next one runs. Schema validation catches missing fields, invalid references, and structural errors — things you can't check against markdown. When validation fails, autospec feeds the specific errors back into the next AI call and the model self-corrects. No manual prompt editing.

The other difference is streamlined developer productivity. autospec runs the agent in non-interactive mode by default — no waiting on chat responses, no accepting edits one by one, no answering a stream of clarifying questions. It just generates what's needed at every stage. You review artifacts between stages, not during. When you do want a conversation — to refine scope or resolve ambiguities — every stage is also available as a Claude Code slash command (/autospec.clarify, /autospec.specify, etc.) for interactive use.

Vibe coding has no feedback loop except you re-reading output and rewording prompts. Spec-driven development has automated quality gates at every stage.

When specs are worth it

Not everything needs a spec. Here's the quick decision test:

Skip autospec — the task has zero design decisions and you can finish in under 30 minutes. Fix a typo, bump a dependency, add a nil check, rename a variable. Just do it.

Use autospec — the task involves 3+ design decisions and touches 3+ files. Adding rate limiting to an API? You're choosing between token bucket and sliding window, deciding on storage, bypass rules, error formats. A webhook delivery system? Signature scheme, retry policy, timeout handling. These are the tasks where vibe coding silently makes the wrong choice on decision #3 and you find out two hours later. Autospec surfaces all of them in the spec before any code exists.

Split first — the task would take more than two days or bundles 3+ independent features. "Add OAuth with Google, GitHub, SAML, and LDAP" is four specs, not one. Split by feature slice, layer, or user journey, then run autospec on each part.

AWS warned in 2026 that review capacity — not developer output — is now the bottleneck in delivery. With 46% of new code AI-generated, pipelines weren't designed for this volume. Specs give reviewers something to review besides a thousand-line diff. They read the spec, verify the intent, then check that the code matches. That's a fundamentally different (and faster) review loop.

Get started

GitHub: github.com/ariel-frischer/autospec

curl -fsSL https://raw.githubusercontent.com/ariel-frischer/autospec/main/install.sh | sh
cd your-project/                       # any git repo
autospec init                          # project setup: agent, permissions, constitution
autospec run -s "your feature here"   # generate spec → review it
autospec clarify                       # optional: interactive refinement with Claude
autospec run -pti                      # plan → tasks → implement

The first run takes a few minutes longer than vibe coding. Every run after saves you hours of debugging and rework.

Stop slop coding. Start using autospec.

autospec on GitHub

¹ METR, "Early 2025 AI & Experienced Open-Source Dev Study"
² CodeRabbit, "State of AI vs Human Code Generation Report"

Making Your SaaS AI-Agent Ready: A Practical Guide

Ariel Frischer — Thu, 12 Feb 2026 04:50:08 +0000

AI agents are becoming the primary interface between developers and APIs. Tools like Claude Code, OpenClaw, and MCP clients don't read your marketing site—they consume your API spec, documentation structure, and machine-readable metadata.

This guide covers the layered approach to making any SaaS API agent-consumable, from discovery to execution.

Overview: All 7 Phases

Phase	Focus	Impact
1	`llms.txt` — Discovery	High
2	OpenAPI spec enhancements	Very High
3	OpenClaw skill	High
4	MCP server (local)	High
5	TypeScript SDK	Medium
6	Remote MCP server	High
7	JSON-LD & polish	Medium

Start with Phases 1–2 for the highest ROI. Add MCP server when you're ready for Claude Code and Cursor integration.

The AI-Agent Stack

The ecosystem is converging on a standard layered architecture:

llms.txt              → Discovery ("what does this product do?")
OpenAPI spec          → Foundation (schema, types, descriptions)
  ├── MCP server      → Agent tool execution
  ├── TypeScript SDK  → Typed client for developers
  ├── OpenClaw skill  → Natural language API guide
  └── agents.json     → Multi-step flow orchestration
JSON-LD               → AI search visibility

The OpenAPI spec is the keystone. Everything else either generates from it or supplements it.

Phase 1: Discovery Layer

Create `llms.txt`

Serve /llms.txt from your public directory. This is a curated index of what your SaaS does, what the API offers, and links to key documentation.

# YourSaaS

> One-line description of what your product does.

Longer description covering key value props and use cases.

## API Reference

- [Authentication](https://yoursaas.com/docs/auth): API keys, OAuth, rate limits
- [Resource A](https://yoursaas.com/docs/resource-a): What it does
- [Resource B](https://yoursaas.com/docs/resource-b): What it does

## Guides

- [Getting Started](https://yoursaas.com/docs/quickstart)
- [OpenAPI Spec](https://yoursaas.com/api/openapi.json)

Why it matters: LLMs trained after late 2024 increasingly check for llms.txt when encountering new APIs. It's become the robots.txt for AI agents.

Create `llms-full.txt` (Optional)

An expanded version with full API reference content inlined—every endpoint, request/response schema, and example. Generate this from your OpenAPI spec at build time.

Phase 2: Enhance Your OpenAPI Spec

Your OpenAPI spec is the foundation. Optimize it for LLM consumption:

Agent-Oriented Descriptions

Write descriptions that tell an agent when to use an endpoint:

# Bad
summary: Get user

# Good
summary: Use to retrieve detailed information about a specific user by their ID.
  Returns profile data, permissions, and account status.
  Use when you need to verify user details before performing actions.

Required Enhancements

Element	Why It Matters
`operationId`	Clean, camelCase names become MCP tool names (`createUser`, not `postApiV1Users`)
Realistic examples	Agents generate better requests when they see real values
Documented enums	Prevents invalid values in generated requests
Side effects	Tell agents what changes ("Sends welcome email", "Charges credit card")
Rate limits	Per-endpoint documentation prevents hammering
Read-only vs write	Helps agents understand safe exploration vs mutations

Document Rate Limits in Spec

Add response headers and info section documentation:

headers:
  X-RateLimit-Limit:
    description: Request limit per minute for your tier
    schema:
      type: integer
      example: 100
  X-RateLimit-Remaining:
    description: Requests remaining in current window
    schema:
      type: integer
      example: 87
  Retry-After:
    description: Seconds to wait before retry (on 429)
    schema:
      type: integer
      example: 45

Phase 3: OpenClaw Skill

OpenClaw agents can use your API directly, but a skill makes it natural-language accessible. Create a SKILL.md that teaches agents how to use your API.

Structure:

Authentication setup
Rate limits per tier
All endpoints with request/response examples
Common workflows ("list active users", "create and send invoice")
Error handling guidance

Distribution: Publish to ClawdHub for discoverability.

Phase 4: MCP Server

The Model Context Protocol (MCP) is becoming the standard for agent-tool integration. Claude Code, Claude Desktop, Cursor, and OpenClaw all support MCP servers.

Build a Local MCP Server

Start with a stdio server published as an npm package:

npx @your saas/mcp-server --api-key=ys_xxxxx

Tool inventory: Map your API endpoints 1:1 to MCP tools:

Resource	Example Tools
Users	`list_users`, `get_user`, `create_user`, `update_user`
Projects	`list_projects`, `create_project`, `delete_project`
Webhooks	`list_webhooks`, `create_webhook`, `test_webhook`

Resources: Expose documentation as readable resources:

docs://api-reference — Full API docs
docs://rate-limits — Tier limits and usage
schema://enums — Valid enum values

Document MCP Setup

Add an example .mcp.json to your docs:

{
  "mcpServers": {
    "yoursaas": {
      "command": "npx",
      "args": ["-y", "@your saas/mcp-server"],
      "env": {
        "YOURSAAS_API_KEY": "${YOURSAAS_API_KEY}"
      }
    }
  }
}

Register your MCP server in registries:

Official MCP Registry
mcpservers.org
Smithery

Phase 5: TypeScript SDK

Generate TypeScript types from your OpenAPI spec using openapi-typescript. Wrap with openapi-fetch for a typed client:

import createClient from "openapi-fetch";
import type { paths } from "@your saas/sdk";

const client = createClient<paths>({
  baseUrl: "https://yoursaas.com/api/v1",
  headers: { Authorization: `Bearer ${apiKey}` },
});

const { data } = await client.GET("/users/{id}", {
  params: { path: { id: "abc123" } },
});

Phase 6: Remote MCP Server (Advanced)

For zero-install experience, host a remote MCP server at https://mcp.yoursaas.com/mcp. Users add a URL and authenticate via OAuth.

Hosting options:

Cloudflare Worker — Edge deployment, handles sessions, cheap
Next.js API route — Simpler, but watch cold starts
Separate Vercel project — Dedicated subdomain

Offer both:

Remote (zero install, OAuth)
Local (API key, air-gapped environments)

Phase 7: Polish & Future-Proofing

JSON-LD Structured Data

Add schema.org markup for AI search visibility:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "SoftwareApplication",
  "name": "YourSaaS",
  "applicationCategory": "BusinessApplication",
  "offers": {
    "@type": "Offer",
    "price": "29.00",
    "priceCurrency": "USD"
  }
}
</script>

"Build with AI" Documentation Page

Create a dedicated page linking to:

MCP server setup
OpenClaw skill
TypeScript SDK
OpenAPI spec
llms.txt

Follow Stripe's lead: docs.stripe.com/building-with-llms

Key Takeaways

Start with llms.txt and OpenAPI — highest ROI for effort
Write agent-oriented descriptions — "Use to..." not just "Get..."
MCP server is table stakes — Claude Code and Cursor users expect it
Document everything — Agents can't guess what your API does
Stay current — The agent ecosystem moves fast; watch for new standards

References

Code Coverage Best Practices for Agentic Development

Ariel Frischer — Sun, 11 Jan 2026 18:15:55 +0000

The Core Problem

When AI agents generate code, two opposing forces create tension:

High coverage slows down - More tests mean longer iteration cycles and higher token costs
Low coverage risks regressions - Agents lack institutional memory; they don't know if they broke something previously built

Key Insight: Tests Are Institutional Memory

Human developers remember past bugs and context. Agents don't. Tests become the primary mechanism for regression detection, not a secondary safety net.

This shifts the value proposition: testing agent-generated code is MORE important than testing human-written code, because it's the only way agents learn what to preserve.

The Middle Ground: Tiered Coverage Strategy

Risk Tier	Coverage Target	What Goes Here
Critical	85-95%	Business logic, security, data integrity, money flows, API contracts
Core	70-80%	Main workflows, state management, integrations
Low-risk	50-60%	Getters/setters, DTOs, glue code, logging

Principles

1. Intention-Based > High Percentage

Every test should answer: "What regression would this catch if it failed?"

Tests that can't name a concrete regression are coverage-chasing noise. Focus on:

Core business rules
Edge cases that have caused bugs before
Integration points between components
Security and data integrity boundaries

2. The Real Cost Is Churn, Not Writing Tests

Implementation-coupled tests that break on refactors cause agent thrashing. This is far more expensive than writing fewer, behavior-focused tests.

Behavior-focused tests verify observable outputs and side effects rather than implementation details. They test WHAT the code does, not HOW it does it - so they survive refactoring.

Behavior-Focused	Implementation-Coupled
Tests inputs → outputs	Tests internal method calls
Breaks when behavior changes	Breaks when code is refactored
Mocks external dependencies only	Mocks internal collaborators

3. Every Bug Becomes a Regression Test

When something breaks, the fix includes a test that would have caught it. This encodes institutional memory into the test suite.

4. Coverage as Diagnostic, Not KPI

Use coverage reports to find untested critical paths, not to hit arbitrary numbers.

80% with meaningful tests beats 95% with brittle ones
Low coverage in a critical module is a signal to investigate
Improving code (deleting dead branches) can legitimately lower coverage while increasing safety

5. Fast Feedback Loop Is Non-Negotiable

If your test suite runs under 3 minutes, run it on every agent iteration. The cost is worth the regression protection.

For longer suites:

Run focused tests during iteration
Run full suite before commit/merge
Consider test parallelization

Diminishing Returns After 80%

Research consistently shows diminishing returns kick in after 80-90% coverage:

Aspect	80-90% Coverage	100% Coverage
Bug-finding value	High early, then flattens	Minimal for final 10-20%
Engineering effort	Proportional to benefit	Disproportionately high
Test complexity	Moderate, behavior-focused	Often brittle, over-specified
Design impact	Encourages testable code	Can incentivize design compromises

The last 10-20% requires:

Intricate test setups
Heavy mocking
Tests tightly coupled to implementation
Covering defensive/error paths rarely hit in practice

Agent-Specific Practices

Minimum Rule

ALL agent-generated code must have at least one intention-revealing test before merge.

Not necessarily high coverage, but something that would fail if the core behavior broke.

Have Agents Write Tests Too

Since agents can generate tests, the marginal cost of test creation is low. Prompt agents to:

Write tests alongside implementation
Generate edge case tests for their own code
Create regression tests when fixing bugs

Run Tests in the Agent Loop

Include test execution as part of the agent's implementation cycle:

implement → run tests → fix failures → verify tests pass → commit

This catches regressions before they propagate.

Test Known Issue Scenarios

Maintain a list of known system failures and ensure tests cover these scenarios. Agents don't know your system's historical pain points unless tests encode them.

Practical Recommendation

Aim for 75-85% coverage on critical code with behavior-focused tests. Accept 50-70% elsewhere. Always run tests in the agent loop.

The balance between cost/duration and regression protection favors FAST, MEANINGFUL tests over comprehensive slow tests.

When Higher Coverage Makes Sense

Push toward 95%+ in these contexts:

Safety-critical or regulated domains (medical, fintech, automotive)
Strong TDD culture where tests come naturally with design
Public APIs with stability guarantees
Security-sensitive code paths

Sources

Stop Building, Start Shipping: The Minimal Startup Toolkit

Ariel Frischer — Mon, 01 Sep 2025 01:55:47 +0000

Essential, simple, cost effective startup tools.

As someone who's built multiple startups and watched countless others struggle with the same decisions, I'm consistently amazed by how many founders waste precious runway building infrastructure instead of products. You're not Google. You don't need to build your own authentication system, payment processor, or email server. You need to validate your idea and get to market—fast.

After building RepoBird.ai and several other SaaS products, I've discovered a core set of tools that streamline the process of building from scratch to MVP - while giving you the flexibility to scale. Here's the stack that lets you focus on what matters: building features your customers actually want.

1. OpenTofu: Infrastructure as Code Without the License Anxiety

Core Value: Infrastructure as code! Open-source Terraform fork with zero vendor lock-in and full compatibility

Have you ever done a tutorial for setting up some AWS service manually? It is actual PAIN - you have to manually click through and type in a bunch of settings and configuration one at a time. Now, imagine setting up your entire cloud infastructure with simple blocks of code in a couple files. You can spin it all up in one cli command or tear it down the same way. This is the right way to centralize your infra in the simplest way - AI agents can help out with this if you have no infra experience (just beware of the possible costs incurred - know what your deploying).

If you're using Terraform, you should have switched to OpenTofu yesterday. When HashiCorp changed Terraform's license to BSL (Business Source License) in 2023, they inadvertently created a ticking time bomb for startups. OpenTofu, backed by the Linux Foundation, gives you everything Terraform does—but without the legal uncertainty or potential future licensing costs.

Why it beats the alternatives:

100% Terraform compatible - Your existing configs work immediately
Truly open source (MPL-2.0) - No surprise license changes or fees as you scale
Community-driven - Features driven by users, not corporate roadmaps
Zero cost forever - No licensing fees, ever

The killer feature: You can literally swap the Terraform binary for OpenTofu and everything just works. No migration, no rewrites, no drama.

Building infrastructure management from scratch would take months and create massive technical debt. AWS CloudFormation locks you into AWS. Terraform might cost you later. OpenTofu gives you enterprise-grade infrastructure management that remains free as you scale from 1 to 1,000 servers. If your startup doesn't require any special infra you may not need this at all! One example is some simple Nextjs webpage that sufficiently handles any server side logic without needing special AWS services.

2. PurelyMail: Email That Costs $10/Year, Not $10/User/Month

Core Value: Unlimited users and domains for the price of a coffee

Every startup needs email. Most pay $6-12 per user per month for Google Workspace or Microsoft 365. That's $720/year for just 5 email addresses. PurelyMail? $10/year total. Not per user. Total.

Why it beats the alternatives:

Unlimited email addresses - Create dev@, support@, sales@, noreply@, and 50 more without paying extra
Unlimited domains - Run email for all your projects and brands from one account
Zero complexity - Set up in 5 minutes, works with any email client
Actual privacy - Your data isn't being mined for advertising

The killer feature: The pricing model. While competitors charge per user (forcing you to share accounts or limit access), PurelyMail lets you create proper email addresses for every function, service, and team member.

Yes, you lose the Google Docs integration. But for pure email? You're saving thousands per year that can go toward actual product development.

3. Supabase: The Open-Source Firebase That Uses Real SQL

Core Value: PostgreSQL backend with instant APIs, auth, and real-time—all open source

Firebase seemed revolutionary until you hit its limitations. Supabase gives you Firebase's developer experience with PostgreSQL's power and zero lock-in. This UI is simple, easy to navigate, can run queries or have AI assistant generate them for you. Incredible value for the free plan.

Why it beats the alternatives:

Real PostgreSQL - Use actual SQL, relations, transactions, and decades of database best practices
Instant APIs - Every table automatically gets REST and GraphQL endpoints
Auth included - User management, social logins, and MFA without another service
Self-hostable - Your data, your servers, your control when you need it

The killer feature: Elegant, simple UI. Auto-generated APIs. Generous free tier.

Building a custom backend would take months. Firebase locks you into Google and NoSQL. AWS requires stitching together a dozen services. Supabase gives you a production-ready backend database in minutes that scales to millions of users.

4. Clerk: Authentication in 5 Minutes, Not 5 Months

Core Value: Modern auth with pre-built UI components and 10,000 free users

Authentication is a tarpit. It seems simple until you're implementing MFA, social logins, organization management, webhooks, and compliance. Clerk handles all of this with literally 5 lines of code. Free tier is working well for me no trial deadline. Has 3rd party signup integrated so users can sign in with google/github quite easily.

Why it beats the alternatives:

10,000 free monthly active users - Most startups never pay
"Free" Tier - Up to 10,000 MAUs and 100 active organizations free.
Pre-built React components - Drop in <SignIn/> and you're done
B2B ready - Organizations, roles, and invites built-in

The killer feature: The pre-built UI components. While Auth0 and Cognito give you APIs, Clerk gives you production-ready React components that look good out of the box.

Custom auth would take months and create security vulnerabilities. Auth0 gets expensive fast. AWS Cognito requires deep AWS knowledge. Clerk gets you to market with enterprise-grade auth before lunch.

5. Stripe: Payments That Actually Work

Core Value: Developer-first payments with APIs for everything

Everyone knows Stripe, but not everyone appreciates why it's revolutionary. It's not about accepting payments—PayPal did that 20 years ago. It's about programmatic control over every aspect of the payment flow.

Why it beats the alternatives:

APIs for everything - Subscriptions, invoices, taxes, payouts—all programmable
Global by default - 135+ currencies and local payment methods built-in
Testing paradise - Comprehensive test mode with fake card numbers for every scenario
Compliance handled - PCI, SCA, tax calculation—Stripe manages the complexity

The killer feature: The subscription API. Managing recurring billing yourself is a nightmare of edge cases. Stripe handles prorations, trials, upgrades, downgrades, pauses, and tax calculation automatically.

PayPal is user-hostile for subscriptions. Square is built for retail. Building payment processing yourself is literally illegal without proper licenses. Stripe lets you accept money from anywhere, in any way, with a few API calls.

6. Next.js + Vercel: The Full-Stack Framework That Actually Ships

Core Value: Frontend, backend, and deployment in one seamless package

In a nutshell: Nextjs provides a powerful way to build fast, scalable, SEO-friendly web applications with a seamless developer experience. Using Vercel with Next.js is recommended because it offers seamless deployment, maximized performance, and unique full-stack capabilities that are directly optimized for the framework. Vercel is created by the team behind Next.js and serves as its default hosting solution with advanced infrastructure benefits.

Why it beats the alternatives:

Full-stack in one repo - Frontend, backend APIs, in one codebase
Automatic everything - SSL, CDN, scaling, previews—all handled by Vercel
Git-based deployment - Push to main, site is live globally in 30 seconds
Built-in optimization - Image optimization, code splitting, caching—all automatic

The killer feature: API routes. Write your backend endpoints in the same repo as your frontend. No CORS issues, no separate deployment, no coordination headaches.

Plain React requires assembling a build pipeline. Rails/Laravel weren't built for modern frontends. AWS requires DevOps expertise. Next.js + Vercel gets you from idea to production URL in under an hour.

The only gripe I have with vercel is the need for (at least) Pro plan (20$) per seat per month. A startup would need a paid (Pro or Enterprise) tier on Vercel because the free Hobby plan is strictly for personal, non-commercial use—and commercial activity such as selling products or services is not permitted under Hobby.

PS: Code in Typescript - not Javascript.

The Hidden Cost of "Building It Yourself"

Every tool above could be built in-house. Also, you could be the next president of the United States. But just because you could doesn't mean you should. Don't be a stuck up do everything from scratch dev - you'll never get your core product out there. Once your in the big leagues, you can think about stripping out dependencies - until then use the frameworks out there for the sake of time savings.

Start Shipping, Stop Building Infrastructure

The startups that win aren't the ones with the most elegant infrastructure. They're the ones that reach product-market fit before running out of money. Every hour you spend building an authentication system is an hour not spent talking to customers or building features they'll pay for.

These six tools eliminate 90% of infrastructure complexity while maintaining the flexibility to scale. They're production-tested by thousands of companies, documented extensively, and supported by thriving communities.

Your startup's success won't come from having built your own email server or authentication system. It'll come from solving a real problem for real customers. These tools let you focus on exactly that.

Stop reinventing wheels. Start shipping products.

What tools have transformed your startup development? What infrastructure decisions do you wish you'd made differently? Share your stack in the comments.

DEV Community: Ariel Frischer

Agentic Coding Strategy: What Works, What Backfires

1. Multi-agent teams help when the work is genuinely parallel

2. Hierarchical decomposition beats one giant plan

3. Spec-driven and test-driven workflows solve different problems

4. The harness matters more than people want to admit

5. Model tiering is the cost control strategy

6. Experienced developers on familiar codebases should be selective

My decision framework

The bucket answer

arc-agent: AI System Design Generator

What it produces

Quickstart

Autospec: Spec-Driven Development for AI Coding Agents

Install

Set up a project

The basic workflow

One-command mode

Useful commands

Pick your agent

Why it is useful

Links

What Happens If Mythos Ships Before the Patches Do

What the weapon does

The defender has a structural problem

Who gets hit

Timeline

The right comparison is not a single past event

Why Glasswing is the actual story

OpenClaw vs Hermes Agent: A Comprehensive Comparison

Architecture

Messaging platforms

Model providers

Skills

Memory

Tools and sandboxing

Sessions

Multi-agent

Automation

CLI and developer experience

Self-improvement

References

I Built an AI Rental Management Platform for My Brother. Here's What Actually Happened.

Three Weeks of Real Data

How It Works

What Rentalot Does

The Point

Looking for Early Users

Stop Vibe Coding. Start Spec-Driven Development.

Vibe coding fails at alignment, not generation

How autospec enforces this

Why autospec over GitHub Spec Kit?

When specs are worth it

Get started

Making Your SaaS AI-Agent Ready: A Practical Guide

Overview: All 7 Phases

The AI-Agent Stack

Phase 1: Discovery Layer

Create llms.txt

Create llms-full.txt (Optional)

Phase 2: Enhance Your OpenAPI Spec

Agent-Oriented Descriptions

Required Enhancements

Document Rate Limits in Spec

Phase 3: OpenClaw Skill

Phase 4: MCP Server

Build a Local MCP Server

Document MCP Setup

Phase 5: TypeScript SDK

Phase 6: Remote MCP Server (Advanced)

Phase 7: Polish & Future-Proofing

JSON-LD Structured Data

"Build with AI" Documentation Page

Key Takeaways

References

Code Coverage Best Practices for Agentic Development

The Core Problem

Key Insight: Tests Are Institutional Memory

The Middle Ground: Tiered Coverage Strategy

Principles

Create `llms.txt`

Create `llms-full.txt` (Optional)