DEV Community

I Built a Local AI Agent That Audits My Own Articles. It Flagged Every Single One.

Daniel Nwaneri on March 30, 2026

Not as a gotcha. As a result. Seven URLs. Seven FAILs. My Hashnode profile is missing an H1. Three freeCodeCamp tutorials have meta descriptions ...

Read full post

Pascal CESCATO • Mar 30

Interesting, but you don't need an LLM for this. Looking at your code, everything you're sending to Claude can be done directly in Python — with two advantages: zero cost, and a fully deterministic approach with no hallucination risk.

Daniel Nwaneri • Mar 30

You're right that the extraction logic is deterministic — PASS/FAIL on character counts doesn't need a model. But the flags array is where it breaks down. "Title is 67 characters and reads like a navigation label rather than a page description" requires judgment a regex doesn't have. I wanted the output to be actionable, not just binary.
The cost argument holds though. For a pure character-count audit, Haiku at $0.001/URL is already trivial, but zero is less than that.
Where does your Python-only approach handle the ambiguous cases — pages where the title length passes but the content is clearly wrong for the query?

Pascal CESCATO • Mar 30

Fair point on the flags — if they're meant to carry semantic judgment ("reads like a nav label"), then yes, a model earns its place. But looking at your schema, the flags are still field-level: "title exceeds 60 characters", not "title is semantically weak". The ambiguous cases you mention — title length passes but content is wrong for the query — aren't in scope here.
That's actually a different tool. A two-pass approach makes more sense: deterministic Python for the binary checks, model call only on pages that pass the mechanical audit but need a second look. You pay per genuinely ambiguous case, not per URL.

Daniel Nwaneri • Mar 30

The 2-pass framing is better than what I shipped. Deterministic filter first, model only on the survivors — you pay per genuinely ambiguous case, not per URL. That's the right architecture and I didn't build it that way.
The honest reason. I wanted one code path, not two. The added complexity of "run Python, decide if model is needed, run model conditionally" felt like scope creep for a tutorial. In production you're right. In a showcase meant to demonstrate the LLM layer, one pass made the demo cleaner.

Worth a follow-up piece though — "when to add a model to automation that already works."

Pascal CESCATO • Mar 30

That's an honest answer — and a better reason than the architecture. "One code path for a tutorial" is defensible; "LLM for character counts in production" isn't.
The follow-up angle is good. Another framing: "the cheapest model that solves the problem" — which sometimes is a regex, sometimes Haiku, occasionally something bigger. Cost and complexity as a sliding scale rather than a binary choice.

Daniel Nwaneri • Mar 30

Cheapest model that solves the problem

is a better frame than 2-pass because it generalizes. Regex → Haiku → Sonnet isn't a decision tree, it's a cost curve. You route based on what the task actually requires not on a predetermined architecture.
The piece writes itself, start with the character count example, work up through cases where Haiku is enough, find the edge where Sonnet earns it. Foundation does something like this implicitly — short queries hit a lighter path but I've never written it out explicitly.
Adding it to the queue.

Apex Stack • Apr 3

The HITL pattern is the most underrated part of this. I run automated SEO audits daily across a large multilingual site (~140K pages, 12 languages) and the biggest lesson was exactly this — the agent needs to know when to stop and flag something for a human instead of silently failing or making up an answer.

The debate in the comments about LLM vs pure Python is interesting. In my experience you need both: deterministic checks for the binary stuff (title length, meta desc present, H1 count, hreflang tags) and the LLM for qualitative judgment calls ("is this title actually descriptive or just keyword-stuffed?"). Running pure regex on 140K pages is cheap but misses the nuance. Running Claude on every page is accurate but expensive. The sweet spot is deterministic first pass → LLM only on the flagged subset.

The $0.002/page cost is solid. At scale that math changes fast though — curious if you've thought about batching the Claude calls or using a cheaper model (Haiku) for the initial classification pass?

Daniel Nwaneri • Apr 3

The 140K number is where my cost math breaks down fast. At that scale the architecture you described — deterministic first, model on the flagged subset — isn't optional, it's survival. The Haiku question is actually what the follow-up piece became: the cost curve runs regex → Haiku → Sonnet based on what the task requires, and on my last 50-URL run, 8 reached Sonnet. The rest resolved cheaper. Your site would probably produce a different ratio — multilingual with hreflang complexity means more pages that pass binary checks but need semantic judgment. Curious what percentage of your 140K actually reach LLM evaluation on a given run.

Pavel Ishchin • Apr 4

the one code path reason thing is probably the most honest part here. seen too many demos where everything works until you pull apart deterministic stuff from the model and then it’s like wait why is the model even here. did the 50 url run change how much of this you’d actually keep in a real system

Daniel Nwaneri • Apr 4

The 50-URL run answered it. Extraction is pure Python now — title, description, H1, canonical, all deterministic. The model only touches pages where something passes the mechanical check but needs judgment. In production the split is cleaner than I expected and the places where the model still earns it are exactly the messy cases you'd guess: inconsistent templates, third-party injections, content that has to be read rather than located. Wrote up the full architecture here if you want the details: dev.to/dannwaneri/i-was-paying-0006-per-url-for-seo-audits-until-i-realized-most-needed-0-132j

Apex Stack • Apr 5

This resonates hard. The deterministic-first approach is exactly what I landed on too. I run automated audits across thousands of pages and the same pattern holds — about 80% of SEO checks are pure string matching and DOM parsing. No model needed.

The messy cases you mentioned (third-party injections, content that has to be read rather than located) — that's where I still lean on the model too. Cookie consent scripts injecting rogue H1 tags, for example. A regex can find the H1 but only the model can tell you "this H1 is from a third-party widget, not your actual page title."

The cost savings from that split are massive at scale. When you're auditing 50+ URLs per run, paying for model inference on every single check would add up fast. Smart architecture. Will check out your full writeup.

Daniel Nwaneri • Apr 5

The "regex finds the H1, model tells you whose H1 it is" distinction is the cleanest way I've heard it put. That's going in the docs. Glad the writeup is useful — the cost savings at your scale dwarf mine, which is probably why you landed on this architecture before I did.

FrancisTRᴅᴇᴠ (っ◔◡◔)っ • Mar 30

Great work! Hope you are well and it's been awhile.

It is a pain where you mentioned "open a spreadsheet, visit each client URL, check the title tag, check the description, check the H1, note broken links, paste everything into a report. Repeat weekly.". It is very time consuming and glad you made a project that tackle this big issue. Well done! :D

Daniel Nwaneri • Mar 30

That weekly spreadsheet ritual is the thing nobody talks about when they pitch agency life. Good to hear from you Francis — been a while indeed.

Apex Stack • Mar 30

The HITL design is the part that really stands out here. Most agent tutorials treat failure as an edge case — your skip/retry/quit approach treats it as a first-class workflow state. That's a huge difference in production.

I run a similar pattern on my own site (89K+ pages across 12 languages). The SEO audit agent checks GSC data, crawls pages, and files tickets — but the key insight I learned early was exactly what you described: the agent needs to know when it's out of its depth and flag for human review instead of guessing.

On the LLM vs deterministic debate in the comments — I think you nailed the response. The binary checks (title length, meta desc presence) don't need a model. But the qualitative flags ("this title reads like a navigation label") are where the LLM earns its $0.002/page. The hybrid approach is underrated.

Curious about your state.json approach — do you version it or just overwrite? I've found that keeping a rolling history of audit results is useful for tracking whether SEO issues are getting better or worse over time.

Daniel Nwaneri • Mar 30

89K pages across 12 languages is a different beast entirely . The GSC integration is the piece I deliberately left out of v1 because it changes the architecture. You're not just auditing what's there, you're correlating with what Google sees. That's where the tool gets genuinely useful for agencies and genuinely complex to build.

On state.json — currently just overwrites. Your point about rolling history is the obvious v2. Even a simple append-per-run with a timestamp key would let you track PASS→FAIL regressions over time. That's probably more valuable than the initial audit for most clients.

What does your ticket-filing look like — do you route by severity or just dump everything into a queue?

Apex Stack • Mar 31

Severity-based routing, but simplified. The agent categorizes into three buckets: broken (5xx errors, missing hreflang on entire page types, broken JSON-LD), degraded (short meta descriptions, thin content under 200 words), and cosmetic (title slightly over 60 chars, minor formatting). Broken gets a ticket filed immediately in Linear with a DEV- prefix. Degraded gets batched into a weekly ticket. Cosmetic gets logged but no ticket unless it affects a page that's actually ranking.

The key insight was that filing a ticket for every issue creates noise. When I first set it up, the agent generated 40+ tickets in a single run — nobody triages that. Now it deduplicates against existing open issues before creating new ones, which cut ticket volume by about 60%.

The GSC correlation you mentioned is where it gets interesting though. A page can pass every on-page audit but still sit in "crawled - not indexed" for weeks. That's where the tool stops being an auditor and starts being a diagnostic — and that's the harder problem to solve.

Daniel Nwaneri • Mar 31

The three-bucket system is the right call. Binary pass/fail at scale is just noise with extra steps — the signal is in the severity routing, not the detection.
The deduplication against open issues is the piece I hadn't thought through. Filing a ticket for every run means the same issue gets reopened weekly until someone fixes it. Checking first whether it's already tracked changes the agent from a reporter into something closer to a monitor.

The crawled-not-indexed problem is a different class entirely. On-page signals are visible to the agent. GSC indexing state requires the API, a time dimension, and context the agent doesn't have — why was it crawled, when did status change, what changed on the page between crawl attempts. That's where you stop auditing and start investigating. Have you found a pattern in what actually resolves it, or is it mostly waiting and hoping Google recrawls?

Apex Stack • Mar 31

Some patterns have emerged after watching 135K URLs go through the GSC pipeline over 3 months:

Content length matters more than people admit. Pages under 300 words almost never escape "crawled - not indexed." Once I expanded stock page analyses from ~200 to 600-800 words, indexed count jumped 81% in a single week (1,335 → 2,425).
Internal linking is the underrated lever. Adding "Related Stocks" and "Popular in Sector" widgets — basically creating a web of cross-links between stock → sector → ETF pages — seemed to help Google decide individual pages were worth indexing. The pages themselves didn't change, just their connectedness.
Hreflang cleanup had an outsized effect. My "alternate canonical" errors dropped from 682 to 83 after fixing hreflang tags. That correlated with the indexing spike, though causation is hard to prove.

What doesn't seem to work: just waiting. Pages that sat in "crawled - not indexed" for 6+ weeks without any changes rarely moved on their own. The trigger was always a content or structural change that gave Google a reason to re-evaluate.

Daniel Nwaneri • Mar 31

The content length finding is the most actionable thing in this thread. 81% indexing jump from 200 → 600-800 words is a number worth putting in front of anyone who thinks thin pages are a technical problem rather than a content problem. The agent can flag under-300-word pages trivially — that's a len(text.split()) < 300 check, not a model call...

The internal linking point reframes what the audit should actually measure. Right now the agent checks whether links are broken. What it doesn't check is whether the page is sufficiently connected to the rest of the site. Connectedness isn't an on-page signal — it requires a graph, not a snapshot...

That's the v2 architecture: page-level audit for on-page signals, site-level graph for structural signals. The crawled-not-indexed diagnostic lives in the second layer...

Lavie • Apr 1

This is exactly why 'human-in-the-loop' is still the most critical part of AI workflows. I've been doing something similar with Next.js 15 dev cycles--building an automated system to catch hallucinations before they hit production. It's fascinating that even when the agent flags everything, the real work is in the systematic prevention of those patterns. Great read on the 'local-first' audit approach!

Daniel Nwaneri • Apr 1

Hallucination detection before production is the harder version of the same problem — the audit has to be more reliable than the thing it's auditing.

Lavie • Apr 1

Exactly. Hallucination detection before production is the harder version of the same problem - the audit has to be more reliable than the thing it's auditing. This is why I've moved away from "audit-after-the-fact" and more towards "generator-constraints" with .mdc files. If you can force the AI to follow the rule during generation, the audit becomes a lot simpler because you're already within the standard. It's the difference between testing for bugs and formal verification.

Daniel Nwaneri • Apr 1

Detection-to-prevention is the right reframe. An auditor that runs after generation is always playing catch-up — the cost of fixing a hallucination compounds with how far it traveled before detection...
The .mdc constraint approach is interesting because it moves the enforcement into the generation context rather than a separate validation pass. The analogy to formal verification holds: you're specifying what correct output looks like before the output exists, not checking conformance after.
The limit is expressiveness. Formal verification works cleanly on systems with bounded state. Natural language generation has enough degrees of freedom that constraints leak — the model finds outputs that satisfy the rule syntactically but violate the intent. How are you handling constraint drift as the rules accumulate?

Lavie • Apr 1

Constraint drift is a huge issue once you cross around 15-20 rules. I've found that grouping them into functional blocks (e.g. 'Data Fetching' vs 'Auth Patterns') and using a 'master' rule to keep the model from trying to satisfy too many competing constraints at once helps. It is definitely a balancing act between precision and flexibility.

Daniel Nwaneri • Apr 1

The 15-20 rule threshold is useful to know. Below that, individual rules can stay precise. Above it, the model starts satisfying the letter of each rule while violating the spirit of the system which is worse than no rules because it looks compliant.
The master rule as a coherence layer is interesting. You're essentially adding a meta-constraint: satisfy the functional blocks without letting them conflict. That's a second-order problem most people don't hit until the rule count is already too high to untangle easily.
The grouping approach mirrors how you'd structure any constraint system — local rules for local concerns, global rules for cross-cutting concerns. The failure mode is when a local rule has global consequences nobody anticipated.

Mykola Kondratiuk • Mar 31

Love this. There's something genuinely useful about using the same tool to audit the work it helped you produce - catches the patterns you've normalized. Curious what the most common flag was. Was it style/tone or more structural things like missing context or weak conclusions?

Daniel Nwaneri • Mar 31

Mostly structural — missing meta descriptions, titles over 60 characters, H1 count issues. Nothing about style or tone because the agent isn't reading for quality, it's checking against standards. The interesting case was freeCodeCamp's own template truncating descriptions on article listing pages — the agent flagged it as a FAIL and it technically is, even though it's platform-level and outside my control. Auditing your own work with your own tool finds the things you'd rationalized as acceptable.

Mykola Kondratiuk • Mar 31

The structural stuff makes sense - those are measurable so the agent can actually flag them. But that freeCodeCamp case is the interesting one. Platform-imposed truncation showing up as a personal FAIL is exactly the kind of thing you'd normally just rationalize away. The agent doesn't know context, so it flags it anyway. Weirdly that's the most honest kind of audit.

Daniel Nwaneri • Mar 31

Context-free is the feature not the limitation. A human auditor would see "freeCodeCamp template" and mark it acceptable. The agent sees a missing meta description and marks it FAIL. Both are correct . they're answering different questions.
The agent answers: does this page meet the standard? The human answers: is this worth fixing given the constraints? You need both. The agent's job is to surface everything. Your job is to triage what actually matters.

The platform-imposed FAIL is useful precisely because it forces the triage decision to be explicit rather than assumed. You either fix it, escalate it, or document why it's acceptable. Any of those is better than normalizing it silently.

Mykola Kondratiuk • Mar 31

The agent/human split you're describing is exactly right. The agent answers "does this meet the standard" - the human answers "is this worth fixing given the context". Those are genuinely different questions and both useful. The platform-imposed failures are actually good signal - they're showing you the gap between your setup and the standard, even if you consciously chose that gap.

Daniel Nwaneri • Mar 31 • Edited

Consciously chose that gap

is the useful distinction. There's a difference between a FAIL you didn't know about and a FAIL you accepted. The agent can't tell which is which but surfacing both forces you to be explicit about which category each one falls into. The ones you assumed were acceptable without ever deciding they were is where the audit earns its cost.

Jonathan Murray • Apr 4

"It flagged every single one" is a satisfying result in a perverse way — it means the audit is actually working rather than being sycophantic. Self-critique tools that only flag obvious problems quickly become noise that you tune out. The ones that surface genuine issues in work you thought was solid are the ones worth keeping.

The interesting design question for a writing auditor is what dimensions to evaluate — clarity, factual accuracy, logical consistency, missing context, tone — because the failure modes are very different and some are more automatable than others. Factual claims and structural logic are relatively easy for an LLM to flag; things like "is this actually useful to the intended reader" are much harder because they require modeling the reader's specific knowledge state.

What were the most common categories of issues it surfaced? I'm curious whether the flags were mostly stylistic or whether it caught substantive gaps in the reasoning.

Daniel Nwaneri • Apr 5

The flags were almost entirely structural, not substantive which is both honest and limiting. Missing meta descriptions, titles over 60 characters, absent canonicals. The agent found what was technically wrong, not what was argumentatively weak. "This title is 67 characters" is a different class of finding than "this piece buries the actual insight in paragraph four."
The dimension you're pointing at — is this actually useful to the intended reader — is the one I haven't touched yet, and probably can't with the current architecture. Modeling a reader's knowledge state requires knowing who the reader is, which the agent doesn't. What it can do is flag when the structure makes the useful part hard to find. That's a narrower version of the same problem, but it's automatable. The substantive gaps are still a human job.

TAMSIV • Apr 1

Love the meta aspect of this — using AI to improve the content you write about AI. That feedback loop is underrated.

I've been exploring a similar angle but for a different source: git commits as content. Instead of auditing existing articles, I use the project's git history as raw material for new ones. Every commit message, every PR description, every refactor decision is a micro-story waiting to be told.

The interesting parallel with your approach is that both methods treat your own work as a dataset. You're mining your articles for quality signals; I'm mining my codebase for narrative signals. Both beat staring at a blank page trying to brainstorm "what should I write about."

Your point about the agent flagging every single article is humbling but honest. I'd be curious to know: what was the most common issue it found? Was it structural (flow, readability) or more about content depth?

Also — did you consider letting the agent suggest rewrites, or intentionally keep it as an auditor only to preserve your voice?

Daniel Nwaneri • Apr 1

Git history as content source is the one I haven't tried. The interesting thing about that approach is the signal is already structured — commit messages have implicit categories (fix, feat, refactor), PR descriptions have implicit narrative (problem, approach, tradeoff), and the diff is the evidence. You're not generating content from nothing, you're surfacing decisions that were already made.

The most common flag was missing or over-length meta descriptions — structural SEO, not content quality. The agent doesn't read for depth or clarity, only against measurable standards. Style and readability would require a different audit layer entirely.
On keeping it as auditor only: deliberate. The moment it suggests rewrites it's optimizing for the standard, not for the voice. Those aren't the same thing.

Julian Oczkowski • Mar 31

This is one of the more nuanced discussions I've seen on AI tooling. The "cost curve" framing Pascal and Daniel landed on is exactly the right mental model — not every task needs the same level of intelligence, and the real engineering challenge is routing to the cheapest model that solves the problem at each step.

I'd add that this pattern extends beyond content auditing. In production AI systems, we often see a tiered approach: deterministic rules first, lightweight models for triage, and larger models reserved for genuinely ambiguous edge cases. It keeps latency low, costs predictable, and reduces unnecessary LLM dependency.

Looking forward to the follow-up on finding that inflection point — measuring where the cost delta stops justifying the upgrade is something most teams skip but is critical for sustainable AI adoption.

Daniel Nwaneri • Mar 31

The 3-tier pattern is the production version of what Pascal and I landed on in the abstract. Deterministic rules → lightweight triage → frontier model for edge cases maps directly to the cost curve: floor, middle, and the inflection point where the upgrade justifies itself.

The latency argument is the one I hadn't foregrounded. Cost is measurable upfront. Latency compounds in ways that aren't obvious until you're watching a 7-URL audit take 4 minutes because every page hits Sonnet regardless of complexity. Routing by task type fixes both problems simultaneously.
The follow-up piece has a natural structure now — build the three tiers explicitly, measure where each plateau hits, find the inflection empirically rather than guessing. That's a more useful article than

here's when to use Haiku.

Apex Stack • Apr 6

The needs_human[] pattern is underrated. I run automated SEO audits across a site with 100K+ pages in 12 languages, and the biggest lesson was exactly this — the agent needs to know when to stop and flag rather than guess.

For instance, my audit agent checks hreflang tags, meta descriptions, schema markup, and H1 structure across every page type. Early versions would silently skip pages that returned unexpected layouts (like cookie consent overlays injecting a second H1). Once I added explicit edge-case detection — basically my own version of needs_human — the audit went from generating noise to generating tickets I could actually act on.

The cost math resonates too. At $0.002/page the ROI is absurd compared to manual spot-checking. Have you thought about adding a diff mode that only re-audits pages that changed since the last run? That was a big efficiency win for me when dealing with large page counts.

Apex Stack • Apr 4

This resonates hard. I run a programmatic SEO site with 89K+ pages across 12 languages, and the exact pattern you describe — deterministic checks that someone does manually every week — is what pushed me to build similar audit agents.

The part about the brittleness shift is the key insight. Traditional scrapers break when the DOM changes. With an LLM in the loop, the agent can adapt to layout changes and still extract the right signals. The tradeoff is cost per audit, but when you're checking thousands of pages, you quickly learn which checks need reasoning and which are pure regex.

One thing I've found useful: tiering the audit. Run the deterministic checks (title length, meta desc, H1 count, status codes) with plain Python first, then only send the pages that need judgment calls to the LLM. Cuts API costs by ~80% while keeping the quality assessment where it matters.

Curious about your experience with Browser Use specifically — how does it handle JavaScript-heavy SPAs compared to something like Playwright? That's where most of my rendering issues come from.

Daniel Nwaneri • Apr 4

Browser Use is Playwright underneath — it's the agent orchestration layer not a separate browser engine. So the SPA rendering question is really a Playwright question: does the page fully hydrate before the agent reads it? In my testing, Browser Use waits for network idle before extraction, which handles most JS-heavy cases. The failure mode I've hit isn't rendering — it's timing. Pages that load a skeleton first and populate content asynchronously can get extracted mid-hydration. The fix I've been using: a short explicit wait after navigation before triggering the Claude extraction call. Not elegant, but reliable.

Where are your rendering failures actually showing up — missing content, wrong H1 (skeleton vs hydrated), or something else?

Apex Stack • Apr 4

Good insight on the Playwright timing issue. My site is static-generated (Astro), so SPA hydration isn't a factor — but the rendering failures I see are different: a third-party cookie consent script (Silktide) injects a hidden H1 tag into the DOM after page load, which means the agent sees 2 H1s on every page when it reads the accessibility tree. That's a template-level bug that silently affects all 100K+ pages.

The other big one is content language mismatch — the page is tagged as Japanese or Spanish via hreflang, but the AI-generated body text is still in English because the content pipeline hadn't caught up. An agent auditing the rendered DOM catches this instantly because it can pattern-match English sentence structures on a page that should be entirely in another language.

For your timing fix, have you tried waitForLoadState('networkidle') in Playwright? That's usually more reliable than an arbitrary delay for async content.

Daniel Nwaneri • Apr 4

The language mismatch case is the one I want to add to the docs. That's not an SEO audit finding . it's a content pipeline failure that only surfaces when you read the rendered page semantically. A character count check passes it. A human reviewer might miss it at scale. The agent catches it because it knows English sentence structure doesn't belong on a page tagged as Japanese. That's the argument for LLM-in-the-loop that no tutorial example captures cleanly.
On waitForLoadState('networkidle') — I've avoided it because on pages with persistent polling or analytics pings it never resolves. The arbitrary delay is inelegant but predictable. If your Astro site is genuinely static after the consent script fires, networkidle probably works cleanly. I'll test it against the edge cases and see if I can drop the fixed wait.

Ella • Apr 1

This is good. I like how you tested it on your own work instead of just using it on a demo. That makes it more real. I also like how you included HITL, as most people would skip this part and their script would just break.

Daniel Nwaneri • Apr 1

The demo problem is real. Most agent tutorials audit a toy site specifically because it passes cleanly. Running it on your own published work means you can't curate the results whatever the agent finds is what gets reported. The seven FAILs weren't staged.

Apex Stack • Apr 3

The framing of "brittleness moved from selectors to prompts" really resonates. I run a daily automated audit agent across ~90K pages on a multilingual Astro site, and the hardest bugs to catch programmatically are exactly the semantic ones — like a cookie consent script injecting a hidden H1 that passes every HTML validator but tanks your SEO because Google sees two competing H1s. A regex would never catch that. The LLM spots it because it understands what an H1 is for, not just what it looks like in markup.

Curious about your --auto mode and the needs_human[] pattern. At scale, how do you handle the backlog of URLs that need human review? Do you batch them into a report or is there a triage workflow?

Daniel Nwaneri • Apr 3

The cookie consent H1 is the best example of this I've seen in the comments. A validator passes it. Regex passes it. The LLM catches it because it's reading intent, not markup. That's the whole argument for semantic extraction in one case.
The needs_human[] backlog is the weakest part of the current build and I'll say that plainly. Right now it's a flat list in the summary. At 90K pages that's not a workflow, it's a graveyard. What I haven't shipped yet: severity tiers in the backlog. A 404 is different from a login wall is different from a redirect chain — triaging a flat list at scale just means nothing gets reviewed.

Apex Stack • Apr 4

The cookie consent H1 example is chef's kiss — that's exactly the kind of issue where regex and validators give you a false clean bill of health because they're checking syntax, not semantics. An LLM catches it because it understands that a cookie banner shouldn't be an H1.

Your point about severity tiers in the backlog is spot on too. I'm dealing with this at ~90K pages myself — a flat issue list becomes completely unactionable. Right now I bucket things manually (P1 = broken pages, P2 = content quality, P3 = nice-to-have), but automating that triage based on SEO impact would be the real unlock. A redirect chain on a high-traffic page is fundamentally different from a missing alt tag on a deep page nobody visits.

Michael Weber • Mar 31

Solid architecture, Daniel. The use of flat JSON for state persistence is a smart move for local agents—keeps things portable and debuggable without the overhead of a database. It’s also interesting to see how Claude handles the accessibility tree instead of raw HTML. Definitely a more resilient way to build scrapers/auditors today.

Daniel Nwaneri • Mar 31

The accessibility tree point is worth expanding on. Raw HTML gives you structure — the accessibility tree gives you intent. A div styled to look like a button is invisible to a scraper. Browser Use sees it the same way a screen reader would. That shift from parsing markup to reading meaning is what makes the extraction prompt reliable across different site architectures.

Prasoon Jadon • Apr 1

This is one of those builds where the result matters more than the tool—and the result here is brutally honest: automation doesn’t just scale work, it exposes it.

Running it on your own content first is the part most people skip—and it’s exactly why this feels credible.

A few thoughts that stood out:

“An agent that knows its limits” is the real innovation here
Everyone’s obsessed with autonomy, but in practice, graceful failure + HITL is what makes systems usable in the real world.
You didn’t just automate SEO—you productized a role
That weekly spreadsheet job you described? This replaces not just effort, but process. That’s a much bigger shift than “AI saves time.”

The semantic vs selector shift is huge
Moving from brittle selectors to reasoning via something like Claude API is basically:

from “tell the computer where to look”
to “let it understand what it’s seeing”

That’s a fundamental change in how automation is built.

Also worth calling out: your cost model (~$0.002/page) quietly kills a lot of SaaS in this space. A lot of “SEO audit tools” are now competing with… a weekend project.

If I had to compress the deeper insight:

Old automation scaled actions.
New agents scale judgment.

And your system shows something even more important:

judgment doesn’t need to be perfect—just consistent and cheap enough to run continuously.

Curious where you take this next—especially if you layer in diff tracking over time or auto-fix suggestions. That’s where it goes from audit tool → autonomous optimization system.

Daniel Nwaneri • Apr 1

"Scales judgment not actions" is the sharpest compression of what's different here. Old automation needed you to specify the action precisely. The agent needs you to specify the standard . what good looks like and it handles the execution. That's a fundamentally different thing to maintain.
The cost model point is the one I think about most. The SEO audit SaaS market is built on the assumption that this problem requires infrastructure. A weekend project running at $0.12 per 20-URL audit doesn't kill the enterprise tools but it does kill the mid-market ones that charge $99/month for what is essentially a scheduled crawler with a dashboard...
On where it goes next. diff tracking is the natural v2. The audit is only interesting on the first run. What becomes interesting over time is whether issues are getting fixed, regressing, or accumulating. That's a monitoring system, not an audit tool and it's a much more valuable thing to sell.

ArkForge • Mar 30

State persistence with flat JSON is underrated. The checkpoint-and-resume pattern also makes the audit trail cleaner — you can replay exactly what the agent evaluated and when.

Daniel Nwaneri • Mar 30

The audit trail angle is the part I didn't write up. Right now state.json just tracks what's done but timestamped, append-only entries would let you reconstruct the full session. Which URL, what result, exactly when. Useful for debugging a run that produced unexpected results three weeks later.
The flat file keeps it portable too. No database dependency means anyone can clone the repo and inspect the history in any text editor.

ArkForge • Mar 31

Yeah, that's where it gets interesting once your state.json is the audit record, it's self-attested. Works perfectly for your own debugging, but the moment a third party needs to verify what actually happened, they're trusting your file.
We ended up solving it by signing at write time each append gets a receipt the instant it's written. Same flat-file approach, no database, but tamper-detectable. If someone replays the log three weeks later, they can verify none of it was touched.

Daniel Nwaneri • Mar 31

Self-attested is the right framing . I hadn't named it that way but that's exactly the failure mode. The file says what happened, and the file is controlled by the process doing the writing.
Signing at write time is the clean fix. The receipt exists independently of the record, so a replay can verify without trusting the record itself. Same flat-file portability, but the trust model is different.
The interesting edge case: what happens when the signing key is on the same machine as the agent? You've moved the trust boundary but not eliminated it. How are you managing key custody in your setup?

Lavie • Apr 1

I really like the choice of using Browser Use for this instead of a traditional headless scraper. Handling real rendering is a game-changer for audits where dynamic content might be missed. I've been deep in the agent space lately too, specifically building .mdc rules for Cursor to prevent Next.js 15 hallucinations (like the new async params requirement). It's interesting to see how we're all moving towards these specialized agent systems for deterministic validation. Great write-up!

Apex Stack • Apr 2

The $0.002-per-page cost breakdown really puts this in perspective. I run a similar scheduled audit setup on a large programmatic SEO site — checking schema markup, hreflang tags, meta descriptions, and H1 structure across thousands of pages daily. The pattern of using an agent to audit your own output is incredibly powerful because it catches the kind of drift that humans miss over time.

Your point about platform-level issues (freeCodeCamp controlling meta descriptions, Dev.to title tags exceeding 60 chars) is an underappreciated nuance. The audit isn't just about your content quality — it's about understanding which problems you can actually fix vs. which are platform constraints you need to work around.

Curious about one thing: how do you handle false positives over time? When I run audits at scale, I've found that certain "issues" are actually acceptable tradeoffs (like longer titles that perform better in clicks despite exceeding character limits). Do you have a way to mark known exceptions so they don't clutter future reports?

Nube Colectiva • Mar 31

Cool 🔥

Glenn Sonna • Apr 6

Cool but you could have done this with fully local/offline AI using an SDK like github.com/xybrid-ai/xybrid