Tracking the "Invisible Web": How I built an engine to detect traffic from ChatGPT, Claude, and Perplexity

#showdev #webdev #analytics #architecture

The Problem: The "Black Box" of Modern Analytics

I spent years frustrated with the same two problems in marketing tech:

The "Direct" Lie: Google Analytics (GA4) dumps huge chunks of traffic into "Direct" because it can't parse the referrer. In reality, this is often "Dark Social" (Slack, Discord) or the new wave of AI Answer Engines.

Fragile A/B Tests: Client-side testing tools rely on CSS selectors that break the moment a developer changes a class name.

I decided to build Zyro to fix this. It’s an optimization platform that combines a server-side A/B tester with a finance-grade attribution ledger.

Here is a breakdown of the architecture and the specific technical challenges I solved.

Catching the "Invisible" Traffic (AI & LLMs)

Standard analytics tools look for standard referrers. But traffic from LLMs often gets stripped. I built a Universal Source Detector that parses over 50+ specific traffic signatures.

Instead of a simple regex, we built a "Traffic Brain" that specifically identifies:

AI Engines: ChatGPT, Perplexity, Claude, Gemini, Copilot.

Ad Identifiers: Parsing gbraid, wbraid (Google), fbclid (Meta), and ttclid (TikTok).

Email Platforms: Detecting _kx (Klaviyo) and mc_cid (Mailchimp).

The Tech: We store these massive, parameter-heavy URLs in a custom SQL schema using NVARCHAR(MAX) columns to ensure zero data truncation, something standard schemas often fail at.

The "Visual Intelligent" Editor (No More Broken Selectors)

Most A/B tools force you to "blind click" on elements, generating fragile CSS paths. I took a different approach using HtmlAgilityPack.

Backend Crawler: Instead of relying on the browser, our backend crawls the live URL and parses the DOM tree server-side.

X-Ray Vision: On the frontend, we render elements in a sandboxed iframe on hover. This allows us to "see" the code structure before editing.

Asset Pipeline: When a user swaps an image, we don't hotlink. We automatically upload it to AWS S3 and serve it via CloudFront. This means the "test" variation often loads faster than the original site.

"God Mode" Attribution (Linear Multi-Touch)

The biggest challenge was attribution. "Last Click" is lazy. I built a Linear Multi-Touch Model with a 30-day lookback window.

The Logic: When a transaction happens (via Stripe webhook or Bank Wire), the engine:

Fetches the user's entire 30-day history.

Filters "Direct" Noise: If other marketing sources exist (e.g., a Facebook ad click 2 weeks ago), "Direct" is explicitly ignored.

Splits the Revenue: It mathematically divides the dollar value across all valid touchpoints and writes it to a dbo.AttributionLedger.

Syncing "Intent" to Ad Pixels

We don't just track pageviews. We track Intent Scores (0–100) based on micro-behaviors:

High Intent: Copying text, dwelling on pricing, reading reviews.

Friction: Rage clicks, dead clicks, form abandonment.

The Cool Part: The system dispatches these events server-side to ad platforms (Meta CAPI, Google Ads). This allows you to retarget users who showed "High Intent" but didn't buy, drastically lowering CAC.
The Stack

Backend: .NET / SQL Server

Crawling: HtmlAgilityPack

Algo: Thompson Sampling (Multi-Armed Bandit) for auto-optimizing traffic.

Infra: AWS S3 + CloudFront.

What’s Next?

I’m currently refining the "Anti-Flicker" engine (customizable timeouts to prevent FOOC) and expanding the AI detection library.

I’d love to hear how you guys are handling attribution for ChatGPT traffic. Are you seeing it show up as "Direct" in GA4 too?

🚀 Try Zyro "God Mode" & Fix Your Attribution