I wanted a practical answer to one question:
How do we measure web tracking signals in a way that is reproducible, explainable, and non-invasive?
This post walks through the approach, what we built, and what we learned from a 10-site batch run.
TL;DR
FlowLens-Web is a TypeScript CLI that:
- records browser sessions with Playwright + HAR,
- extracts identifier-like request signals,
- scores evidence levels (L1-L5),
- reports cross-domain reuse and cross-run persistence,
- outputs Markdown + Mermaid summaries.
It is a research/measurement tool, not a blocker.
Architecture
Core stack:
- Node.js + TypeScript
- Playwright (Chromium)
- tldts (eTLD+1 classification)
- SHA-256 hashing for safe identifier matching
Pipeline:
- run scripted browsing scenario
- save HAR
- parse entries + normalize request metadata
- extract candidate identifier fields
- compute reuse/persistence signals
- assign evidence levels
- generate reports (case, matrix, A/B, funnel, longitudinal)
Evidence Model
We use explicit confidence tiers:
- L1: third-party domain observed
- L2: identifier-like field observed
- L3: repeated within run
- L4: cross-domain hash reuse
- L5: cross-run persistence
This keeps interpretation honest: higher level = stronger network evidence, not guaranteed ad-decision proof.
CLI Workflows
Matrix (multi-site)
npm run flowlens -- study-matrix \
--sites https://www.google.com,https://www.youtube.com \
--scenarios baseline,engaged,ad-click \
--runs 3
A/B (causal contrast)
npm run flowlens -- study-ab \
--url https://www.youtube.com \
--control baseline \
--treatment ad-click \
--runs 3
Funnel (stage deltas)
npm run flowlens -- study-funnel \
--url https://www.google.com \
--query running+shoes \
--runs 3
Longitudinal (stability over samples)
npm run flowlens -- study-longitudinal \
--url https://www.wikipedia.org \
--samples 7 \
--runs 1
Full-Batch Findings (Current Run)
Batch design:
- 10 sites
- 3 scenarios
- target 3 runs/scenario
Outcome:
- 9/10 sites produced complete scenario outputs
- Amazon repeatedly failed under runtime constraints in this environment (timeouts/session closure), and was kept as explicit failed evidence
Pattern-level observations:
- signal intensity varied strongly by site/scenario
- deeper interaction stages often increased observed signal metrics
- some content-centric cases remained low-signal across repeated runs
Why the Redaction Layer Matters
Raw tokens are not published.
Instead, FlowLens stores:
- redacted preview
- token length
- stable hash for equality/reuse checks
That gives us reproducibility without leaking sensitive raw values.
What You Can Claim Responsibly
From this tooling and dataset, you can claim:
- network-observed data-flow signals vary by context,
- controlled behavior changes can shift measured signals,
- reuse/persistence patterns are measurable in a repeatable way.
You cannot claim from network traces alone:
- definitive platform-internal ad decision logic,
- person-level identity resolution.
Engineering Notes
What worked well:
- modular analysis pipeline
- evidence-level abstraction for communication quality
- matrix/funnel/A-B/longitudinal complement each other
What remains hard:
- large-site reliability under fixed timeouts
- anti-bot/session constraints
- balancing coverage vs runtime cost
Read the Full Materials
- Repository: https://github.com/yul761/FlowLens
- Full-batch summary:
data/reports/published/formal-v1-full-overall-summary.md - Academic-style article:
data/reports/published/public-v1-academic-article.md
If You Want to Build on This
Next useful extensions:
- stronger single-variable controls (consent, login, click-id toggles)
- bootstrap confidence intervals on key deltas
- cross-environment runs (device profile/region)
- publication-grade data manifests
Closing
A lot of tracking debates are stuck between oversimplified claims and opaque internals.
A HAR-first, evidence-tier approach gives a practical middle path: measurable, repeatable, and honest about uncertainty.
Top comments (0)