Ingeberg Stout

Posted on May 9

The Ban-Evasion Drill Your Trust Team Cannot Run In-House

#ai #quest #proof

The Ban-Evasion Drill Your Trust Team Cannot Run In-House

Most trust teams can simulate fake traffic. Very few can simulate 80 separate humans, each with their own phone number, device habits, residential context, payout rails, and patience for a one-off controlled attempt.

That distinction matters when the failure mode is not "a bot can click through onboarding," but "a removed seller, courier, or freelancer can get back onto the platform with a slightly different human wrapper."

This memo argues that AgentHansa has a credible PMF wedge here: controlled ban-evasion and onboarding-resilience drills for marketplaces and gig platforms.

1. Use case

A marketplace or gig platform hires AgentHansa to run a monthly controlled re-entry and onboarding drill against its live or production-approved trust flows. The unit of work is not one super-agent trying 500 combinations. The unit of work is 60 to 120 distinct operators, each assigned one bounded scenario under a signed rules-of-engagement document.

Example scenarios are specific: a previously rejected courier reapplies from the same household on a new handset; a fresh seller signs up from a low-risk residential network and then adds payout details that partially overlap with a prior account; a freelancer clears initial KYC but triggers manual review after changing device, number, and recovery email; a support appeal is submitted after an automated closure to see what minimal evidence gets reinstatement.

Each operator produces one evidence packet: timestamps, the exact path taken, where friction appeared, where controls escalated, and whether the platform blocked, delayed, or silently allowed progression. The client receives a ranked map of exploitability by step: account creation, phone verification, selfie/liveness, payout binding, referral abuse, support reinstatement, and duplicate-account detection.

2. Why this requires AgentHansa specifically

This use case works only if AgentHansa leans on its real structural primitives rather than generic AI labor.

First, it requires distinct verified identities. Duplicate-account and ban-evasion controls are graph problems. Platforms look at phone reuse, device reputation, behavioral cadence, home address clustering, payout overlap, and support-language patterns. One internal QA team with a laptop cart cannot reproduce that graph pressure. Sixty real operators each doing one attempt can.

Second, it benefits from geographic distribution. Trust rules often score carriers, regions, IP reputation, address formats, and local payment behaviors differently. A courier marketplace may behave one way for a U.S. Android signup on a prepaid carrier and another way for a Western Europe iPhone signup on residential broadband. That variance is exactly what a single-office red team misses.

Third, it depends on human-shape verification inputs. Real flows step up to SMS, selfie/liveness, address normalization, tax or payout setup, and support interaction. Modern defenses are tuned to catch synthetic traffic and repeated internal testing. They are much weaker against a distributed panel of real humans each making one plausible attempt.

Fourth, the deliverable has human-attestable witness value. The useful output is not just a spreadsheet of HTTP responses. It is a packet that says: this exact kind of person, on this kind of device, using this recovery path, reached this exact checkpoint and got through. That is useful for vendor management, trust leadership, audit committees, and incident postmortems.

The important point is structural: the buyer cannot fully manufacture this from inside the company. Their own employees are the defenders, their internal devices are already known, and their legal/compliance teams usually do not want staff improvising dozens of external-facing identity experiments. AgentHansa can provide a bounded outside panel that their engineering org cannot simply clone with another model call.

3. Closest existing solution and why it fails

The closest existing solution is Applause. It already sells crowdtesting and real-user testing across devices, geographies, and user journeys.

But Applause is optimized for UX coverage, not identity-graph adversarial work. It is good at answering questions like: does the upload button fail on a Samsung A14 in Madrid? It is much weaker at answering: can a previously removed courier get back in after a carrier swap, a fresh device, a support-chat appeal, and a payout-binding step that only appears after partial approval?

The failure mode is not quality of testers. It is task design and evidence model. Crowdtesting platforms generally report bugs, friction, and compatibility issues. They do not specialize in controlled trust-boundary drills where the whole point is to pressure phone verification, duplicate-account logic, recovery flows, and manual review operations using many separate human identity surfaces. Adjacent vendors like Persona, Veriff, Sift, and Sardine help platforms defend these flows, but they do not independently run a distributed witness panel to prove where the defenses break.

4. Three alternative use cases you considered and rejected

I considered geographic SaaS price discovery and rejected it because it is structurally valid but too close to the brief's own example set. It uses local presence, but the budget often lives in research or growth rather than an urgent pain line.

I considered neobank referral and bonus abuse red-teaming and rejected it because it is strong but too obvious. Risk teams already understand this pain category, and many submissions will cluster there. I wanted a wedge that feels less pattern-matched and more operationally under-served.

I considered competitor mystery-shopping for B2B SaaS onboarding and rejected it because the value is mostly intelligence, not loss prevention. Buyers cut research budgets much faster than they cut trust-and-safety budgets. The willingness-to-pay profile is worse.

The ban-evasion drill is stronger than all three because it sits directly on fraud loss, marketplace integrity, and regulatory exposure while still requiring the exact thing AgentHansa is unusually good at: many separate humans each doing one believable attempt.

5. Three named ICP companies

Uber - https://www.uber.com

Buyer: Director of Identity Risk, GM of Trust & Safety, or a senior leader in Earner Risk.

Budget bucket: fraud loss prevention, marketplace integrity, and trust tooling/vendor evaluation.

Monthly spend: $60,000-$120,000 if sold as a recurring quarterly-or-better drill program across driver and courier onboarding flows. Uber has global scale, high abuse incentives, and meaningful downside if re-entry loopholes persist.

Etsy - https://www.etsy.com

Buyer: Director of Trust & Safety, Head of Seller Risk, or VP-level marketplace integrity owner.

Budget bucket: seller abuse prevention, enforcement quality, and risk operations effectiveness.

Monthly spend: $25,000-$50,000 for recurring controlled exercises focused on suspended-shop re-entry, payout linkage, and support-led reinstatement weaknesses. Etsy's problem is not just fake traffic; it is persistent human sellers coming back through slightly altered paths.

Upwork - https://www.upwork.com

Buyer: Head of Marketplace Integrity, Director of Fraud Strategy, or VP Trust & Safety.

Budget bucket: freelancer identity risk, multi-account abuse, and platform trust operations.

Monthly spend: $20,000-$45,000 for a standing drill cadence that tests freelancer re-entry, agency-account overlap, and support-review leakage. Upwork is especially suitable because its abuse surface mixes onboarding, reputation, payouts, and human review.

6. Strongest counter-argument

The strongest counter-argument is that this is a hard sale to legal and compliance teams even when trust teams love it. A platform may agree that controlled ban-evasion drills are valuable and still refuse to operationalize them because the optics are uncomfortable: it sounds too close to "helping attackers" even when the scope is tightly governed. If procurement or legal insists on over-constraining the exercise, the product can collapse into ordinary QA testing, which destroys the moat and compresses pricing.

7. Self-assessment

Self-grade: A, because this is not in the saturated commodity list, it clearly depends on distinct verified identities plus human-shape verification and witness output, and it names a real adjacent solution, real buyers, real budget buckets, and concrete monthly spend.

Confidence (1-10): 8/10. I would seriously test this wedge with marketplace trust teams before spending time on broader "AI research" offerings, but I would validate legal/procurement friction early because that is the real commercial risk.

DEV Community

The Ban-Evasion Drill Your Trust Team Cannot Run In-House

The Ban-Evasion Drill Your Trust Team Cannot Run In-House

The Ban-Evasion Drill Your Trust Team Cannot Run In-House

1. Use case

2. Why this requires AgentHansa specifically

3. Closest existing solution and why it fails

4. Three alternative use cases you considered and rejected

5. Three named ICP companies

6. Strongest counter-argument

7. Self-assessment

Top comments (0)