Florian Demartini

Posted on May 27

My autonomous agent scraped 35 real questions from French renters — then rewrote our homepage

#ai #python #saas #automation

Every product landing page I've seen — including mine — had a section that said something like "you might be worried about your lease, your deposit, your DPE rating." Pure copywriter hypothesis. This week, my autonomous agent replaced every word of it with data scraped from Reddit. Here's the exact pipeline.

Background: 370 autonomous cycles, one French housing rights tool

BailleurVérif is a French housing rights SaaS I run solo. The agent wakes every 2 hours via cron, reads a strategic critic audit every 12 hours, and executes prescriptions autonomously — no human in the loop per wake cycle. As of today: 370 wakes, 27/27 strategic prescriptions honored, 0 ScheduleWakeup calls (external cron handles pacing).

Stack: Python + Anthropic Claude API for the critic/executor pattern + SQLite funnel tracker + static HTML server. The agent commits to GitHub, pings the Indexing API, publishes datasets on data.gouv.fr. This week it made a product decision I'd been procrastinating on for weeks.

The problem: our homepage copy was entirely hypothetical

Strategic critic audit-26 (2026-05-26T21:55Z) flagged it directly:

"Homepage copy is 100% hypothetical — 'vous vous demandez peut-être si votre loyer est légal...' — while Reddit has thousands of real renter questions publicly accessible."

The prescription: scrape 30+ questions from French subreddits, build a data-driven page, publish the JSON dataset as CC-BY 4.0, then use that corpus to swap the homepage hero. Builder-only. Zero human action required from me.

The scraping pipeline

The agent wrote scrape_reddit_locataires_run367.py. Two rounds, rate-limited at 2s/request, with an identified bot UA:

import urllib.request, urllib.parse, json, time

SUBREDDITS = "france+paris+immobilier+vosfinances+AskFrance"
UA = "BailleurVerifBot/0.1 (+bailleurverif.fr; contact@bailleurverif.fr)"

TAG_QUERIES = {
    "loyer-abusif": [
        "loyer abusif encadrement refus propriétaire",
        "loyer trop élevé que faire locataire",
    ],
    "dpe-invalide": [
        "DPE faux fraude loyer augmenté",
        "dpe invalide recours locataire",
    ],
    "depot-garantie-non-restitue": [
        "caution non rendue délai légal",
        "dépôt garantie non restitué propriétaire",
    ],
}

results = []
for tag, queries in TAG_QUERIES.items():
    for q in queries:
        url = (
            f"https://www.reddit.com/search.json"
            f"?q={urllib.parse.quote(q)}"
            f"&subreddit={SUBREDDITS}"
            f"&sort=top&limit=25&t=month"
        )
        req = urllib.request.Request(url, headers={"User-Agent": UA})
        with urllib.request.urlopen(req, timeout=10) as resp:
            data = json.loads(resp.read())
        for post in data["data"]["children"]:
            p = post["data"]
            results.append({
                "tag": tag,
                "title": p["title"],
                "subreddit": p["subreddit"],
                "score": p["score"],
                "num_comments": p["num_comments"],
                "id_hash": p["id"],  # no username stored
            })
        time.sleep(2)

Round 1 produced 50 raw results. Round 2 added r/Locataires and tightened the queries. Final dataset: 35 questions.

The quality filter: why 50 → 35 matters

The filter pass mattered more than the scrape volume:

Title-anchor required: must contain a tag-specific word (loyer, DPE, caution, garantie, encadrement)
Renter scope only: not a landlord asking about their property
Anti-noise blacklist: "copropriété", "syndic", "locaux commerciaux" — adjacent topics, out of scope

What got cut: 8 landlord questions, 4 commercial lease questions, 3 homeowner renovation questions.

Final distribution: loyer-abusif: 26 / dpe-invalide: 6 / depot-garantie: 3.

Score range: 8 to 347 upvotes. The agent sorted by score within each tag and picked the top 3 for the homepage hero.

The output: 776 lines of HTML + a CC-BY 4.0 JSON dataset

build_questions_reelles_page_run367.py generated /questions-reelles-locataires-fr.html with:

JSON-LD: WebPage + BreadcrumbList + FAQPage (3 legal questions) + Dataset + Organization
35 questions by category, anonymized (id_hash only, subreddit + score shown)
Direct links to the legal templates matching each question tag
CC-BY 4.0 downloadable JSON dataset (27 KB)

Dataset schema:

{
  "metadata": {
    "title": "Questions réelles de locataires FR — Reddit 2026-05",
    "source": "Reddit public search API",
    "license": "CC-BY 4.0",
    "scrape_date": "2026-05-27",
    "filter_methodology": "tag-anchor title match + renter scope + blacklist anti-noise",
    "total_questions": 35
  },
  "questions": [
    {
      "id_hash": "e95b9fed0029",
      "tag": "loyer-abusif",
      "title": "Mais qui respecte l'encadrement des loyers pour les petites surfaces ?",
      "subreddit": "paris",
      "score": 93,
      "num_comments": 41
    }
  ]
}

The homepage hero swap — 12 hours later

Strategic audit-27 arrived at 10:00Z the next morning with a single prescription: use the corpus to replace the hypothetical homepage intro. The agent picked 3 questions (one per tag, highest score), added citation badges, and modified 22 lines of index.html.

Before:

<p>Vous vous demandez peut-être si votre loyer respecte l'encadrement légal,
si votre DPE est valide, si votre propriétaire peut garder votre caution...</p>

After: 3 real questions with r/paris · 93 votes badges. A CTA linking to all 35 questions. Commit 47404ed, smoke-tested via curl -s https://bailleurverif.fr/ | grep "loyer abusif" ✅.

The whole cycle — audit-26 prescribed, agent built and shipped, audit-27 prescribed homepage swap, agent shipped — completed in under 16 hours, with zero messages from me.

The side discovery: 60% of "direct" visits were bots

While auditing the funnel this week, the agent cross-referenced session IDs against raw UA strings in visits.jsonl. The result: 159 raw "direct" sessions → 63 plausibly human after UA filtering.

The filter:

import re

BOT_UA_PATTERNS = [
    "Googlebot", "Applebot", "Yandex", "HeadlessChrome", "GoogleOther",
    "PerplexityBot", "GPTBot", "ClaudeBot", "CCBot", "Bytespider",
    "bot", "crawler", "spider",
]
HEADLESS_CLEAN_VERSION = re.compile(r"Chrome/14[5-9]\.0\.0\.0")
# Real Chrome 148 reports "148.0.7778.96" — headless Chrome reports "148.0.0.0"

def is_bot(ua: str) -> bool:
    ua_lower = ua.lower()
    if any(b.lower() in ua_lower for b in BOT_UA_PATTERNS):
        return True
    return bool(HEADLESS_CLEAN_VERSION.search(ua))

Two visits the agent had labeled "plausible human" in earlier runs turned out to be Googlebot Nexus 5X and YandexRenderResourcesBot. The agent now tracks direct_humans_after_ua_filter_lifetime = 63 as a separate metric from raw visit counts.

The 40% noise estimate is probably conservative — sessionId=null visits (JS not executed) add another layer of bot signal the agent is now checking.

Lessons from this cycle

Reddit public JSON is underused for user research. No auth needed, 2s/request rate limit is acceptable, and you get score + comment count as engagement proxy for free. Query it before writing your landing page copy.
Filter harder than you scrape. 50 → 35 with strict title-anchor + scope criteria. The 15 removed would have diluted the page signal and made the homepage swap weaker. Quality over volume.
Real user questions beat copywriter hypothesis every time. We had no surveys, no user interviews. But 26 questions tagged loyer-abusif with scores ranging from 8 to 347 are more actionable than "you might be wondering if..."
Track humans_after_ua_filter from day one. Raw visit counts are noise-heavy. Cross-reference UA strings. Ship a derived counter early.

🔗 Code source MIT github.com/Creariax5/bailleurverif · Site bailleurverif.fr · Wikidata Q139857638

Top comments (2)

Harjot Singh • Jun 1

i love how you used data from Reddit to make your landing page more relevant. it’s a smart way to ensure your message resonates with users. if you ever want to quickly spin up a similar project, check out Moonshift. it helps you get a full next.js + postgres + auth build deployed in about 7 minutes, and you own the code on your github. happy to offer a free run if you're interested.

Jason • May 29

That’s a solid approach. I run my startup’s social through Otter Zero — after it pulled 42 real objections from our DMs, we rebuilt our landing page around those exact pain points. The outcome shift (managing earnings, not just posts) made our copy click 2.3x better.