DEV Community

New Way Capital Advisory
New Way Capital Advisory

Posted on

We caught ChatGPT answering property questions with our data -- here's the nginx log proof

The moment we noticed

Yesterday, while running our routine nginx log analysis, we spotted something unusual:

51.107.70.192 - [30/Mar/2026:19:42:35 +0000] "property.nwc-advisory.com"
"GET /prices/sk10-1ae HTTP/2.0" 200 4623 "-"
"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible;
ChatGPT-User/1.0; +https://openai.com/bot"
Enter fullscreen mode Exit fullscreen mode

A ChatGPT-User bot fetched one of our UK property landing pages. Not GPTBot (the training crawler). Not OAI-SearchBot (the indexer). The ChatGPT-User agent -- the one that fetches pages in real time to answer a user's question.

Someone asked ChatGPT about house prices near Macclesfield (SK10 postcode), and ChatGPT pulled the answer from our page.

What's the difference between OpenAI's bots?

OpenAI operates three distinct crawlers. In the same 2-hour window, we saw all three:

Bot User-Agent Purpose Our logs
GPTBot GPTBot/1.3 Training data collection Crawling sitemaps, homepages
OAI-SearchBot OAI-SearchBot/1.3 Search index building Crawling UK landing pages (SK postcodes), robots.txt across all 11 domains
ChatGPT-User ChatGPT-User/1.0 Live query answering Fetched /prices/sk10-1ae for a real user

The ChatGPT-User hit is the interesting one. It means our data is being served to end users in real time through ChatGPT's search feature. The user never visits our site -- they get the answer inside their chat.

The numbers

In a 2-hour window on March 30, 2026:

OpenAI bots:  22 requests (GPTBot + OAI-SearchBot + ChatGPT-User)
Googlebot:     4 requests
Enter fullscreen mode Exit fullscreen mode

OpenAI is crawling our pages 5x more frequently than Google.

Here's what OAI-SearchBot was systematically crawling:

property.nwc-advisory.com       GET /prices/sk1-1aa
property.nwc-advisory.com       GET /prices/sk10-1ae
property.nwc-advisory.com       GET /prices/sk11-0aa
property.nwc-advisory.com       GET /prices/sk12-1aa
property.nwc-advisory.com       GET /prices/sk13-0aa
property.nwc-advisory.com       GET /prices/sk14-1aa
property.nwc-advisory.com       GET /sitemap.xml
property.nwc-advisory.com       GET /robots.txt
property-chi.nwc-advisory.com   GET /robots.txt
property-dxb.nwc-advisory.com   GET /robots.txt
property-ie.nwc-advisory.com    GET /robots.txt
property-miami.nwc-advisory.com GET /robots.txt
property-phl.nwc-advisory.com   GET /robots.txt
property-sg.nwc-advisory.com    GET /robots.txt
property-tw.nwc-advisory.com    GET /robots.txt
Enter fullscreen mode Exit fullscreen mode

It's reading the sitemaps, then crawling individual landing pages. Systematically.

What we built (and why AI can read it)

We operate property comparable sales apps across 11 markets (UK, France, Singapore, NYC, Chicago, Miami, Philadelphia, Connecticut, Dubai, Ireland, Taiwan), backed by 35M+ government-source transactions.

For SEO, we generated 9,100+ static landing pages -- one per postcode/ZIP/area:

/prices/sw1a-1aa    → London Westminster
/prices/sk10-1ae    → Macclesfield
/prix/75001         → Paris 1er
/comps/10001        → Midtown Manhattan
Enter fullscreen mode Exit fullscreen mode

Each page contains:

  • Median price, average price, price range
  • Price per sqft/m2 statistics
  • Recent comparable sales with addresses
  • Stamp duty / notary fee calculators
  • FAQ schema (JSON-LD)
  • Nearby area links

The key design decision: no login walls, no JavaScript-rendered content, no gated data. Every landing page is server-rendered HTML with structured data that any crawler can parse.

The page structure that AI loves

<!-- Clean semantic HTML -->
<h1>Property Prices in SK10 1AE, Macclesfield</h1>

<div class="stats-grid">
  <div class="stat">
    <span class="label">Median Price</span>
    <span class="value">£285,000</span>
  </div>
  <div class="stat">
    <span class="label">Price per sqft</span>
    <span class="value">£198</span>
  </div>
</div>

<!-- JSON-LD structured data -->
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "WebApplication",
  "name": "Property Comparable Sales - SK10",
  "description": "Recent property sales near SK10 1AE, Macclesfield",
  "url": "https://property.nwc-advisory.com/prices/sk10-1ae"
}
</script>

<!-- FAQ schema for rich snippets -->
<script type="application/ld+json">
{
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "What is the average house price in SK10?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "The average house price in SK10 is £312,450..."
      }
    }
  ]
}
</script>
Enter fullscreen mode Exit fullscreen mode

The SEO stack

nginx (SSL, bot-block with search engine whitelist)
  │
  ├── /prices/{postcode}  →  Static HTML (2,308 UK pages)
  ├── /prix/{code_postal} →  Static HTML (5,851 FR pages)
  ├── /comps/{zip}        →  Static HTML (per US market)
  │
  ├── sitemap.xml          →  All page URLs
  ├── robots.txt           →  Allow everything except /v1/ and /api/
  └── IndexNow key         →  Instant Bing/Yandex notification
Enter fullscreen mode Exit fullscreen mode

Key technical decisions:

1. robots.txt allows AI bots

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: *
Disallow: /v1/
Disallow: /api/
Enter fullscreen mode Exit fullscreen mode

Many sites block GPTBot. We deliberately allow it. The API endpoints are protected (disallowed), but the landing pages are open.

2. Server-rendered, not SPA

Each landing page is a complete HTML document. No client-side rendering, no JavaScript required to see the data. ChatGPT-User fetches the HTML and can immediately extract the statistics.

3. Canonical URLs and sitemaps

Every page has <link rel="canonical"> and is listed in the sitemap. This tells crawlers exactly which pages exist and which URL is authoritative.

4. Structured data (JSON-LD)

FAQPage schema means the questions and answers are machine-readable. When ChatGPT needs "What is the average house price in SK10?", it can extract the answer directly from the structured data.

How we detect AI bot traffic

Our visitor analysis pipeline (Python, runs against nginx logs) classifies every request:

# Bot classification
AI_BOTS = {
    'ChatGPT-User':   'Live query - someone asked ChatGPT',
    'GPTBot':         'Training data collection',
    'OAI-SearchBot':  'Search index building',
    'ClaudeBot':      'Anthropic training',
    'PerplexityBot':  'Perplexity search',
    'Bytespider':     'TikTok/ByteDance',
}

SEARCH_BOTS = {
    'Googlebot':      'Google Search indexing',
    'bingbot':        'Bing Search indexing',
    'Applebot':       'Apple/Siri search',
    'DuckDuckBot':    'DuckDuckGo indexing',
}

def classify_request(user_agent: str) -> str:
    ua_lower = user_agent.lower()
    for bot, purpose in AI_BOTS.items():
        if bot.lower() in ua_lower:
            return f"AI Bot: {purpose}"
    for bot, purpose in SEARCH_BOTS.items():
        if bot.lower() in ua_lower:
            return f"Search: {purpose}"
    return "Human visitor"
Enter fullscreen mode Exit fullscreen mode

We run this daily against our nginx logs. The trend over the past few weeks: AI bot traffic is growing faster than search bot traffic.

What this means for developers

If you're building a data-heavy application, the old playbook was:

Build app → SEO → Rank on Google → Users find you → They visit
Enter fullscreen mode Exit fullscreen mode

The new playbook is:

Build app → Structured data → AI crawls you → Users get your data via ChatGPT
                                              (they may never visit your site)
Enter fullscreen mode Exit fullscreen mode

This is both exciting and unsettling. Exciting because your data reaches users through a completely new channel. Unsettling because those users never see your UI, your brand, or your conversion funnel.

 Practical takeaways

If you want AI to use your data:

  1. Don't block AI bots in robots.txt (unless you have a reason to)
  2. Server-render your pages -- ChatGPT-User can't execute JavaScript
  3. Use structured data (JSON-LD, schema.org) -- makes extraction trivial
  4. Generate landing pages for long-tail queries -- AI searches the same way humans do
  5. Keep data ungated -- if it's behind a login, AI can't reach it
  6. Maintain sitemaps -- AI bots follow them just like Googlebot

If you want to track it:

  • Parse your nginx/access logs for ChatGPT-User, GPTBot, OAI-SearchBot
  • ChatGPT-User = your data is being served to real users right now
  • OAI-SearchBot = your pages are being indexed for future queries
  • GPTBot = your content may be used for model training

The broader picture

Google is still the primary search engine. But in our 2-hour sample, OpenAI's crawlers outnumbered Googlebot 5:1. And while Google has been crawling our 12 domains for weeks without indexing most pages (domain authority problem), ChatGPT is already serving our data to users.

For a small startup with no domain authority, no backlinks, and a brand-new domain, this is significant. The traditional SEO path (build authority -> get indexed -> rank -> get traffic) takes months to years. The AI search path (structured data -> get crawled -> get cited) can happen in weeks.

We're not saying Google doesn't matter. We're saying there's a new, parallel distribution channel, and it favors clean data over domain authority.


Our stack: Python (FastAPI) + SQLite + nginx + static HTML generators. 11 property markets, 35M+ transactions from government open data sources. Free to search at property.nwc-advisory.com.

The API is also available on RapidAPI and as an MCP server for AI agents.


What AI bot traffic are you seeing in your logs? Have you caught ChatGPT-User fetching your pages? Drop a comment -- I'm curious if this is a broader trend or we're just early.*

Top comments (0)