DEV Community

Caspar Bannink
Caspar Bannink

Posted on • Originally published at homescout.io

Building a Real-Time Listing Alert System: Polling, Webhooks, and Monitoring 90+ Sites

The alert system is the most user-critical feature in a rental aggregator. Users care most about being notified quickly when something matches their criteria. Here's how I built it, what I evaluated first, and where the interesting tradeoffs are.

The options: webhooks, RSS, polling

Webhooks would be ideal. Source site posts a listing, fires a webhook to your system, you process and alert within seconds. Zero wasted requests, minimal latency.

Reality: almost no rental portals expose webhooks. This isn't a technical limitation on their end. They just haven't built it, and in many cases scraping is technically prohibited by their ToS. The webhook path is effectively unavailable for most sources.

RSS feeds are available from a small number of sources. Daft used to have them. A few smaller sites still publish them. Where they exist they're great: structured, cacheable, low overhead. But coverage is limited.

Polling is what you actually use. You hit each source on a schedule, parse the results, diff against what you've already seen, and trigger alerts for new listings. It's the slowest and most resource-intensive option, but it's the one that works across all sources.

The system design challenge with polling is making it fast enough to be useful without hammering sources.

The polling architecture

Each source has a crawler that runs on a schedule. Crawlers are classified by priority:

  • High-priority sources (main portals, high listing volume): crawl every 15-30 minutes
  • Medium-priority sources (secondary portals, mid-tier agencies): crawl every 1-2 hours
  • Low-priority sources (small agencies, long-tail sites): crawl every 4-12 hours

Within a crawl, the crawler fetches the listing index (search results page or feed), extracts listing IDs and basic metadata, and compares against the stored state. Only changed or new listings trigger a full extraction. This keeps full-page requests proportional to the change rate, not the crawl rate.

async def crawl_source(source_config: SourceConfig) -> list[RawListing]:
    # Fetch listing index (search results or feed)
    index_items = await fetch_index(source_config)

    # Load stored state for this source
    stored = await db.get_listing_ids(source_config.source_id)
    stored_set = {item.external_id for item in stored}

    new_ids = {item.external_id for item in index_items} - stored_set

    # Only fetch full listing pages for new listings
    new_listings = []
    for item in index_items:
        if item.external_id in new_ids:
            listing = await fetch_full_listing(item.url, source_config)
            new_listings.append(listing)

    return new_listings
Enter fullscreen mode Exit fullscreen mode

This approach means most crawl cycles result in zero full-page fetches. The overhead scales with the listing change rate of the source, not the total inventory size.

Diffing and change detection

When a listing is fetched, I store a content hash alongside the structured data. On subsequent crawls:

  • Same hash: no change, skip
  • Different hash: fetch full listing, update stored record, check if the change affects any active alerts

Price changes get special handling. A listing that drops in price might not be "new" but it becomes relevant to users who were previously priced out. The alert system checks price-sensitive alerts when price changes are detected, not just on new listings.

Listings that disappear from the index are marked as potentially taken. I don't immediately remove them from user views because sources sometimes temporarily de-list listings without them being actually let. I wait for two consecutive crawl cycles where the listing is absent before marking it as gone and stopping alerts on it.

The alert matching engine

Users set up saved searches with typed criteria: price range, beds, area, transport proximity, pet policy, and optionally a freetext description preference.

When new listings come in, each one is scored against all active saved searches. The matching is two-stage:

  1. Structured filter check (hard pass/fail): does the listing meet the hard criteria (price, beds, area)? If not, stop. This is fast and handles most rejections.

  2. Soft preference scoring (for listings that pass the filter): score the listing against any freetext preferences using the embedding similarity approach described in the search article. Listings above a similarity threshold trigger an alert. Listings below it go into a "possible match" digest rather than an instant alert.

This prevents freetext preferences from flooding users with weak matches while still surfacing them at lower priority.

Delivery and deduplication

Alerts go via email by default. The delivery layer handles:

  • Per-user rate limiting: a user with five saved searches shouldn't get fifteen emails in a minute if fifteen listings come in. I batch alerts within a short window and send a digest.
  • Cross-search deduplication: the same listing can match multiple saved searches for the same user. It gets mentioned once in the alert, with a note about which searches it matched.
  • Platform deduplication: the same physical property sometimes appears on multiple sources. After normalization and deduplication, only one record per property exists. Alert matching happens on deduplicated records, so users don't get alerted twice for the same apartment.

Where the latency actually lives

The end-to-end latency from a listing being posted to a user getting an alert is:

  • Time until next crawl of that source (0-30 minutes for high-priority sources)
  • Extraction and normalization time (~seconds)
  • Alert matching time (~seconds for the active alert set)
  • Email delivery (~seconds to minutes depending on provider)

For high-priority sources the realistic alert latency is 5-35 minutes. That's competitive with Daft's own alert system, and it covers sources Daft doesn't.

The gap I haven't solved: if a listing goes live and gets taken at 2:00am while the next crawl is at 2:20am, users will get an alert for something that's already gone. This is a fundamental limitation of polling-based systems. I note it clearly in the product rather than pretending it doesn't happen.

Infrastructure

The crawl jobs run as async workers. I use a task queue (Celery with Redis as broker) to schedule crawls, with priority queues for high-priority sources. The queue approach means I can add crawl workers horizontally when source count grows.

Postgres handles all storage. The listings table has a GIN index on the JSONB amenities column, and standard B-tree indexes on price, beds, area, and source. Alert matching queries run in under 100ms for the current dataset size.

I wrote a user-facing version of how alerts work at https://homescout.io/guide/free-rental-alerts-dublin. This post is the internals behind it.


Caspar Bannink. Founder of HomeScout.io. Building AI-powered rental search for Dublin.

Top comments (0)