Ben Utting

Posted on May 1 • Originally published at ctrlaltautomate.com

I Built a Skill That Pulls Any Australian Real Estate Agent's Sales History in 60 Seconds

#automation #productivity #showdev #webscraping

Researching a real estate agent's track record in Australia is painful. You open Domain, search the agent's name, click through to their profile, scroll the sold tab, copy details into a spreadsheet. Then you do it again on realestate.com.au because the two portals show different listings. Then maybe RateMyAgent for a third pass. Cross-reference the addresses, figure out which sales appear on both, fill in the gaps where one portal shows the price and the other says "Contact Agent."

For one agent, that's 30 to 90 minutes of clicking, scrolling, and copy-pasting. For a shortlist of five agents, it's an afternoon gone. And you still end up with a messy spreadsheet that's missing half the prices.

I built this tool for a client who needed to vet agents across multiple suburbs in Sydney. After watching them do it manually twice (once for Paddington, once for Mosman), I automated the whole pipeline. Input an agent's name, get back structured data: every recent sale with address, price, date, bedrooms, bathrooms, car spaces, days on market, and a link back to the original listing.

Now I'm giving it away for free. Here's how it works and how to set it up.

The problem in detail

The Australian real estate market has no single source of truth for agent performance. Domain and realestate.com.au both show sold listings, but they show different ones. An agent might have 12 sold properties on Domain and 8 on REA, with only 5 overlapping. Prices are inconsistent: Domain shows a sold figure, REA says "Contact Agent" for the same property. Neither portal offers a structured API for agent-level sold data.

If you're a vendor choosing which agent to list with, you're supposed to make a decision worth tens of thousands in commission based on whatever curated stats the agent puts in their pitch deck. If you're a buyers agent who needs to know who dominates a suburb, you're doing this research weekly across dozens of agents.

The data exists. It's just trapped behind JavaScript-rendered portals with no export button.

What the tool does

You give it an agent's name. It gives you their recent sold listings as clean, structured data.

uv run main.py search --agent "Sarah Mitchell" --agency "Ray White" --suburb Paddington --state NSW

Out comes a JSON array:

[
  {
    "agent_name": "Sarah Mitchell",
    "agency_name": "Ray White Paddington",
    "property_address": "14 Glenmore Road",
    "suburb": "Paddington",
    "state": "NSW",
    "postcode": "2021",
    "property_type": "house",
    "bedrooms": 3,
    "bathrooms": 2,
    "car_spaces": 1,
    "sold_price": 2150000,
    "sold_date": "2024-11-03",
    "days_on_market": 22,
    "listing_url": "https://www.domain.com.au/...",
    "source_portal": "domain.com.au"
  },
  {
    "agent_name": "Sarah Mitchell",
    "agency_name": "Ray White Paddington",
    "property_address": "7/88 Oxford Street",
    "suburb": "Paddington",
    "state": "NSW",
    "postcode": "2021",
    "property_type": "unit",
    "bedrooms": 2,
    "bathrooms": 1,
    "car_spaces": 1,
    "sold_price": 980000,
    "sold_date": "2024-10-18",
    "days_on_market": 14,
    "listing_url": "https://www.realestate.com.au/...",
    "source_portal": "realestate.com.au"
  }
]

Up to 15 listings per search. Both Domain and REA combined, deduplicated, with price data merged from whichever portal actually shows it. Output as JSON to the terminal, or export to CSV, Google Sheets, or fire a webhook to n8n.

Who this is for

Vendors choosing an agent. You're about to sign an exclusive agreement worth 2% of your property's value. Instead of trusting the agent's own marketing, pull their actual sales data. How many have they sold in the last 6 months? What's their average days on market? Are they actually selling in your suburb or three suburbs away?

Buyers agents. You need to know who dominates a pocket. Run it across the top 5 agents in a suburb and compare sold volumes, property types, and price ranges. Do it weekly and you'll spot patterns before your competitors do.

Agency owners. Benchmarking a recruit's claims against reality. They say they sold $40M last year. Pull the data and check.

Mortgage brokers and valuers. Recent comparable sales filtered by the selling agent, not just by suburb. Useful for building a picture of market activity in a specific area.

Proptech teams. Building agent comparison products, performance databases, or market intelligence dashboards. This tool gives you the raw data layer.

Anyone who currently does this research with 6 browser tabs and a spreadsheet.

How it works under the hood

The tool runs a five-stage pipeline. Each stage is its own module, testable independently.

Stage 1: Agent discovery

The first challenge is finding the agent's canonical profile URL on each portal. The tool searches Domain's agent directory (domain.com.au/find-agent) and REA's agent pages using httpx and BeautifulSoup. It parses the search results to extract the profile link and internal agent ID.

Common names are a problem. "John Smith" at "Ray White" might match four different agents across NSW, VIC, and QLD. Adding a suburb narrows it significantly. If the direct lookup fails entirely (some agents have unusual profile URL structures), the tool falls back to a Google search via SerpAPI, looking for profile pages across both portals.

Stage 2: Sold listings scrape

Once the profile URLs are resolved, Playwright launches headless Chromium and loads the agent's "Sold" tab on each portal. This is where it gets tricky: both portals render listing data with JavaScript. A simple HTTP request gets you an empty page. You need a real browser.

Domain uses traditional pagination with a "Next" button. The scraper clicks through up to 2 pages. REA uses a lazy-load scroll pattern: new listings only appear when you scroll to the bottom. The scraper scrolls incrementally, waiting for new cards to render after each scroll.

Both portals deploy Cloudflare and behavioural fingerprinting. The scraper uses randomised user agents, realistic viewport sizes, and delays between 1.5 and 3 seconds between every action. It also patches the navigator.webdriver property so Chromium doesn't announce itself as automated.

The output from this stage is raw HTML: the inner content of each listing card element. Messy, unstructured, full of nested divs and CSS classes that change without notice.

Stage 3: LLM extraction

This is where the tool gets interesting. Instead of writing brittle CSS selectors that break every time Domain updates their frontend, the raw HTML cards get sent to Claude Haiku with a structured extraction prompt.

The prompt tells Haiku exactly what fields to extract (address, suburb, postcode, property type, bedrooms, bathrooms, cars, price, date, URL) and how to handle edge cases: "Contact Agent" means null price, "terrace" maps to "house", price ranges use the lower value. The response comes back as JSON, validated against a Pydantic model.

Cost per run: roughly $0.001 for 10 listings. That's a fraction of a cent. The token usage gets logged so you can track spend over time.

If the Anthropic API is unreachable or returns malformed JSON, a regex fallback parser handles the most common HTML patterns from both portals. It's less accurate but better than nothing.

Stage 4: Deduplication and merge

The same property often appears on both Domain and REA with slightly different formatting. "14 Glenmore Road" on one, "14 Glenmore Rd" on the other. The pipeline normalises addresses (expands or contracts street type abbreviations, strips punctuation, lowercases) and deduplicates on normalised address plus sold date.

When a duplicate is found, Domain data takes priority for price (it's more frequently visible), but null fields get filled from the REA listing. The result is a single clean record that combines the best data from both sources.

Stage 5: Output

JSON to stdout by default. Also supports:

CSV export via pandas, saved to an output directory
Google Sheets append via gspread, adding rows to a configured spreadsheet
Webhook POST for piping results into n8n or any other automation tool

Every run gets logged to a local SQLite database: timestamp, agent name, records found, sources hit, and API cost. You can view history with uv run main.py history.

Setup takes about 3 minutes

git clone https://github.com/ben-utting/claude-skills.git
cd claude-skills/au-agent-sales-miner
cp .env.example .env
# Add your Anthropic API key to .env
uv sync
playwright install chromium

That's it. The only required key is your Anthropic API key (for the Haiku extraction step). SerpAPI, Google Sheets, and webhook credentials are all optional extras.

If you use Claude Code, it's even simpler. Point it at the folder and ask "search for Sarah Mitchell's recent sales in Paddington." The skill.md file tells Claude what the tool does and how to invoke it. No manual commands, no remembering flags.

Things to know before you run it

Anti-bot measures. Domain and REA both use Cloudflare. From a home/office IP you'll be fine for personal research. If you're planning to run this at scale (hundreds of agents per day), you'll want a residential proxy. The tool supports proxy configuration via the .env file.

Price visibility. REA frequently suppresses sold prices at the vendor's request. The tool handles this gracefully with null values, but don't expect 100% price coverage. Domain is significantly better for price data. In some suburbs, up to 40% of REA listings hide the sold figure.

Agent name disambiguation. Common names with common agencies can match multiple agents across states. Always add a suburb when you can. The tool picks the best match from the search results, but suburb context makes the difference between the right Sarah Mitchell and the wrong one.

Terms of service. Both portals prohibit automated scraping in their ToS. This tool is provided for research and educational purposes. Use it responsibly.

The repo: github.com/ben-utting/claude-skills

Over to you

How long does it take you to research a real estate agent right now? And if you could pull structured data from any Australian portal that doesn't have a public API, what would you build first?

ctrlaltautomate.com

DEV Community