Meng Lin

Posted on Apr 16

Building NewHomie property analytics tool — Part 1

#dataengineering #aws #softwareengineering #distributedsystems

From Scrappy Scraper to Production Pipeline

It all started with a question.

“How am I supposed to afford a house?”

So I set out to transfigure my anxiety into a software product.

But first, I needed data I could trust.

So I could build this insanity 😉:

1. Validate scrapeability

There is no point building a scraper locally only to discover it breaks the moment you deploy it.

I have chosen to scrape Domain for Australian property data; since Realestate will probably send endless wave of lawyers for bypassing the picky Kasada bot protection.

Before writing much pipeline code, I wanted to validate two things: whether the site was practically scrapeable, and whether it could be scraped reliably in a deployed environment.

Scrapeability factors to consider:

Frontend framework
- Domain uses Next.js client-side rendering.
- Next.js sends a JSON payload in a <script id=“NEXT_DATA” type=“application/json”/> tag.
- Extracting that JSON would be much more stable and structured than parsing rendered HTML, because it was closer to the frontend’s source data.
Bot detection
- Domain uses Akamai which prevents simple HTTP requests unless they come from a real browser.
- I did not observe stronger protections like IP blocking, fingerprinting or human cursor detection.

To test quickly, I deployed Playwright on AWS Lambda using SST so I could iterate against a live environment. For stronger bot protection, I would consider something like Camoufox, but at a performance cost.

2. Validate scraped data at boundary

Once I knew I could scrape the site, I needed a way to enforce data integrity.

Validating data early in the scrape process simplified the rest of the pipeline. My first pass used LLM-generated scraper code, which duplicated validation logic in multiple places and made the transformation layer harder to reason about.

Introducing a schema validator at the boundary fixed a lot of that. Using Zod made it easier to distinguish expected data from unexpected data, catch bad assumptions early, and keep the downstream transformation logic much simpler.

After deploying, I noticed some localities were not being scraped because certain paths in the extracted JSON did not exist. It turned out those pages did not exist at all. Once I confirmed that, I added a validator to detect the 404 page shape so the pipeline could handle it gracefully instead of failing deeper in the system.

Boundary validation made debugging easier and simplified the downstream transformation logic.

3. Insert data with raw SQL

SQL abstractions can hide important insertion logic and encourage assumptions that only surface later as dirty data.

I was tempted to use Kysely query builder for inserts. It reduced boilerplate, but it also made it too easy to assume every table shared the same conflict-handling logic. In practice, each table needed different upsert and deduplication behaviour.

That mismatch introduced bad data which only became obvious later when I started exploring the dataset. Cleaning it up was expensive. I had to write careful migration scripts to transform or delete rows that should never have been inserted in the first place.

One example was property listings that shared the same address and overwrote each other. Another was duplicate listings with different prices, including oddly precise unrounded numbers. These cases occurred less than 1% of the time, but they still produced enough junk data to cost me one to two weeks of cleanup work.

For this part of the pipeline, raw SQL was more verbose, but it made conflict handling explicit.

4. Add observability and iterate before scaling

Observability on Grafana LGTM made it possible to see where the pipeline was slow, fragile, or built on bad assumptions. That sped up architectural iteration by exposing bottlenecks and clarifying the real requirements.

Once the scraper moved into a real deployed environment, I added an SQS queue in front of the workers and started tracking a few key signals:

Worker duration:

Workers are split by locality.
This made it easier to see whether cold starts and setup overhead were dominating runtime.
In practice, that pushed me toward batching work where possible.

CPU and memory usage

I wanted workers to use available CPU efficiently rather than sit idle.
Higher utilization generally meant better cost efficiency, as long as memory stayed within safe limits.

Pipeline fragility:

I wanted to know where the pipeline failed most often, so I added class and method names to the OTEL code_function_name attribute.
I also logged ambiguous or partially extracted data so bad assumptions were visible earlier.
Error occurrence patterns over time and space can be visualized on a heat map.

Anomalies:

Unusual worker duration.
Unusual resource usage.
Unexpected failure patterns.

These signals made the pipeline easier to iterate on. Instead of guessing where the bottlenecks and fragile spots were, I could observe them directly and improve the system from there. Once failure modes were better understood and error rates came down, I could scale the pipeline with much more confidence.

5. Design for iteration speed first

To validate scrapeability quickly, I needed infrastructure that could deploy fast. I intentionally traded long-term flexibility for iteration speed. At this stage, the main requirements were simple: keep scrape speed within rate limits, keep SQL inserts idempotent, and retry workers on failure.

Once the scraper worked reliably on 3 localities, I scaled it to 362 localities near the CBD. At that scale, rarer problems started to appear in the observability dashboard (showcasing 367 errors):

Browser process was hanging

This showed up in the heat map as repeated failures in ScrapeController.tryExtractSuburbPage across many localities after scrape attempts. Eventually the browser process would restart.

Errors:

TimeoutError — BrowserService.getHTML Navigation timeout of 10000 ms exceeded
ProtocolError — ScrapeController.tryExtractSuburbPage Target.createTarget timed out. Increase the protocolTimeout setting in launch/connect calls for a higher timeout if needed.
ProtocolError — ScrapeController.tryExtractRentsPage Target.createTarget timed out. Increase the protocolTimeout setting in launch/connect calls for a higher timeout if needed.

Fix:

I reworked the browser error-handling and retry logic so failures triggered a full browser restart instead of leaving the process in a bad state. This greatly decreased the average scrape worker duration.

Price extraction logic was flawed, but caught by database constraint

Errors*:*

Warn — DomainListingsService.tryTransformSalePrice no price in listing.listingModel.price — "2bedroom + 1bedroom (study)"
Warn — ScrapeModel.tryUpdateSaleListing value "640000680000" is out of range for type integer

Fix:

I added tests from production logs so more valid price strings are accepted while invalid price strings are rejected.

Non-existent pages were being scraped without enough context in the logs

Errors:

ZodError — DomainSuburbService.tryExtractProfile
ZodError — DomainListingsService.tryExtractListings

Fix:

The error logs needed to include the locality that caused the failure. Once I added that context, I found that some of the localities did not exist at all.

This version of the architecture did what it needed to do: it validated scrapeability quickly and exposed real failure modes early. Its weaknesses only became obvious once scale increased, which was acceptable for a design optimized for learning speed.

Fast validation was the right trade at the start, because it exposed real failure modes before the architecture was worth hardening.

6. Design for observed requirements next

Once the pipeline became stable with less than 50 errors per run, iteration speed was no longer the main priority. At that point, I could trade some of it away to meet the requirements the system had actually revealed:

Orchestration for pre and post processing.
Smaller blast radius when scrape workers failed.
Independent workflow execution.
Scrape workers that were easier to test and deploy.
Timeout flexibility above Lambda’s 15 minutes.
Full workflow completion within 1 day.

The next architecture centered around Step Functions for orchestration, with Fargate workers running the scraper in Docker containers. This made the workflow easier to reason about, and the Step Functions visualizer was especially useful for debugging and manually retrying failed runs.

My original plan was to use Fargate workers running the scraper in Docker containers. However, I could not work out how to inject environment variables into Fargate tasks from Step Function using SST, so I temporarily kept Lambda workers despite their limitations.

However, Step Functions introduced its own constraints. The 256 KB message limit and 25,000 event history limit added complexity to larger runs. The simplest workaround I found at the time was to trigger two workflows in parallel.

Once this architecture looked stable in a preview branch, I scaled it from 362 to 4,491 localities. That was the point where Step Functions began to hit its practical limits and forced a temporary redesign.

Once the system revealed its real constraints, the architecture had to evolve around them rather than the assumptions that shaped the first version.

7. Optimize for cost last

After observing the cost of the previous design, I realized it was more expensive than expected, partly because Lambda’s free tier had hidden some of the true cost earlier on.

At this stage, I wanted the cheapest compute that still fit the workload. In theory, that meant spot instance or spot Fargate on Arm64. In practice, that would reduce scrape worker availability and increase the chance of interrupted runs and forced restarts.

My target was to keep the overhead of batch locality scraping below 10%. Since AWS Batch on Fargate adds roughly 1 minute of provisioning overhead, each worker needed to run for around 10 minutes to make that overhead acceptable. Based on a median scrape time of 15 seconds per locality, I designed each worker to handle 50 localities.

Although spot instances were cheap, they introduced additional costs and complexity, including public IPv4 charges and more complex IaC.

Cost only became worth optimizing once the workload was understood well enough to separate real savings from premature complexity.

Conclusion

What began as a scrappy scraper gradually became a production pipeline. Each stage exposed a different class of problem: scrapeability, data integrity, insertion rules, observability, and finally architecture itself.

The main lesson was to validate assumptions early, especially through observability. Production data exposed where those assumptions failed, and each redesign became a response to that reality rather than guesswork.

The result is a scraper that has been in production since October 2025 and has been operating with an average of fewer than 100 warnings and errors per run.

The biggest benefit was being able to explore tens of thousands of properties with a single SQL query instead of being constrained by the limited interfaces of property listing websites. That made the engineering effort worthwhile: it turned messy public listing data into something I could reason about quickly.

What’s Next?

I could use AWS FIS to test the resilience of the pipeline by deliberately injecting faults in the spirit of Netflix-style chaos engineering.

The next obvious step is to explore the data itself, but that is a separate problem.

That opens the door to a different set of questions:

How to design an interactive UX for exploring large property datasets?
How much complexity does local-first caching introduce?

DEV Community

Building NewHomie property analytics tool — Part 1

From Scrappy Scraper to Production Pipeline

1. Validate scrapeability

2. Validate scraped data at boundary

3. Insert data with raw SQL

4. Add observability and iterate before scaling

5. Design for iteration speed first

6. Design for observed requirements next

7. Optimize for cost last

Conclusion

What’s Next?

Top comments (0)