Session zero

Posted on Apr 28

The Proxy Problem: When Your Build Passes But Your Geography Fails

#webscraping #korea #apify #typescript

This is part of an ongoing series about building and monetizing Korean data scrapers on Apify. This post continues from #47: Korea's #1 Real Estate Platform Has No Official API — So I Built a Scraper. Then Got Blocked.

The build took about four hours. The code was clean. The Actor ran locally, returned exactly the data I expected, and passed every test I threw at it. I deployed it to Apify, clicked Run, and waited.

net::ERR_CONNECTION_CLOSED

Not a bug. Not a logic error. The scraper was working. The problem was where the request was coming from.

What Actually Happened

The target: Naver Land, Korea's dominant real estate platform. No official API. The data — apartment listings, price ranges, floor counts, transaction history — lives behind a web interface that assumes you're in Seoul, not a U.S. data center in Iowa.

Apify runs on cloud servers in the US and Europe. Korean web platforms know this. Naver Land blocks those IP ranges at the network level. The error isn't a timeout or a captcha — it's a connection refusal before the handshake completes.

ERR_CONNECTION_CLOSED

Three words. Zero ambiguity. The request never reached the application layer.

The Geography Problem in Web Scraping

Data center IPs are easy to enumerate. Services like MaxMind maintain databases of ASNs (Autonomous System Numbers) that map IP ranges to their owners — AWS, Google Cloud, Apify, Browserless. Korean platforms license these databases and apply block rules at the infrastructure level.

This is not new. Streaming services have done it for years. But the pattern is more aggressive in Korea for two reasons:

Limited API alternatives: Unlike US tech giants, Korean platforms rarely publish official APIs. The web is often the only access point.
High-value commercial data: Naver Land's data powers real estate apps that charge for access. Blocking scrapers protects a revenue stream.

A data center IP gets blocked because the platform can. A residential IP — assigned by an ISP to an actual home subscriber — is harder to block without collateral damage. You can't block all of KT Telecom without also blocking actual customers.

The Fix: 3 Lines of Code

The code change to route through a Korean residential proxy is minimal. In Apify's SDK:

// Before
const crawler = new PlaywrightCrawler({
  requestHandlerTimeoutSecs: 60,
});

// After
const crawler = new PlaywrightCrawler({
  requestHandlerTimeoutSecs: 60,
  proxyConfiguration: await Actor.createProxyConfiguration({
    groups: ['RESIDENTIAL'],
    countryCode: 'KR',
  }),
});

Three lines. The logic stays identical. The origin IP changes from an Iowa data center to a Korean residential address.

The scraper works immediately after this change. The same request that returned ERR_CONNECTION_CLOSED now returns a 200 with the full apartment listing data.

The Real Problem: The Cost

Residential proxies aren't cheap. Apify's Korean residential proxy runs at approximately $8/GB of transferred data.

For reference:

A single Naver Land API response for one apartment complex: ~15-50 KB
1,000 apartment complexes: ~15-50 MB
Cost at $8/GB: $0.12 - $0.40 per 1,000 requests

At low volumes, this is negligible. But web scraping at scale changes the math quickly. If the Actor becomes popular and a power user runs it against every apartment complex in Seoul (roughly 10,000 complexes), the cost per run could reach $4-20 — and that's before Apify's compute fees.

This creates a pricing problem. If I charge $5 per 1,000 results and the proxy alone costs $3-4, the margin disappears before accounting for any other overhead.

The Decision Framework

I ended up with a three-part question:

1. Is the data worth the proxy cost?

Real estate data in Korea has genuine commercial value. Agents, developers, and investors pay for access to this data elsewhere. The proxy cost is a pass-through that's recoverable in the price.

2. Does my current pricing model absorb it?

Apify charges users per compute unit and data transfer. Proxy costs can be partially passed through via Apify's billing model, but the math needs to be explicit before launch — not discovered post-launch when users are running 10,000 requests at your advertised price.

3. Is the volume predictable?

For a niche scraper targeting Korean real estate researchers, usage patterns are likely bounded. Unlike a news scraper that someone might run continuously, real estate data is usually collected in batches. Lower per-request costs, more predictable total spend.

My decision: proceed, but price per request at a level that covers a residential proxy for every run. $0.01/result at the high end of expected transfer rates covers the proxy with margin.

What I Was Wrong About

I assumed the technical problem was the hard part.

Building a Playwright-based scraper that correctly navigates Naver Land's tile-based map interface, sends authenticated requests to the internal complex API, and parses the nested JSON response — that took real effort. The proxy configuration took ten minutes.

But pricing a scraper that runs on expensive residential proxies at a sustainable margin took three days of thinking. And I still don't know if I got it right until volume arrives.

The pattern here: the technical problem has a known solution. The economics problem has a correct answer that only becomes visible in production.

The Status: Still Waiting

The Actor is deployed (ID: 8R9tvPV1BrKDShW29). The proxy configuration is written. It's waiting on one thing: a cost decision.

Korean residential proxies run at $8/GB. Before I enable that in production, I need to be certain the pricing model holds at scale. Code ready, deployment done — blocked not by geography but by economics. The scraper can pass the build check. It can't pass the margin check yet.

There's something almost poetic about a scraper designed to get around one kind of gate being stopped by another.

I'll publish the results once the proxy is approved — whether it works, what the real per-request cost ends up being, and whether the pricing held.

Takeaway

If you're building a scraper for Korean platforms:

Test with a residential KR proxy from day one. Don't wait until you have a "working" build to add the proxy — the build isn't working without it.
Price the proxy cost before you price the product. $8/GB sounds small until you multiply it by user behavior.
ERR_CONNECTION_CLOSED is a geography signal, not a code bug. Nine times out of ten, if the code works locally and fails on cloud infrastructure, the problem is the IP range.

The proxy problem is solvable. The cost problem requires more data.

Actor on Apify: naver-land-scraper — in beta pending proxy approval.

Tags: webscraping, korea, apify, typescript, devtools

DEV Community