I scored 8,000 Seoul cafes for laptop-workability from review keywords

#datascience #sideprojects #webdev #seo

I wanted one boring thing: a map that tells me which cafe near me is actually good for sitting with a laptop for three hours. Not "4.5 stars" — that rates the latte. I wanted "wide seats, quiet, has outlets, won't glare at you for staying."

That signal exists. It's just buried in the text of visitor reviews, not in any star rating. So I spent a few weeks pulling it out for ~8,000 cafes in Seoul, turning it into a single number, and — because the data turned out to be interesting on its own — publishing it as a free report. Here's what was annoying, what I learned, and what the data said.

Why not just use the existing ratings

Naver Place and Kakao Map already have ratings and reviews for basically every cafe in Korea. I tried to lean on them and bounced off for three reasons:

Star ratings answer the wrong question. A 4.6-star cafe can have zero outlets and a 60-minute soft time limit. The rating measures "did people enjoy their visit," not "can I work here."
The useful signal is in keywords, not stars. Korean review platforms surface visitor keyword tags ("좌석이 편해요" / "콘센트가 많아요" / "오래 있기 좋아요"). That's the gold. But it's per-cafe, unaggregated, and not comparable across places.
No ownership. I can't rank, filter, or build a map on top of a number I don't control. I wanted a score I could recompute and stand behind.

So the job became: convert fuzzy keyword tags + a few structured fields into one comparable, defensible score.

The parts that ate the time

Turning keywords into axes (~the longest part)

Visitor keywords are wonderfully human and terrible as data. "콘센트 넉넉" / "콘센트 많아요" / "충전 가능" all mean the same axis (outlets) with different surface forms, and the count of each keyword matters more than its presence. I ended up mapping keyword clusters to eight axes — price, pressure (the "are they rushing you out" vibe), outlets, wifi, seat space, focus/noise, refills, restrooms — each normalized to 0–100.

The trap that cost me a day: never treat a missing axis as zero. A cafe with no wifi data isn't a 0-wifi cafe; it's an unknown. If you fill nulls with 0 (or with a flat 50 "average"), every sparse cafe collapses toward the same mediocre score and the ranking turns to mush. The fix was to drop null axes out of the weighted average and renormalize the remaining weights per cafe. Boring, but it's the difference between a real ranking and noise.

Picking a formula I could defend

I split it into two scores instead of one mega-number:

Workability = seat space × 0.45 + focus × 0.40 + price × 0.15, bucketed into S/A/B/C grades. This is the "can I actually work here" score, and it leans hard on space + quiet because that's what the reviews say matters.
Survival = an 8-axis weighted blend (price, inverted-pressure, outlets, wifi, space, focus, refill, restroom) for "how long can I camp here."

The weights are opinions, not physics — but they're stated opinions. Anyone can disagree with "space is 45%," and that's fine; the point is the number is reproducible and the methodology is on the page.

The SEO sandbox surprise (the genuinely humbling part)

I rendered ~8,000 cafe pages plus district/station/purpose landing pages server-side, submitted clean sitemaps, valid canonicals, JSON-LD — textbook. Google crawled them and indexed almost none. "Crawled – currently not indexed."

That's not a bug you can fix in code. A brand-new domain emitting thousands of structurally-similar pages reads as low-value, and Google quietly throttles the indexing quota until you earn authority. The lesson that reframed the whole project: for programmatic pages, the bottleneck isn't crawlability, it's per-page uniqueness and domain trust. Which is exactly why the next section exists.

What I extracted

Two things came out of this that are reusable on their own:

A per-page aggregate. Instead of every district page being the same template with a different list of cafes, each one now states its own computed numbers — average workability, grade distribution, the strongest axis, average americano price. Computed from real data, so each page is genuinely different.
A public district report. I aggregated all 25 Seoul districts into one page and published it free (CC BY 4.0): Seoul cafe workability report. It's the linkable, citeable version of the dataset — and honestly the most fun output to read.

What the data actually said

A few things that surprised me once it was aggregated by district:

The highest-average-workability districts are not the trendy ones — quieter residential-leaning districts beat the Instagram hotspots, because hotspot cafes optimize for vibe and turnover, not for someone parked with a laptop.
Outlet density varies wildly by district — some areas average meaningfully higher, which tracks with how many study/work-oriented cafes cluster there.
Average americano price spreads further than I expected across districts, which makes the price axis carry real signal.

Honest limits

Keyword bias. The score is only as good as what visitors chose to mention. Cafes with few reviews get thin, less-trustworthy scores — so I noindex/exclude the very sparse ones rather than pretend.
Scores are proxies. "Focus 80" is a model output from review language, not a decibel meter. It's directionally right, not laboratory-precise.
Coverage is Seoul-only, and freshness depends on review volume, which is uneven.
A higher-authority competitor (Naver itself) will always have more raw coverage. My bet is on the angle — workability, not popularity — not on out-crawling them.

The thing I'm still chewing on

For data-heavy sites, the honest growth lever turned out to be "make each page worth citing," not "publish more pages." If you've shipped a programmatic/data site: did unique per-page aggregates actually move indexing for you, or did it only move once real backlinks showed up? I'd love to hear which one broke the logjam first.

Top comments (1)

Marouane K • Jun 23

Hi, I saw that you aggregated and published data on Seoul cafes. Clypify can help you automate content aggregation and publishing to multiple platforms. We support WordPress, Medium, and RSS feeds. Free plan at clypify.com — no card needed.