Note: This article is also available in Japanese.
The Setup
Japan's Ministry of Health publishes a list of ~10,000 pharmacies that dispense emergency contraception. I built a search tool for it.
The dataset has a hours field. Business hours. How bad could it be?
Mon-Fri:9:00-18:00,Sat:9:00-13:00
Split on ,, split on :, parse the range. One regex. Done.
First version coverage: 89.4%.
Over 10% of entries failed to parse. Here's why.
The Horror: Free-Text Entry With No Schema
There's no format specification. Each pharmacy across 47 prefectures types whatever they want. Here are real entries that all mean "Monday to Friday, 9:00 to 18:00":
月-金:9:00-18:00 ← clean
月~金:9:00~18:00 ← full-width everything
⽉-⾦:9:00-18:00 ← ...what?
月曜日~金曜日 9時~18時 ← kanji time notation
(月火水木金)9:00-18:00 ← parenthesis grouping
平日:9:00-18:00 ← "weekdays" in Japanese
月から金は9時から18時 ← literal prose
All the same meaning.
My job: funnel all of these into a single canonical form. The function that does this calls .replace() 82 times.
Hell #1: Characters That Look Identical But Aren't
"月" // U+6708 — correct
"⽉" // U+2F47 — CJK Compatibility Ideograph
Can you tell them apart? You can't. But /[月火水木金土日]/ only matches the first one.
One entire prefecture's data used CJK compatibility characters for all day-of-week kanji. Every entry looked perfect to the human eye. The parser skipped all of them. Debugging took half a day.
.replace(/⽉/g, "月").replace(/⽕/g, "火").replace(/⽔/g, "水")
.replace(/⽊/g, "木").replace(/⾦/g, "金").replace(/⼟/g, "土").replace(/⽇/g, "日")
Hell #2: Five Kinds of Colons
":" // U+FF1A Fullwidth colon (3,587 entries — 36% of all data)
"∶" // U+2236 RATIO
"︓" // U+FE13 PRESENTATION FORM
"ː" // U+02D0 MODIFIER LETTER TRIANGULAR COLON
":" // U+003A Normal colon
All intended as colons. 36% of the entire dataset doesn't use the normal colon.
Dashes are worse — nine varieties:
"-" "‐" "−" "–" "—" "ー" "―" "ー" "-"
That ー is a katakana vowel lengthener. People use it as a hyphen.
Hell #3: "FridaySunday"
To normalize days, I strip 曜日 (day-of-week suffix). Simple enough:
月曜日-金曜日,日曜日 → 月-金,日
But the data contains 金曜日曜 — "Friday" and "Sunday" concatenated without a separator.
月曜-金曜日曜9:00-18:00
Naive stripping of 曜日?:
月曜-金曜日曜 ← input
↓ match 曜日?
月 - 金 ← "金曜日" matched, ate "日" (Sunday)
Sunday vanished.
Fix: lookahead assertion.
.replace(/曜(?:日(?!曜))?/g, "")
Don't eat 日 if it's followed by 曜 (meaning it's the start of the next day name).
But now the result is 月-金日 — "Friday" and "Sunday" are glued together. Another fix:
.replace(/([月火水木金土日])-([月火水木金土日])([月火水木金土日])/g, "$1-$2・$3")
Every bug fix spawns a new bug.
Hell #4: Commas Mean Three Different Things
月-金:9:00-18:00,土:9:00-13:00 ← segment separator
月-金:9:00-13:00,14:00-18:00 ← AM/PM split shift
月,水,金:9:00-18:00 ← day enumeration
I started with .split(","). The second pattern splits AM and PM into separate segments — PM has no day, parse fails. The third splits into 月, 水, 金:9:00-18:00 — Monday and Wednesday have no hours.
You have to determine the meaning of each comma from context.
// Heuristics:
// - "day+time" after comma → new segment
// - "time only" after comma → append to previous segment (split shift)
// - "day only" after comma → merge with next element (enumeration)
Hell #5: Seven Ways to Separate Days From Hours
月-金:9:00-18:00 ← colon
月-金 9:00-18:00 ← space
月-金9:00-18:00 ← direct concatenation
(月火金)9:00-18:00 ← parentheses
月-金/9:00-18:00 ← slash
[月-金]9:00-18:00 ← brackets
月-金;9:00-18:00 ← semicolon
Hell #6: Typos in Time Values
9:00-8:00 ← closing before opening (probably 18:00)
8:30-6:00 ← probably 16:00 or 18:00
80:30-88:50 ← how
If you parse a typo and display "closes at 8:00," someone will show up at 4 PM to an open pharmacy and... wait, no, the opposite. Someone will not show up because they think it's closed.
Validation rejects anything outside 0:00–29:59, falling back to raw data display.
(29:59 because Japan uses "25:00" to mean 1:00 AM the next day for late-night businesses.)
The Biggest Design Decision: Wrong Is Worse Than Missing
At 97% coverage, the question becomes: what about the remaining 3%?
The answer: don't parse them. Show the raw data.
This site is a pharmacy search for emergency contraception. Users are under time pressure. Showing "this pharmacy is open on Saturday" when it isn't could mean someone misses their window. That's not recoverable.
Example:
Mon-Fri:9:00-18:00,Sat:???,Sun:9:00-14:00
If Saturday is unparseable, I have two choices:
- Skip Saturday, show Mon-Fri + Sun → user might assume "closed on Saturday"
- Fail the whole entry, show raw text → user reads it themselves
I chose 2. No partial parses. All or nothing.
The same logic applies to "nth weekday" patterns:
第1・3土曜:9:00-12:00 (= "1st and 3rd Saturday only")
The parser outputs weekly schedules ({day: "Sat", open: "9:00", close: "12:00"}). There's no field for "which week of the month."
First implementation: display as "every Saturday." But that's a lie. Someone showing up on the 2nd Saturday finds a closed pharmacy. For medication access, you don't get to lie.
Fix: skip the entire entry. Saturday info is lost. But a gap is better than a lie.
Holiday Hell: One More Layer
After hitting 97.1%, I noticed another problem:
月-金:9:00-18:00,日祝休み (= "closed on Sundays and holidays")
The parser doesn't know about holidays. It would show "Open" on a holiday Monday.
I implemented Japan's entire holiday calendar. ~60 lines, zero dependencies.
- Fixed holidays (New Year's, National Foundation Day, Emperor's Birthday...)
- Happy Monday holidays (Coming of Age Day, Marine Day, Respect for the Aged Day, Sports Day)
- Vernal/Autumnal equinox (calculated from an astronomical formula, valid through ~2099)
- Substitute holidays (if a holiday falls on Sunday, Monday becomes a holiday)
- Citizen's holidays (a weekday sandwiched between two holidays)
// Vernal equinox formula
const vernal = Math.floor(
20.8431 + 0.242194 * (year - 1980)
- Math.floor((year - 1980) / 4)
);
Coverage Over Time
89.4% → 95.7% → 96.6% → 96.3%(!) → 96.9% → 97.1%
Notice the drop from 96.6% to 96.3%. That's where I stopped lying about "every Saturday" and started skipping nth-weekday entries. Coverage went down. Accuracy went up. Coverage is a quality indicator, not a target.
Why I Stopped at 97.1%
The remaining 2.9%:
月~土:9:00~20:00、お客様感謝デーを除く日・祝:9:00~18:00
("except Customer Appreciation Days")
Natural language conditions.
月/水/金曜日第2/4火曜日第1/3/5土曜日9:00-18:00
Days, ordinals, and times in an undifferentiated blob.
月 09:00 19:00
火 09:00 19:00
Copy-pasted from Excel. Tab-separated.
Each edge case means a new .replace() that could interfere with the existing 82. Adding one pattern for 3 entries risks regressions across 9,951. Not worth it.
Unparseable pharmacies show their raw data. Users can read Japanese. It's fine.
By The Numbers
| Metric | Value |
|---|---|
| Input records | 9,951 (non-empty hours) |
| Unique formats | 6,933 |
.replace() calls |
82 |
| Regex patterns | ~150 |
| Parser code | ~590 lines |
| Coverage | 97.1% (9,659 entries) |
| Unparseable | 292 (2.9%) |
Lessons
Free-text fields in government data are hell. Please just use JSON.
["Mon-Fri:9:00-18:00","Sat:9:00-13:00"]. Please.Normalization rules have order dependencies. Adding one breaks another. Arranging 82 of them in the right order is basically compiler pass optimization.
Coverage is a quality indicator, not a target. The commit that dropped coverage from 96.6% to 96.3% was the most important fix.
For health-related data, "unknown" is 100x better than "wrong." If you can't parse it, show the raw data. Partial parses are a lie factory.
The cost of the last 3% is exponential. 89→95 was character normalization. 95→97 was context-dependent comma parsing and lookahead assertions. 97→99 would mean NLP. Knowing when to stop is a skill.
Repository
odakin
/
mhlw-ec-pharmacy-finder
Emergency contraception pharmacy finder based on official MHLW data (Japan)
緊急避妊薬(アフターピル)販売可能な薬局検索
English version below / Jump to English
このリポジトリは、厚生労働省が公表している緊急避妊薬の薬局一覧(要指導医薬品販売)と医療機関一覧(対面診療・処方)を、 検索しやすい CSV / XLSX / JSON に整形し、さらに 静的Web検索(GitHub Pages) と LINE Botサンプル を添えたものです。
- 出典(公式ページ)
- 最新取り込みデータ時点: 薬局 2026-03-10 / 医療機関 2026-02-20
- 生成物
-
data/: 整形済みデータ(CSV/XLSX/JSON、原本XLSX、ジオコーディングキャッシュ) -
docs/: 静的Web検索(GitHub Pages用、地図・営業時間表示対応) -
line_bot/: LINE Bot(Node.js最小サンプル) -
scripts/update_data.py: 薬局データ更新スクリプト(公式XLSX取得) -
scripts/update_clinics.py: 医療機関データ更新スクリプト(公式PDF 47件パース) -
scripts/geocode.py: 住所→緯度経度変換(東大CSIS API、薬局+医療機関対応)
-
重要な注意(必ずお読みください)
- このリポジトリは医療アドバイスを提供しません。
- 実際の購入可否・在庫・営業時間・販売条件は、各薬局に確認してください。
- 公式ページでも、在庫等が変動しうるため 来局前に電話確認が推奨されています。 最終的な根拠は、上記の公式ページを最優先にしてください。
1) Web検索(GitHub Pages)
docs/ 配下は静的ファイルだけで動作します。
公開
- GitHub の Settings → Pages
- Source を「Deploy from a branch」
- Branch を
main/ Folder を/docsにして保存
公開後のURLは通常 https://<ユーザー名>.github.io/<リポジトリ名>/ になります。
例:リポジトリ名を mhlw-ec-pharmacy-finder にした場合 → https://odakin.github.io/mhlw-ec-pharmacy-finder/
ローカルで試す
cd docs
python -m http.server 8000
# http://localhost:8000 を開く
2) 整形済みデータ
data/mhlw_ec_pharmacies_cleaned_2026-03-10.xlsx-
data/mhlw_ec_pharmacies_cleaned_2026-03-10.csv(UTF-8 BOM) -
data/data_2026-03-10.json(Web/LINE Bot用)
追加した列(例):
-
市区町村_推定:住所文字列から市区町村相当を推定(完璧ではありません) -
電話番号_数字:ハイフン等を除去して通話リンクに使いやすくしたもの -
時間外の電話番号_数字:時間外の電話番号を同様に数字化したもの -
販売可能薬剤師数_女性/販売可能薬剤師数_男性/販売可能薬剤師数_答えたくない:公式一覧の「販売可能薬剤師・性別(人数)」
Web UI の絞り込み:
- 事前連絡「要」を除く
- 時間外対応あり
- 女性薬剤師がいる
- 個室あり
Web UI の機能:
- 地図表示: Leaflet.js + OpenStreetMap で検索結果をピン表示(マーカークラスタリング対応、薬局=青・医療機関=赤)
- 近い順ソート…
Parser code: docs/app.js | Design doc: docs/HOURS_PARSER.md
Live site: Emergency Contraception Pharmacy Search — searches 10,000+ pharmacies and 2,900+ clinics from Japan's official MHLW data.
Top comments (0)