odakin

Posted on Mar 21 • Edited on Apr 1

I Wrote 82 Regex Replacements to Parse 6,933 Time Format Variations From a Government Dataset

#javascript #regex #opendata #webdev

Note: This article is also available in Japanese.

The Setup

Japan's Ministry of Health publishes a list of ~10,000 pharmacies that dispense emergency contraception. I built a search tool for it.

The dataset has a hours field. Business hours. How bad could it be?

Mon-Fri:9:00-18:00,Sat:9:00-13:00

Split on ,, split on :, parse the range. One regex. Done.

First version coverage: 89.4%.

Over 10% of entries failed to parse. Here's why.

The Horror: Free-Text Entry With No Schema

There's no format specification. Each pharmacy across 47 prefectures types whatever they want. Here are real entries that all mean "Monday to Friday, 9:00 to 18:00":

月-金:9:00-18:00           ← clean
月～金：９：００～１８：００  ← full-width everything
⽉-⾦:9:00-18:00           ← ...what?
月曜日～金曜日 9時～18時    ← kanji time notation
(月火水木金)9:00-18:00     ← parenthesis grouping
平日:9:00-18:00            ← "weekdays" in Japanese
月から金は9時から18時       ← literal prose

All the same meaning.

My job: funnel all of these into a single canonical form. The function that does this calls .replace() 82 times.

Hell #1: Characters That Look Identical But Aren't

"月"  // U+6708 — correct
"⽉"  // U+2F47 — CJK Compatibility Ideograph

Can you tell them apart? You can't. But /[月火水木金土日]/ only matches the first one.

One entire prefecture's data used CJK compatibility characters for all day-of-week kanji. Every entry looked perfect to the human eye. The parser skipped all of them. Debugging took half a day.

.replace(/⽉/g, "月").replace(/⽕/g, "火").replace(/⽔/g, "水")
.replace(/⽊/g, "木").replace(/⾦/g, "金").replace(/⼟/g, "土").replace(/⽇/g, "日")

Hell #2: Five Kinds of Colons

"："  // U+FF1A Fullwidth colon (3,587 entries — 36% of all data)
"∶"  // U+2236 RATIO
"︓"  // U+FE13 PRESENTATION FORM
"ː"  // U+02D0 MODIFIER LETTER TRIANGULAR COLON
":"  // U+003A Normal colon

All intended as colons. 36% of the entire dataset doesn't use the normal colon.

Dashes are worse — nine varieties:

"－" "‐" "−" "–" "—" "ー" "―" "ｰ" "-"

That ー is a katakana vowel lengthener. People use it as a hyphen.

Hell #3: "FridaySunday"

To normalize days, I strip 曜日 (day-of-week suffix). Simple enough:

月曜日-金曜日,日曜日  →  月-金,日

But the data contains 金曜日曜 — "Friday" and "Sunday" concatenated without a separator.

月曜-金曜日曜9:00-18:00

Naive stripping of 曜日?:

月曜-金曜日曜        ← input
  ↓ match 曜日?
月 - 金              ← "金曜日" matched, ate "日" (Sunday)

Sunday vanished.

Fix: lookahead assertion.

.replace(/曜(?:日(?!曜))?/g, "")

Don't eat 日 if it's followed by 曜 (meaning it's the start of the next day name).

But now the result is 月-金日 — "Friday" and "Sunday" are glued together. Another fix:

.replace(/([月火水木金土日])-([月火水木金土日])([月火水木金土日])/g, "$1-$2・$3")

Every bug fix spawns a new bug.

Hell #4: Commas Mean Three Different Things

月-金:9:00-18:00,土:9:00-13:00    ← segment separator
月-金:9:00-13:00,14:00-18:00      ← AM/PM split shift
月,水,金:9:00-18:00               ← day enumeration

I started with .split(","). The second pattern splits AM and PM into separate segments — PM has no day, parse fails. The third splits into 月, 水, 金:9:00-18:00 — Monday and Wednesday have no hours.

You have to determine the meaning of each comma from context.

// Heuristics:
// - "day+time" after comma → new segment
// - "time only" after comma → append to previous segment (split shift)
// - "day only" after comma → merge with next element (enumeration)

Hell #5: Seven Ways to Separate Days From Hours

月-金:9:00-18:00     ← colon
月-金 9:00-18:00     ← space
月-金9:00-18:00      ← direct concatenation
(月火金)9:00-18:00   ← parentheses
月-金/9:00-18:00     ← slash
[月-金]9:00-18:00    ← brackets
月-金;9:00-18:00     ← semicolon

Hell #6: Typos in Time Values

9:00-8:00    ← closing before opening (probably 18:00)
8:30-6:00    ← probably 16:00 or 18:00
80:30-88:50  ← how

If you parse a typo and display "closes at 8:00," someone will show up at 4 PM to an open pharmacy and... wait, no, the opposite. Someone will not show up because they think it's closed.

Validation rejects anything outside 0:00–29:59, falling back to raw data display.

(29:59 because Japan uses "25:00" to mean 1:00 AM the next day for late-night businesses.)

The Biggest Design Decision: Wrong Is Worse Than Missing

At 97% coverage, the question becomes: what about the remaining 3%?

The answer: don't parse them. Show the raw data.

This site is a pharmacy search for emergency contraception. Users are under time pressure. Showing "this pharmacy is open on Saturday" when it isn't could mean someone misses their window. That's not recoverable.

Example:

Mon-Fri:9:00-18:00,Sat:???,Sun:9:00-14:00

If Saturday is unparseable, I have two choices:

Skip Saturday, show Mon-Fri + Sun → user might assume "closed on Saturday"
Fail the whole entry, show raw text → user reads it themselves

I chose 2. No partial parses. All or nothing.

The same logic applies to "nth weekday" patterns:

第1・3土曜:9:00-12:00  (= "1st and 3rd Saturday only")

The parser outputs weekly schedules ({day: "Sat", open: "9:00", close: "12:00"}). There's no field for "which week of the month."

First implementation: display as "every Saturday." But that's a lie. Someone showing up on the 2nd Saturday finds a closed pharmacy. For medication access, you don't get to lie.

Fix: skip the entire entry. Saturday info is lost. But a gap is better than a lie.

Holiday Hell: One More Layer

After hitting 97.1%, I noticed another problem:

月-金:9:00-18:00,日祝休み  (= "closed on Sundays and holidays")

The parser doesn't know about holidays. It would show "Open" on a holiday Monday.

I implemented Japan's entire holiday calendar. ~60 lines, zero dependencies.

Fixed holidays (New Year's, National Foundation Day, Emperor's Birthday...)
Happy Monday holidays (Coming of Age Day, Marine Day, Respect for the Aged Day, Sports Day)
Vernal/Autumnal equinox (calculated from an astronomical formula, valid through ~2099)
Substitute holidays (if a holiday falls on Sunday, Monday becomes a holiday)
Citizen's holidays (a weekday sandwiched between two holidays)

// Vernal equinox formula
const vernal = Math.floor(
  20.8431 + 0.242194 * (year - 1980)
  - Math.floor((year - 1980) / 4)
);

Coverage Over Time

89.4% → 95.7% → 96.6% → 96.3%(!) → 96.9% → 97.1%

Notice the drop from 96.6% to 96.3%. That's where I stopped lying about "every Saturday" and started skipping nth-weekday entries. Coverage went down. Accuracy went up. Coverage is a quality indicator, not a target.

Why I Stopped at 97.1%

The remaining 2.9%:

月～土：9:00~20：00、お客様感謝デーを除く日・祝：9:00~18:00
("except Customer Appreciation Days")

Natural language conditions.

月/水/金曜日第2/4火曜日第1/3/5土曜日9:00-18:00

Days, ordinals, and times in an undifferentiated blob.

月 09:00   19:00
火 09:00   19:00

Copy-pasted from Excel. Tab-separated.

Each edge case means a new .replace() that could interfere with the existing 82. Adding one pattern for 3 entries risks regressions across 9,951. Not worth it.

Unparseable pharmacies show their raw data. Users can read Japanese. It's fine.

By The Numbers

Metric	Value
Input records	9,951 (non-empty hours)
Unique formats	6,933
`.replace()` calls	82
Regex patterns	~150
Parser code	~590 lines
Coverage	97.1% (9,659 entries)
Unparseable	292 (2.9%)

Lessons

Free-text fields in government data are hell. Please just use JSON. ["Mon-Fri:9:00-18:00","Sat:9:00-13:00"]. Please.
Normalization rules have order dependencies. Adding one breaks another. Arranging 82 of them in the right order is basically compiler pass optimization.
Coverage is a quality indicator, not a target. The commit that dropped coverage from 96.6% to 96.3% was the most important fix.
For health-related data, "unknown" is 100x better than "wrong." If you can't parse it, show the raw data. Partial parses are a lie factory.
The cost of the last 3% is exponential. 89→95 was character normalization. 95→97 was context-dependent comma parsing and lookahead assertions. 97→99 would mean NLP. Knowing when to stop is a skill.

Disclosure: The regex patterns and parser code were generated with Claude Code and refined through iterative testing against the actual dataset. The design decisions, coverage targets, and "when to stop" judgments described in this article are mine.

Repository

odakin / mhlw-ec-pharmacy-finder

Emergency contraception pharmacy finder based on official MHLW data (Japan)

緊急避妊薬（アフターピル）販売可能な薬局検索

English version below / Jump to English

このリポジトリは、厚生労働省が公表している緊急避妊薬の薬局一覧（要指導医薬品販売）と医療機関一覧（対面診療・処方）を、検索しやすい CSV / XLSX / JSON に整形し、さらに 静的Web検索（GitHub Pages） と LINE Botサンプル を添えたものです。

出典（公式ページ）
- 薬局: https://www.mhlw.go.jp/stf/kinnkyuuhininnyaku_00005.html
- 医療機関: https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/0000186912_00002.html
最新取り込みデータ時点: 薬局 2026-03-10 / 医療機関 2026-02-20
生成物
- data/ : 整形済みデータ（CSV/XLSX/JSON、原本XLSX、ジオコーディングキャッシュ）
- docs/ : 静的Web検索（GitHub Pages用、地図・営業時間表示対応）
- line_bot/ : LINE Bot（Node.js最小サンプル）
- scripts/update_data.py : 薬局データ更新スクリプト（公式XLSX取得）
- scripts/update_clinics.py : 医療機関データ更新スクリプト（公式PDF 47件パース）
- scripts/geocode.py : 住所→緯度経度変換（東大CSIS API、薬局+医療機関対応）

重要な注意（必ずお読みください）

このリポジトリは医療アドバイスを提供しません。
実際の購入可否・在庫・営業時間・販売条件は、各薬局に確認してください。
公式ページでも、在庫等が変動しうるため 来局前に電話確認が推奨されています。最終的な根拠は、上記の公式ページを最優先にしてください。

1) Web検索（GitHub Pages）

docs/ 配下は静的ファイルだけで動作します。

公開

GitHub の Settings → Pages
Source を「Deploy from a branch」
Branch を main / Folder を /docs にして保存

公開後のURLは通常 https://<ユーザー名>.github.io/<リポジトリ名>/ になります。例：リポジトリ名を mhlw-ec-pharmacy-finder にした場合 → https://odakin.github.io/mhlw-ec-pharmacy-finder/

ローカルで試す

cd docs
python -m http.server 8000
# http://localhost:8000 を開く

2) 整形済みデータ

data/mhlw_ec_pharmacies_cleaned_2026-03-25.xlsx
data/mhlw_ec_pharmacies_cleaned_2026-03-25.csv（UTF-8 BOM）
data/data_2026-03-25.json（Web/LINE Bot用）

追加した列（例）:

市区町村_推定：住所文字列から市区町村相当を推定（完璧ではありません）
電話番号_数字：ハイフン等を除去して通話リンクに使いやすくしたもの
時間外の電話番号_数字：時間外の電話番号を同様に数字化したもの
販売可能薬剤師数_女性 / 販売可能薬剤師数_男性 / 販売可能薬剤師数_答えたくない：公式一覧の「販売可能薬剤師・性別（人数）」

Web UI の絞り込み:

事前連絡「要」を除く
女性薬剤師がいる
個室あり
今対応可能（営業中 + 時間外対応可 + 不明を表示、確実に閉まっている施設を非表示）

Web UI の機能:

地図表示: Leaflet.js…

View on GitHub

Parser code: docs/app.js | Design doc: docs/HOURS_PARSER.md

Live site: Emergency Contraception Pharmacy Search — searches 10,000+ pharmacies and 2,900+ clinics from Japan's official MHLW data.

Sequel: My AI Coding Assistant Misapplied the Design Principle I Gave It (link updated after publication)

DEV Community