DEV Community

odakin
odakin

Posted on • Edited on

I Wrote 82 Regex Replacements to Parse 6,933 Time Format Variations From a Government Dataset

Note: This article is also available in Japanese.

The Setup

Japan's Ministry of Health publishes a list of ~10,000 pharmacies that dispense emergency contraception. I built a search tool for it.

The dataset has a hours field. Business hours. How bad could it be?

Mon-Fri:9:00-18:00,Sat:9:00-13:00
Enter fullscreen mode Exit fullscreen mode

Split on ,, split on :, parse the range. One regex. Done.

First version coverage: 89.4%.

Over 10% of entries failed to parse. Here's why.

The Horror: Free-Text Entry With No Schema

There's no format specification. Each pharmacy across 47 prefectures types whatever they want. Here are real entries that all mean "Monday to Friday, 9:00 to 18:00":

月-金:9:00-18:00           ← clean
月~金:9:00~18:00  ← full-width everything
⽉-⾦:9:00-18:00           ← ...what?
月曜日~金曜日 9時~18時    ← kanji time notation
(月火水木金)9:00-18:00     ← parenthesis grouping
平日:9:00-18:00            ← "weekdays" in Japanese
月から金は9時から18時       ← literal prose
Enter fullscreen mode Exit fullscreen mode

All the same meaning.

My job: funnel all of these into a single canonical form. The function that does this calls .replace() 82 times.

Hell #1: Characters That Look Identical But Aren't

""  // U+6708 — correct
""  // U+2F47 — CJK Compatibility Ideograph
Enter fullscreen mode Exit fullscreen mode

Can you tell them apart? You can't. But /[月火水木金土日]/ only matches the first one.

One entire prefecture's data used CJK compatibility characters for all day-of-week kanji. Every entry looked perfect to the human eye. The parser skipped all of them. Debugging took half a day.

.replace(/⽉/g, "").replace(/⽕/g, "").replace(/⽔/g, "")
.replace(/⽊/g, "").replace(/⾦/g, "").replace(/⼟/g, "").replace(/⽇/g, "")
Enter fullscreen mode Exit fullscreen mode

Hell #2: Five Kinds of Colons

""  // U+FF1A Fullwidth colon (3,587 entries — 36% of all data)
""  // U+2236 RATIO
""  // U+FE13 PRESENTATION FORM
"ː"  // U+02D0 MODIFIER LETTER TRIANGULAR COLON
":"  // U+003A Normal colon
Enter fullscreen mode Exit fullscreen mode

All intended as colons. 36% of the entire dataset doesn't use the normal colon.

Dashes are worse — nine varieties:

"" "" "" "" "" "" "" "" "-"
Enter fullscreen mode Exit fullscreen mode

That is a katakana vowel lengthener. People use it as a hyphen.

Hell #3: "FridaySunday"

To normalize days, I strip 曜日 (day-of-week suffix). Simple enough:

月曜日-金曜日,日曜日  →  月-金,日
Enter fullscreen mode Exit fullscreen mode

But the data contains 金曜日曜 — "Friday" and "Sunday" concatenated without a separator.

月曜-金曜日曜9:00-18:00
Enter fullscreen mode Exit fullscreen mode

Naive stripping of 曜日?:

月曜-金曜日曜        ← input
  ↓ match 曜日?
月 - 金              ← "金曜日" matched, ate "日" (Sunday)
Enter fullscreen mode Exit fullscreen mode

Sunday vanished.

Fix: lookahead assertion.

.replace(/曜(?:(?!))?/g, "")
Enter fullscreen mode Exit fullscreen mode

Don't eat if it's followed by (meaning it's the start of the next day name).

But now the result is 月-金日 — "Friday" and "Sunday" are glued together. Another fix:

.replace(/([月火水木金土日])-([月火水木金土日])([月火水木金土日])/g, "$1-$2・$3")
Enter fullscreen mode Exit fullscreen mode

Every bug fix spawns a new bug.

Hell #4: Commas Mean Three Different Things

月-金:9:00-18:00,土:9:00-13:00    ← segment separator
月-金:9:00-13:00,14:00-18:00      ← AM/PM split shift
月,水,金:9:00-18:00               ← day enumeration
Enter fullscreen mode Exit fullscreen mode

I started with .split(","). The second pattern splits AM and PM into separate segments — PM has no day, parse fails. The third splits into , , 金:9:00-18:00 — Monday and Wednesday have no hours.

You have to determine the meaning of each comma from context.

// Heuristics:
// - "day+time" after comma → new segment
// - "time only" after comma → append to previous segment (split shift)
// - "day only" after comma → merge with next element (enumeration)
Enter fullscreen mode Exit fullscreen mode

Hell #5: Seven Ways to Separate Days From Hours

月-金:9:00-18:00     ← colon
月-金 9:00-18:00     ← space
月-金9:00-18:00      ← direct concatenation
(月火金)9:00-18:00   ← parentheses
月-金/9:00-18:00     ← slash
[月-金]9:00-18:00    ← brackets
月-金;9:00-18:00     ← semicolon
Enter fullscreen mode Exit fullscreen mode

Hell #6: Typos in Time Values

9:00-8:00    ← closing before opening (probably 18:00)
8:30-6:00    ← probably 16:00 or 18:00
80:30-88:50  ← how
Enter fullscreen mode Exit fullscreen mode

If you parse a typo and display "closes at 8:00," someone will show up at 4 PM to an open pharmacy and... wait, no, the opposite. Someone will not show up because they think it's closed.

Validation rejects anything outside 0:00–29:59, falling back to raw data display.

(29:59 because Japan uses "25:00" to mean 1:00 AM the next day for late-night businesses.)

The Biggest Design Decision: Wrong Is Worse Than Missing

At 97% coverage, the question becomes: what about the remaining 3%?

The answer: don't parse them. Show the raw data.

This site is a pharmacy search for emergency contraception. Users are under time pressure. Showing "this pharmacy is open on Saturday" when it isn't could mean someone misses their window. That's not recoverable.

Example:

Mon-Fri:9:00-18:00,Sat:???,Sun:9:00-14:00
Enter fullscreen mode Exit fullscreen mode

If Saturday is unparseable, I have two choices:

  1. Skip Saturday, show Mon-Fri + Sun → user might assume "closed on Saturday"
  2. Fail the whole entry, show raw text → user reads it themselves

I chose 2. No partial parses. All or nothing.

The same logic applies to "nth weekday" patterns:

第1・3土曜:9:00-12:00  (= "1st and 3rd Saturday only")
Enter fullscreen mode Exit fullscreen mode

The parser outputs weekly schedules ({day: "Sat", open: "9:00", close: "12:00"}). There's no field for "which week of the month."

First implementation: display as "every Saturday." But that's a lie. Someone showing up on the 2nd Saturday finds a closed pharmacy. For medication access, you don't get to lie.

Fix: skip the entire entry. Saturday info is lost. But a gap is better than a lie.

Holiday Hell: One More Layer

After hitting 97.1%, I noticed another problem:

月-金:9:00-18:00,日祝休み  (= "closed on Sundays and holidays")
Enter fullscreen mode Exit fullscreen mode

The parser doesn't know about holidays. It would show "Open" on a holiday Monday.

I implemented Japan's entire holiday calendar. ~60 lines, zero dependencies.

  • Fixed holidays (New Year's, National Foundation Day, Emperor's Birthday...)
  • Happy Monday holidays (Coming of Age Day, Marine Day, Respect for the Aged Day, Sports Day)
  • Vernal/Autumnal equinox (calculated from an astronomical formula, valid through ~2099)
  • Substitute holidays (if a holiday falls on Sunday, Monday becomes a holiday)
  • Citizen's holidays (a weekday sandwiched between two holidays)
// Vernal equinox formula
const vernal = Math.floor(
  20.8431 + 0.242194 * (year - 1980)
  - Math.floor((year - 1980) / 4)
);
Enter fullscreen mode Exit fullscreen mode

Coverage Over Time

89.4% → 95.7% → 96.6% → 96.3%(!) → 96.9% → 97.1%
Enter fullscreen mode Exit fullscreen mode

Notice the drop from 96.6% to 96.3%. That's where I stopped lying about "every Saturday" and started skipping nth-weekday entries. Coverage went down. Accuracy went up. Coverage is a quality indicator, not a target.

Why I Stopped at 97.1%

The remaining 2.9%:

月~土:9:00~20:00、お客様感謝デーを除く日・祝:9:00~18:00
("except Customer Appreciation Days")
Enter fullscreen mode Exit fullscreen mode

Natural language conditions.

月/水/金曜日第2/4火曜日第1/3/5土曜日9:00-18:00
Enter fullscreen mode Exit fullscreen mode

Days, ordinals, and times in an undifferentiated blob.

 09:00   19:00
 09:00   19:00
Enter fullscreen mode Exit fullscreen mode

Copy-pasted from Excel. Tab-separated.

Each edge case means a new .replace() that could interfere with the existing 82. Adding one pattern for 3 entries risks regressions across 9,951. Not worth it.

Unparseable pharmacies show their raw data. Users can read Japanese. It's fine.

By The Numbers

Metric Value
Input records 9,951 (non-empty hours)
Unique formats 6,933
.replace() calls 82
Regex patterns ~150
Parser code ~590 lines
Coverage 97.1% (9,659 entries)
Unparseable 292 (2.9%)

Lessons

  1. Free-text fields in government data are hell. Please just use JSON. ["Mon-Fri:9:00-18:00","Sat:9:00-13:00"]. Please.

  2. Normalization rules have order dependencies. Adding one breaks another. Arranging 82 of them in the right order is basically compiler pass optimization.

  3. Coverage is a quality indicator, not a target. The commit that dropped coverage from 96.6% to 96.3% was the most important fix.

  4. For health-related data, "unknown" is 100x better than "wrong." If you can't parse it, show the raw data. Partial parses are a lie factory.

  5. The cost of the last 3% is exponential. 89→95 was character normalization. 95→97 was context-dependent comma parsing and lookahead assertions. 97→99 would mean NLP. Knowing when to stop is a skill.


Disclosure: The regex patterns and parser code were generated with Claude Code and refined through iterative testing against the actual dataset. The design decisions, coverage targets, and "when to stop" judgments described in this article are mine.

Repository

GitHub logo odakin / mhlw-ec-pharmacy-finder

Emergency contraception pharmacy finder based on official MHLW data (Japan)

緊急避妊薬(アフターピル)販売可能な薬局検索

English version below / Jump to English

このリポジトリは、厚生労働省が公表している緊急避妊薬の薬局一覧(要指導医薬品販売)と医療機関一覧(対面診療・処方)を、 検索しやすい CSV / XLSX / JSON に整形し、さらに 静的Web検索(GitHub Pages)LINE Botサンプル を添えたものです。

  • 出典(公式ページ)
  • 最新取り込みデータ時点: 薬局 2026-03-10 / 医療機関 2026-02-20
  • 生成物
    • data/ : 整形済みデータ(CSV/XLSX/JSON、原本XLSX、ジオコーディングキャッシュ)
    • docs/ : 静的Web検索(GitHub Pages用、地図・営業時間表示対応)
    • line_bot/ : LINE Bot(Node.js最小サンプル)
    • scripts/update_data.py : 薬局データ更新スクリプト(公式XLSX取得)
    • scripts/update_clinics.py : 医療機関データ更新スクリプト(公式PDF 47件パース)
    • scripts/geocode.py : 住所→緯度経度変換(東大CSIS API、薬局+医療機関対応)

重要な注意(必ずお読みください)

  • このリポジトリは医療アドバイスを提供しません。
  • 実際の購入可否・在庫・営業時間・販売条件は、各薬局に確認してください。
  • 公式ページでも、在庫等が変動しうるため 来局前に電話確認が推奨されています。 最終的な根拠は、上記の公式ページを最優先にしてください。

1) Web検索(GitHub Pages)

docs/ 配下は静的ファイルだけで動作します。

公開

  1. GitHub の Settings → Pages
  2. Source を「Deploy from a branch」
  3. Branch を main / Folder を /docs にして保存

公開後のURLは通常 https://<ユーザー名>.github.io/<リポジトリ名>/ になります。 例:リポジトリ名を mhlw-ec-pharmacy-finder にした場合 → https://odakin.github.io/mhlw-ec-pharmacy-finder/

ローカルで試す

cd docs
python -m http.server 8000
# http://localhost:8000 を開く
Enter fullscreen mode Exit fullscreen mode

2) 整形済みデータ

  • data/mhlw_ec_pharmacies_cleaned_2026-03-25.xlsx
  • data/mhlw_ec_pharmacies_cleaned_2026-03-25.csv(UTF-8 BOM)
  • data/data_2026-03-25.json(Web/LINE Bot用)

追加した列(例):

  • 市区町村_推定:住所文字列から市区町村相当を推定(完璧ではありません)
  • 電話番号_数字:ハイフン等を除去して通話リンクに使いやすくしたもの
  • 時間外の電話番号_数字:時間外の電話番号を同様に数字化したもの
  • 販売可能薬剤師数_女性 / 販売可能薬剤師数_男性 / 販売可能薬剤師数_答えたくない:公式一覧の「販売可能薬剤師・性別(人数)」

Web UI の絞り込み:

  • 事前連絡「要」を除く
  • 女性薬剤師がいる
  • 個室あり
  • 今対応可能(営業中 + 時間外対応可 + 不明を表示、確実に閉まっている施設を非表示)

Web UI の機能:

  • 地図表示: Leaflet.js…

Parser code: docs/app.js | Design doc: docs/HOURS_PARSER.md

Live site: Emergency Contraception Pharmacy Search — searches 10,000+ pharmacies and 2,900+ clinics from Japan's official MHLW data.

Sequel: My AI Coding Assistant Misapplied the Design Principle I Gave It (link updated after publication)

Top comments (0)