DEV Community

odakin
odakin

Posted on

I Wrote 82 Regex Replacements to Parse 6,933 Time Format Variations From a Government Dataset

Note: This article is also available in Japanese.

The Setup

Japan's Ministry of Health publishes a list of ~10,000 pharmacies that dispense emergency contraception. I built a search tool for it.

The dataset has a hours field. Business hours. How bad could it be?

Mon-Fri:9:00-18:00,Sat:9:00-13:00
Enter fullscreen mode Exit fullscreen mode

Split on ,, split on :, parse the range. One regex. Done.

First version coverage: 89.4%.

Over 10% of entries failed to parse. Here's why.

The Horror: Free-Text Entry With No Schema

There's no format specification. Each pharmacy across 47 prefectures types whatever they want. Here are real entries that all mean "Monday to Friday, 9:00 to 18:00":

月-金:9:00-18:00           ← clean
月~金:9:00~18:00  ← full-width everything
⽉-⾦:9:00-18:00           ← ...what?
月曜日~金曜日 9時~18時    ← kanji time notation
(月火水木金)9:00-18:00     ← parenthesis grouping
平日:9:00-18:00            ← "weekdays" in Japanese
月から金は9時から18時       ← literal prose
Enter fullscreen mode Exit fullscreen mode

All the same meaning.

My job: funnel all of these into a single canonical form. The function that does this calls .replace() 82 times.

Hell #1: Characters That Look Identical But Aren't

""  // U+6708 — correct
""  // U+2F47 — CJK Compatibility Ideograph
Enter fullscreen mode Exit fullscreen mode

Can you tell them apart? You can't. But /[月火水木金土日]/ only matches the first one.

One entire prefecture's data used CJK compatibility characters for all day-of-week kanji. Every entry looked perfect to the human eye. The parser skipped all of them. Debugging took half a day.

.replace(/⽉/g, "").replace(/⽕/g, "").replace(/⽔/g, "")
.replace(/⽊/g, "").replace(/⾦/g, "").replace(/⼟/g, "").replace(/⽇/g, "")
Enter fullscreen mode Exit fullscreen mode

Hell #2: Five Kinds of Colons

""  // U+FF1A Fullwidth colon (3,587 entries — 36% of all data)
""  // U+2236 RATIO
""  // U+FE13 PRESENTATION FORM
"ː"  // U+02D0 MODIFIER LETTER TRIANGULAR COLON
":"  // U+003A Normal colon
Enter fullscreen mode Exit fullscreen mode

All intended as colons. 36% of the entire dataset doesn't use the normal colon.

Dashes are worse — nine varieties:

"" "" "" "" "" "" "" "" "-"
Enter fullscreen mode Exit fullscreen mode

That is a katakana vowel lengthener. People use it as a hyphen.

Hell #3: "FridaySunday"

To normalize days, I strip 曜日 (day-of-week suffix). Simple enough:

月曜日-金曜日,日曜日  →  月-金,日
Enter fullscreen mode Exit fullscreen mode

But the data contains 金曜日曜 — "Friday" and "Sunday" concatenated without a separator.

月曜-金曜日曜9:00-18:00
Enter fullscreen mode Exit fullscreen mode

Naive stripping of 曜日?:

月曜-金曜日曜        ← input
  ↓ match 曜日?
月 - 金              ← "金曜日" matched, ate "日" (Sunday)
Enter fullscreen mode Exit fullscreen mode

Sunday vanished.

Fix: lookahead assertion.

.replace(/曜(?:(?!))?/g, "")
Enter fullscreen mode Exit fullscreen mode

Don't eat if it's followed by (meaning it's the start of the next day name).

But now the result is 月-金日 — "Friday" and "Sunday" are glued together. Another fix:

.replace(/([月火水木金土日])-([月火水木金土日])([月火水木金土日])/g, "$1-$2・$3")
Enter fullscreen mode Exit fullscreen mode

Every bug fix spawns a new bug.

Hell #4: Commas Mean Three Different Things

月-金:9:00-18:00,土:9:00-13:00    ← segment separator
月-金:9:00-13:00,14:00-18:00      ← AM/PM split shift
月,水,金:9:00-18:00               ← day enumeration
Enter fullscreen mode Exit fullscreen mode

I started with .split(","). The second pattern splits AM and PM into separate segments — PM has no day, parse fails. The third splits into , , 金:9:00-18:00 — Monday and Wednesday have no hours.

You have to determine the meaning of each comma from context.

// Heuristics:
// - "day+time" after comma → new segment
// - "time only" after comma → append to previous segment (split shift)
// - "day only" after comma → merge with next element (enumeration)
Enter fullscreen mode Exit fullscreen mode

Hell #5: Seven Ways to Separate Days From Hours

月-金:9:00-18:00     ← colon
月-金 9:00-18:00     ← space
月-金9:00-18:00      ← direct concatenation
(月火金)9:00-18:00   ← parentheses
月-金/9:00-18:00     ← slash
[月-金]9:00-18:00    ← brackets
月-金;9:00-18:00     ← semicolon
Enter fullscreen mode Exit fullscreen mode

Hell #6: Typos in Time Values

9:00-8:00    ← closing before opening (probably 18:00)
8:30-6:00    ← probably 16:00 or 18:00
80:30-88:50  ← how
Enter fullscreen mode Exit fullscreen mode

If you parse a typo and display "closes at 8:00," someone will show up at 4 PM to an open pharmacy and... wait, no, the opposite. Someone will not show up because they think it's closed.

Validation rejects anything outside 0:00–29:59, falling back to raw data display.

(29:59 because Japan uses "25:00" to mean 1:00 AM the next day for late-night businesses.)

The Biggest Design Decision: Wrong Is Worse Than Missing

At 97% coverage, the question becomes: what about the remaining 3%?

The answer: don't parse them. Show the raw data.

This site is a pharmacy search for emergency contraception. Users are under time pressure. Showing "this pharmacy is open on Saturday" when it isn't could mean someone misses their window. That's not recoverable.

Example:

Mon-Fri:9:00-18:00,Sat:???,Sun:9:00-14:00
Enter fullscreen mode Exit fullscreen mode

If Saturday is unparseable, I have two choices:

  1. Skip Saturday, show Mon-Fri + Sun → user might assume "closed on Saturday"
  2. Fail the whole entry, show raw text → user reads it themselves

I chose 2. No partial parses. All or nothing.

The same logic applies to "nth weekday" patterns:

第1・3土曜:9:00-12:00  (= "1st and 3rd Saturday only")
Enter fullscreen mode Exit fullscreen mode

The parser outputs weekly schedules ({day: "Sat", open: "9:00", close: "12:00"}). There's no field for "which week of the month."

First implementation: display as "every Saturday." But that's a lie. Someone showing up on the 2nd Saturday finds a closed pharmacy. For medication access, you don't get to lie.

Fix: skip the entire entry. Saturday info is lost. But a gap is better than a lie.

Holiday Hell: One More Layer

After hitting 97.1%, I noticed another problem:

月-金:9:00-18:00,日祝休み  (= "closed on Sundays and holidays")
Enter fullscreen mode Exit fullscreen mode

The parser doesn't know about holidays. It would show "Open" on a holiday Monday.

I implemented Japan's entire holiday calendar. ~60 lines, zero dependencies.

  • Fixed holidays (New Year's, National Foundation Day, Emperor's Birthday...)
  • Happy Monday holidays (Coming of Age Day, Marine Day, Respect for the Aged Day, Sports Day)
  • Vernal/Autumnal equinox (calculated from an astronomical formula, valid through ~2099)
  • Substitute holidays (if a holiday falls on Sunday, Monday becomes a holiday)
  • Citizen's holidays (a weekday sandwiched between two holidays)
// Vernal equinox formula
const vernal = Math.floor(
  20.8431 + 0.242194 * (year - 1980)
  - Math.floor((year - 1980) / 4)
);
Enter fullscreen mode Exit fullscreen mode

Coverage Over Time

89.4% → 95.7% → 96.6% → 96.3%(!) → 96.9% → 97.1%
Enter fullscreen mode Exit fullscreen mode

Notice the drop from 96.6% to 96.3%. That's where I stopped lying about "every Saturday" and started skipping nth-weekday entries. Coverage went down. Accuracy went up. Coverage is a quality indicator, not a target.

Why I Stopped at 97.1%

The remaining 2.9%:

月~土:9:00~20:00、お客様感謝デーを除く日・祝:9:00~18:00
("except Customer Appreciation Days")
Enter fullscreen mode Exit fullscreen mode

Natural language conditions.

月/水/金曜日第2/4火曜日第1/3/5土曜日9:00-18:00
Enter fullscreen mode Exit fullscreen mode

Days, ordinals, and times in an undifferentiated blob.

 09:00   19:00
 09:00   19:00
Enter fullscreen mode Exit fullscreen mode

Copy-pasted from Excel. Tab-separated.

Each edge case means a new .replace() that could interfere with the existing 82. Adding one pattern for 3 entries risks regressions across 9,951. Not worth it.

Unparseable pharmacies show their raw data. Users can read Japanese. It's fine.

By The Numbers

Metric Value
Input records 9,951 (non-empty hours)
Unique formats 6,933
.replace() calls 82
Regex patterns ~150
Parser code ~590 lines
Coverage 97.1% (9,659 entries)
Unparseable 292 (2.9%)

Lessons

  1. Free-text fields in government data are hell. Please just use JSON. ["Mon-Fri:9:00-18:00","Sat:9:00-13:00"]. Please.

  2. Normalization rules have order dependencies. Adding one breaks another. Arranging 82 of them in the right order is basically compiler pass optimization.

  3. Coverage is a quality indicator, not a target. The commit that dropped coverage from 96.6% to 96.3% was the most important fix.

  4. For health-related data, "unknown" is 100x better than "wrong." If you can't parse it, show the raw data. Partial parses are a lie factory.

  5. The cost of the last 3% is exponential. 89→95 was character normalization. 95→97 was context-dependent comma parsing and lookahead assertions. 97→99 would mean NLP. Knowing when to stop is a skill.


Repository

GitHub logo odakin / mhlw-ec-pharmacy-finder

Emergency contraception pharmacy finder based on official MHLW data (Japan)

緊急避妊薬(アフターピル)販売可能な薬局検索

English version below / Jump to English

このリポジトリは、厚生労働省が公表している緊急避妊薬の薬局一覧(要指導医薬品販売)と医療機関一覧(対面診療・処方)を、 検索しやすい CSV / XLSX / JSON に整形し、さらに 静的Web検索(GitHub Pages)LINE Botサンプル を添えたものです。

  • 出典(公式ページ)
  • 最新取り込みデータ時点: 薬局 2026-03-10 / 医療機関 2026-02-20
  • 生成物
    • data/ : 整形済みデータ(CSV/XLSX/JSON、原本XLSX、ジオコーディングキャッシュ)
    • docs/ : 静的Web検索(GitHub Pages用、地図・営業時間表示対応)
    • line_bot/ : LINE Bot(Node.js最小サンプル)
    • scripts/update_data.py : 薬局データ更新スクリプト(公式XLSX取得)
    • scripts/update_clinics.py : 医療機関データ更新スクリプト(公式PDF 47件パース)
    • scripts/geocode.py : 住所→緯度経度変換(東大CSIS API、薬局+医療機関対応)

重要な注意(必ずお読みください)

  • このリポジトリは医療アドバイスを提供しません。
  • 実際の購入可否・在庫・営業時間・販売条件は、各薬局に確認してください。
  • 公式ページでも、在庫等が変動しうるため 来局前に電話確認が推奨されています。 最終的な根拠は、上記の公式ページを最優先にしてください。

1) Web検索(GitHub Pages)

docs/ 配下は静的ファイルだけで動作します。

公開

  1. GitHub の Settings → Pages
  2. Source を「Deploy from a branch」
  3. Branch を main / Folder を /docs にして保存

公開後のURLは通常 https://<ユーザー名>.github.io/<リポジトリ名>/ になります。 例:リポジトリ名を mhlw-ec-pharmacy-finder にした場合 → https://odakin.github.io/mhlw-ec-pharmacy-finder/

ローカルで試す

cd docs
python -m http.server 8000
# http://localhost:8000 を開く
Enter fullscreen mode Exit fullscreen mode

2) 整形済みデータ

  • data/mhlw_ec_pharmacies_cleaned_2026-03-10.xlsx
  • data/mhlw_ec_pharmacies_cleaned_2026-03-10.csv(UTF-8 BOM)
  • data/data_2026-03-10.json(Web/LINE Bot用)

追加した列(例):

  • 市区町村_推定:住所文字列から市区町村相当を推定(完璧ではありません)
  • 電話番号_数字:ハイフン等を除去して通話リンクに使いやすくしたもの
  • 時間外の電話番号_数字:時間外の電話番号を同様に数字化したもの
  • 販売可能薬剤師数_女性 / 販売可能薬剤師数_男性 / 販売可能薬剤師数_答えたくない:公式一覧の「販売可能薬剤師・性別(人数)」

Web UI の絞り込み:

  • 事前連絡「要」を除く
  • 時間外対応あり
  • 女性薬剤師がいる
  • 個室あり

Web UI の機能:

  • 地図表示: Leaflet.js + OpenStreetMap で検索結果をピン表示(マーカークラスタリング対応、薬局=青・医療機関=赤)
  • 近い順ソート

Parser code: docs/app.js | Design doc: docs/HOURS_PARSER.md

Live site: Emergency Contraception Pharmacy Search — searches 10,000+ pharmacies and 2,900+ clinics from Japan's official MHLW data.

Top comments (0)