A Regulatory Analysis Dashboard for Fast Searching NITE CHRIP Data using FTS5

#devtools #python #productivity

NITE CHRIP Data Conversion

Regulatory data provided by the Chemical Substance Risk Information Platform (CHRIP) is published in XML format. In this project, we designed a process to convert this data to CSV format and then import it into FTS5.

import xml.etree.ElementTree as ET
import pandas as pd

def xml_to_csv(xml_path, output_csv):
    tree = ET.parse(xml_path)
    root = tree.getroot()
    records = []
    for item in root.findall('item'):
        record = {
            'cas_number': item.find('cas_number').text if item.find('cas_number') is not None else '',
            'substance_name': item.find('substance_name').text if item.find('substance_name') is not None else '',
            'regulation_id': item.find('regulation_id').text if item.find('regulation_id') is not None else '',
            'regulation_text': item.find('regulation_text').text if item.find('regulation_text') is not None else ''
        }
        records.append(record)
    df = pd.DataFrame(records)
    df.to_csv(output_csv, index=False, encoding='utf-8')

In a WSL2 Ubuntu environment, it is best practice to unify internal processing with UTF-8 and perform encoding conversion only when necessary for integration with external systems.

FTS5 Index Construction

We leverage SQLite's FTS5 extension to enable high-speed searching of chemical substance names and regulatory documents.

CREATE VIRTUAL TABLE chemicals_fts USING fts5(
  cas_number,
  substance_name,
  regulation_text,
  tokenize='simple'
);

-- CSVから一括インポート
.mode csv
.import /tmp/chrip.csv chemicals_fts

Since FTS5 is a virtual table, you cannot directly create a regular B-tree index on it. If high-speed searching by CAS number is required, an effective method is to create a separate regular table that links CAS numbers to FTS5 rowids and then index that table.

Visualization with Streamlit

We built a Streamlit application to display CAS number search results.

import streamlit as st
from google import genai

client = genai.Client()

if 'cas_cache' not in st.session_state:
    st.session_state.cas_cache = {}

def get_cas_analysis(cas_number, substance_text):
    if cas_number in st.session_state.cas_cache:
        return st.session_state.cas_cache[cas_number]
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=f"以下の化学物質の法規制情報を要約してください。\n{substance_text}"
    )
    st.session_state.cas_cache[cas_number] = response.text
    return response.text

cas_input = st.text_input('CAS番号を入力してください')
if cas_input:
    # SQLite FTS5で検索
    results = search_fts5(cas_input)
    st.write(f"検索結果: {len(results)}件")

Automatic CAS Number Extraction

We implemented a template to automatically extract CAS numbers from SDS (Safety Data Sheets).

from google import genai

client = genai.Client()

def extract_cas_from_sds(sds_text):
    prompt = """以下のSDS文書からCAS番号を抽出してください。
複数存在する場合はカンマ区切りで返してください。
抽出対象は「CAS No.」または「CAS番号」表記のみとします。
文書: """ + sds_text

    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents=prompt
    )
    return response.text.strip()

Industry Implementation Examples

You can build a workflow that allows uploading product SDS, automatically extracting CAS numbers, cross-referencing them with a regulatory database, and visualizing non-compliance risks. Integration with a patent database also enables analysis of patent status for similar substances.

Summary

This dashboard streamlines chemical substance regulatory compliance by enabling high-speed searching of NITE CHRIP regulatory data with FTS5 and combining it with AI analysis via the Gemini API. For handling large datasets, optimizing SQLite's page size (pragma page_size) and loading data in chunks are effective.

This article was generated by Nemotron-Nano-9B-v2-Japanese and formatted and validated by Gemini 2.5 Flash.