NITE CHRIP Data Conversion
Regulatory data provided by the Chemical Substance Risk Information Platform (CHRIP) is published in XML format. In this project, we designed a process to convert this data to CSV format and then import it into FTS5.
import xml.etree.ElementTree as ET
import pandas as pd
def xml_to_csv(xml_path, output_csv):
tree = ET.parse(xml_path)
root = tree.getroot()
records = []
for item in root.findall('item'):
record = {
'cas_number': item.find('cas_number').text if item.find('cas_number') is not None else '',
'substance_name': item.find('substance_name').text if item.find('substance_name') is not None else '',
'regulation_id': item.find('regulation_id').text if item.find('regulation_id') is not None else '',
'regulation_text': item.find('regulation_text').text if item.find('regulation_text') is not None else ''
}
records.append(record)
df = pd.DataFrame(records)
df.to_csv(output_csv, index=False, encoding='utf-8')
In a WSL2 Ubuntu environment, it is best practice to unify internal processing with UTF-8 and perform encoding conversion only when necessary for integration with external systems.
FTS5 Index Construction
We leverage SQLite's FTS5 extension to enable high-speed searching of chemical substance names and regulatory documents.
CREATE VIRTUAL TABLE chemicals_fts USING fts5(
cas_number,
substance_name,
regulation_text,
tokenize='simple'
);
-- CSVから一括インポート
.mode csv
.import /tmp/chrip.csv chemicals_fts
Since FTS5 is a virtual table, you cannot directly create a regular B-tree index on it. If high-speed searching by CAS number is required, an effective method is to create a separate regular table that links CAS numbers to FTS5 rowids and then index that table.
Visualization with Streamlit
We built a Streamlit application to display CAS number search results.
import streamlit as st
from google import genai
client = genai.Client()
if 'cas_cache' not in st.session_state:
st.session_state.cas_cache = {}
def get_cas_analysis(cas_number, substance_text):
if cas_number in st.session_state.cas_cache:
return st.session_state.cas_cache[cas_number]
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=f"以下の化学物質の法規制情報を要約してください。\n{substance_text}"
)
st.session_state.cas_cache[cas_number] = response.text
return response.text
cas_input = st.text_input('CAS番号を入力してください')
if cas_input:
# SQLite FTS5で検索
results = search_fts5(cas_input)
st.write(f"検索結果: {len(results)}件")
Automatic CAS Number Extraction
We implemented a template to automatically extract CAS numbers from SDS (Safety Data Sheets).
from google import genai
client = genai.Client()
def extract_cas_from_sds(sds_text):
prompt = """以下のSDS文書からCAS番号を抽出してください。
複数存在する場合はカンマ区切りで返してください。
抽出対象は「CAS No.」または「CAS番号」表記のみとします。
文書: """ + sds_text
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=prompt
)
return response.text.strip()
Industry Implementation Examples
You can build a workflow that allows uploading product SDS, automatically extracting CAS numbers, cross-referencing them with a regulatory database, and visualizing non-compliance risks. Integration with a patent database also enables analysis of patent status for similar substances.
Summary
This dashboard streamlines chemical substance regulatory compliance by enabling high-speed searching of NITE CHRIP regulatory data with FTS5 and combining it with AI analysis via the Gemini API. For handling large datasets, optimizing SQLite's page size (pragma page_size) and loading data in chunks are effective.
This article was generated by Nemotron-Nano-9B-v2-Japanese and formatted and validated by Gemini 2.5 Flash.
Top comments (0)