Bryan Doss

Posted on Feb 15, 2025 • Edited on Feb 20, 2025

De-identifying HIPAA PHI Using Local LLMs with Ollama

#llm #hipaa #nvidia #python

Recent advancements in smaller, more efficient language models have made it possible to run sophisticated NLP tasks entirely on consumer-grade hardware. Models like distiled DeepSeek-R1 (7/14B), Phi 4 (14B), and especially the new Mistral Small 3 (24B) can now perform complex tasks locally with impressive accuracy. This is a game-changer for sensitive data processing, as it eliminates the need for expensive cloud infrastructure or specialized hardware.

In this post, I'll show you how I used these local LLMs with Ollama to identify and remove Protected Health Information (PHI) from unstructured medical texts while maintaining data consistency - all running on standard desktop GPUs.

Example output of local LLM processing.

What is HIPAA PHI?

HIPAA (Health Insurance Portability and Accountability Act) defines 18 types of PHI that need to be removed or modified for data to be considered de-identified. This includes obvious identifiers like names and addresses, but also less obvious ones like device identifiers and biometric data.

Why Use Local LLMs?

While there are many cloud-based solutions for PHI de-identification, using local models has several advantages:

Data never leaves your environment
No API costs or rate limits
Faster processing with lower latency
Full control over the model and process

The best part? All of this runs on consumer-grade GPUs. I've successfully tested this solution on two NVIDIA RTX 3060s with 20GB of VRAM combined. However, you can run it on even older GPUs, given you have enough VRAM - or just run entirely on CPU with longer processing times.

The Solution: Ollama + Mistral

My solution uses Ollama running the Mistral Small 3 model locally to identify PHI elements. Here's how it works:

I create a detailed prompt that teaches the model about HIPAA PHI elements
Process the text in chunks to maintain context
Build a consistent mapping of PHI elements to replacement tokens
Replace the identified PHI with standardized tokens

The Core Components

Before trying to replicate, make sure you have Ollama installed and you've pulled Mistral Small 3 ollama pull mistral-small:24b.

Let's break down the key parts of the implementation:

Only relevant code blocks included below - see full code on GitHub.

1. First, I create a detailed prompt that specifies exactly what PHI elements to look for:

def create_hipaa_prompt() -> str:
    return f'''You are a HIPAA compliance expert. Analyze the text provided and identify all HIPAA Safe Harbor PHI elements including:
    - Names related to patient
    - Addresses or geographic subdivisions smaller than state
    - Dates (except years) and DOBs
    - Ages over 89
    [... other PHI elements ...]'''

2. I call Ollama's API to process the text:

def call_ollama(text: str, prompt: str, model: str = "mistral-small:24b"):
    full_prompt = f"{prompt}\n\nText to analyze:\n{text}"
    response = requests.post('http://localhost:11434/api/generate',
        json={
            "model": model,
            "options": {
                "temperature": 0.2,
            },
            "prompt": full_prompt,
            "stream": False
        }
    )
    ...

3. The model returns a JSON mapping of PHI elements to standardized tokens:

{
  "John A. Smith": "NAME_1",
  "123 Main St, Chicago": "ADDRESS_1",
  "09/28/1975": "DOB_1"
}

4. I maintain consistency by tracking used tokens and incrementing numbers as needed:

def deidentify_text(text: str, complete_phi_mappings: dict):
    phi_mappings = call_ollama(text, prompt)
    ...
    # De-duplicate the mappings
    keys_to_remove = [key for key in phi_mappings if key in complete_phi_mappings]
        for key in keys_to_remove:
            del phi_mappings[key]
    ...
    # Increment token variables if needed
    for key, value in phi_mappings.items():
            temp_value = value
            while temp_value in complete_phi_mappings.values() or list(phi_mappings.values()).count(temp_value) > 1:
                # if it doesn't end in a number, add _1, else increment the number
                if not re.search(r'_(\d+)$', temp_value):
                    new_value = temp_value + "_1"
                else:
                    new_value = re.sub(r'_(\d+)$', lambda x: f"_{int(x.group(1)) + 1}", temp_value)
                temp_value = new_value
            phi_mappings[key] = temp_value

5. Finally, I write a function to do the final de-identification:

def sanitize_text(text: str, complete_phi_mappings: dict) -> str:
    deidentified_text = text
    for phi, replacement in complete_phi_mappings.items():
        deidentified_text = deidentified_text.replace(phi, f"[{replacement}]")

    return deidentified_text

Output

Here's a snippet of the diff output from one of my test documents:

(all data is synthetic/generated using GPT-4o)

Obviouisly, if this data were in a structured, tabular format we could label various columns as PHI or not. But this is unstructured text which means our solution can apply to many different types of medical data.

Best Practices and Considerations

From my testing, I've found these practices to be crucial:

Consistency: Process the entire document to build a complete mapping before making replacements. This ensures that the same PHI element gets the same replacement token throughout.
Validation: Always validate the model's output. Include checks for empty keys/values and handle edge cases.
Performance: Process text in manageable chunks while maintaining context. This helps avoid token limits and improves accuracy.
Model Selection: Choose an appropriate model/prompt/context size combo. I found that smaller models like mistral-small:24b can be faster while still maintaining good accuracy for this task given the right prompting and document chunking.

Example Usage

Here's how to use the code:

# Load your medical text
sample_text = open("patient_chart.txt").read()

# Process the document
deidentified_text, phi_mappings = process_document(sample_text, '\n---\n')

# Save the results
with open("deidentified_text.txt", "w") as f:
    f.write(deidentified_text)
with open("phi_mappings.json", "w") as f:
    json.dump(phi_mappings, f, indent=2)

Future Improvements

I'm planning to enhance this solution with:

Creating specialized prompts for different models to compare performance (ideally smaller parameter count models that can run on even more systems)
Implementing confidence scores for each identified PHI element
Building a simple web interface for easy text processing

Conclusion

Using local LLMs for PHI de-identification offers a powerful, flexible, and secure solution for healthcare data processing. What excites me most is that this entire solution runs locally on consumer hardware - no cloud required. The approach I've described here provides a solid foundation that you can build upon for your specific needs.

The complete code is available in the GitHub repository. Feel free to try it out and let me know how it works for your use case!

Remember: While this solution can help with PHI identification, always have qualified personnel review the results before using de-identified data in production.

You can find me on LinkedIn | CTO & Partner @ EES.

DEV Community