<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Malik B. Parker</title>
    <description>The latest articles on DEV Community by Malik B. Parker (@mparker25).</description>
    <link>https://dev.to/mparker25</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3825588%2Fc43b66b3-2873-467f-a1eb-2b8ec1dc54a4.jpeg</url>
      <title>DEV Community: Malik B. Parker</title>
      <link>https://dev.to/mparker25</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mparker25"/>
    <language>en</language>
    <item>
      <title>Bypassing 2FA in Web Scraping: Why iMessage is a Local SQL Database and How That Changes Everything</title>
      <dc:creator>Malik B. Parker</dc:creator>
      <pubDate>Mon, 16 Mar 2026 23:46:07 +0000</pubDate>
      <link>https://dev.to/mparker25/bypassing-2fa-in-web-scraping-why-imessage-is-a-local-sql-database-and-how-that-changes-everything-2olg</link>
      <guid>https://dev.to/mparker25/bypassing-2fa-in-web-scraping-why-imessage-is-a-local-sql-database-and-how-that-changes-everything-2olg</guid>
      <description>&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Two-factor authentication kills web scraping. Every automated login flow hits a wall when the site says "we just sent you a code." Traditional approaches fall into two camps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Email-based&lt;/strong&gt;: Spin up an IMAP listener, poll for the email, parse the code. Requires an LLM or regex to extract from HTML emails. Adds 10-30 seconds of latency and another API call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authenticator apps (TOTP)&lt;/strong&gt;: Easy if you have the secret key. Most credential managers expose TOTP generation. But many sites only offer SMS or email — no TOTP option.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;SMS is the worst of both worlds. You'd think you need a Twilio number, a webhook endpoint, some cloud infrastructure to receive and parse incoming texts. Or you do what most people do: give up and mark the site as "can't automate."&lt;/p&gt;

&lt;p&gt;But if you're running on a Mac with iMessage, there's a third option that's almost embarrassingly simple.&lt;/p&gt;

&lt;h2&gt;
  
  
  iMessage is Just a SQLite Database
&lt;/h2&gt;

&lt;p&gt;Here's the thing nobody talks about: every iMessage and SMS message that syncs to your Mac is stored in a plain SQLite database at:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~/Library/Messages/chat.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. No API. No authentication. No webhook. Just a &lt;code&gt;.db&lt;/code&gt; file sitting on your filesystem. You can open it with &lt;code&gt;sqlite3&lt;/code&gt; right now and query your entire message history:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sender&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;handle&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;handle_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ROWID&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_from_me&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a 2FA SMS arrives on your phone, iMessage syncs it to your Mac within seconds. It lands in &lt;code&gt;chat.db&lt;/code&gt; as a new row. No polling an external service, no waiting for email delivery, no parsing HTML. Just a SQL query.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture Decision: Why macOS?
&lt;/h2&gt;

&lt;p&gt;When I was building &lt;a href="https://github.com/..." rel="noopener noreferrer"&gt;Bill Analyzer&lt;/a&gt; — an agent-based bill scraper that logs into utility company portals, handles authentication, and extracts billing data — I made a deliberate choice to target macOS as the runtime environment. Here's why:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The scraping stack already assumes a desktop environment.&lt;/strong&gt; We're using Playwright with a visible Chromium instance (stealth mode, anti-detection). This isn't a headless cloud scraper — it's a local automation tool that needs to handle complex SPAs, Angular dashboards, and JavaScript-heavy login flows. A Mac with a display is already the natural home for this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;iMessage gives you SMS 2FA for free.&lt;/strong&gt; If you have an iPhone paired with your Mac, every SMS verification code lands in &lt;code&gt;chat.db&lt;/code&gt; within 2-5 seconds. No infrastructure needed. No third-party services. No API keys.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The credential pipeline stays local.&lt;/strong&gt; We pull credentials from 1Password via the &lt;code&gt;op&lt;/code&gt; CLI, fill them directly into the browser via Playwright (the agent never sees them), and now SMS codes flow through the same local-only pipeline. Nothing leaves the machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the SMS Tool Works
&lt;/h2&gt;

&lt;p&gt;The implementation is ~100 lines of Python. Here's the flow:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Baseline Snapshot (Before Login)
&lt;/h3&gt;

&lt;p&gt;Before the agent starts clicking anything, we capture the ROWID of the most recent message in &lt;code&gt;chat.db&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_get_latest_message_rowid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CHAT_DB&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT MAX(ROWID) FROM message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is our baseline. Any message with a ROWID higher than this arrived &lt;em&gt;after&lt;/em&gt; we started the session.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Agent Triggers SMS
&lt;/h3&gt;

&lt;p&gt;The agent navigates the 2FA page — clicks "Send via text," clicks "Send Code." These are normal click actions, recorded for replay caching just like any other step.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Poll for the Code
&lt;/h3&gt;

&lt;p&gt;When the agent calls &lt;code&gt;fill_sms_code&lt;/code&gt;, we poll &lt;code&gt;chat.db&lt;/code&gt; for new messages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_get_sms_code_after&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;after_rowid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sender&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;code_pattern&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sqlite3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CHAT_DB&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cursor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        SELECT m.text, h.id as sender_id
        FROM message m
        JOIN handle h ON m.handle_id = h.ROWID
        WHERE m.ROWID &amp;gt; ?
          AND h.id LIKE ?
          AND m.is_from_me = 0
        ORDER BY m.date DESC
        LIMIT 5
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;after_rowid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sender&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;code_pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sender_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We poll every 2 seconds, up to 60 seconds. In practice, the code arrives in 3-5 seconds.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Fill and Continue
&lt;/h3&gt;

&lt;p&gt;The code gets filled directly into the verification input via Playwright. The agent never sees the code value — it flows opaquely, just like credentials. The tool records a &lt;code&gt;FillSmsCodeStep&lt;/code&gt; with the discovered sender number for replay caching.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Race Condition Problem
&lt;/h2&gt;

&lt;p&gt;The naive approach has a subtle race condition. The timeline looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Agent clicks "Send Code"&lt;/li&gt;
&lt;li&gt;SMS is dispatched by the 2FA service&lt;/li&gt;
&lt;li&gt;Agent processes the click result, decides to call &lt;code&gt;fill_sms_code&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fill_sms_code&lt;/code&gt; starts polling&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The code might arrive at step 2, &lt;em&gt;before&lt;/em&gt; the poll starts at step 4. If you use a timestamp-based filter ("only messages from the last 5 seconds"), you'll miss it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The ROWID approach eliminates this.&lt;/strong&gt; We snapshot the highest ROWID &lt;em&gt;before the agent starts any actions&lt;/em&gt; — before step 1. Any message arriving after that point, whether at step 2, 3, or 4, will have a higher ROWID. We catch it regardless of when the poll starts.&lt;/p&gt;

&lt;p&gt;The only remaining edge case: what if the SMS arrives before the baseline snapshot? This would mean the SMS arrived before we even started the login flow — which doesn't happen in practice. But as a safety net, we could subtract a small buffer from the baseline ROWID if needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Staying PII-Safe
&lt;/h2&gt;

&lt;p&gt;A key design constraint: the agent should never see sensitive data. The SMS tool follows the same opaque pattern as credential filling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;fill_secure_credential&lt;/code&gt;&lt;/strong&gt;: Fetches username/password from 1Password, fills directly into the page. Agent sees only "Filled 'password' into #password."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;fill_sms_code&lt;/code&gt;&lt;/strong&gt;: Reads the code from iMessage, fills directly into the page. Agent sees only "Filled SMS verification code into #verificationCode."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The code value exists only in the tool's execution scope. It never enters the LLM context. The recorded cache step stores only the &lt;em&gt;sender number&lt;/em&gt; and &lt;em&gt;selector&lt;/em&gt; — never the code itself.&lt;/p&gt;

&lt;p&gt;For page content, everything the agent sees goes through Presidio PII redaction. Names, addresses, account numbers, SSNs, phone numbers — all replaced with placeholders. Dollar amounts and dates are preserved because those are the extraction targets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Replay: Zero-LLM 2FA on Subsequent Runs
&lt;/h2&gt;

&lt;p&gt;The first run uses an LLM agent to navigate the login flow. Every action is recorded:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"click"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"selector"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"#chooseText"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"click"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"selector"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"#sendCodeButton"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fill_sms_code"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"selector"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"#verificationCode"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"sender"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"69525"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"click"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"selector"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"#continueButton"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On subsequent runs, the replay engine executes these steps with pure Playwright — no LLM. When it hits the &lt;code&gt;fill_sms_code&lt;/code&gt; step, it polls iMessage using the cached sender number. The only wait is for the SMS to arrive (~3-5 seconds).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total replay cost: $0.00 in LLM calls.&lt;/strong&gt; The entire login + 2FA + extraction happens with Playwright + SQLite + regex.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Makes This Work
&lt;/h2&gt;

&lt;p&gt;The approach works because of a few converging factors:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;iMessage is a local database.&lt;/strong&gt; Apple doesn't expose an API for reading messages programmatically, but they don't need to — the data is right there in SQLite. Full Disk Access permission is the only gate.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;SMS 2FA codes are predictable.&lt;/strong&gt; They're always 4-8 digit numbers in a short text message. A simple &lt;code&gt;\b(\d{4,8})\b&lt;/code&gt; regex catches them reliably.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The sender is consistent.&lt;/strong&gt; PECO always sends from &lt;code&gt;69525&lt;/code&gt;. Once discovered on the first run, we cache it and filter by sender on replay — no false positives from other messages.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;ROWID is monotonically increasing.&lt;/strong&gt; SQLite ROWIDs are sequential. This gives us a reliable "messages after this point" filter without dealing with timestamp formats, timezone issues, or Apple's nanosecond epoch offset.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;macOS only.&lt;/strong&gt; This requires a Mac with iMessage syncing enabled. No Linux, no cloud VMs (unless you're running macOS VMs, which has its own licensing implications).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full Disk Access required.&lt;/strong&gt; The terminal or Python process needs FDA permission to read &lt;code&gt;chat.db&lt;/code&gt;. This is a one-time system preference toggle.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SMS sync latency.&lt;/strong&gt; iMessage typically syncs within 2-5 seconds, but network conditions or iCloud delays could extend this. The 60-second polling timeout handles edge cases.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not all 2FA is SMS.&lt;/strong&gt; Sites that only offer email or authenticator apps need different approaches. Email could use a similar local technique with Mail.app's database, but that's a future project.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;The real insight here isn't about iMessage specifically — it's about &lt;strong&gt;treating local system databases as APIs.&lt;/strong&gt; Your Mac is full of SQLite databases that applications use for storage: Messages, Safari history, Contacts, Notes, Calendar. When you need data from these apps programmatically, you don't need to build integrations or scrape UIs. You just query the database.&lt;/p&gt;

&lt;p&gt;For automated web scraping that needs to handle 2FA, this collapses what would normally be a complex infrastructure problem (SMS webhook service, message parsing, code extraction) into five lines of SQL and a polling loop.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is part of an ongoing series on building agent-based web scrapers with PII safety. Previously: &lt;a href="https://dev.to/mparker25/how-to-strip-sensitive-data-before-it-hits-your-llm-3pd9"&gt;PII Redaction for AI Pipelines&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>ai</category>
      <category>database</category>
      <category>python</category>
    </item>
    <item>
      <title>How to Strip Sensitive Data Before It Hits Your LLM</title>
      <dc:creator>Malik B. Parker</dc:creator>
      <pubDate>Sun, 15 Mar 2026 16:46:50 +0000</pubDate>
      <link>https://dev.to/mparker25/how-to-strip-sensitive-data-before-it-hits-your-llm-3pd9</link>
      <guid>https://dev.to/mparker25/how-to-strip-sensitive-data-before-it-hits-your-llm-3pd9</guid>
      <description>&lt;p&gt;You built an AI agent that logs into your bank, navigates to billing, and extracts your bill amount. Smart. But now Claude is reading your full name, home address, account numbers, and partial SSN — all sent through an API you don't control. That's not a pipeline. That's a liability.&lt;/p&gt;

&lt;p&gt;Here's how I solved it with four regex patterns and an open-source library most people have never heard of.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Context
&lt;/h2&gt;

&lt;p&gt;I'm building &lt;strong&gt;Bill Analyzer&lt;/strong&gt; — an agentic system that automatically logs into utility and financial sites, navigates to billing pages, and extracts what I owe and when it's due. It uses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Playwright&lt;/strong&gt; for browser automation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude (Haiku)&lt;/strong&gt; as the AI agent for navigation and extraction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1Password CLI&lt;/strong&gt; for credential management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture has two phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Login Agent&lt;/strong&gt; — navigates login flows with credentials handled opaquely (the agent never sees passwords)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extract Agent&lt;/strong&gt; — reads post-login pages to find billing data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The extract agent needs to &lt;em&gt;read the page&lt;/em&gt; to find dollar amounts and due dates. But those pages also contain names, addresses, SSNs, account numbers — PII that has no business being sent to any external API.&lt;/p&gt;

&lt;p&gt;The constraint: &lt;strong&gt;the agent must understand the page well enough to extract billing data, without ever seeing personal information.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters Beyond My Project
&lt;/h2&gt;

&lt;p&gt;Any time you're feeding real user data into an LLM — customer support transcripts, medical records, financial documents, scraped web content — you face the same problem. The model needs the &lt;em&gt;structure&lt;/em&gt; and &lt;em&gt;relevant content&lt;/em&gt;, not the &lt;em&gt;identity&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This isn't just good practice. Depending on your industry, it's GDPR, HIPAA, or CCPA compliance.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Tool: Microsoft Presidio
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/microsoft/presidio" rel="noopener noreferrer"&gt;Presidio&lt;/a&gt; is Microsoft's open-source PII detection and anonymization library. It combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;spaCy NER&lt;/strong&gt; (Named Entity Recognition) for names and locations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pattern-based recognizers&lt;/strong&gt; (regex) for structured PII like SSNs, phone numbers, credit cards&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom recognizers&lt;/strong&gt; you can add for domain-specific patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;presidio-analyzer presidio-anonymizer
python &lt;span class="nt"&gt;-m&lt;/span&gt; spacy download en_core_web_lg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Basic usage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;presidio_analyzer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AnalyzerEngine&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;presidio_anonymizer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AnonymizerEngine&lt;/span&gt;

&lt;span class="n"&gt;analyzer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AnalyzerEngine&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;anonymizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AnonymizerEngine&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;John Smith, (555) 123-4567, john@example.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;analyzer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;analyze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;redacted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anonymizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;anonymize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;analyzer_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redacted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# &amp;lt;PERSON&amp;gt;, &amp;lt;PHONE_NUMBER&amp;gt;, &amp;lt;EMAIL_ADDRESS&amp;gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What Presidio Catches Out of the Box
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Entity&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;PERSON&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;John Smith&lt;/td&gt;
&lt;td&gt;spaCy NER&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;PHONE_NUMBER&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;(555) 123-4567&lt;/td&gt;
&lt;td&gt;Regex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;EMAIL_ADDRESS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:john@example.com"&gt;john@example.com&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Regex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CREDIT_CARD&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4532-1234-5678-9012&lt;/td&gt;
&lt;td&gt;Regex + Luhn&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;US_SSN&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;123-45-6789&lt;/td&gt;
&lt;td&gt;Regex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;US_BANK_NUMBER&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4829184729&lt;/td&gt;
&lt;td&gt;Regex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;LOCATION&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Springfield (city names only)&lt;/td&gt;
&lt;td&gt;spaCy NER&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What It Misses: Street Addresses
&lt;/h2&gt;

&lt;p&gt;This is where I hit a wall. Presidio's &lt;code&gt;LOCATION&lt;/code&gt; entity catches city names &lt;em&gt;sometimes&lt;/em&gt;, but full street addresses? Completely invisible:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;IN:  John Smith, 123 Main St Springfield IL 62701, Balance: $142.37
OUT: &amp;lt;PERSON&amp;gt;, 123 Main St Springfield IL 62701, Balance: $142.37
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                Address passes through unredacted
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bare addresses are even worse:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;IN:  3498 Ebenezer Ave
OUT: 3498 Ebenezer Ave    ← completely missed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a billing extraction pipeline, this is unacceptable. Your home address is on every utility bill page.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix: Four Custom Regex Patterns
&lt;/h2&gt;

&lt;p&gt;US street addresses follow predictable patterns. I built a custom &lt;code&gt;PatternRecognizer&lt;/code&gt; with four patterns, ordered from most specific (highest confidence) to least:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;presidio_analyzer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PatternRecognizer&lt;/span&gt;

&lt;span class="n"&gt;US_STATES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AL|AK|AZ|AR|CA|CO|CT|DE|FL|GA|HI|ID|IL|IN|IA|KS|KY|LA|ME|MD|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|OH|OK|OR|PA|RI|SC|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SD|TN|TX|UT|VT|VA|WA|WV|WI|WY|DC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;STREET_SUFFIXES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;St(?:reet)?|Ave(?:nue)?|Blvd|Boulevard|Rd|Road|Dr(?:ive)?|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ln|Lane|Way|Ct|Court|Pl(?:ace)?|Cir(?:cle)?|Pkwy|Parkway|&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ter(?:race)?|Hwy|Highway|Loop|Run|Path|Trail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;address_recognizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PatternRecognizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;supported_entity&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ADDRESS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;patterns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="c1"&gt;# 1. Full: "123 Main St, Apt 2, Springfield, IL 62701"
&lt;/span&gt;        &lt;span class="nc"&gt;Pattern&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us_address_full&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sa"&gt;rf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\d{{1,5}}\s+[\w\s.]+?(?:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;STREET_SUFFIXES&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;rf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[,.\s]+(?:[\w\s#.,]+[,.\s]+)?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;rf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(?:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;US_STATES&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)\s+\d{{5}}(?:-\d{{4}})?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="c1"&gt;# 2. With state: "123 Main St, Springfield, IL"
&lt;/span&gt;        &lt;span class="nc"&gt;Pattern&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us_address_state&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sa"&gt;rf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\d{{1,5}}\s+[\w\s.]+?(?:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;STREET_SUFFIXES&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;rf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[,.\s]+(?:[\w\s#.,]+[,.\s]+)?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;rf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(?:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;US_STATES&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="c1"&gt;# 3. With ZIP: "123 Main St 62701"
&lt;/span&gt;        &lt;span class="nc"&gt;Pattern&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us_address_zip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sa"&gt;rf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\d{{1,5}}\s+[\w\s.]+?(?:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;STREET_SUFFIXES&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;rf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[,.\s]+[\w\s.,#]+\d{{5}}(?:-\d{{4}})?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mf"&gt;0.6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="c1"&gt;# 4. Bare: "3498 Ebenezer Ave" or "789 Elm Dr, Apt 4"
&lt;/span&gt;        &lt;span class="nc"&gt;Pattern&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us_address_street_only&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sa"&gt;rf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\d{{1,5}}\s+[\w\s.]+?(?:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;STREET_SUFFIXES&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)\b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;rf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(?:[,.\s]+(?:Apt|Suite|Ste|Unit|#)\s*[\w-]+)?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;street&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mailing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;billing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;home&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;residence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;analyzer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AnalyzerEngine&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;analyzer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_recognizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;address_recognizer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How the patterns work
&lt;/h3&gt;

&lt;p&gt;All four share a common prefix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;\d{1,5}\s+[\w\s.]+?(?:STREET_SUFFIXES)\b
│          │              │
│          │              └─ Street suffix (St, Ave, Blvd...)
│          └─ Street name (lazy match)
└─ House number (1-5 digits)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key regex decisions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lazy &lt;code&gt;+?&lt;/code&gt;&lt;/strong&gt; on the street name prevents overmatching into surrounding text. Without this, &lt;code&gt;123 Main St Springfield IL 62701, Balance: $142.37&lt;/code&gt; matches all the way through &lt;code&gt;$142.37&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Word boundary &lt;code&gt;\b&lt;/code&gt;&lt;/strong&gt; after the suffix prevents partial matches like "Driveways" matching &lt;code&gt;Drive&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context words&lt;/strong&gt; (&lt;code&gt;"address"&lt;/code&gt;, &lt;code&gt;"billing"&lt;/code&gt;, &lt;code&gt;"mailing"&lt;/code&gt;) boost confidence for borderline matches — a bare &lt;code&gt;3498 Ebenezer Ave&lt;/code&gt; at 0.4 confidence gets boosted when near the word "address".&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;IN:  3498 Ebenezer Ave
OUT: &amp;lt;ADDRESS&amp;gt;

IN:  123 Main St
OUT: &amp;lt;ADDRESS&amp;gt;

IN:  789 Elm Drive, Apt 4
OUT: &amp;lt;ADDRESS&amp;gt;

IN:  Service address: 3498 Ebenezer Ave. Your balance is $142.37
OUT: Service address: &amp;lt;ADDRESS&amp;gt;. Your balance is $142.37

IN:  John Smith, 123 Main St Springfield IL 62701, Balance: $142.37
OUT: &amp;lt;PERSON&amp;gt;, &amp;lt;ADDRESS&amp;gt;, Balance: $142.37

IN:  Jane Doe, 456 Oak Avenue, Apt 2B, Chicago, IL 60601, Amount: $98.50
OUT: &amp;lt;PERSON&amp;gt;, &amp;lt;ADDRESS&amp;gt;, Amount: $98.50

IN:  789 Broadway New York NY 10003, Phone: (555) 123-4567
OUT: &amp;lt;ADDRESS&amp;gt;, Phone: &amp;lt;PHONE_NUMBER&amp;gt;

IN:  1234 W Elm Blvd, Suite 100, Los Angeles CA 90001
OUT: &amp;lt;ADDRESS&amp;gt;

IN:  55 Park Dr, Unit 3, Denver CO 80202. Your bill is $75.00 due May 1.
OUT: &amp;lt;ADDRESS&amp;gt;. Your bill is $75.00 due May 1.

IN:  Your bill of $200.00 is due May 1. No address here.
OUT: Your bill of $200.00 is due May 1. No address here.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every address caught. Every dollar amount and date preserved. No false positives on non-address text.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Survives Redaction (By Design)
&lt;/h2&gt;

&lt;p&gt;The whole point is that the LLM still gets what it needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;$142.37&lt;/code&gt;&lt;/strong&gt; — the bill amount (not PII)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;April 15, 2026&lt;/code&gt;&lt;/strong&gt; — the due date (not PII)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Balance Due:&lt;/code&gt;&lt;/strong&gt; — structural labels (not PII)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;&amp;lt;PERSON&amp;gt;&lt;/code&gt;&lt;/strong&gt; — the agent knows a name &lt;em&gt;was&lt;/em&gt; there without knowing &lt;em&gt;whose&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The redacted text preserves enough structure for the AI to do its job while stripping everything that identifies the person.&lt;/p&gt;




&lt;h2&gt;
  
  
  Known Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PO Box addresses&lt;/strong&gt; (&lt;code&gt;PO Box 1234, Springfield, IL&lt;/code&gt;) — not covered, would need another pattern&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;International addresses&lt;/strong&gt; — US only; other countries need separate recognizers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No-suffix streets&lt;/strong&gt; (&lt;code&gt;123 Broadway&lt;/code&gt; works, &lt;code&gt;123 Maple&lt;/code&gt; does not)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Account numbers&lt;/strong&gt; — Presidio sometimes tags these as &lt;code&gt;PHONE_NUMBER&lt;/code&gt; due to digit patterns; close enough for redaction purposes&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;PII redaction isn't just about regex. It's an architectural decision. In my pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Login phase&lt;/strong&gt; — the agent never sees credentials (opaque tool fills them, snapshots redact filled fields)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Navigation phase&lt;/strong&gt; — the agent sees only links and buttons, not page content&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extraction phase&lt;/strong&gt; — the agent sees Presidio-redacted text (this article)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Three layers of privacy enforcement. The LLM is powerful but sandboxed. It can read a page without knowing who you are.&lt;/p&gt;

&lt;p&gt;If you're building any pipeline that sends real-world data through an LLM, ask yourself: &lt;strong&gt;does the model actually need to see the PII to do its job?&lt;/strong&gt; Usually the answer is no.&lt;/p&gt;




&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/microsoft/presidio" rel="noopener noreferrer"&gt;Microsoft Presidio GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://microsoft.github.io/presidio/tutorial/08_no_code/" rel="noopener noreferrer"&gt;Presidio Custom Recognizers Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://spacy.io/models/en" rel="noopener noreferrer"&gt;spaCy NER Models&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>security</category>
      <category>python</category>
      <category>ai</category>
      <category>webscraping</category>
    </item>
  </channel>
</rss>
