Malik B. Parker

Posted on Mar 16

Bypassing 2FA in Web Scraping: Why iMessage is a Local SQL Database and How That Changes Everything

#webscraping #ai #database #python

The Problem

Two-factor authentication kills web scraping. Every automated login flow hits a wall when the site says "we just sent you a code." Traditional approaches fall into two camps:

Email-based: Spin up an IMAP listener, poll for the email, parse the code. Requires an LLM or regex to extract from HTML emails. Adds 10-30 seconds of latency and another API call.
Authenticator apps (TOTP): Easy if you have the secret key. Most credential managers expose TOTP generation. But many sites only offer SMS or email — no TOTP option.

SMS is the worst of both worlds. You'd think you need a Twilio number, a webhook endpoint, some cloud infrastructure to receive and parse incoming texts. Or you do what most people do: give up and mark the site as "can't automate."

But if you're running on a Mac with iMessage, there's a third option that's almost embarrassingly simple.

iMessage is Just a SQLite Database

Here's the thing nobody talks about: every iMessage and SMS message that syncs to your Mac is stored in a plain SQLite database at:

~/Library/Messages/chat.db

That's it. No API. No authentication. No webhook. Just a .db file sitting on your filesystem. You can open it with sqlite3 right now and query your entire message history:

SELECT m.text, h.id as sender, m.date
FROM message m
JOIN handle h ON m.handle_id = h.ROWID
WHERE m.is_from_me = 0
ORDER BY m.date DESC
LIMIT 5;

When a 2FA SMS arrives on your phone, iMessage syncs it to your Mac within seconds. It lands in chat.db as a new row. No polling an external service, no waiting for email delivery, no parsing HTML. Just a SQL query.

The Architecture Decision: Why macOS?

When I was building Bill Analyzer — an agent-based bill scraper that logs into utility company portals, handles authentication, and extracts billing data — I made a deliberate choice to target macOS as the runtime environment. Here's why:

The scraping stack already assumes a desktop environment. We're using Playwright with a visible Chromium instance (stealth mode, anti-detection). This isn't a headless cloud scraper — it's a local automation tool that needs to handle complex SPAs, Angular dashboards, and JavaScript-heavy login flows. A Mac with a display is already the natural home for this.

iMessage gives you SMS 2FA for free. If you have an iPhone paired with your Mac, every SMS verification code lands in chat.db within 2-5 seconds. No infrastructure needed. No third-party services. No API keys.

The credential pipeline stays local. We pull credentials from 1Password via the op CLI, fill them directly into the browser via Playwright (the agent never sees them), and now SMS codes flow through the same local-only pipeline. Nothing leaves the machine.

How the SMS Tool Works

The implementation is ~100 lines of Python. Here's the flow:

1. Baseline Snapshot (Before Login)

Before the agent starts clicking anything, we capture the ROWID of the most recent message in chat.db:

def _get_latest_message_rowid() -> int:
    conn = sqlite3.connect(CHAT_DB)
    row = conn.execute("SELECT MAX(ROWID) FROM message").fetchone()
    conn.close()
    return row[0] or 0

This is our baseline. Any message with a ROWID higher than this arrived after we started the session.

2. Agent Triggers SMS

The agent navigates the 2FA page — clicks "Send via text," clicks "Send Code." These are normal click actions, recorded for replay caching just like any other step.

3. Poll for the Code

When the agent calls fill_sms_code, we poll chat.db for new messages:

def _get_sms_code_after(after_rowid, sender, code_pattern):
    conn = sqlite3.connect(CHAT_DB)
    cursor = conn.execute("""
        SELECT m.text, h.id as sender_id
        FROM message m
        JOIN handle h ON m.handle_id = h.ROWID
        WHERE m.ROWID > ?
          AND h.id LIKE ?
          AND m.is_from_me = 0
        ORDER BY m.date DESC
        LIMIT 5
    """, (after_rowid, f"%{sender}%"))

    for row in cursor:
        match = re.search(code_pattern, row["text"])
        if match:
            return match.group(1), row["sender_id"]
    return None

We poll every 2 seconds, up to 60 seconds. In practice, the code arrives in 3-5 seconds.

4. Fill and Continue

The code gets filled directly into the verification input via Playwright. The agent never sees the code value — it flows opaquely, just like credentials. The tool records a FillSmsCodeStep with the discovered sender number for replay caching.

The Race Condition Problem

The naive approach has a subtle race condition. The timeline looks like this:

Agent clicks "Send Code"
SMS is dispatched by the 2FA service
Agent processes the click result, decides to call fill_sms_code
fill_sms_code starts polling

The code might arrive at step 2, before the poll starts at step 4. If you use a timestamp-based filter ("only messages from the last 5 seconds"), you'll miss it.

The ROWID approach eliminates this. We snapshot the highest ROWID before the agent starts any actions — before step 1. Any message arriving after that point, whether at step 2, 3, or 4, will have a higher ROWID. We catch it regardless of when the poll starts.

The only remaining edge case: what if the SMS arrives before the baseline snapshot? This would mean the SMS arrived before we even started the login flow — which doesn't happen in practice. But as a safety net, we could subtract a small buffer from the baseline ROWID if needed.

Staying PII-Safe

A key design constraint: the agent should never see sensitive data. The SMS tool follows the same opaque pattern as credential filling:

fill_secure_credential: Fetches username/password from 1Password, fills directly into the page. Agent sees only "Filled 'password' into #password."
fill_sms_code: Reads the code from iMessage, fills directly into the page. Agent sees only "Filled SMS verification code into #verificationCode."

The code value exists only in the tool's execution scope. It never enters the LLM context. The recorded cache step stores only the sender number and selector — never the code itself.

For page content, everything the agent sees goes through Presidio PII redaction. Names, addresses, account numbers, SSNs, phone numbers — all replaced with placeholders. Dollar amounts and dates are preserved because those are the extraction targets.

Replay: Zero-LLM 2FA on Subsequent Runs

The first run uses an LLM agent to navigate the login flow. Every action is recorded:

{"action": "click", "selector": "#chooseText"},
{"action": "click", "selector": "#sendCodeButton"},
{"action": "fill_sms_code", "selector": "#verificationCode", "sender": "69525"},
{"action": "click", "selector": "#continueButton"}

On subsequent runs, the replay engine executes these steps with pure Playwright — no LLM. When it hits the fill_sms_code step, it polls iMessage using the cached sender number. The only wait is for the SMS to arrive (~3-5 seconds).

Total replay cost: $0.00 in LLM calls. The entire login + 2FA + extraction happens with Playwright + SQLite + regex.

What Makes This Work

The approach works because of a few converging factors:

iMessage is a local database. Apple doesn't expose an API for reading messages programmatically, but they don't need to — the data is right there in SQLite. Full Disk Access permission is the only gate.
SMS 2FA codes are predictable. They're always 4-8 digit numbers in a short text message. A simple \b(\d{4,8})\b regex catches them reliably.
The sender is consistent. PECO always sends from 69525. Once discovered on the first run, we cache it and filter by sender on replay — no false positives from other messages.
ROWID is monotonically increasing. SQLite ROWIDs are sequential. This gives us a reliable "messages after this point" filter without dealing with timestamp formats, timezone issues, or Apple's nanosecond epoch offset.

Limitations

macOS only. This requires a Mac with iMessage syncing enabled. No Linux, no cloud VMs (unless you're running macOS VMs, which has its own licensing implications).
Full Disk Access required. The terminal or Python process needs FDA permission to read chat.db. This is a one-time system preference toggle.
SMS sync latency. iMessage typically syncs within 2-5 seconds, but network conditions or iCloud delays could extend this. The 60-second polling timeout handles edge cases.
Not all 2FA is SMS. Sites that only offer email or authenticator apps need different approaches. Email could use a similar local technique with Mail.app's database, but that's a future project.

The Bigger Picture

The real insight here isn't about iMessage specifically — it's about treating local system databases as APIs. Your Mac is full of SQLite databases that applications use for storage: Messages, Safari history, Contacts, Notes, Calendar. When you need data from these apps programmatically, you don't need to build integrations or scrape UIs. You just query the database.

For automated web scraping that needs to handle 2FA, this collapses what would normally be a complex infrastructure problem (SMS webhook service, message parsing, code extraction) into five lines of SQL and a polling loop.

This is part of an ongoing series on building agent-based web scrapers with PII safety. Previously: PII Redaction for AI Pipelines.

DEV Community