Building a Self-Hosted AI WhatsApp Agent for Structured Invoice Extraction

#ai #automation #node #architecture

As an engineering manager and developer, I constantly look for ways to eliminate repetitive business friction using automation. One of the most common manual bottlenecks is bookkeeping—specifically, reading utility bills or vendor invoices and logging them into financial trackers.

To solve this, I built a production-ready, self-hosted AI Agent using a pure automation stack. It allows users to simply snap a photo of an invoice, send it over WhatsApp, and have the structured data extracted and logged automatically in seconds.

Here is a breakdown of the actual system architecture, the code nodes, and how to handle it using an AI-first approach.

The System Architecture

A robust automation pipeline requires strict separation of concerns. This entire Proof of Concept (POC) runs without an external Node.js server, relying entirely on a self-hosted orchestration engine:

Ingress: WhatsApp Business API Cloud Webhooks (configured with a temporary developer test number).
Orchestration & Data Flow: Self-hosted n8n environment.
Cognitive Layer: Gemini 1.5 Flash (chosen for its native multimodal capabilities, large context window, and fast inference speeds).
Data Formatting: Native n8n Code Nodes running isolated JavaScript.

[WhatsApp Client] ──> [Cloud Webhook] ──> [n8n Code Node (JS)] ──> [Gemini 1.5 Flash] ──> [Database/Sheet]

Prerequisites & WhatsApp Developer Setup

To build this yourself, you don't need a paid enterprise WhatsApp account immediately. You can start building today using Meta's developer ecosystem:

Create a Developer Account: Head to the Meta for Developers Portal and register.
App Creation: Follow the Meta App Setup Guide to create a new App. Select Other -> Business as your app type.
Add WhatsApp Product: Inside your app dashboard, add the WhatsApp product. Meta will instantly provision a free temporary test number and a test Business Account (WABA) for development.
Configure Webhooks: Point Meta's webhooks to your n8n production webhook URL to start capturing real-time message payloads.

Step 1: Handling Multimodal Input via Webhooks

When a user uploads an invoice image or PDF via WhatsApp, the webhook doesn't deliver the file directly; it delivers a media ID.

We use an HTTP Request node to securely request the download URL using your Meta Access Token, pull the binary data, and hold it in the workflow memory to pass directly to our AI node.

Step 2: Designing the System Prompt for Document Extraction

Instead of relying on fragile, traditional OCR software that breaks if a vendor moves a logo, we pass the raw image straight to Gemini 1.5 Flash. The magic lies in instructing the model to act as a strict data parser.

Here is the sample System Prompt structure used in the model configuration:

"You are an expert financial data extraction AI. Your sole task is to analyze the provided invoice or utility bill image and extract data with absolute accuracy.

Follow these strict formatting rules:

Extract the primary Vendor Name.

Extract the Invoice Date and normalize it into standard YYYY-MM-DD format.

Extract the Grand Total Amount as a pure floating-point number.

Extract individual line items into an array containing description, quantity, and total price.

Do not include any conversational text, markdown formatting, or wrappers. Your response must be a single, raw, valid JSON object."

Step 3: Formatting Data with n8n Code Nodes (JavaScript)

Once Gemini returns the text string, we avoid spinning up an external server or application layer. Instead, we pipe the model's output directly into an internal n8n Code Node configured to execute JavaScript.

This node isolates the extraction logic, runs a quick safety validation, and formats the properties perfectly for our target database or sheet.

Here is the exact layout of the JavaScript snippet inside the n8n Code Node:

// Loop through incoming items from the Gemini node
for (const item of $input.all()) {
  try {
    // Parse the raw string response from the AI model
    const extractedData = JSON.parse(item.json.output);

    // Format and return the details explicitly for downstream nodes
    item.json.formattedInvoice = {
      vendor: extractedData.vendor_name || 'Unknown Vendor',
      date: extractedData.invoice_date || new Date().toISOString().split('T')[0],
      total: parseFloat(extractedData.grand_total) || 0.0,
      lineItems: extractedData.line_items || []
    };
  } catch (error) {
    // Handle parsing errors safely if the response was malformed
    item.json.formattedInvoice = {
      vendor: 'Parsing Error',
      error: error.message,
      rawOutput: item.json.output
    };
  }
}

return $input.all();

By keeping this execution inside the internal JavaScript node, the data mapping remains incredibly fast, entirely self-contained, and exceptionally easy to debug directly from the execution logs.

Handling Edge Cases in Production

Building a POC is easy; making it production-ready is where the real engineering begins. When deploying this system, you must design around:

Rate Limiting: Managing concurrent incoming webhooks from active users by placing a lightweight Redis queue ahead of the API calls.
Data Security: Ensuring that processed invoice data is deleted from local temporary server storage immediately after database insertion to protect PII.

What's Next?

This pattern proves that you can build highly sophisticated AI Agents entirely inside a visual workflow manager when paired with robust system prompts and minor inline JavaScript engineering. Next up, I am scaling this to run automated cross-matching against incoming banking transactions.

The full workflow JSON schema will be open-sourced on my GitHub profile soon. Follow along as I map out more production-ready AI engineering architectures.