DEV Community

Pirate Prentice
Pirate Prentice

Posted on

n8n Information Extractor Node: Extract Structured Data from Text Using AI — Free Workflow JSON

The Information Extractor node is one of n8n's most practical AI nodes. You give it unstructured text — an email, a PDF snippet, a support ticket, a scraped web page — and it uses a language model to pull out exactly the fields you define. No regex. No brittle parsing. Just structured JSON out.

This guide covers everything: how the node works, schema definition, output modes, the 6 gotchas that trip people up, and 3 production-ready workflow patterns with free JSON.


What the Information Extractor Node Does

The Information Extractor node wraps LangChain's structured output capability. You define a JSON schema (field names + types + descriptions), pass in a text input, and the connected Chat Model fills in the schema. The output is a single n8n item with your extracted fields as top-level keys.

Typical inputs:

  • Email body text
  • PDF or document content (after passing through Extract From File)
  • Support ticket descriptions
  • Web page text (after HTML Extract or HTTP Request)
  • Form submission free-text fields
  • API response descriptions

Typical outputs:

  • Structured records ready for a database insert
  • Enriched lead data
  • Classified + tagged items
  • Parsed addresses, dates, names, amounts

Node Wiring

The Information Extractor is an AI node — it sits inside n8n's LangChain sub-graph and requires a Chat Model sub-node connected to it.

[Your text source]
       ↓
[Information Extractor]
       ↑ (Chat Model connection)
[OpenAI / Anthropic / Gemini Chat Model]
       ↓
[Next node — uses extracted fields]
Enter fullscreen mode Exit fullscreen mode

Required connection: Chat Model (any: OpenAI, Anthropic Claude, Google Gemini, Ollama, etc.)

Optional connections:

  • Memory node (rarely needed for extraction tasks)

Schema Definition

This is where most of the work happens. You define your extraction schema inside the node's Attributes section.

Adding Attributes

Each attribute has:

  • Name — the output field key (e.g., customer_name, invoice_total)
  • Type — String, Number, Boolean, or Object (nested)
  • Description — critical: the LLM uses this to understand what to extract. Be specific.

Good vs Bad Descriptions

Field Bad description Good description
amount "The amount" "Total invoice amount as a number, excluding currency symbols and commas (e.g. 1250.00)"
sentiment "Sentiment" "Customer sentiment: exactly one of positive, negative, or neutral"
date "The date" "Invoice date in ISO 8601 format (YYYY-MM-DD). If no year, assume current year."

Rule: Write descriptions as if you're telling a smart intern exactly what to write in that field. Ambiguous descriptions produce inconsistent outputs.

Optional vs Required

By default, all attributes are optional — the LLM will return null for fields it can't find. You can mark fields as required; the model will try harder but may hallucinate if the data isn't there.


Output Mode

The node has one primary output mode: it emits one item per input item, with your defined fields added to (or replacing) the item's JSON.

The extracted fields appear at the top level of the output item's json object, accessible as:

{{ $json.customer_name }}
{{ $json.invoice_total }}
{{ $json.sentiment }}
Enter fullscreen mode Exit fullscreen mode

6 Gotchas

1. The Chat Model connection is mandatory — and often forgotten

If you drop the Information Extractor node and run without wiring a Chat Model sub-node, you'll get a connection error. Always add the Chat Model sub-node first. It appears as a small input handle at the bottom of the node.

2. Input field selection matters

The node has an Input field where you specify which field from the incoming item contains the text to extract from. Default is often text. If your text is in body, content, or description, change this — otherwise the model gets an empty string and returns nulls.

3. Long texts hit context limits

The LLM must read your entire input text. Very long documents (PDFs, long emails) can exceed the model's context window. Pre-chunk with the Summarization Chain node or truncate with a Code node before feeding into the extractor. For multi-page PDFs, extract page by page.

4. Hallucination on missing fields

If a field genuinely doesn't exist in the text and you've marked it required, the model may fabricate a plausible-sounding value. For required fields where absence is meaningful, use Optional + check for null downstream rather than Required.

5. Nested objects need careful schema design

You can define Object-type attributes with nested keys, but the deeper the nesting, the more the model struggles to fill it correctly. Flatten your schema where possible; one level of nesting is usually fine, two levels degrades reliability.

6. Temperature and model choice affect consistency

Higher-temperature models produce more variable extractions. For structured extraction, use a model/temperature that favors precision over creativity. With OpenAI, gpt-4o-mini at temperature 0 is a reliable cost-effective choice. With Anthropic, claude-haiku-4-5 works well for simple schemas; use claude-sonnet-4-6 for complex ones.


3 Workflow Patterns

Pattern 1: Email Lead Extractor

Scenario: Inbound emails arrive at a support inbox. You want to extract lead data (name, company, use case, budget range) and write it to a CRM or Google Sheet.

Flow:

Gmail Trigger (new email)
  → Information Extractor
      Schema: sender_name (String), company (String), use_case (String), budget_range (String), urgency (String: low/medium/high)
      Input field: text (email body)
  → Google Sheets (append row) OR HTTP Request (POST to CRM API)
Enter fullscreen mode Exit fullscreen mode

Why it works: Email bodies are unstructured but the signal is there. The extractor turns "Hi, I'm Jane from Acme Corp, we need to automate our invoice reconciliation, budget around $5k" into a clean structured record in one step.

Free JSON: Download the Email Lead Extractor workflow →


Pattern 2: Support Ticket Classifier and Router

Scenario: Support tickets come in via webhook (Typeform, Jotform, or your app). You want to extract the issue category, severity, and affected product, then route to the right Slack channel or Linear project.

Flow:

Webhook Trigger (new ticket)
  → Information Extractor
      Schema: issue_category (String: billing/technical/account/feature-request), severity (String: low/medium/high/critical), affected_product (String), customer_tier (String: free/pro/enterprise)
      Input field: description
  → Switch node (route on issue_category)
      → Slack #billing-support / #tech-support / etc.
      → Linear (create issue with extracted fields)
Enter fullscreen mode Exit fullscreen mode

Why it works: Routing logic that used to require keyword matching or a full classification model becomes a 2-node pattern. The extracted severity and customer_tier drive SLA assignment downstream.

Free JSON: Download the Support Ticket Router workflow →


Pattern 3: Invoice Data Extractor (PDF → Database)

Scenario: You receive invoices as PDF attachments via email. You want to extract invoice number, vendor, date, line items total, and due date, then insert them into Postgres or Airtable for AP tracking.

Flow:

Gmail Trigger (attachment: PDF)
  → Extract From File node (Read PDF as text)
  → Information Extractor
      Schema: invoice_number (String), vendor_name (String), invoice_date (String: YYYY-MM-DD), total_amount (Number), due_date (String: YYYY-MM-DD), currency (String)
      Input field: text
  → Postgres node (INSERT) OR Airtable node (Create Record)
  → Gmail (Send confirmation reply to sender)
Enter fullscreen mode Exit fullscreen mode

Why it works: PDF invoice parsing used to require paid OCR services or brittle regex. With n8n's Extract From File + Information Extractor, you get structured AP data from any invoice format in seconds — and it handles vendor-to-vendor format differences automatically.

Free JSON: Download the Invoice Extractor workflow →


Information Extractor vs Other AI Nodes

Node Best for
Information Extractor Structured field extraction from unstructured text
Basic LLM Chain Free-form text generation, classification, summarization
Summarization Chain Condensing long documents into shorter summaries
AI Agent Multi-step reasoning, tool use, decisions
Sentiment Analysis (via LLM Chain) Single-dimension classification

Use Information Extractor when you need specific named fields out of text. Use Basic LLM Chain when you need a free-form text response.


Quick Reference

Node: Information Extractor
Requires: Chat Model sub-node (mandatory)
Input field: configure to match your data field
Output: extracted fields added to item JSON

Schema tips:
- Name: snake_case, descriptive
- Type: String / Number / Boolean / Object
- Description: be specific — the LLM reads this

Gotchas:
- Wire Chat Model first
- Set correct Input field
- Chunk long texts before extracting
- Use Optional fields for data that may be absent
- Flatten nested schemas
- Low temperature = more consistent extraction
Enter fullscreen mode Exit fullscreen mode

Get the Free Workflow JSON

All three patterns above are included in the n8n Workflow Packs available on Gumroad. One download, instant access, plug the JSON into your n8n instance and go.

Download the n8n Workflow Pack


Found this useful? Drop a comment below — I'm especially curious which extraction use case you're tackling and which model you're using.

Top comments (1)

Collapse
 
pirateprentice profile image
Pirate Prentice

Are you using the Information Extractor node to pull structured fields from emails, PDFs, or support tickets? Drop your use case in the comments — especially curious which Chat Model you're using and what schema you've found most reliable.