The Information Extractor node is one of n8n's most practical AI nodes. You give it unstructured text — an email, a PDF snippet, a support ticket, a scraped web page — and it uses a language model to pull out exactly the fields you define. No regex. No brittle parsing. Just structured JSON out.
This guide covers everything: how the node works, schema definition, output modes, the 6 gotchas that trip people up, and 3 production-ready workflow patterns with free JSON.
What the Information Extractor Node Does
The Information Extractor node wraps LangChain's structured output capability. You define a JSON schema (field names + types + descriptions), pass in a text input, and the connected Chat Model fills in the schema. The output is a single n8n item with your extracted fields as top-level keys.
Typical inputs:
- Email body text
- PDF or document content (after passing through Extract From File)
- Support ticket descriptions
- Web page text (after HTML Extract or HTTP Request)
- Form submission free-text fields
- API response descriptions
Typical outputs:
- Structured records ready for a database insert
- Enriched lead data
- Classified + tagged items
- Parsed addresses, dates, names, amounts
Node Wiring
The Information Extractor is an AI node — it sits inside n8n's LangChain sub-graph and requires a Chat Model sub-node connected to it.
[Your text source]
↓
[Information Extractor]
↑ (Chat Model connection)
[OpenAI / Anthropic / Gemini Chat Model]
↓
[Next node — uses extracted fields]
Required connection: Chat Model (any: OpenAI, Anthropic Claude, Google Gemini, Ollama, etc.)
Optional connections:
- Memory node (rarely needed for extraction tasks)
Schema Definition
This is where most of the work happens. You define your extraction schema inside the node's Attributes section.
Adding Attributes
Each attribute has:
-
Name — the output field key (e.g.,
customer_name,invoice_total) - Type — String, Number, Boolean, or Object (nested)
- Description — critical: the LLM uses this to understand what to extract. Be specific.
Good vs Bad Descriptions
| Field | Bad description | Good description |
|---|---|---|
amount |
"The amount" | "Total invoice amount as a number, excluding currency symbols and commas (e.g. 1250.00)" |
sentiment |
"Sentiment" | "Customer sentiment: exactly one of positive, negative, or neutral" |
date |
"The date" | "Invoice date in ISO 8601 format (YYYY-MM-DD). If no year, assume current year." |
Rule: Write descriptions as if you're telling a smart intern exactly what to write in that field. Ambiguous descriptions produce inconsistent outputs.
Optional vs Required
By default, all attributes are optional — the LLM will return null for fields it can't find. You can mark fields as required; the model will try harder but may hallucinate if the data isn't there.
Output Mode
The node has one primary output mode: it emits one item per input item, with your defined fields added to (or replacing) the item's JSON.
The extracted fields appear at the top level of the output item's json object, accessible as:
{{ $json.customer_name }}
{{ $json.invoice_total }}
{{ $json.sentiment }}
6 Gotchas
1. The Chat Model connection is mandatory — and often forgotten
If you drop the Information Extractor node and run without wiring a Chat Model sub-node, you'll get a connection error. Always add the Chat Model sub-node first. It appears as a small input handle at the bottom of the node.
2. Input field selection matters
The node has an Input field where you specify which field from the incoming item contains the text to extract from. Default is often text. If your text is in body, content, or description, change this — otherwise the model gets an empty string and returns nulls.
3. Long texts hit context limits
The LLM must read your entire input text. Very long documents (PDFs, long emails) can exceed the model's context window. Pre-chunk with the Summarization Chain node or truncate with a Code node before feeding into the extractor. For multi-page PDFs, extract page by page.
4. Hallucination on missing fields
If a field genuinely doesn't exist in the text and you've marked it required, the model may fabricate a plausible-sounding value. For required fields where absence is meaningful, use Optional + check for null downstream rather than Required.
5. Nested objects need careful schema design
You can define Object-type attributes with nested keys, but the deeper the nesting, the more the model struggles to fill it correctly. Flatten your schema where possible; one level of nesting is usually fine, two levels degrades reliability.
6. Temperature and model choice affect consistency
Higher-temperature models produce more variable extractions. For structured extraction, use a model/temperature that favors precision over creativity. With OpenAI, gpt-4o-mini at temperature 0 is a reliable cost-effective choice. With Anthropic, claude-haiku-4-5 works well for simple schemas; use claude-sonnet-4-6 for complex ones.
3 Workflow Patterns
Pattern 1: Email Lead Extractor
Scenario: Inbound emails arrive at a support inbox. You want to extract lead data (name, company, use case, budget range) and write it to a CRM or Google Sheet.
Flow:
Gmail Trigger (new email)
→ Information Extractor
Schema: sender_name (String), company (String), use_case (String), budget_range (String), urgency (String: low/medium/high)
Input field: text (email body)
→ Google Sheets (append row) OR HTTP Request (POST to CRM API)
Why it works: Email bodies are unstructured but the signal is there. The extractor turns "Hi, I'm Jane from Acme Corp, we need to automate our invoice reconciliation, budget around $5k" into a clean structured record in one step.
Free JSON: Download the Email Lead Extractor workflow →
Pattern 2: Support Ticket Classifier and Router
Scenario: Support tickets come in via webhook (Typeform, Jotform, or your app). You want to extract the issue category, severity, and affected product, then route to the right Slack channel or Linear project.
Flow:
Webhook Trigger (new ticket)
→ Information Extractor
Schema: issue_category (String: billing/technical/account/feature-request), severity (String: low/medium/high/critical), affected_product (String), customer_tier (String: free/pro/enterprise)
Input field: description
→ Switch node (route on issue_category)
→ Slack #billing-support / #tech-support / etc.
→ Linear (create issue with extracted fields)
Why it works: Routing logic that used to require keyword matching or a full classification model becomes a 2-node pattern. The extracted severity and customer_tier drive SLA assignment downstream.
Free JSON: Download the Support Ticket Router workflow →
Pattern 3: Invoice Data Extractor (PDF → Database)
Scenario: You receive invoices as PDF attachments via email. You want to extract invoice number, vendor, date, line items total, and due date, then insert them into Postgres or Airtable for AP tracking.
Flow:
Gmail Trigger (attachment: PDF)
→ Extract From File node (Read PDF as text)
→ Information Extractor
Schema: invoice_number (String), vendor_name (String), invoice_date (String: YYYY-MM-DD), total_amount (Number), due_date (String: YYYY-MM-DD), currency (String)
Input field: text
→ Postgres node (INSERT) OR Airtable node (Create Record)
→ Gmail (Send confirmation reply to sender)
Why it works: PDF invoice parsing used to require paid OCR services or brittle regex. With n8n's Extract From File + Information Extractor, you get structured AP data from any invoice format in seconds — and it handles vendor-to-vendor format differences automatically.
Free JSON: Download the Invoice Extractor workflow →
Information Extractor vs Other AI Nodes
| Node | Best for |
|---|---|
| Information Extractor | Structured field extraction from unstructured text |
| Basic LLM Chain | Free-form text generation, classification, summarization |
| Summarization Chain | Condensing long documents into shorter summaries |
| AI Agent | Multi-step reasoning, tool use, decisions |
| Sentiment Analysis (via LLM Chain) | Single-dimension classification |
Use Information Extractor when you need specific named fields out of text. Use Basic LLM Chain when you need a free-form text response.
Quick Reference
Node: Information Extractor
Requires: Chat Model sub-node (mandatory)
Input field: configure to match your data field
Output: extracted fields added to item JSON
Schema tips:
- Name: snake_case, descriptive
- Type: String / Number / Boolean / Object
- Description: be specific — the LLM reads this
Gotchas:
- Wire Chat Model first
- Set correct Input field
- Chunk long texts before extracting
- Use Optional fields for data that may be absent
- Flatten nested schemas
- Low temperature = more consistent extraction
Get the Free Workflow JSON
All three patterns above are included in the n8n Workflow Packs available on Gumroad. One download, instant access, plug the JSON into your n8n instance and go.
→ Download the n8n Workflow Pack
Found this useful? Drop a comment below — I'm especially curious which extraction use case you're tackling and which model you're using.
Top comments (1)
Are you using the Information Extractor node to pull structured fields from emails, PDFs, or support tickets? Drop your use case in the comments — especially curious which Chat Model you're using and what schema you've found most reliable.