I turned every n8n node into a machine-readable dataset (524 nodes, free) so agents can build workflows

#ai #n8n #opensource #automation

I have been writing agents that build n8n workflows. The hard part is not "call the n8n API and post a workflow JSON." The hard part is "pick the right node, with the right operation, with the right parameters, without hallucinating fields that do not exist."

The n8n GUI is the source of truth. The TypeScript source files are the second source of truth. Neither is a thing you can hand to an LLM at inference time.

So I extracted everything into one structured catalog and put it on HuggingFace.

524 nodes. Every operation. Every credential type. Properties schema. Free. CC-BY-4.0.

HuggingFace: automatelab/n8n-nodes-catalog
Browsable index: automatelab.tech/products/datasets/n8n-nodes-catalog/

Why a catalog and not "just scrape n8n.io"

Three real problems with leaving this implicit:

Hallucination cost is high. An LLM that invents a slack.sendDM operation will produce a workflow that imports fine and fails at runtime. Hard to detect, expensive to debug.
Context window pressure. Dropping the entire n8n source tree into a prompt is not realistic. You want a compact index the agent can search.
Coverage is non-obvious. There are two source packages (nodes-base and @n8n/nodes-langchain), and the split between them is not visible in the UI.

The catalog flattens all of that into one row per node.

What is in each row

Field	What it is
`node_name`	Internal id (e.g. `slack`, `airtable`, `lmChatOpenAi`)
`display_name`	UI label
`categories`	Top-level categories (Communication, AI, Data and Storage)
`subcategories`	Leaf taxonomy values
`group`	`input`, `output`, or `transform`
`version`	Default version for multi-version nodes
`description`	One-liner
`credentials_required`	Credential type names (e.g. `slackApi`, `openAiApi`)
`operations_supported`	Operation values for the node
`properties_schema`	JSON describing top-level property descriptors
`source_package`	`nodes-base` or `@n8n`
`source_file_path`	Repo-relative path to the `.node.ts`
`github_permalink`	Pinned GitHub link to the source

Format: JSON and Parquet (Snappy). License: CC-BY-4.0. Updates monthly.

A sample row

Here is the Slack node, trimmed:

{
  "node_name": "slack",
  "display_name": "Slack",
  "categories": ["Communication"],
  "group": ["transform"],
  "version": "2.3",
  "description": "Send and read messages, manage channels",
  "credentials_required": ["slackApi"],
  "operations_supported": ["message", "channel", "user", "reaction"],
  "properties_schema": "[{\"name\":\"resource\",\"type\":\"options\"},{\"name\":\"operation\",\"type\":\"options\"}]",
  "source_package": "nodes-base",
  "github_permalink": "https://github.com/n8n-io/n8n/blob/stable/packages/nodes-base/nodes/Slack/Slack.node.ts"
}

And an AI node, to show the cross-package coverage:

{
  "node_name": "lmChatOpenAi",
  "display_name": "OpenAI Chat Model",
  "categories": ["AI"],
  "subcategories": ["Language Models", "Chat Models (Recommended)"],
  "group": ["transform"],
  "version": "1.3",
  "credentials_required": ["openAiApi"],
  "source_package": "@n8n",
  "source_file_path": "packages/@n8n/nodes-langchain/nodes/llms/LMChatOpenAi/LmChatOpenAi.node.ts"
}

Numbers I did not expect

A few things that fell out of the catalog once it existed:

431 nodes from nodes-base, 93 from @n8n/nodes-langchain. The langchain side is a real and growing chunk.
The single most common credential type, by a wide margin, is httpBasicAuth (because the generic HTTP Request node is everywhere). After that the long tail starts immediately.
A non-trivial number of nodes have an empty operations_supported list. Those are usually root nodes (LLMs, vector stores, output parsers) where the "operation" abstraction does not apply.

Useful to know if you are writing a planner that filters by operation.

How agents actually use it

from datasets import load_dataset

ds = load_dataset("automatelab/n8n-nodes-catalog")["train"]

# Filter to nodes that can post messages somewhere
messaging = ds.filter(
    lambda r: "message" in (r["operations_supported"] or [])
)
for row in messaging:
    print(row["node_name"], row["credentials_required"])

Typical pipeline:

Embed every row (description, operations, credentials) into a vector store.
At plan time, retrieve the top N nodes for a user request.
Hand the agent only those rows. Compact context, no hallucinated operations.
The agent emits an n8n workflow JSON. Validation against properties_schema catches malformed configs before deploy.

This is the same shape as RAG over a tool catalog, which is becoming a pattern in its own right.

Caveats

The properties schema is a top-level summary, not the full recursive parameter tree. For deep parameter shapes the github_permalink is your friend.
Multi-version nodes only report the default version. If you need every version of a node, the source link covers it.
License is CC-BY-4.0 on the catalog additions; the n8n source itself is governed by n8n's own license, which you should respect when you ship.

Links

Dataset: automatelab/n8n-nodes-catalog
Browsable index: automatelab.tech/products/datasets/n8n-nodes-catalog/
All open datasets: automatelab.tech/products/datasets/

If you build agent tooling on top of this, the thing I would most like to see is an open eval set: prompts in, expected n8n workflow JSON out. That is the next obvious missing piece, and I do not think anyone has shipped one yet.