DEV Community

Artyom Rabzonov
Artyom Rabzonov

Posted on

I turned every n8n node into a machine-readable dataset (524 nodes, free) so agents can build workflows

I have been writing agents that build n8n workflows. The hard part is not "call the n8n API and post a workflow JSON." The hard part is "pick the right node, with the right operation, with the right parameters, without hallucinating fields that do not exist."

The n8n GUI is the source of truth. The TypeScript source files are the second source of truth. Neither is a thing you can hand to an LLM at inference time.

So I extracted everything into one structured catalog and put it on HuggingFace.

524 nodes. Every operation. Every credential type. Properties schema. Free. CC-BY-4.0.

Why a catalog and not "just scrape n8n.io"

Three real problems with leaving this implicit:

  1. Hallucination cost is high. An LLM that invents a slack.sendDM operation will produce a workflow that imports fine and fails at runtime. Hard to detect, expensive to debug.
  2. Context window pressure. Dropping the entire n8n source tree into a prompt is not realistic. You want a compact index the agent can search.
  3. Coverage is non-obvious. There are two source packages (nodes-base and @n8n/nodes-langchain), and the split between them is not visible in the UI.

The catalog flattens all of that into one row per node.

What is in each row

Field What it is
node_name Internal id (e.g. slack, airtable, lmChatOpenAi)
display_name UI label
categories Top-level categories (Communication, AI, Data and Storage)
subcategories Leaf taxonomy values
group input, output, or transform
version Default version for multi-version nodes
description One-liner
credentials_required Credential type names (e.g. slackApi, openAiApi)
operations_supported Operation values for the node
properties_schema JSON describing top-level property descriptors
source_package nodes-base or @n8n
source_file_path Repo-relative path to the .node.ts
github_permalink Pinned GitHub link to the source

Format: JSON and Parquet (Snappy). License: CC-BY-4.0. Updates monthly.

A sample row

Here is the Slack node, trimmed:

{
  "node_name": "slack",
  "display_name": "Slack",
  "categories": ["Communication"],
  "group": ["transform"],
  "version": "2.3",
  "description": "Send and read messages, manage channels",
  "credentials_required": ["slackApi"],
  "operations_supported": ["message", "channel", "user", "reaction"],
  "properties_schema": "[{\"name\":\"resource\",\"type\":\"options\"},{\"name\":\"operation\",\"type\":\"options\"}]",
  "source_package": "nodes-base",
  "github_permalink": "https://github.com/n8n-io/n8n/blob/stable/packages/nodes-base/nodes/Slack/Slack.node.ts"
}
Enter fullscreen mode Exit fullscreen mode

And an AI node, to show the cross-package coverage:

{
  "node_name": "lmChatOpenAi",
  "display_name": "OpenAI Chat Model",
  "categories": ["AI"],
  "subcategories": ["Language Models", "Chat Models (Recommended)"],
  "group": ["transform"],
  "version": "1.3",
  "credentials_required": ["openAiApi"],
  "source_package": "@n8n",
  "source_file_path": "packages/@n8n/nodes-langchain/nodes/llms/LMChatOpenAi/LmChatOpenAi.node.ts"
}
Enter fullscreen mode Exit fullscreen mode

Numbers I did not expect

A few things that fell out of the catalog once it existed:

  • 431 nodes from nodes-base, 93 from @n8n/nodes-langchain. The langchain side is a real and growing chunk.
  • The single most common credential type, by a wide margin, is httpBasicAuth (because the generic HTTP Request node is everywhere). After that the long tail starts immediately.
  • A non-trivial number of nodes have an empty operations_supported list. Those are usually root nodes (LLMs, vector stores, output parsers) where the "operation" abstraction does not apply.

Useful to know if you are writing a planner that filters by operation.

How agents actually use it

from datasets import load_dataset

ds = load_dataset("automatelab/n8n-nodes-catalog")["train"]

# Filter to nodes that can post messages somewhere
messaging = ds.filter(
    lambda r: "message" in (r["operations_supported"] or [])
)
for row in messaging:
    print(row["node_name"], row["credentials_required"])
Enter fullscreen mode Exit fullscreen mode

Typical pipeline:

  1. Embed every row (description, operations, credentials) into a vector store.
  2. At plan time, retrieve the top N nodes for a user request.
  3. Hand the agent only those rows. Compact context, no hallucinated operations.
  4. The agent emits an n8n workflow JSON. Validation against properties_schema catches malformed configs before deploy.

This is the same shape as RAG over a tool catalog, which is becoming a pattern in its own right.

Caveats

  • The properties schema is a top-level summary, not the full recursive parameter tree. For deep parameter shapes the github_permalink is your friend.
  • Multi-version nodes only report the default version. If you need every version of a node, the source link covers it.
  • License is CC-BY-4.0 on the catalog additions; the n8n source itself is governed by n8n's own license, which you should respect when you ship.

Links

If you build agent tooling on top of this, the thing I would most like to see is an open eval set: prompts in, expected n8n workflow JSON out. That is the next obvious missing piece, and I do not think anyone has shipped one yet.

Top comments (0)