DEV Community: DataWorkers

Building an Incident Debugging Agent: What We've Learned So Far

DataWorkers — Mon, 01 Jun 2026 02:01:13 +0000

Data pipeline downtime costs enterprises $150K-$540K per hour. The average incident takes 2-4 hours to diagnose manually — most of that time spent tracing lineage across five different tools, not actually fixing the problem. We built the Incident Debugging Agent to eliminate that diagnostic bottleneck.

Incident debugging is where we started building. Not because it is the easiest problem, but because it is the most painful. Every data engineer we talked to described the same experience: an alert fires, you open your laptop, and you spend the next two to four hours manually tracing the problem across five different tools.

What the Agent Does

When a data incident occurs, the agent:

Ingests the alert context. What failed, when, what system reported it.
Runs diagnostic queries. Checks freshness, row counts, null rates, schema changes, and value distributions for the affected table and its upstream dependencies.
Traces lineage. Uses the Data Context and Catalog Agent's lineage information to identify upstream sources and downstream consumers. Maps the blast radius.
Correlates with recent changes. Checks dbt deployment logs, schema migration history, and orchestrator runs for recent changes that could explain the breakage.
Generates a diagnosis. Produces a structured incident report with probable root cause, affected assets, blast radius, and suggested remediation steps.

The engineer gets a diagnosis, not a symptom. Instead of "Table X freshness check failed," they get "Table X has not been updated in 6 hours because the upstream Airflow DAG failed at the extraction step due to a source API timeout. 3 downstream dashboards are affected. Suggested action: re-trigger the DAG with exponential backoff."

Early Results (With Caveats)

Mean time to diagnosis: Minutes instead of hours for manual debugging.
Root cause accuracy: The majority of diagnoses correctly identify the root cause. The remainder identify a contributing factor but miss the primary cause.

Important caveats: these are early results from a limited set of historical incidents across a small number of environments. We share them to show directional progress, not to claim production readiness.

Where This Fails

Novel failure modes. The agent struggles with failures it has not seen patterns for — a subtle data drift, a business logic change, an infrastructure issue that manifests as a data issue.
Cross-system correlation. When the root cause spans multiple systems, the agent's ability to correlate drops significantly.
Semantic understanding. The agent can tell you that values in a column changed. It cannot tell you whether those values are wrong.
False confidence. The agent sometimes generates plausible-sounding diagnoses that are wrong.

What We Learned About Trust

Data engineers do not trust a diagnosis they cannot verify. Every design partner conversation included some version of: "This is cool, but how do I know it is right?" The evidence chain — showing every query and result — is not a nice-to-have feature. It is the feature.

We also learned that the agent needs to say "I don't know." Our early versions always produced a diagnosis, even when the evidence was ambiguous. Engineers found this less trustworthy than an agent that says "I found these anomalies but cannot determine a definitive root cause."

Originally published at https://dataworkers.io/blog/building-incident-debugging-agent/. Data Workers is an open-source autonomous agent swarm for data engineering — see the repo.

Why We Open-Sourced 14 Autonomous Data Engineering Agents

DataWorkers — Mon, 01 Jun 2026 02:00:55 +0000

Today we released the community edition of Data Workers: 14 autonomous agents for data engineering, open-sourced under Apache 2.0. This post explains why we made that decision, how the trust model works, and what we are looking for from the community.

What Ships in the Open

The community edition includes 14 agents covering the core data engineering lifecycle: incident debugging, quality monitoring, schema evolution, pipeline construction, data context and cataloging, governance and security, cost optimization, data migration, insights and analytics, streaming operations, orchestration coordination, connector management, observability, and usage intelligence.

In concrete terms, that is 202+ MCP tools, 15 catalog connectors (Snowflake, Databricks, BigQuery, Unity Catalog, Hive Metastore, and more), and 3,000+ passing tests. Every agent has its own MCP server. Every tool call is auditable.

A 15th agent for ML model monitoring ships as enterprise-only, alongside 35 additional enterprise connectors and features like PII detection middleware, tamper-evident audit trails, and OAuth 2.1 authentication. The community edition has no feature gates on the 14 agents it includes.

Why Open Source?

The short answer: because black-box agents and critical data infrastructure do not mix.

When an agent modifies your Airflow DAGs, evolves a schema in production, or recommends dropping an unused table that turns out to be consumed by a downstream team you did not know about, you need to understand exactly what logic drove that decision. You need to read the code. You need to audit the tool calls. You need to verify the reasoning.

That requirement is not compatible with a closed-source product. We considered offering a hosted-only service and rejected it. Data engineers are rightfully skeptical of autonomous systems they cannot inspect. We would be too.

Vendor lock-in compounds over time. Once an agent manages your pipeline configurations, incident response, and governance policies, switching costs become prohibitive. Your operational knowledge lives in a system you do not own.
Customization hits walls. Every data environment is different. When a proprietary agent does not handle your specific migration pattern, you file a feature request and wait. With open source, you fix it yourself.
Audit requirements grow. Regulated industries need to demonstrate exactly how autonomous systems make decisions. Reading the actual source code satisfies auditors in a way that vendor assurances do not.
Incident response is blind. When a proprietary agent makes a bad decision at 2 AM, your on-call engineer cannot read the code to understand what happened.

Because it is Apache 2.0, your investment is protected even if we disappear tomorrow. Fork it, modify it, run it in production indefinitely. The license guarantees that.

The Trust Model: Read-Only by Default

Every agent in the swarm is designed to operate in read-only mode by default. Agents observe, diagnose, and recommend. They do not take write actions unless you explicitly opt in.

This is a deliberate architectural decision, not a temporary limitation. The trust model works in three tiers:

Observe. Agents connect to your data stack, read metadata, trace lineage, and surface findings. No write access required.
Recommend. Based on observations, agents propose specific actions: fix this query, evolve this schema, drop this unused table. Each recommendation includes the reasoning chain and the tool calls that produced it.
Act (opt-in only). With explicit configuration, agents can execute approved action types autonomously. Human approval gates are available for every write operation. You control exactly how much autonomy each agent gets.

Every MCP tool in the system is tagged as either a READ or WRITE operation. Write tools are disabled by default and require explicit enablement per agent, per environment.

What This Looks Like in Practice

Consider a production incident: a key pipeline fails at 2 AM. Without Data Workers, an on-call engineer wakes up, checks Airflow logs, traces the failure upstream through dbt, queries Snowflake to find the root cause, and manually applies a fix. That typically takes 30 to 90 minutes on a good night.

With the community edition, the incident agent detects the failure, traces lineage across tools, identifies root cause, and presents a diagnosis with full evidence. The agent shows you exactly what it found, what it checked, and what it recommends — designed to compress that diagnosis from an hour to minutes.

The community edition tells you the root cause. The Pro tier lets the agent automatically apply the fix and rerun the pipeline, with approval gates you configure. That is the upgrade path: not gated features on the same work, but additional autonomy on top of full transparency.

How We Built It

The architecture is MCP-first. Each agent runs its own MCP server, exposing tools that other agents and external clients can call. Agents coordinate through shared context rather than a centralized orchestrator.

14 specialized agents, each focused on one domain of data engineering
202+ MCP tools across all agents, with clear READ/WRITE separation
15 catalog connectors for cross-platform data discovery
Factory-pattern infrastructure that auto-detects real services from environment variables and falls back to in-memory stubs for local development
3,000+ tests covering tool functionality, agent coordination, and edge cases

We spent 12 months on research and development before this release. The agent designs are grounded in real data engineering workflows, not hypothetical use cases. That said, we are early stage and honest about it. These agents are designed to handle production scenarios, but they have not yet been battle-tested across hundreds of environments. That is what the next phase is for.

The Business Model

The community edition is free and fully functional for the 14 agents it includes. The Pro and Enterprise tiers add operational autonomy (write actions, automated remediation), the 15th ML monitoring agent, 35 additional enterprise connectors, PII detection, tamper-evident audit logs, OAuth 2.1 authentication, and dedicated support.

The line is straightforward: transparency and diagnosis are free. Autonomy and enterprise security are paid.

We Are Looking for Design Partners

We are looking for design partners to validate these agents in real environments. If you run a data stack with more than a few pipelines and have experienced the 2 AM incident, the schema change that broke downstream consumers, or the warehouse bill that quietly doubled, we want to work with you.

What design partners get: direct access to the engineering team, influence on the roadmap, early access to Pro features during the validation period, and the knowledge that the agents are being shaped by your real-world requirements.

What we get: honest feedback on what works, what does not, and what we missed.

Clone the repo: github.com/DataWorkersProject/dataworkers-claw-community

Join the community: discord.com/invite/b8DR5J53

Read the docs and pricing: dataworkers.io

We built this in the open because we believe that is the only way autonomous agents earn trust in production. Read the code. Tell us what is wrong. Help us make it better.

Originally published at https://dataworkers.io/blog/why-we-open-sourced-14-autonomous-data-engineering-agents/. Data Workers is an open-source autonomous agent swarm for data engineering — see the repo.

Why We Bet on MCP (And What We're Still Figuring Out)

DataWorkers — Sun, 31 May 2026 01:51:04 +0000

When we started building Data Workers, we had to make a foundational decision: how do our AI agents connect to the dozens of tools in a modern data stack? We could build custom integrations for each tool. We could use existing orchestration frameworks. Or we could bet on the Model Context Protocol (MCP).

We bet on MCP. Here is why, and what we are still figuring out.

What MCP Actually Is

MCP is an open protocol, originally developed by Anthropic, that standardizes how AI models interact with external tools and data sources. Think of it as a USB-C port for AI — a universal connector that lets an AI agent talk to any tool that implements the protocol.

The ecosystem has exploded. There are now 12,230+ MCP servers available, covering everything from databases to CI/CD tools to cloud platforms. A year ago, this number was in the hundreds.

Why We Chose MCP Over Custom Integrations

The math is simple. Data Workers needs to connect to warehouses (Snowflake, Databricks, BigQuery, Redshift), orchestrators (Airflow, Dagster, Prefect), transformation tools (dbt, Spark), catalogs (Unity Catalog, Datahub, Hive Metastore), BI tools (Tableau, Looker, Power BI), and more.

Building and maintaining custom integrations for each of these is a full-time job for a team our size. With MCP, we get a standard interface. If a tool has an MCP server, our agents can connect to it. We are building custom MCP servers for each agent in our swarm.

What Is Working

Rapid prototyping. Our Incident Debugging Agent prototype connected to Snowflake query logs, dbt manifests, and Airflow DAGs through MCP in days, not weeks.
Composability. Because each agent has its own MCP server, agents can share context through the protocol. When the Incident Debugging Agent identifies a data quality issue, it can invoke tools from the Quality Monitoring Agent's server.
Community leverage. We do not have to build an Airflow integration from scratch because community MCP servers for Airflow already exist.

What We're Still Figuring Out

Authentication at scale. Managing credentials across dozens of tools in an enterprise environment is complex. OAuth flows, service accounts, token rotation, least-privilege access.
Latency. Each MCP call adds network overhead. When an agent needs to make 15-20 tool calls to diagnose an incident, those round trips add up.
Server quality variance. The 12,230+ MCP servers vary wildly in quality. We have had to fork and fix community servers more than we expected.
Stateful workflows. MCP is fundamentally request-response. But data engineering workflows are stateful. We are building a context layer on top of MCP to handle this.
Security surface area. Every MCP connection is an attack surface. When an agent can execute queries against your warehouse, the security implications are serious.

Our Honest Assessment

MCP is the right bet for us. The alternative — building custom integrations — would consume our entire engineering bandwidth. MCP lets a small team connect to a broad tool landscape.

But MCP is not a silver bullet. It solves the connector problem, not the intelligence problem. Our agents still need to know what queries to run, how to interpret results, and when to escalate to a human. MCP gives us the plumbing. We still have to build the logic.

Originally published at https://dataworkers.io/blog/why-we-bet-on-mcp/. Data Workers is an open-source autonomous agent swarm for data engineering — see the repo.

Copilots, Agents, and Swarms: A Decision Framework for Data Teams

DataWorkers — Sun, 31 May 2026 01:38:03 +0000

Every vendor in data engineering is an 'agent' now. Every product has 'agentic capabilities.' The word has lost all meaning — which makes it harder for data teams to evaluate what they actually need and what is just marketing.

After talking to dozens of data teams, we think the confusion comes from collapsing three fundamentally different things into one buzzword. Getting the category wrong means either over-building (spending agent-level effort on a copilot problem) or under-building (slapping a chat interface on something that needs autonomous capability).

Copilots: AI as an Assistant

A copilot helps a human do their existing job faster. It responds to explicit requests. It does not take independent action. Think GitHub Copilot for pipeline code, or Databricks Assistant for SQL.

Good for: Writing SQL queries, generating dbt models, explaining error messages, exploring unfamiliar datasets. Useful — but limited to tasks where the human is always present and initiating.

The limitation that matters: Copilots do not handle multi-step workflows. They do not monitor your pipelines at 2 AM. They do not alert, triage, or take action when you are asleep. If a pipeline breaks on Saturday night, your copilot is not going to fix it.

Agents: AI as a Specialist

An agent handles a specific workflow end-to-end with limited human oversight. It operates on triggers — an alert fires, a schema changes, a query fails — rather than waiting for human prompts. It can observe, decide, and act within a defined domain.

Good for: Incident triage, data quality monitoring, schema change management, cost optimization — workflows where the trigger-observe-decide-act loop is well-defined and the patterns are repeatable.

Where it gets interesting: Databricks Genie and BigQuery Data Canvas are copilots — you ask a question, they write a query. An agent like our Data Science and Insights Agent grounds queries in a semantic layer, disambiguates business terms (is 'revenue' gross or net?), and validates results against governed definitions before returning an answer. Google's benchmarks show a 66% accuracy improvement when queries are grounded in a semantic layer. That gap is the difference between a copilot and an agent.

Swarms: Coordinated Agent Teams

A swarm is multiple agents that share context and coordinate actions. The whole is greater than the sum of the parts because agents can hand off context, trigger each other, and maintain a shared understanding of the environment.

Why this matters: When an incident spans quality, lineage, schema, and governance simultaneously, a single agent cannot solve it. You need coordinated intelligence — the Quality Agent provides diagnostic context, the Schema Agent generates the fix, the Pipeline Agent deploys it, the Catalog Agent documents what happened. Four agents, coordinated automatically, resolving what would take a human hours.

How to Decide What You Need

Ask three questions:

Does this task require autonomous action? If the human is always present, you want a copilot. If the work needs to happen when no one is watching, you want an agent.
Does this task span multiple domains? If self-contained, a single agent or copilot is fine. If it requires context from multiple systems, you want coordinated agents.
What is the cost of a wrong action? If cheap to fix, a copilot with minimal guardrails works. If expensive (production data, financial reports, compliance), you need agents with human-in-the-loop approval, audit trails, and rollback capability.

Most data teams need all three categories for different problems. The mistake is treating 'agent' as a universal solution. Match the architecture to the problem.

Originally published at https://dataworkers.io/blog/copilots-agents-swarms-framework/. Data Workers is an open-source autonomous agent swarm for data engineering — see the repo.

How to Give Claude Access to Snowflake Without Exposing PII

DataWorkers — Sun, 31 May 2026 01:25:03 +0000

You want Claude — or Cursor, or ChatGPT, or any MCP-aware agent — to answer questions about your Snowflake data. You also do not want the agent to read social security numbers, free-text customer notes, or anything subject to GDPR / HIPAA / SOC 2. The default MCP setup hands the agent everything its connection role can see. That is the problem.

This post walks through five layers of defense, ordered from cheapest to most thorough. Each is independent — pick the ones that match your risk tolerance. The whole stack takes roughly an hour to set up on an existing Snowflake account.

The Default Posture (and Why It Is Wrong)

A typical MCP server for Snowflake — including the official one — connects with a service account, exposes a query tool, and lets the model run any SQL the role can run. That role is usually scoped to a warehouse and a database, but rarely to columns or row sets. The model gets a fluent SQL interface to your warehouse and the warehouse trusts every query it sees.

The blast radius is large. According to the 2025 IBM Cost of a Data Breach Report, the average cost of a data breach hit $4.88M, with breaches involving extensive cloud data exposure costing 23% more than average. Letting an AI agent run uncurated queries against a production warehouse is exactly the cloud-data-exposure category that drives the premium.

Layer 1: A Dedicated MCP Role

First step, every time: create a role that exists only for the agent. Do not reuse the analytics role, do not reuse the dbt role, and definitely do not use SYSADMIN.

Grant USAGE on the warehouse you want the agent to use. Use a small, dedicated warehouse (X-Small or Small) so a runaway query has a bounded cost ceiling.
Grant USAGE on the database and the specific schemas the agent should see.
Grant SELECT on the specific views the agent should query — not raw tables. Views give you a place to apply masking, filters, and joins without modifying the underlying data.
Never grant CREATE, INSERT, UPDATE, DELETE, or TRUNCATE. The agent is a read-only role.

A read-only role with view-only SELECT grants is roughly 80% of what most teams need. The remaining 20% is where the PII risk actually lives.

Layer 2: Column-Level Masking Policies

Snowflake supports masking policies that fire based on the executing role. The same SELECT statement returns the raw value for an analyst role and a masked value for the agent role. This is the single most important PII control because it does not depend on the agent or the MCP server behaving correctly.

A masking policy that returns SHA2(email) for any role except ANALYTICS_HUMAN means even if the model is jailbroken into producing a SELECT * query, it gets hashes, not addresses. The policy is enforced at the SQL engine layer, not at the application layer.

Apply masking policies to every column tagged as PII. If you do not have PII tags yet, an audit tool (or the Data Workers governance agent) can scan the schema and tag candidate columns automatically — emails, phone numbers, SSNs, free-text columns, IP addresses, dates of birth.

Layer 3: Row Access Policies

Masking hides values. Row access policies hide entire rows. For multi-tenant data — or any case where the agent should see only one customer's, one region's, or one fiscal-year's data — row access policies are the right primitive.

Common patterns: scope the agent role to the last 90 days of data, exclude rows tagged sensitive = true, restrict to a specific tenant_id. Like masking policies, these are enforced inside the engine — no application-layer code can bypass them.

Layer 4: Audit Logging

Every query the agent runs should be auditable for at least 30 days. Snowflake's QUERY_HISTORY view is the source of truth — it includes the SQL text, the executing role, the start and end times, and the rows returned. Pipe it into your SIEM (Datadog, Splunk, S3+Athena) so you can answer 'what did the agent see last week' without writing custom code.

Tag every agent-driven query with a comment header (e.g., /* mcp_agent=data_workers, session=abc123 */) so you can filter QUERY_HISTORY trivially.
Set up an alert for any agent query that returns more than 10,000 rows. That is almost never the intended behavior.
Set up a hard query timeout on the agent's warehouse (try 60 seconds to start). Runaway agents are cheap when they cannot run for 30 minutes.

Layer 5: Schema-Aware Catalog as a Guardrail

The most subtle PII leak is the one that comes from the agent picking the wrong table. The agent does not know that customers_legacy was deprecated in 2024 but never deleted. It does not know that orders_raw has unredacted payment data but orders has the cleaned version. Without a catalog, the agent picks whichever table sounds right.

A data catalog that the agent reads before writing SQL solves this. The agent asks the catalog: 'Where is order data?' and the catalog responds with the governed view, the ownership, the freshness, and the PII tags. The agent never sees the legacy table because the catalog never surfaces it.

This is exactly what Data Workers' Catalog Agent does. It exposes catalog discovery as MCP tools, so when Claude queries it for 'order data', it gets the governed answer — same response shape, same masking policies applied. The catalog itself enforces what the agent can see.

What Each Layer Buys You

Layer	Defends Against	Setup Time	Production Impact
Dedicated MCP role	Privilege escalation	10 min	None
Column masking	PII column exfiltration	20 min per table	<1ms per query
Row access policies	Tenant / scope leakage	30 min per table	<5ms per query
Audit logging	Detection after the fact	1 hr (with SIEM)	Storage cost
Catalog guardrail	Wrong-table selection	1 day to wire MCP	Adds 1 round-trip

Frequently Asked Questions

Do these controls work with ChatGPT and Cursor too, or just Claude? Yes. All of these are Snowflake-side controls. They apply regardless of which MCP client is connecting — Claude, Cursor, OpenClaw, ChatGPT (via remote MCP), or a custom agent.

What about BigQuery and Databricks? Same five layers. BigQuery has authorized views and column-level access controls; Databricks has Unity Catalog row filters and column masks. The naming differs, the pattern does not.

Will masking break joins or aggregations? Masking policies preserve datatype, so JOIN and GROUP BY still work — they just operate on the masked value. For HASH-based masks, this means 'group by hashed email' still gives you per-customer counts.

How do I know if my current MCP server is leaking PII? Run a query against QUERY_HISTORY filtered to your agent role, look at the SQL text, and check whether any of those queries select columns that should have been masked. If you cannot tell whether a column should be masked, you do not have PII tagging yet — start there.

The Data Workers governance agent ships a pii_audit_snowflake tool that does the scan, the tag, and the policy generation in one call. It is open source. The point of this post, though, is that you do not need it — five SQL-level controls and an hour of work close the largest part of the gap. The catalog guardrail is the icing.

If you are running into specific issues setting this up, we keep notes on what works in our open-source repo at github.com/DataWorkersProject/dataworkers-claw-community. Issues and PRs welcome.

Originally published at https://dataworkers.io/blog/claude-snowflake-without-pii-exposure/. Data Workers is an open-source autonomous agent swarm for data engineering — see the repo.

MCP Servers for BI Tools: Looker, Tableau, Power BI, Mode (2026)

DataWorkers — Sun, 31 May 2026 01:12:02 +0000

Every AI-agent-meets-data-stack project hits the same problem in the same order. First the agent connects to the warehouse and runs raw SQL. Then someone notices it is bypassing the semantic layer and getting numbers wrong. Then someone proposes 'just point it at the BI tool' — and the project stalls for six months because the BI surface is the most heterogeneous, least API-friendly part of the modern data stack.

MCP changes that. The Model Context Protocol gives every BI vendor a way to expose dashboards, datasets, and semantic models to AI agents through a single contract. As of May 2026, four major BI tools have working MCP coverage. The catch is that 'working' means different things in each ecosystem.

The Four BI Tools, Their MCP Surfaces, and Their Trade-offs

BI Tool	MCP Server	Surface Exposed	Auth	Production-Ready?
Looker	Community LookML MCP servers; official Google not yet	Explores, dashboards, LookML measures/dimensions	API3 client_id / client_secret	Beta — most coverage of LookML semantics; admin-API gated
Tableau	Community Tableau Server / Cloud MCP servers	Workbooks, views, published data sources, VizQL	Personal access tokens	Beta — read-heavy; write actions limited
Power BI	Power BI Analyst MCP (community)	Workspaces, datasets, DAX queries, measures	Azure AD service principal	Beta — DAX execution + large-result paging via local CSV
Mode	No official or community MCP yet	n/a — Mode's REST API is the workaround	n/a	No — query via REST or Mode's native AI

Why BI Is Harder Than Warehouses

A Snowflake or BigQuery MCP server has it easy. The data is in tables, the query language is SQL, the auth model is roles, and the audit log lives in one place. BI tools are the opposite of all four:

The data is in projections. A 'view' or 'workbook' or 'report' is a derivation of underlying tables, often with embedded calculations the warehouse cannot see. An agent that reads only the warehouse misses the actual answer.
The query language is proprietary per vendor. LookML, DAX, VizQL, Mode's HTML/CSS-embedded SQL — each is a different surface. No common abstraction.
The auth model is per-user or per-app, with row-level security baked in. What an analyst sees in Looker is different from what an executive sees in the same dashboard. Bypassing that for an agent breaks the security model.
Audit trails are vendor-specific and often partial. Compared to Snowflake's QUERY_HISTORY, BI audit logs are inconsistent. Wiring agent access without observability is the easiest way to lose track of what the agent did.

What Each MCP Server Actually Does

Looker MCP servers (multiple community projects) expose Explores (LookML's semantic abstraction) as discoverable resources, and let agents construct queries by combining dimensions and measures. The strongest path is to expose LookML's governed metrics as tools — query_revenue(time_grain, breakdown_by) becomes a typed MCP tool rather than a raw SQL surface. This matches the semantic-layer guardrail pattern that reduces text-to-SQL hallucinations by ~66% (per Google's benchmarks).

Tableau MCP servers are read-heavier. They expose published data sources, workbooks, and views; querying typically resolves through VizQL or the published data source's underlying connection. The practical pattern is one tool per data source, with the agent picking the right one based on the question.

Power BI Analyst MCP is the most production-ready of the community options. It connects through Azure AD service principals, lets agents browse workspaces, datasets, tables, and measures, and runs DAX queries. Notable: it pages large query results to local CSV so an agent does not blow its context window on a million rows.

Mode: no MCP yet. The pragmatic workaround is to use Mode's REST API behind a thin MCP wrapper (5-10 tools: list reports, run a parameterized report, fetch results). Several teams have built private versions; nothing is published yet as of May 2026.

Production Checklist (Same Across All Four)

Read-only auth. Always start the agent with read-only credentials, even if the BI tool supports writes via the API. The blast radius of an agent accidentally publishing a dashboard is large.
Row-level security must pass through. Do not impersonate an admin account; pass the actual user identity or use a least-privilege service principal scoped to the questions the agent will answer.
Cache layer for large datasets. BI tools are not optimized for repeated identical queries from an agent's exploratory loop. Add a 5-15 min cache for query results unless the freshness requirement is sub-minute.
Log every MCP call to your existing observability stack (Datadog, Honeycomb, etc.). BI vendors will not give you the granularity you need.
Quota the agent's question budget. A loop agent can rack up thousands of dashboard renders without anyone noticing. Set a daily quota per agent identity.

Where This Is Headed

By end of 2026 every major BI vendor will ship an official MCP server. The community servers will get absorbed or formalized. What will not change quickly: the underlying complexity that makes BI integration hard — heterogeneous query languages, per-user security models, vendor-specific audit. The MCP server is a contract, not a fix.

The teams that get this right early will be the ones whose AI agents answer business questions with the same numbers the dashboards show — not approximations from raw SQL. That alignment is what makes AI agents trustworthy to non-engineering stakeholders, and it is what determines whether AI rolls out company-wide or stays in a sandbox.

Frequently Asked Questions

Does Data Workers ship an MCP server for any of these BI tools? Catalog and lineage tools, yes. BI-specific (Looker, Tableau, Power BI, Mode), not yet — we partner with the community servers listed above and the upcoming official ones. Our Insights agent is the layer above: it composes BI server outputs with warehouse and catalog data to answer questions across the stack.

Can I use ChatGPT instead of Claude with these MCP servers? Yes. ChatGPT supports remote MCP via the Apps platform. The same servers work, the auth flow differs (OAuth instead of local credentials).

What about Superset, Metabase, and Hex? Superset has a mature community MCP (135+ tools — bintocher/mcp-superset). Metabase has 1luvc0d3/metabase-mcp with 28 tools. Hex has no public MCP server as of May 2026. The same production checklist applies to all three.

Will MCP eat BI tools' UI surface? Long-term, partially. Most exploratory analytics will move into chat interfaces backed by MCP. Heavy-customization dashboards (executive summaries, embedded analytics) will stay in BI UIs. The split will look like terminal vs IDE — both exist, different jobs.

We track the MCP-for-BI ecosystem in the Data Workers OSS repo at github.com/DataWorkersProject/dataworkers-claw-community. PRs welcome with new servers as the community ships them.

Originally published at https://dataworkers.io/blog/mcp-servers-for-bi-tools-looker-tableau-powerbi-mode/. Data Workers is an open-source autonomous agent swarm for data engineering — see the repo.

Atlan Alternatives: 6 Open-Source Data Catalogs Compared (2026)

DataWorkers — Sun, 31 May 2026 01:11:31 +0000

Atlan does a lot of things well. It also costs $40-80k/year for mid-market deployments, and it gates several features (machine-learning auto-classification, certain integrations, advanced lineage) behind enterprise tiers. If you have a budget, a roadmap that does not depend on a single vendor's velocity, or just a strong open-source preference, the alternatives are stronger in 2026 than they were even six months ago.

This is the field, ranked by what each one is actually best at — not by feature-checkbox count. We will explicitly say where Atlan is still better, because pretending otherwise wastes your time.

Quick Comparison Matrix

Tool	License	Strongest At	Weakest At	Best For
OpenMetadata	Apache 2.0	Lineage, glossary, native integrations	UI polish, real-time updates	Teams who want depth + community
DataHub (Acryl)	Apache 2.0	Streaming lineage, programmatic API	Setup complexity, learning curve	Engineering-led teams
Amundsen (Lyft)	Apache 2.0	Fast search, discovery UX	Lineage, governance workflows	Discovery-first use cases
Marquez (OpenLineage)	Apache 2.0	Lineage as a primitive, OpenLineage spec	Catalog UI, business metadata	Data engineering teams
Unity Catalog (open)	Apache 2.0	Multi-cloud governance, Iceberg native	Maturity outside Databricks	Databricks + Iceberg shops
Data Workers Catalog Agent	Apache 2.0	Cross-catalog search via MCP, agent-native	Single-pane UI (it is agent-first)	Teams using Claude/Cursor/ChatGPT

1. OpenMetadata — The Closest Open Atlan Equivalent

OpenMetadata is the most mature open-source catalog by adoption. Backed by Collate (commercial fork) and a large GitHub community (~6k stars, ~1k contributors). It covers data discovery, lineage, governance, glossary, quality, and observability in one binary.

What it does well: 90+ native connectors (Snowflake, BigQuery, Redshift, Databricks, Looker, Tableau, Power BI, Airflow, dbt, Fivetran). End-to-end lineage including column-level. Built-in tagging, glossary, classifications. Embedded data quality test framework. Active release cadence.

Where it is not Atlan: UI is less polished. Some advanced governance workflows are simpler. Real-time updates can lag in larger environments. Documentation is still catching up to the feature set.

Pick OpenMetadata if: you want the broadest feature set, are comfortable running a Postgres + Elasticsearch + service deployment, and have a team that can occasionally read Java/Python source code.

2. DataHub (Acryl) — The Engineering-Led Catalog

DataHub came out of LinkedIn and now drives Acryl's commercial offering. It is the most programmatically extensible catalog in the space — emits CloudEvents, has a strong GraphQL API, integrates streaming lineage via Kafka.

What it does well: real-time and streaming lineage (uniquely strong here). Programmatic ingestion is a first-class citizen — you can push metadata from any source without writing a connector. Strong RBAC. Good Snowflake / dbt / Airflow integrations.

Where it is not Atlan: steeper learning curve. The UI assumes a technical user. Setup is more involved than OpenMetadata (Kafka, MySQL, Elasticsearch, multiple services).

Pick DataHub if: your team is engineering-led, you want a catalog you can extend programmatically, and you have streaming data that needs streaming lineage.

3. Amundsen — The Discovery-First Option

Amundsen came out of Lyft and is laser-focused on data discovery — fast search, ranked results by usage, simple UX. It is intentionally less of an everything-tool than OpenMetadata or DataHub.

What it does well: search ranking is the best in the field. Sub-second discovery on millions of tables. Simple Neo4j + Elasticsearch + Flask stack. The UX gets analysts to data faster than any of the alternatives.

Where it is not Atlan: weak on governance workflows. Lineage support has improved but is still behind OpenMetadata/DataHub. Community activity has slowed since 2023 — fewer recent commits than the others on this list.

Pick Amundsen if: the problem you are solving is 'analysts cannot find data', and you are not yet trying to govern it.

4. Marquez + OpenLineage — Lineage As A First-Class Citizen

Marquez is the reference implementation of the OpenLineage spec — the emerging standard for emitting lineage events from any data tool (Airflow, dbt, Spark, Flink). It is not a full catalog, but it is the canonical way to get lineage right.

What it does well: pure lineage focus. Open standard (OpenLineage) means you are not locked in. Airflow has native OpenLineage support; dbt-OpenLineage adapter exists. Good Kubernetes deployment story.

Where it is not Atlan: not a catalog. No glossary, classifications, governance workflows. You will pair it with OpenMetadata or DataHub or similar.

Pick Marquez if: lineage is the single biggest gap, and you want lineage that survives tool changes (because OpenLineage is the spec underneath it).

5. Unity Catalog (Open Source) — Multi-Cloud Governance, Iceberg-Native

Databricks open-sourced Unity Catalog in June 2024. It is the only catalog on this list that is explicitly designed for Iceberg + multi-cloud governance (Snowflake, Databricks, BigQuery all readable through one API).

What it does well: Iceberg-native. Multi-cloud table access through a single grants model. REST API is the same as Databricks' commercial Unity Catalog (so portability is real). Strong on access policies.

Where it is not Atlan: maturity outside Databricks deployments is still catching up. Discovery / search UI is minimal compared to others. Less of a business-glossary tool, more of a governance plane.

Pick Unity Catalog if: you are betting on Iceberg, want multi-cloud table access governed in one place, and care less about a discovery UI.

6. Data Workers Catalog Agent — Agent-Native, Cross-Catalog

This is us. We built the Catalog Agent because every catalog on this list assumes a human user clicking through a UI. AI agents (Claude Code, Cursor, ChatGPT) cannot click. They need catalog access through MCP tools.

What it does well: federates across OpenMetadata, DataHub, Amundsen, Unity Catalog (and Atlan via API) so a single MCP tool call resolves 'where is order data?' against whichever catalog has the answer. 18 catalog tools (entity resolution, toolsets, 4-signal RRF ranking, 200 golden queries eval suite). Apache 2.0. No vendor lock-in.

Where it is not Atlan: there is no standalone UI. The Catalog Agent is designed to be consumed by an AI agent or to wrap an existing catalog. If you want a single-pane-of-glass UI for humans, pair it with OpenMetadata.

Pick Data Workers Catalog Agent if: AI agents are the primary consumers of your catalog, or you want federated cross-catalog discovery.

When You Should Still Pay For Atlan

Open source is not the right answer for everyone. Pay for Atlan if:

You need a polished UI that non-technical users will adopt without training. Atlan invests heavily here; open-source catalogs are catching up but are not equivalent.
You want one vendor's roadmap to be your roadmap. Some teams legitimately do not want to assemble five tools.
You want managed deployment with SLAs. Self-hosted OpenMetadata/DataHub means you own the ops.
You need certain enterprise integrations that ship faster in commercial catalogs. Salesforce Data Cloud, certain BI tool deep integrations, etc.

Frequently Asked Questions

Is Collibra a better alternative to Atlan than these? For pure governance-and-compliance use cases, sometimes. Collibra is stronger on regulated-industry workflows (banks, pharma). The open-source tools on this list cover technical metadata and discovery better. The fair comparison is Atlan vs Collibra vs Alation as commercial peers — and OpenMetadata + DataHub as the open challengers across the board.

Can I migrate from Atlan to one of these without losing my glossary and lineage? Yes for OpenMetadata and DataHub via their bulk import APIs. Atlan exports glossary, classifications, and table descriptions to JSON. Lineage is harder to migrate (graph topology) but Marquez + OpenLineage can rebuild it by re-emitting from your orchestrator.

How long does it take to stand up OpenMetadata or DataHub in production? OpenMetadata: 2-4 weeks for a real deployment including ingestion of major sources, glossary import, and team training. DataHub: similar timeline; the longer setup is offset by deeper API extensibility. Atlan's managed setup is faster (days, not weeks) — that is part of what you pay for.

Do any of these work with Snowflake Cortex, BigQuery semantic layer, or Databricks Genie? Yes. OpenMetadata, DataHub, and Unity Catalog all integrate with at least one. Data Workers Catalog Agent federates queries across them. Atlan integrates with all three.

What about Hightouch, Castor, Select Star, Secoda — are those Atlan alternatives? They are commercial peers, not open-source alternatives. Same trade-off as Atlan: faster setup, polished UX, ongoing license cost.

We track the open-source data catalog ecosystem at github.com/DataWorkersProject/dataworkers-claw-community — the Catalog Agent code, federation logic, and the 200-query eval set are all there.

Originally published at https://dataworkers.io/blog/atlan-alternatives-open-source-data-catalogs-2026/. Data Workers is an open-source autonomous agent swarm for data engineering — see the repo.