DEV Community: Jose Prabhu Michael Singarayan

Microsoft Fabric Data Agent: Ask Your Data Questions in Plain English

Jose Prabhu Michael Singarayan — Thu, 07 May 2026 19:52:29 +0000

If you've ever watched a business analyst wait three days for a data team to answer a simple question like "What were our top 10 products by revenue last quarter?" — you know the pain. That lag between curiosity and insight is where decisions go to die.
Microsoft's Fabric Data Agent (currently in preview) is built to close that gap. It lets anyone in your organization ask questions about enterprise data in plain English and get structured, accurate answers — no SQL, no DAX, no KQL required.
Let's dig into what it actually is, how it works under the hood, and what you need to know to get started.

What Is a Fabric Data Agent?
A Fabric Data Agent is a conversational Q&A system built on top of your organization's data in Microsoft Fabric's OneLake. It uses large language models (LLMs) — specifically Azure OpenAI's Assistant APIs — to interpret natural language questions and translate them into queries against your actual data sources.
Think of it as a smart intermediary that:

Understands your question
Figures out which data source best answers it
Generates and executes the right query
Returns a human-readable answer

The agent can connect to lakehouses, warehouses, Power BI semantic models, KQL databases, ontologies, and Microsoft Graph — all within the governed Fabric ecosystem.
Within broader agentic architectures on Microsoft Fabric, data agents serve as the conversational analytics component, connecting to governed data through multiple data source types in multi-agent solutions.

How It Works: Under the Hood
The magic happens in a well-orchestrated pipeline. Here's the flow:
1. Question Parsing & Validation
When a user submits a question, the agent applies Azure OpenAI Assistant APIs to process it. Before anything else, it checks that the question complies with:

Security protocols
Responsible AI (RAI) policies
The requesting user's permissions

The agent operates with read-only access to all data sources — it cannot write, modify, or delete data.
2. Data Source Identification
The agent uses your credentials to access the schema of available data sources (not the data itself). It evaluates your question against all connected sources and determines which one is best positioned to answer it. You can even add custom instructions to guide this routing — for example: "Direct financial metric questions to the Power BI semantic model; route raw data exploration to the lakehouse."
3. Query Generation
Once the right data source is identified, the agent generates the appropriate query using one of these translation tools:
Data SourceQuery TypeLakehouse / WarehouseNL2SQL (Natural Language → SQL)Power BI Semantic ModelsNL2DAX (Natural Language → DAX)KQL DatabasesNL2KQL (Natural Language → KQL)Microsoft GraphGraph API queries
4. Query Validation & Execution
The generated query is validated for correctness and security compliance, then executed against the data source. Results are formatted into a human-readable response — tables, summaries, key insights — and returned to the user.

Configuring a Fabric Data Agent
Setting up an agent is described as being similar to building a Power BI report: you design and refine it, then publish and share it. Here's what configuration involves:
Select Your Data Sources
An agent supports up to five data sources in any combination — lakehouses, warehouses, KQL databases, Power BI semantic models, ontologies, or Microsoft Graph. You could have five Power BI semantic models, or a mix of two semantic models, a lakehouse, and a KQL database.
Choose Relevant Tables
After adding a data source, you define which specific tables the agent can access. For lakehouses, this means lakehouse tables (not raw files). If your data lives in CSV or JSON files, you'll need to ingest it into tables first to make it available to the agent.
Add Context with Instructions & Example Queries
This is where you fine-tune the agent for your organization:
Data agent instructions — Tell the agent how to behave. Define which data source to use for which type of question. Clarify organizational terminology. Set custom rules.
Example query pairs — Provide sample question-to-query mappings so the agent learns how to handle common queries in your domain. (Note: example query pairs aren't yet supported for Power BI semantic model sources.)

Security & Governance: Built-In, Not Bolted On
One of the more impressive aspects of Fabric Data Agent is how deeply governance is integrated:

Least-privilege access: The agent uses the requesting user's credentials, so it can only surface data that person is already authorized to see.
Microsoft Purview integration: DLP policies, access restriction policies, Insider Risk Management, and audit/eDiscovery all apply to agent interactions.
Guardrails on scope: Queries are constrained to configured data sources — the agent can't go rogue and query things outside its defined scope.
Optional Azure AI Content Safety: You can add an extra layer of content risk controls to filter harmful or out-of-policy responses.

Beyond the Chat Window: Copilot Studio Integration
Fabric Data Agents aren't limited to the Fabric portal. You can consume a Fabric data agent in Copilot Studio, embedding your data agent into custom Microsoft 365 Copilot experiences, Teams bots, or other applications. This opens the door to deploying data conversations wherever your users already work.

Why This Matters
The Fabric Data Agent addresses a real organizational problem: data insight accessibility. Most enterprise data is technically available but practically inaccessible to the majority of people who need it, because it requires technical skills to query.
By enabling plain-English conversations with governed data, the agent:

Lowers the barrier for non-technical stakeholders
Reduces the bottleneck on data teams for ad-hoc queries
Fosters a culture of data-driven decision-making
Keeps everything within your existing governance and security boundaries

This isn't a chatbot on top of a CSV. It's a governed, multi-source, enterprise-grade conversational analytics layer built into the same platform where your data already lives.

Modernizing Data Movement for the AI-Ready Enterprises

Jose Prabhu Michael Singarayan — Thu, 23 Apr 2026 22:46:13 +0000

Introduction

No matter what type of Artificial Intelligence workload your business implements, it requires high-quality data sources to operate effectively. From recommendation engines to conversational AI assistants, each AI application needs a robust data foundation that ensures timely delivery of the information and its integrity.

Despite this requirement, many organizations stick to legacy batch ETL pipelines built mainly for static reporting needs. Though such a pipeline could be helpful when it comes to developing a company's traditional reporting strategy, it is far from being enough for building AI solutions. That is why modernizing data movement should be among your top priorities.

Why Traditional ETL Fails for AI Workloads

Traditional ETL consists of three basic steps – extracting, transforming, and loading data. Though this process worked fine for classic dashboards, nowadays, it fails in several crucial aspects due to the nature of AI and machine learning systems they often rely on near real-time data streams

AI applications have to integrate different types of information – streaming data and historical data,large and diverse datasets require scalable processing capabilities upstream changes should not stop pipelines from operating effectively.

If your pipeline is inefficient, you can expect poor accuracy of your ML models or late delivery of insights generated by the system.

What Modern Data Movement Requires

Compared to their predecessors, modern data pipelines are much more scalable and flexible because they can accommodate all types of data regardless of its structure, size, or velocity. In addition to supporting traditional batch processing, they allow continuous and event-driven streaming, which is essential in developing AI-driven solutions.

Characteristics of a modern data pipeline include the following:

ELT over ETL so raw data can land quickly before transformation
streaming ingestion for time-sensitive workloads
event-driven design to trigger processing as changes happen
lakehouse storage for unified structured and semi-structured data
schema evolution to handle changing source systems
governance and lineage for trust and compliance

Reference Architecture of an AI-Ready Data Pipeline

As we discussed above, a typical modern data architecture follows a hierarchical structure close to that used for lakehouses. Let us review each layer below:

Sources: operational databases, SaaS solutions, RESTful APIs, IOT sensors, log files, external datasets.

Ingestion: batch and streaming data intake with Apache Kafka or Azure Event Hubs, cloud data integration services;

Processing: distributed processing engines like Spark and PySpark;

Storage: cloud lakehouse platform like Delta Lake and Microsoft Fabric Lakehouse;

Consumption: Power BI dashboards, machine learning models, feature stores, notebooks, AI-driven applications.

With this architecture in place, you will be able to use your unified data foundation for dashboards and machine learning as well.

Example of Coding Schema Evolution in Delta Lake

As we mentioned previously, schema evolution is key to handling changing sources of data properly. However, if the schema enforcement does not work correctly, a pipeline is at risk of breaking down.

Delta Lake technology allows managing evolving schemas easily.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SchemaEvolutionExample").getOrCreate()

df = spark.read.json("/data/incoming/retail_events")

df.write.format("delta") \

.option("mergeSchema", "true") \

.mode("append") \

.save("/data/lakehouse/retail_events")

As you see, this code allows adding the columns received from external data sources to your Delta table seamlessly.

Example of Real-Time Data Pipeline

The example of the previous section is pretty straightforward. In contrast, a small example demonstrating the difference between traditional and modern data pipelines could be a good choice here.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("RetailStreamingPipeline").getOrCreate()

stream_df = spark.readStream \

.format("json") \

.schema("order_id STRING, product_id STRING, quantity INT, event_time TIMESTAMP") \

.load("/data/stream/orders")

aggregated_df = stream_df.groupBy("product_id").sum("quantity")

query = aggregated_df.writeStream \

.format("delta") \

.outputMode("complete") \

.option("checkpointLocation", "/checkpoints/orders") \

.start("/data/lakehouse/product_sales")

query.awaitTermination()

The code above demonstrates the operation of a streaming data pipeline that aggregates sales transactions.

Example of Data Quality Verification

Adding some code snippet illustrating data quality validation would enhance your discussion.

from pyspark.sql.functions import col

validated_df = df.filter(

col("customer_id").isNotNull() &

col("transaction_amount").isNotNull() &

(col("transaction_amount") >= 0)

)

As you understand, this piece of code allows validating the quality of incoming data and filtering out untrustworthy data sources.

Example of Real-Time Retail Pipeline

Now let us consider a retail enterprise implementing dashboards and AI-based demand forecasting in parallel. The company receives sales transactions, e-commerce data, customer engagement metrics, etc., from different sources.

In the traditional reporting environment, all data could be loaded into the database daily. For AI tasks like demand forecast or product recommendations, such a latency would be unacceptable.

A retail pipeline capable of ingesting sales transactions, merging them with historical inventory and customer data, and then delivering information to various destinations looks as follows:

Power BI dashboards for sales monitoring

machine learning models for demand forecasting

recommendation systems for personalization

alerting systems for anomaly detection

Governance, Security, and Control of AI Pipelines

For any data pipeline to work effectively, especially those used for building AI applications, it should comply with strict governance regulations and security controls.

Here is a list of important capabilities that you should implement in your data pipelines:

data lineage to trace data from source to model or dashboard

role-based access control to secure sensitive datasets

audit logging to monitor pipeline activity and usage

schema governance to manage changes safely

API security for authenticated and authorized access

All these mechanisms are crucial for securing your pipelines and ensuring that ML models receive high-quality data sources.

Conclusion

Modernizing your data movement architecture is probably the most critical step towards building an AI-ready enterprise. Despite the fact that classical batch ETL pipelines have proven their effectiveness for historical reporting purposes, they hardly meet today's needs.

By introducing lakehouse technologies, event-driven architecture, scalable distributed processing, and proper schema governance, you can create a robust pipeline architecture to fuel analytics and AI.