DEV Community: Anshul Jangale

Delta Tables in Microsoft Fabric: What They Are and How They're Structured

Anshul Jangale — Sun, 28 Jun 2026 07:12:16 +0000

If you've worked with the Microsoft Fabric Lakehouse, you've probably noticed that all your managed tables are stored as Delta tables. But what exactly is a Delta table? What does it look like on disk? And why does Fabric use it as the default format?

This blog answers all of that simply and clearly.

What Is a Delta Table?

A Delta table is a table stored in the Delta Lake open-source format. It's built on top of regular Parquet files (a popular columnar file format) — but with a key addition: a transaction log that tracks every change made to the table.

That transaction log is what makes Delta tables different from plain files. It gives you:

ACID transactions — Reads and writes are reliable. No partial writes, no corrupt data.
Time travel — Query the table as it looked yesterday, last week, or any past version.
Schema enforcement — Delta rejects data that doesn't match the table's schema.
Efficient updates and deletes — You can actually UPDATE or DELETE rows, which you can't do with plain Parquet files.

Where Do Delta Tables Live in the Fabric Lakehouse?

In Microsoft Fabric, your Lakehouse is connected to OneLake — Fabric's unified storage layer. OneLake uses an ADLS Gen2-compatible folder structure under the hood.

Every Lakehouse has two sections:

Section	What it is
Tables	Managed Delta tables. Schema tracked by Fabric. Appear in the SQL endpoint automatically.
Files	Raw files (CSV, JSON, Parquet, etc.) you manage yourself. Not automatically queryable as tables.

When you create a Delta table in the Lakehouse (either via Spark, a pipeline, or a Dataflow), it gets stored inside the Tables folder.

The Folder Structure of a Delta Table

This is the most important part. Let's say you create a table called sales. Here's what the folder structure looks like in OneLake:

Lakehouse/
└── Tables/
    └── sales/
        ├── _delta_log/
        │   ├── 00000000000000000000.json
        │   ├── 00000000000000000001.json
        │   ├── 00000000000000000002.json
        │   └── ... (one file per transaction)
        ├── part-00000-<uuid>.snappy.parquet
        ├── part-00001-<uuid>.snappy.parquet
        └── part-00002-<uuid>.snappy.parquet

Let's walk through each part.

The Parquet Files — Your Actual Data

The files named part-00000-....snappy.parquet are where your data lives. Each file is a Parquet file — a compressed, columnar binary format optimized for analytical queries.

A few things to know:

There can be many Parquet files per table, depending on how many Spark partitions were used when writing.
Each file is self-contained. You can read it independently.
They are compressed (usually Snappy or ZSTD), so they're much smaller than equivalent CSV files.
They are columnar — meaning if you query only the revenue column, only that column's data is read from disk. This makes analytical queries very fast.

When you have a large table, there could be hundreds of these Parquet files. Spark reads them all in parallel.

The `_delta_log` Folder — The Transaction Log

This is the heart of Delta Lake. The _delta_log folder contains a series of JSON files, one per transaction (or commit).

Every time something changes in the table — an INSERT, a DELETE, an UPDATE, a schema change — Delta writes a new JSON file to _delta_log with a description of what happened.

Here's what a simple log entry (simplified) looks like:

{
  "add": {
    "path": "part-00000-abc123.snappy.parquet",
    "size": 1048576,
    "stats": "{\"numRecords\": 50000, \"minValues\": {\"date\": \"2024-01-01\"}, \"maxValues\": {\"date\": \"2024-03-31\"}}"
  }
}

And when a file is removed (after an UPDATE or DELETE):

{
  "remove": {
    "path": "part-00000-abc123.snappy.parquet",
    "deletionTimestamp": 1710000000000
  }
}

The log is append-only. Nothing is deleted from it. This is how Delta supports time travel — you can replay the log up to any version to reconstruct the table at that point in time.

Checkpoints: Keeping the Log Fast

As the log grows (thousands of transactions), reading all those JSON files to figure out the current state of the table gets slow. Delta solves this with checkpoints.

Every 10 commits (by default), Delta writes a checkpoint file in Parquet format that summarizes the full state of the table at that point. Future reads only need to read the latest checkpoint + any newer log files after it.

_delta_log/
├── 00000000000000000000.json
├── ...
├── 00000000000000000010.json
├── 00000000000000000010.checkpoint.parquet   ← checkpoint
├── 00000000000000000011.json
├── 00000000000000000012.json
└── ...

You'll see these checkpoint files appear naturally in your Lakehouse as tables get updated over time.

Partitioned Delta Tables

For large tables, you'll typically partition your data — split the files into subfolders based on a column value. For example, partitioning a sales table by year and month looks like this:

Tables/
└── sales/
    ├── _delta_log/
    ├── year=2023/
    │   ├── month=01/
    │   │   └── part-00000-<uuid>.snappy.parquet
    │   └── month=02/
    │       └── part-00000-<uuid>.snappy.parquet
    └── year=2024/
        ├── month=01/
        │   └── part-00000-<uuid>.snappy.parquet
        └── month=02/
            └── part-00000-<uuid>.snappy.parquet

Partitioning is a performance optimization. If you query WHERE year = 2024 AND month = 01, Spark only reads the files in that one subfolder — skipping everything else. For tables with years of data, this makes an enormous difference.

How Fabric Uses This Structure

In Microsoft Fabric:

The Lakehouse UI reads the _delta_log to show you table metadata, column names, row counts, and table history. This all comes from the transaction log.
The SQL Analytics Endpoint is automatically built on top of your Delta tables. Fabric reads the Delta log to register the tables and their schemas, making them instantly queryable with T-SQL.
Power BI Direct Lake mode reads the Parquet files directly using V-Order optimization (Fabric writes Parquet files in a special ordered format), bypassing the need to import or cache data. This is why Direct Lake is faster than Import mode.
Time travel works out of the box. You can run SELECT * FROM sales VERSION AS OF 5 in Spark SQL to see the table as it was at version 5.

A Quick Example: Inspecting Your Delta Table

In a Fabric notebook, you can inspect the table history and files easily:

# View the history of all changes made to the table
display(spark.sql("DESCRIBE HISTORY sales"))

# View the individual files that make up the table right now
display(spark.sql("DESCRIBE DETAIL sales"))

# Time travel: query the table as it was at version 2
display(spark.sql("SELECT * FROM sales VERSION AS OF 2"))

You can also read the _delta_log files directly if you're curious:

log = spark.read.json("abfss://<workspace>@onelake.dfs.fabric.microsoft.com/<lakehouse>.Lakehouse/Tables/sales/_delta_log/*.json")
display(log)

Summary: What a Delta Table Really Is

Component	What it does
Parquet files	Store the actual data, compressed and columnar
`_delta_log/` JSON files	Record every transaction — adds, removes, schema changes
Checkpoint files	Summarize table state every 10 commits for fast reads
Partition folders	Optional subfolders by column value for query performance

A Delta table is not magic — it's Parquet files you can already read, plus a log folder that makes those files transactional, versioned, and reliable.

Microsoft Fabric builds everything on top of this structure: the Lakehouse SQL endpoint, Direct Lake Power BI, time travel, and ACID-safe pipelines. Understanding the folder structure helps you reason about how your data is stored, why queries perform the way they do, and how to troubleshoot when something looks off.

Apache Spark in Microsoft Fabric: How It Handles Big Data and Makes Your Life Easier

Anshul Jangale — Sun, 28 Jun 2026 07:11:07 +0000

If you've ever tried to process millions of rows of data in a regular tool like Excel or even a basic SQL database, you've probably hit a wall. Things slow down, crash, or just take forever. This is exactly the problem Apache Spark was built to solve — and Microsoft Fabric brings it front and center with deep integration and better programmability than ever before.

Let's break this down simply.

What Is Apache Spark?

Apache Spark is an open-source distributed computing engine. In plain English: it's a system that splits a huge data processing job across many computers (or cores) working at the same time, finishes the job fast, and gives you the result.

Think of it like this. You have 1 million documents to read and summarize. If one person does it, it takes months. If you split the work across 1,000 people, it takes hours. Spark does that — but with data and compute nodes.

The Architecture: How Spark Actually Works

This is the core of it. Spark has a simple but powerful architecture with three main players.

1. The Driver

The Driver is the brain. It's the program you write and submit. It figures out what needs to be done, creates a plan, and coordinates everything. When you write a PySpark or Spark SQL script, the Driver is running your code.

2. The Cluster Manager

The Cluster Manager decides where the work gets done. It manages the pool of machines (nodes) available and assigns tasks to them. In Microsoft Fabric, this is handled automatically — you don't have to set it up yourself. Fabric provisions and manages Spark clusters for you.

3. The Executors

Executors are the workers. Each executor runs on a separate node (machine) and actually processes the data. They do the heavy lifting — reading files, filtering rows, joining tables, aggregating values — and then send results back to the Driver.

Here's how the flow looks:

Your Code (Driver)
       |
       v
  Cluster Manager  -->  Executor 1 (Node A)
                   -->  Executor 2 (Node B)
                   -->  Executor 3 (Node C)
                         ...and so on

All executors work in parallel, which is why Spark is fast.

The Key Idea: RDDs and DataFrames

When Spark loads your data, it doesn't load it all in one machine's memory. It breaks it into partitions and distributes those partitions across executors. This distributed collection of data is called an RDD (Resilient Distributed Dataset).

In modern Spark (and in Fabric), you mostly work with DataFrames — which are like tables with rows and columns, similar to a pandas DataFrame or a SQL table. But unlike pandas, a Spark DataFrame is distributed across the cluster. You can have a DataFrame with 10 billion rows and Spark will handle it without breaking a sweat.

Lazy Evaluation: Spark's Secret Weapon

Here's something clever about Spark. When you write transformations — like filtering rows, joining two tables, or selecting columns — Spark doesn't execute them immediately. It builds a plan.

Only when you ask for a result (like writing output to a file or calling .show()) does Spark actually execute everything. This is called lazy evaluation.

Why is this good? Because Spark can look at your entire chain of operations, optimize the plan, and eliminate unnecessary steps before running anything. It's like planning your entire road trip before driving, instead of making wrong turns along the way.

How Spark Handles Large Volumes of Data

Let's say you have 500 GB of log files sitting in OneLake (Fabric's storage layer). Here's what happens when Spark processes it:

Spark reads the files and breaks them into partitions (say, 200 MB each).
Each partition goes to a different executor on a different node.
All executors process their partition simultaneously.
Results are combined and written back to storage.

The whole thing might take a few minutes. Doing the same on a single machine would take hours — if it didn't crash first.

This is the core value of distributed computing. More data? Add more nodes. It scales horizontally.

Apache Spark in Microsoft Fabric

Microsoft Fabric doesn't just include Spark — it makes Spark significantly easier to use. Here's what Fabric adds on top of raw Spark:

Serverless Spark Pools

You don't manage clusters. You don't provision VMs. Fabric automatically starts a Spark cluster when you need it and shuts it down when you're done. You pay for what you use.

Native OneLake Integration

Spark in Fabric reads and writes directly to OneLake, which is Fabric's unified storage layer. No connection strings, no mounting blob storage, no configuration. Your data is just there.

Better Programmability

Fabric supports multiple languages in Spark notebooks:

PySpark — Python with Spark. Most popular, easiest to learn.
Spark SQL — Write SQL directly against distributed tables.
Scala — The original Spark language, great for performance-heavy jobs.
R (SparkR) — For data scientists coming from an R background.

You can even mix languages in the same notebook. Write SQL to query a table, then switch to Python to visualize the result.

Built-in Runtime Optimization

Fabric uses Spark 3.x with the Photon engine and auto-optimization features like:

Adaptive Query Execution (AQE) — Spark adjusts the query plan at runtime based on actual data sizes, not estimates.
Dynamic partition pruning — Spark skips reading partitions it doesn't need.
V-Order optimization — Fabric applies extra optimization when writing Delta files, making future reads faster.

Native Delta Lake Support

All tables in Fabric's Lakehouse are Delta tables by default. Spark in Fabric reads and writes Delta format natively, giving you ACID transactions, schema enforcement, and time travel right out of the box (more on this in Blog 2).

A Simple Example: PySpark in Fabric

Here's what processing data looks like in a Fabric notebook:

# Read a large CSV from the Lakehouse
df = spark.read.format("csv").option("header", True).load("Files/sales_data/")

# Filter and transform
df_filtered = df.filter(df["region"] == "India") \
                .groupBy("product") \
                .agg({"revenue": "sum"}) \
                .orderBy("sum(revenue)", ascending=False)

# Write result as a Delta table
df_filtered.write.format("delta").saveAsTable("top_products_india")

That's it. Spark reads the CSV across all your nodes in parallel, filters, aggregates, and writes the result as a Delta table — all distributed, all optimized automatically.

Why This Matters for You

If you're building data pipelines, doing analytics on large datasets, or running machine learning workloads, Spark in Fabric gives you:

Speed — Parallel processing across many nodes.
Scale — Handle gigabytes or petabytes with the same code.
Simplicity — No cluster management, no infrastructure headaches.
Flexibility — Use Python, SQL, Scala, or R depending on what you're comfortable with.
Integration — Works natively with all other Fabric services like Data Factory, Power BI, and the Lakehouse.

You write code. Fabric and Spark figure out how to run it efficiently at scale.

Summary

Apache Spark works by distributing data and computation across many machines. The Driver plans the work, the Cluster Manager assigns it, and Executors do it in parallel. DataFrames let you work with huge datasets as if they're simple tables. Lazy evaluation means Spark optimizes before executing.

Microsoft Fabric wraps all of this in a serverless, fully managed experience with native storage integration, multi-language support, and runtime optimizations — so you get the full power of Spark without the operational complexity.

In the next blog, we'll look at Delta tables — how they're structured, where they live in the Fabric Lakehouse, and why they're the preferred storage format for all of this.

How I Finally implemented CI/CD for Microsoft Fabric — And What Nobody Tells You About It

Anshul Jangale — Mon, 27 Apr 2026 05:44:32 +0000

A data engineer's honest account of automating deployments when the platform is brand new and the documentation is still catching up.

There's a particular kind of confidence that comes from clicking a button and watching your code flow — automatically, safely, predictably — from your laptop all the way to production. It's the kind of confidence that lets you sleep well on a Sunday night before a Monday stand-up.

For a long time, Microsoft Fabric didn't give data engineers that feeling. You built your pipelines, your notebooks, your lakehouse schemas — and then you deployed them the old-fashioned way. Manually. Carefully. Nervously. Hoping you didn't overwrite something important.

That changes now. Here's how we set up a proper CI/CD system for our Fabric data engineering project — and what the journey actually looked like.

The Problem We Were Trying to Solve

Our team was working across three environments — Dev, Test, and Production. Each was its own Microsoft Fabric workspace. On paper, this is clean architecture. In practice, it meant that every time we wanted to push a change to production, someone had to manually coordinate it. Someone had to remember what changed. Someone had to click the right buttons in the right order.

And if something broke in production at 11pm? We had no rollback plan. We had prayers.

The goal was simple: make deployments boring. When something is automated and repeatable, it stops being an event. It just becomes Tuesday.

Choosing the Tools

We already lived in the Microsoft ecosystem — Azure DevOps for version control and pipelines, Microsoft Fabric for the data platform. The natural choice was to wire them together directly.

The architecture we landed on has two layers working in harmony:

Layer 1 — Git Integration. Each Fabric workspace is connected to a branch in Azure DevOps. When a developer commits changes to the Dev branch, the Dev workspace reflects those changes. Same for Test. This means the source of truth lives in Git, not inside some workspace that only one person has access to.

Layer 2 — Fabric Deployment Pipelines. When code is ready to promote, we use Fabric's native deployment pipeline to push content from Dev → Test → Prod. This isn't just copying files — Fabric understands its own artifacts and promotes them intelligently.

The Azure DevOps pipeline is the conductor. It decides when to trigger a deployment, authenticates securely, calls the Fabric API, and waits to confirm everything succeeded before declaring victory.

The Service Principal Problem

Here's where most tutorials skip the hard part.

To automate anything in Fabric, you need a Service Principal — essentially a machine identity that your pipeline runs as, rather than impersonating a real human's account. This is non-negotiable for proper automation. A pipeline that runs under someone's personal credentials is one password reset away from breaking.

Getting the SPN set up touched every layer of the organisation:

The data platform team needed to register the application in Microsoft Entra ID
The security team needed to grant it the right permissions in Fabric — specifically, a tenant-level setting that allows service principals to call Fabric APIs at all (it's off by default)
The infrastructure team needed to store the SPN credentials in Azure Key Vault and grant the right access to the right people and pipelines

And this last point was where we learned something important about how enterprise teams operate. When the infrastructure team sent over an email saying access had been provisioned — they'd given three developers human read-access to the Key Vault to verify the secret was there, while separately ensuring the pipeline's machine identity could retrieve it at runtime.

At first that felt confusing. Why do humans need access to a secret that pipelines are supposed to use? The answer: someone needs to verify the secret exists before trusting the pipeline to run. Human verification and pipeline consumption are separate concerns, and treating them that way is actually good practice.

Secrets Done Right

The infra team made one very specific request: retrieve the secret once at pipeline startup and cache it in memory for the duration of the run. Do not re-query Key Vault from individual steps.

This pattern deserves more attention than it usually gets. Every time a pipeline task calls Key Vault, that's a network round trip, a potential throttling risk, and another thing that can fail. Fetching once, storing in a pipeline variable, and passing it through the rest of the run is cleaner, faster, and harder to break.

In the final YAML, the AzureKeyVault@2 task runs as a pre-job step — before anything else executes. By the time the first "real" task starts, the Fabric SPN secret is already loaded and masked in pipeline logs. No other task ever reaches back out to Key Vault. The token gets acquired once, used for the deployment, and discarded when the job ends.

This is a small design decision that makes the whole system more resilient.

What the Flow Actually Looks Like

On a normal day, this is what happens:

A developer finishes work in the Dev workspace, commits through Fabric's Git integration UI, and raises a Pull Request from Dev to Test. A colleague reviews and approves it — they cannot approve their own PR, a branch policy we enforced to prevent the classic problem of being both the author and the rubber stamp. When the PR merges, the pipeline fires automatically, authenticates as the SPN, and pushes Dev → Test through the Fabric Deployment Pipeline API. The pipeline polls until the operation completes and fails loudly if anything goes wrong.

For production, the same flow repeats — Test to main — but with an additional manual approval gate in Azure DevOps. Someone with prod access gets a notification, reviews what's being deployed, and clicks Approve. Only then does the automation proceed.

The whole thing takes a few minutes. The developer doesn't touch anything after raising the PR.

The Safety Net Nobody Thinks About Until They Need It

We also built a rollback stage. It does exactly one thing: re-deploy the Test workspace state back to Production, overwriting whatever bad change just landed there.

It never runs automatically. It sits in the pipeline doing nothing until the day something breaks in Prod at an inconvenient hour. Then someone manually triggers just that stage, the same production approval gate fires — because even emergency rollbacks should be authorised — and Prod is restored in minutes instead of hours.

Building the rollback before you need it is the entire point. The worst time to design a safety net is when you're already falling.

What Surprised Us

A few things caught us off guard along the way.

Fabric's API is newer than the documentation. The deployment API returns operation status asynchronously — you trigger a deploy, get back an operation ID, and then have to poll a separate endpoint to know when it finished. Several tutorials don't mention this at all. We had to add a polling loop and handle the operation ID coming back from multiple possible places in the response.

The tenant-level setting is easy to miss. If you don't enable "Service principals can use Fabric APIs" in the Fabric Admin Portal, every API call from your SPN returns a 401 and the error message doesn't tell you why. This burned an entire afternoon.

Branch policies are not optional. The moment we enforced "requestors cannot approve their own changes", the whole process felt legitimate. Before that, the CI/CD pipeline was technically automated but the governance around it was still built on trust. After, it was built on policy.

Where We Are Now

We have a deployment process that:

Cannot accidentally bypass code review
Never hardcodes a credential anywhere
Leaves a full audit trail of every deployment — who triggered it, what build number, what the deployment note said
Can be rolled back in under five minutes if something goes wrong
Runs the same way every single time, regardless of who initiates it

None of this is glamorous work. There's no clever algorithm here, no interesting data transformation, nothing that would impress someone at a conference. It's plumbing. It's the infrastructure that lets everything else be done with confidence.

But that Sunday-night feeling before a Monday stand-up? We have it now. That was the whole point.

Building something similar on Fabric? The biggest advice I'd give: get the service principal and Key Vault access sorted first, before writing a single line of YAML. Everything else flows from that foundation.

PySpark : The Big Brain of Data Processing

Anshul Jangale — Tue, 31 Mar 2026 08:46:01 +0000

Imagine you run a restaurant. On a quiet Tuesday, one chef can handle everything — take the order, cook the food, plate it, done. Easy.

Now imagine it's New Year's Eve and 500 people walk in at once. One chef? Absolute chaos. You need a full kitchen team — multiple chefs working on different dishes at the same time, coordinated, fast, efficient.

That's the difference between regular data tools and PySpark.

What Even Is PySpark?

PySpark is a tool built for processing huge amounts of data — we're talking millions of rows, gigabytes, even terabytes of information — quickly and efficiently.

The "Spark" part is the engine (Apache Spark), one of the most powerful data processing engines ever built. The "Py" part means you use it with Python, one of the most popular programming languages in the world.

Together? A seriously powerful combination.

But here's the key thing that makes Spark special — it doesn't do the work on one machine. It splits the work across many machines (or many cores of the same machine) and does everything at the same time. Just like that kitchen full of chefs — everyone working in parallel, no one waiting around.

Why Does This Even Matter?

Because data has gotten absolutely enormous.

Ten years ago, a "big" dataset might be a few thousand rows in a spreadsheet. Today, companies are dealing with:

Millions of customer transactions every single day
Billions of social media interactions
Sensor data streaming in every millisecond from thousands of devices
Logs from applications that never sleep

A regular tool chokes on this. PySpark eats it for breakfast.

How Is It Different from Other Tools?

Let's put PySpark up against the competition.

PySpark vs Excel / Google Sheets

This one's almost unfair.

Excel is brilliant for what it does — budgets, small reports, a few thousand rows. But try opening a 10 million row file in Excel. It either crashes or takes five minutes just to scroll. Excel is the corner shop. PySpark is the warehouse.

	Excel / Sheets	PySpark
Max rows (practical)	~1 million	Unlimited
Speed on big data	Crashes	Fast
Runs on multiple machines	No	Yes

Verdict: Excel is for humans reading data. PySpark is for machines processing data at scale.

PySpark vs Pandas (Python)

Pandas is the most popular data tool among Python developers — and it's genuinely excellent. For datasets that fit on your laptop, pandas is fast, flexible, and friendly.

The problem? It only runs on one machine, and everything has to fit in RAM (your computer's short-term memory). Run out of RAM and the whole thing crashes.

PySpark solves exactly this. Same concepts, same feel, but now your data is spread across a cluster of machines with combined memory and processing power.

	Pandas	PySpark
Data size limit	Your RAM (~16–32 GB)	Petabytes
Speed on small data	Faster	Slightly slower
Speed on big data	Crashes	Excellent
Runs distributed	No	Yes

Verdict: Pandas is your everyday car. PySpark is the truck when you need to move something massive.

PySpark vs SQL (Traditional Databases)

SQL databases like MySQL or PostgreSQL are the backbone of most applications. They're great at storing data and answering questions — "show me all orders from last month", that kind of thing.

But traditional SQL databases are designed to run on one server. When data gets huge, they slow down. You can throw better hardware at the problem, but there's a hard limit.

PySpark can actually run SQL queries too — but it runs them across a cluster, making it far faster for large-scale analytical work. And it can read from almost any source: databases, files, data lakes, cloud storage.

	Traditional SQL DB	PySpark
Best for	Storing + querying app data	Analysing massive datasets
Scales to	One powerful server	Hundreds of machines
Handles files (CSV, JSON)	Limited	Natively
Streaming data	Limited	Built-in support

Verdict: SQL databases are where your data lives. PySpark is how you analyse it at scale.

PySpark vs Hadoop (MapReduce)

This is PySpark's actual origin story. Before Spark, the king of big data was Hadoop MapReduce. It also processed data across multiple machines — but in a very old-fashioned way.

Hadoop read data from disk, processed a bit, wrote back to disk, read it again, processed more, wrote again. Every single step meant reading and writing to disk, which is painfully slow.

Spark changed everything by keeping data in memory (RAM) as much as possible. Processing happens in RAM, results stay in RAM until you actually need them saved. The result? Spark is typically 10 to 100 times faster than Hadoop for the same job.

	Hadoop MapReduce	PySpark
Processing location	Disk (slow)	Memory (fast)
Speed	Slow	10–100x faster
Ease of use	Complex, verbose	Much simpler
Real-time processing	No	Yes
Still widely used?	Fading	Growing fast

Verdict: Hadoop was the pioneer. Spark made it obsolete for most use cases.

PySpark vs Snowflake / BigQuery (Cloud Data Warehouses)

These are the shiny modern tools — cloud-based, managed, very polished. You write SQL, they handle everything else. No servers to manage, no clusters to configure.

So why would anyone use PySpark instead?

Because PySpark gives you full control. You can write custom logic, build complex pipelines, process any kind of data (not just structured tables), and integrate deeply with machine learning tools. Snowflake and BigQuery are amazing for querying structured data. PySpark is better when you need to transform, enrich, or build pipelines with complex custom logic.

Many companies actually use both — Snowflake or BigQuery for storage and querying, PySpark for the heavy transformation work that feeds into them.

	Snowflake / BigQuery	PySpark
Ease of setup	Very easy (fully managed)	Needs configuration
Custom logic	Limited	Unlimited
Machine learning	Limited	Deep integration
Control	Low	High

Verdict: Cloud warehouses are convenient. PySpark is powerful. Often used together.

Where Does PySpark Actually Run?

PySpark isn't something you just install on your laptop. It runs on platforms built for scale:

Databricks — the most popular platform, built by the creators of Spark itself
Microsoft Fabric — Microsoft's modern data platform with Spark built in
Amazon EMR — AWS's managed Spark service
Google Dataproc — Google Cloud's version
Azure Synapse — another Microsoft option

These platforms give you a cluster of machines ready to go — you just write the code and hit run.

When Should You Use PySpark?

Use it when:

Your data is too big for a normal laptop or server
You need to process data fast — time is money
You're building pipelines that run automatically on a schedule
You're combining data from many different sources
You're doing machine learning on large datasets

Don't bother when:

You have a small dataset (pandas is simpler and faster for this)
You need a quick one-off analysis (just use SQL or Excel)
Your team doesn't have the skills yet (there's a real learning curve)

The Bottom Line

PySpark exists because data outgrew the tools that came before it. One machine, one processor, one chunk of RAM — simply not enough anymore.

Spark took the idea of "what if many machines worked together on the same problem?" and turned it into one of the most widely used data tools in the world. PySpark put Python on top of that, making all that power accessible to millions of developers.

It's not the right tool for every job. But when you have a serious data problem — the kind that makes regular tools give up and go home — PySpark is the one you call.

Apache Spark was originally created at UC Berkeley in 2009. Today it's used by thousands of companies including Netflix, Uber, Airbnb, and NASA.

Lakehouse or Warehouse : Which one to choose ?

Anshul Jangale — Sat, 21 Feb 2026 17:03:47 +0000

Core Concepts

Data Warehouse

A centralized repository for cleaned, integrated, structured data from multiple sources, using schema-on-write and optimized for SQL analytics and BI.

It emphasizes strong data quality, conformed dimensions, historical tracking, and tight governance, typically using ETL or ELT pipelines to transform data before loading.

Data Lakehouse

An architecture that builds on a data lake (object storage) but adds warehouse-like capabilities—ACID transactions, schema enforcement, indexing, and SQL query performance—over open table formats like Delta, Iceberg, or Hudi.

It supports structured, semi-structured, and unstructured data in one platform, enabling both BI and AI/ML workloads without separate lake + warehouse stacks.

Architectural Differences

Storage & Schema

Warehouse

Stores data in relational structures (tables, columns, indexes) using schema-on-write—data is conformed to a fixed schema before it’s stored.
Often uses proprietary or tightly controlled storage engines tuned for OLAP and star schemas.

Lakehouse

Stores data in open formats (e.g., Parquet + Delta/Iceberg/Hudi) on object storage, with both schema-on-write and schema-on-read patterns.
Can ingest raw files (CSV, JSON, images, logs) and later layer schemas and table definitions on top for analytics.

Compute & Query Engine

Warehouse

Uses a tightly integrated SQL engine optimized for analytic workloads (columnar storage, vectorized execution, cost-based optimizer).
Often separates compute/storage logically in cloud warehouses but still exposes a single “data warehouse engine” entry point.

Lakehouse

Typically supports multiple engines over the same data: Spark, SQL engines, ML frameworks, streaming engines.
The same Delta/Iceberg tables can be queried by BI tools and used directly in ML or streaming pipelines.

Data Types & Workloads

Warehouse

Primarily structured, relational data from OLTP systems, ERP/CRM, etc.
Optimized for BI, dashboards, regulatory and financial reporting, ad-hoc SQL analytics by analysts.

Lakehouse

Handles structured, semi-structured (JSON, logs), and unstructured data (images, audio, documents) in one place.
Designed for mixed workloads: BI, data science, ML feature engineering, real-time/streaming, and advanced analytics.

Governance & Reliability

Warehouse

Strong, centralized governance with RBAC, fixed schemas, data quality rules, and lineage built into the platform.
ACID transactions and strict constraints are standard, which is why warehouses are preferred for financial/regulatory reporting.

Lakehouse

Uses transactional table formats (e.g., Delta) to bring ACID guarantees and time travel to lake data.
Governance is richer than a raw data lake but generally more complex than a classic warehouse because of the broader set of data types and tools.

Performance & Cost

Warehouse

Highly optimized for star/snowflake schemas, aggregations, joins, and tends to give very predictable performance for BI.
Usually more expensive per TB due to structured storage and pre-processing (ETL/ELT) but often cheaper in total for pure BI if the workload is well-modeled.

Lakehouse

Leverages cheap object storage with decoupled compute, making storage at petabyte scale cost-effective.
Query performance can be extremely good but may require careful optimization (partitioning, Z-ordering, caching) and may be less predictable for pure BI than a tuned warehouse.

Comparison Table

Aspect	Warehouse	Lakehouse
Primary data types	Structured	Structured + semi + unstructured
Schema strategy	Schema-on-write	Mix of schema-on-write & schema-on-read
Storage	Relational DW engine	Open formats on object storage (Delta/Iceberg/Hudi)
Workloads	BI, reporting, SQL analytics	BI + ML/AI + streaming + exploration
Governance	Strong, centralized, rigid	Strong but more complex; needs careful design
Performance	Very strong for SQL/star schemas	Strong but more tuning; multi-engine
Cost model	Higher per-TB; ETL cost	Cheaper storage; more flexible ELT; ops cost shifts
Team focus	BI devs, SQL, data modeling	Data engineers, ML, mixed SQL + Spark/ML skills

Pros & Cons in Practice

Data Warehouse – Strengths and Weaknesses

Strengths

Very strong support for enterprise BI and reporting, especially when you have conformed dimensions and consistent metrics.
Predictable query performance and SLAs, ideal for executives and operational dashboards.
Mature tooling for governance, lineage, security, and change control.

Weaknesses

Not ideal for large volumes of raw/semi-structured data, IoT logs, clickstream, etc.
ETL/ELT pipelines need to do significant up-front modeling, which can slow down onboarding new data sources.
Less natural fit for heavy ML/AI workflows; data often needs to be exported to other systems.

Data Lakehouse – Strengths and Weaknesses

Strengths

Single platform for all data types and workloads, reducing duplication between lake (for data science) and warehouse (for BI).
Good support for AI/ML pipelines and feature engineering directly on the same data used for BI.
Cost-efficient at scale, as raw and curated data both live on cheap cloud object storage.

Weaknesses

Operational complexity: more moving parts (Spark, SQL engines, catalogs, governance services).
Query performance for classic star-schema BI can require more tuning than a specialized warehouse.
Requires stronger data engineering and platform skills, especially around table formats, partitioning, and governance.

When to Choose Which

Prefer a Warehouse When

Primary workloads are classic BI and reporting on structured data (ERP/CRM, membership, finance, etc.) with predictable schemas.
There are regulatory or financial controls where high trust in curated, slowly changing schemas is essential.
Teams are predominantly SQL / BI-oriented, and speed to deliver stable dashboards is more important than experimentation flexibility.

Prefer a Lakehouse When

You need to manage diverse data types (logs, events, documents, semi-structured API payloads) alongside relational data.
There is a strong focus on data science, ML, and streaming analytics in addition to BI.
The platform must scale to very large volumes (multi-TB/PB) while keeping storage costs low.

Hybrid / Unified Architectures

Most modern patterns recommend hybrid approaches:

Use a lakehouse (or lake + lakehouse) for raw and enriched layers and ML/experimentation.
Feed a curated warehouse (or a warehouse-like gold layer) for “single source of truth” BI and regulated reporting.

Lakehouses are often described as the “third generation” after warehouses and lakes, combining many strengths while still leaving room for specialized warehouses in some scenarios.

Conclusion

Data Warehouses and Data Lakehouses serve different but often complementary purposes. Warehouses provide structured, highly governed, and predictable environments ideal for BI and reporting. Lakehouses offer flexibility, scale, and support for diverse data types and AI/ML workloads on a unified platform.

The right choice depends on your primary workload and organizational goals and in many modern architectures, a thoughtful combination of both delivers the best results.

How Power BI MCP Makes Analytics Faster and Easier

Anshul Jangale — Mon, 02 Feb 2026 07:15:43 +0000

If you're a data analyst or engineer working with Power BI, you'll want to know about Microsoft's new Power BI MCP (Model Context Protocol) servers. Released in November 2025, this tool lets you talk to your Power BI models using plain English and AI assistants like Claude or GitHub Copilot. Let me show you how this can save you hours of work.

What is Power BI MCP?

Think of Power BI MCP as a bridge that lets AI assistants understand and work with your Power BI data models. Instead of clicking through menus and manually configuring things, you just describe what you want in natural language, and the AI does it for you.

Microsoft released two versions:

Power BI Modeling MCP - Runs locally and helps you build and modify Power BI models
Remote Power BI MCP - Lets you query your data using natural language

How It Speeds Up Your Daily Work

1. Create Models 30x Faster

Old way: Click through Power BI Desktop, manually create a calendar table, set up columns, define hierarchies, create relationships... takes 15-30 minutes.

With MCP: Type "Create a calendar table for 2025 with Year, Quarter, Month hierarchies and link it to my Sales table" - done in 30 seconds.

That's a 97% time saving on routine tasks.

2. Bulk Operations in Seconds

Need to rename 50 measures to match your new naming convention? Or translate all your column descriptions into Spanish? These tasks used to take hours of repetitive work.

With MCP, you just say "Rename all measures to use underscore_case" or "Translate all descriptions to Spanish" and it handles everything at once. Users report saving 10-20 hours per month on these kinds of bulk operations.

3. Instant Documentation

Ever inherit a Power BI model with no documentation? Instead of spending hours exploring tables and relationships manually, just ask the AI:

"Show me all the relationships in this model"
"Document all the measures with their business logic"
"Create a diagram of the data model"

You get comprehensive documentation in minutes instead of hours.

4. Ask Questions in Plain English

Want to know "What were total sales by product category last quarter?" Just ask. The MCP server understands your data model, generates the right DAX query, and gives you the answer instantly.

No need to export data to Excel and pivot tables. No writing complex DAX yourself. Just ask and get answers.

5. Optimize Performance Automatically

The MCP can analyze your queries and suggest improvements:

"This measure is slow - try using SUMMARIZECOLUMNS instead"
"You have a cross join here that's causing performance issues"
"Add this index to speed up your queries"

You get expert-level optimization advice without being a DAX expert yourself.

6. Automate Repetitive Work

Set up the AI to handle entire workflows:

Check all your Power Query connections
Update parameters for different environments (dev/test/prod)
Validate naming conventions across the model
Commit changes to Git automatically

What used to be manual checklists becomes automated, consistent, and error-free.

Real-World Time Savings

Here's what actual users are experiencing:

Model creation: 15-30 minutes → 30 seconds (tasks like calendar tables, basic structures)
Bulk renaming: 2-3 hours → 2 minutes (renaming conventions across 50+ objects)
Documentation: 4-5 hours → 10 minutes (complete model documentation)
Data exploration: 1-2 hours → 5 minutes (understanding new models)
Performance tuning: Hours of analysis → Minutes with AI suggestions

Getting Started

Install Visual Studio Code
Install GitHub Copilot extension
Install the Power BI Modeling MCP extension from VS Code marketplace
Open Copilot chat and connect to your Power BI model
Start asking questions or giving commands in plain English

That's it. No complicated setup, no learning curve.

Best Uses for MCP

Great for:

Creating standard structures (calendar tables, common measures)
Bulk operations (renaming, translations, formatting)
Model exploration and documentation
Quick data queries and analysis
Learning and getting suggestions

Still needs human expertise for:

Complex business logic with nested calculations
Custom time intelligence for fiscal calendars
Mission-critical measures that need validation
Advanced DAX patterns

Tips for Best Results

Use clear, specific language in your requests
For complex tasks, break them into smaller steps
Always review AI-generated DAX before deploying to production
Start with simple tasks to get comfortable with the tool
Use it alongside your existing tools like DAX Studio or Tabular Editor

The Bottom Line

Power BI MCP isn't replacing data analysts - it's making them way more productive. The tedious, repetitive parts of your job that eat up hours? Those are now automated. This frees you up to focus on the interesting stuff: understanding business problems, designing solutions, and delivering insights.

If you spend any significant time working with Power BI models, this tool can easily save you 10-20 hours per month. That's half a work week back in your schedule.

The best part? It's available now and easy to set up. Give it a try - your future self will thank you.

Ready to speed up your Power BI workflow? Install the Power BI Modeling MCP extension today and start working smarter, not harder.

Performance Analyzer & DAX Studio

Anshul Jangale — Tue, 16 Dec 2025 08:43:49 +0000

A Simple Guide to Power BI Optimization

If your Power BI reports are running slow, don't worry—you're not alone! The good news is that Power BI gives you two powerful tools to find and fix performance issues: Performance Analyzer (built into Power BI) and DAX Studio (a free external tool). Let me show you how to use both.

What Are These Tools?

Performance Analyzer is like a stopwatch for your Power BI report. It tells you exactly how long each visual takes to load and what's slowing it down.

DAX Studio is like a mechanic's diagnostic tool. It lets you test your DAX formulas, see how they perform, and understand what's happening behind the scenes.

Part 1: Using Performance Analyzer in Power BI

Step 1: Open Performance Analyzer

Open your Power BI Desktop report
Go to the View tab on the ribbon
Click on Performance Analyzer
A panel will open on the right side of your screen

Step 2: Start Recording

Click the Start recording button in the Performance Analyzer panel
Interact with your report—click on visuals, use slicers, switch pages
Click Stop recording when you're done

Step 3: Read the Results

Performance Analyzer breaks down the time for each visual into three parts:

DAX query: Time spent calculating your measures and formulas
Visual display: Time spent drawing the chart or table
Other: Background tasks like sending queries

What to look for:

Any visual taking more than 2-3 seconds is a problem
High DAX query times mean your formulas need work
High visual display times mean you might have too much data in one visual

Step 4: Take Action

Click the Copy query button next to any slow visual. This copies the DAX query so you can analyze it further in DAX Studio.

Part 2: Using DAX Studio for Deep Analysis

Step 1: Install and Connect

Download DAX Studio for free from daxstudio.org
Install and open it
Click Connect and choose your Power BI file or dataset
Your data model is now loaded

Step 2: Test Your DAX Queries

Paste the query you copied from Performance Analyzer (or write your own measure)
Click the Run button (or press F5)
Look at the results and timing at the bottom

Step 3: Use Server Timings

This is where the magic happens:

Click on the Server Timings button (looks like a stopwatch) in the toolbar
Run your query again
A new tab opens showing you exactly where time is being spent

What to look for:

Storage Engine (SE) queries: Time spent reading data from your tables
Formula Engine (FE): Time spent doing calculations
If SE time is high, you might need better data modeling or filters
If FE time is high, your DAX formula might be inefficient

Step 4: Analyze Query Plans

For advanced optimization:

Click Query Plan button before running a query
Run your query
Review the physical and logical query plans to see exactly how your query is executed

Common Performance Issues and Fixes

Issue 1: Slow Measures with Iterators

Problem: Using functions like SUMX, FILTER, or CALCULATE over large tables

Solution:

Pre-calculate values in calculated columns when possible
Use variables (VAR) to avoid repeating calculations
Filter data early in your formulas

Issue 2: Too Many Visuals on One Page

Problem: Page takes forever to load

Solution:

Split content across multiple pages
Use bookmarks to show/hide sections
Remove unnecessary visuals

Issue 3: Large Data Model

Problem: The entire report is slow

Solution:

Remove unused columns and tables
Use aggregations for summarized data
Check your relationships—avoid bi-directional filters unless necessary
Consider using Import mode instead of DirectQuery when possible

Issue 4: Complex DAX Formulas

Problem: Measures with nested CALCULATE statements run slowly

Solution:

Break complex measures into simpler intermediate measures
Use measure branching—store parts of calculations in separate measures
Avoid using ALL() functions unnecessarily

Quick Optimization Checklist

Use this checklist after analyzing your report:

Data Model:

[ ] Remove unused columns and tables
[ ] Hide columns not needed in reports
[ ] Use star schema design (fact and dimension tables)
[ ] Avoid bi-directional relationships

DAX Formulas:

[ ] Use variables (VAR) to store intermediate results
[ ] Filter early, calculate late
[ ] Replace iterators with simpler aggregations when possible
[ ] Avoid nested CALCULATE statements

Visuals:

[ ] Limit visuals to 10-15 per page
[ ] Reduce data points in charts (use aggregation)
[ ] Turn off visual interactions that aren't needed
[ ] Use page navigation instead of cramming everything on one page

Pro Tips

Tip 1: Always test performance with realistic data volumes. A report that works fast with 100 rows might crawl with 100,000 rows.

Tip 2: Use DAX Studio's "Clear Cache" button before testing to get accurate timings without cached results.

Tip 3: Start by optimizing the slowest visuals first—fixing one 10-second visual is better than fixing ten 0.5-second visuals.

Tip 4: Document your changes. Keep notes on what you changed and how it improved performance.

Conclusion

Performance optimization doesn't have to be overwhelming. Start with Performance Analyzer to identify slow visuals, then use DAX Studio to understand and fix the underlying queries. Focus on the biggest problems first, and you'll see dramatic improvements.

Remember: a fast report isn't just nice to have—it's essential for user adoption and productivity. Take the time to optimize, and your users will thank you!

Happy optimizing!

Git Integration in Microsoft Fabric

Anshul Jangale — Tue, 30 Sep 2025 14:47:31 +0000

This guide walks you through the basic tasks for using Microsoft Fabric’s Git integration tool, including how to connect a workspace to a Git repository, commit changes, update from Git, and disconnect from Git.

Prerequisites

Fabric Prerequisites

Access to a Fabric capacity is required to use all supported Fabric items. You can sign up for a free trial if you don't have one.
The following tenant switches must be enabled from the Admin portal:
- Users can create Fabric items
- Users can synchronize workspace items with their Git repositories
- Create workspaces (needed if branching out to a new workspace)
- Users can synchronize workspace items with GitHub repositories (for GitHub users)
These switches can be enabled by tenant admin, capacity admin, or workspace admin depending on organizational settings.

Git Prerequisites

Git integration supports Azure DevOps or GitHub repositories.
You must have:
- An active Azure account registered to the same user as the Fabric workspace.
- Access to an existing Git repository in Azure DevOps or GitHub.

Connect a Workspace to a Git Repo

Connect to a Git Repo

Only workspace admins can connect a workspace to a Git repository, though anyone with permission can work in the connected workspace.

To connect:

Sign in to Microsoft Fabric and navigate to the workspace.
Go to Workspace settings.
Choose your Git provider—Azure DevOps or GitHub.
For Azure DevOps, click Connect to automatically sign in using the Azure Repos account associated with your Microsoft Entra user.

Connect to a Workspace Branch

From the dropdown menu, specify:
- Organization
- Project
- Git repository
- Branch (select an existing branch or create a new one)
- Folder (existing or new folder; blank creates content in root)

Click Connect and sync.

On initial sync, if one side (workspace or Git branch) is empty, content copies from the nonempty side.

If both sides have content, you choose the sync direction.

Commit Changes to Git

After connecting, you can edit the workspace normally. Changes are saved only in the workspace until committed to the Git branch.

To commit changes:

Go to the workspace and click the Source control icon which shows the number of uncommitted changes.

Choose items to commit or select all.
Add a comment (default added if empty).
Click Commit.

After commit, items are removed from the changes list, and the status changes from Uncommitted to Synced.

When others commit changes to the connected Git branch, a notification shows in the workspace.

To update:

Go to the workspace.
Click the Source control icon.
Select Updates to see changes since the last sync.
Click Update all.

After a successful update, changes are applied, and status changes to Synced.

Only workspace admins can disconnect a workspace.

Steps to disconnect:

Go to Workspace settings.
Select Git integration.
Click Disconnect workspace.
Confirm by selecting Disconnect again.

Manage Branches in Microsoft Fabric Workspaces

Collaboration Workspace (Main Branch)
- The collaboration workspace is connected to the main branch of the repo.
- This workspace contains the consolidated, reviewed, and approved versions of the work shared by the team.
Feature Branch Workspaces (Developer Workspaces)
- Each developer can create their own feature branch in the workspace setting of Git Integration created off of main (e.g., "feature1", "feature2").

Each developer works in their own workspace connected to their respective feature branch.
Developers make changes, commit them locally, and push to their feature branches.
Pull Requests and Merging
- Changes from feature branches are merged into the main branch via pull requests (PRs) in Azure DevOps.
- PRs require review and approval ensuring code quality and collaboration governance.
Syncing and Updating Workspaces
- Once a PR is merged into main, the collaboration workspace can be updated to reflect the merged changes.

Developers keep their feature branches long-lived, which means they occasionally need to sync their branches with main to get the latest changes. As Fabric does not support direct branch updates inside the workspace, this update is done by creating a PR from main back into the feature branch in Azure DevOps.
Branch Policies and Permissions
- Branch policies, like requiring minimum reviewers, help protect the main branch from direct commits, ensuring all changes come through PRs.

Why to use Medallion Architecture ?

Anshul Jangale — Mon, 01 Sep 2025 19:30:04 +0000

Understanding the Medallion Architecture: A Comprehensive Guide with a Use Case

Data management is crucial for organizations aiming to optimize efficiency and reliability. Choosing the appropriate data architecture is vital to achieving this. One prominent architecture gaining traction is the Medallion Architecture, often structured in three layers: bronze, silver, and gold. This approach helps organizations systematically improve data quality and usability through progressive refinement.

What is the Medallion Architecture?

The Medallion Architecture organizes data into three key layers, each with a distinct role in the data lifecycle:

Bronze Layer: Raw Data Ingestion

Purpose: Capture and store raw, unprocessed data exactly as it arrives from various sources.
Description: Serves as a landing zone preserving original data formats and contents, including logs, streaming, batch, and unstructured data. Basic deduplication can be done here.
Example: Collecting raw membership activity data from various platforms such as website interactions, mobile app usage, and event attendance.
Users: Data engineers and analysts tasked with ingesting raw data and exploratory analysis.

Silver Layer: Cleansed and Enriched Data

Purpose: Clean, transform, and enrich raw data to improve quality and analytical usability.
Description: Applies data cleansing such as removing duplicates, filling missing values, and applying business rules to create a consistent dataset. Data from multiple sources may be joined or integrated here.
Example: Filtering out incomplete membership records, standardizing member identifiers, and integrating demographic data for enriched profiles.
Users: Data engineers, data scientists, and analysts performing deeper analysis and feature engineering.

Gold Layer: Business-Ready Data

Purpose: Provide highly processed, aggregated data optimized for business intelligence (BI), analytics, and machine learning.
Description: Contains aggregated metrics, KPIs, summaries, and structured datasets tailored for end-user consumption and decision-making.
Example: Calculating monthly active members, average membership duration, and retention rates to guide marketing and engagement strategies.
Users: Business analysts, executives, data scientists, and AI/ML engineers consuming clean and ready-to-use data.

Why Use the Medallion Architecture?

Data Quality Management: Ensures quality checks occur progressively, reducing errors and inconsistencies before business use.
Flexibility: Supports diverse data environments and reuse of transformed data, while maintaining modularity for easier maintenance.
Governance: Simplifies compliance and access control by separating raw, cleansed, and business-ready data layers.
Data Lineage: Provides transparent data transformation tracking for auditability and trust.

When to Use the Medallion Architecture?

Organizations handling large volumes of data from varied sources.
Environments requiring high data quality and governance like healthcare, finance, and regulated industries.
Companies aiming for scalable, maintainable data pipelines supporting analytics and machine learning.

Implementing the Medallion Architecture: A Practical Use Case with Azure Tools

Consider an organization analyzing membership data to gain business insights using Azure data engineering tools like Azure Data Factory (ADF) and Microsoft Fabric.

Step 1: Environment Setup

Prepare your data infrastructure using Azure Data Lake Storage for scalable storage and Azure Data Factory for orchestrating data workflows and pipelines.

Step 2: Ingest Raw Data (Bronze Layer)

Use Azure Data Factory to ingest membership activity data from various sources (e.g., web logs, app data, event registration systems) into the Bronze layer stored in Azure Data Lake. This raw data retains its original format and serves as the source of truth.

Step 3: Clean and Enrich Data (Silver Layer)

Transform the raw data in Azure Synapse or Fabric by cleaning (removing duplicates, handling missing values), standardizing member IDs, and enriching with additional profile data from CRM systems. This produces a high-quality curated dataset ready for analysis.

Step 4: Aggregate and Prepare Business Data (Gold Layer)

Aggregate and summarize membership trends using Synapse or Fabric SQL to create business-ready datasets, such as monthly active members, average membership tenure, and retention rates. These datasets feed Power BI dashboards and support machine learning models for personalized marketing.

Conclusion

The Medallion Architecture offers a powerful framework to organize data into layers of increasing quality and business value. Its layered approach facilitates improved data governance, traceability, and scalability. Leveraging data engineering tools like Azure Data Factory and Microsoft Fabric enables organizations to build robust, scalable, and maintainable data pipelines that empower data-driven decision-making and advanced analytics.

Mastering Cursor Rules: Your Complete Guide to AI-Powered Coding Excellence

Anshul Jangale — Tue, 08 Jul 2025 17:22:11 +0000

Cursor AI has revolutionized the way developers write code, but its true power lies in customization through rules. Cursor Rules provide a powerful way to give consistent, reusable instructions to Cursor's AI features, like the Agent and Cmd-K. They help the AI understand your project's context, adhere to specific coding styles, and automate workflows, making it a more effective coding partner.

What Are Cursor Rules?

Cursor Rules are essentially saved prompts or guidelines that are automatically included when the AI processes your requests. Think of them as a way to give the AI persistent memory and instructions tailored to your needs. Large language models don't retain memory between completions. Rules provide persistent, reusable context at the prompt level.

Types of Rules in Cursor

Cursor offers three distinct types of rules, each serving different purposes:

1. Global Rules

Set in Cursor Settings under General > Rules for AI. Located in Cursor Settings -> General, these rules establish core principles for all AI interactions. They define fundamental behavior patterns and are language-agnostic.

When to Use:

Setting universal coding standards across all projects
Defining your preferred programming practices
Establishing consistent AI behavior patterns

2. Project Rules

Stored in .cursor/rules, version-controlled and scoped to your codebase. Create rules using the New Cursor Rule command or going to Cursor Settings > Rules. This creates a new rule file in .cursor/rules.

When to Use:

Project-specific guidelines and conventions
Team collaboration standards
Framework-specific instructions

3. Legacy .cursorrules Files

Defined in a .cursorrules file in your project's root directory. Still supported, but deprecated. Use Project Rules instead.

Setting Up Your First Rules

Quick Start with Global Rules

Open Cursor Settings (Cmd/Ctrl + ,)
Navigate to General > Rules for AI
Add your universal coding preferences:

# Global Coding Standards
- Use TypeScript for all new code
- Follow clean code principles
- Prefer async/await over callbacks
- Write comprehensive error handling
- Include JSDoc comments for functions
- Use meaningful variable names

Creating Project-Specific Rules

Open the command palette (Cmd+Shift+P or Ctrl+Shift+P) and type Cursor Rules: Add .cursorrules.

Example project rule structure:

# Project: E-commerce Platform
## Technology Stack
- React 18 with TypeScript
- Node.js backend with Express
- PostgreSQL database
- Tailwind CSS for styling

## Coding Guidelines
- Use functional components with hooks
- Implement proper error boundaries
- Follow atomic design principles
- Use React Query for data fetching
- Maintain 80% test coverage

Advanced Rule Examples

React/TypeScript Project

# React TypeScript Rules
## Component Standards
- Use functional components exclusively
- Implement proper prop types with interfaces
- Follow naming convention: PascalCase for components
- Use custom hooks for complex logic
- Implement proper error handling

## State Management
- Use React Query for server state
- Use Zustand for client state
- Avoid prop drilling beyond 2 levels
- Implement optimistic updates where appropriate

Backend API Rules

# Backend API Guidelines
## Architecture
- Follow RESTful conventions
- Use middleware for authentication
- Implement proper logging
- Use environment variables for configuration

## Error Handling
- Return consistent error responses
- Log all errors with context
- Use appropriate HTTP status codes
- Implement request validation

Database Rules

# Database Guidelines
## Query Optimization
- Use indexes for frequently queried fields
- Avoid N+1 queries
- Implement proper pagination
- Use connection pooling

## Security
- Sanitize all inputs
- Use parameterized queries
- Implement proper access controls
- Regular security audits

Best Practices for Effective Rules

1. Be Specific and Actionable

Instead of vague instructions like "write good code," provide specific guidelines:

# Good: Specific and actionable
- Use camelCase for variable names
- Limit functions to 20 lines maximum
- Include type annotations for all parameters

# Bad: Vague and unhelpful
- Write clean code
- Make it efficient
- Follow best practices

2. Include Context and Examples

You might use this to give for context on what you're building, style guidelines, or info on commonly-used methods.

# Authentication Context
This project uses JWT tokens for authentication.
Example login flow:
1. User submits credentials
2. Server validates and returns JWT
3. Client stores token in httpOnly cookie
4. Include token in Authorization header

3. Organize Rules Hierarchically

project/ .cursor/rules/ # Project-wide rules backend/ server/ .cursor/rules/ # Backend-specific rules frontend/ .cursor/rules/ # Frontend-specific rules

4. Version Control Your Rules

Since rules are stored in .cursor/rules, they're automatically version-controlled, allowing your team to collaborate on AI behavior guidelines.

Common Use Cases

Code Style Enforcement

# Code Style Rules
- Use 2 spaces for indentation
- No trailing whitespace
- Semicolons are required
- Use single quotes for strings
- Maximum line length: 100 characters

Framework-Specific Guidelines

Great to remind the AI about outdated things (like 'use client' on top of client-side files in Nextjs)

# Next.js 13+ App Router Rules
- Use 'use client' directive for client components
- Implement proper loading and error states
- Use Server Components by default
- Leverage parallel routes for complex layouts

Testing Standards

# Testing Guidelines
- Write tests for all public functions
- Use Jest for unit tests
- Implement E2E tests with Playwright
- Maintain 80% code coverage
- Use descriptive test names

Troubleshooting Common Issues

Rules Not Working?

Check rule syntax: Ensure proper markdown formatting
Verify rule location: Confirm rules are in the correct directory
Restart Cursor: Rules may need a restart to take effect
Check rule conflicts: Global rules may override project rules

Performance Considerations

Keep rules concise but comprehensive
Avoid redundant instructions
Use clear, unambiguous language
Regular review and updates

Advanced Tips

Dynamic Rules for Different Environments

# Environment-Specific Rules
## Development
- Include detailed console logging
- Use development API endpoints
- Enable debug mode

## Production
- Minimize console output
- Use production API endpoints
- Implement proper error tracking

Integration with External Tools

Reference external style guides and documentation:

# External References
- Follow Google TypeScript Style Guide
- Use Prettier for code formatting
- Adhere to ESLint configuration
- Reference project's README for setup

Measuring Success

Track the effectiveness of your rules by monitoring:

Code review feedback frequency
Bug reports related to style/standards
Developer onboarding time
Code consistency across the team

Community Resources

Use https://cursor.directory/ to get some sample usage. The community has created numerous rule templates for different technologies and use cases.

Conclusion

Cursor Rules transform the AI from a generic code assistant into a personalized development partner that understands your project's unique requirements. By defining coding standards and best practices in your .cursorrules files, you can ensure more relevant and accurate code suggestions.

Start with basic global rules, then gradually implement project-specific guidelines. Remember, effective rules are specific, actionable, and regularly updated to reflect your evolving development practices.

The investment in setting up comprehensive rules pays dividends in code quality, team consistency, and development velocity. Your AI assistant becomes not just a tool, but a knowledgeable team member who understands your standards and helps maintain them across your entire codebase.

Backend Deployment on Azure App Service with Bitbucket CI/CD Pipeline

Anshul Jangale — Wed, 19 Mar 2025 11:07:12 +0000

Introduction

After successfully setting up CI/CD for our frontend application, our next challenge was deploying the backend API to Azure App Service. This blog post details our approach to automating the backend deployment process using Bitbucket Pipelines and shares important Azure configuration steps that are often overlooked in tutorials.

Prerequisites

To follow along with this guide, you'll need:

A Node.js backend application in a Bitbucket repository
An Azure subscription
An Azure App Service plan and web app
Service principal credentials for Azure deployment

Backend Pipeline Configuration

Our backend deployment pipeline follows a similar structure to our frontend pipeline, with some key differences tailored to backend requirements. Here's our complete bitbucket-pipelines.yml file:

image: node:22

pipelines:
  branches:
    main:
      - step:
          name: Install
          caches:
            - node
          script:
            - echo "Installing dependencies..."
            - npm install --force

      - parallel:
          - step:
              name: Run build
              script:
                - echo "Building project..."
                - npm install --force
                - npm run build
                - apt-get update && apt-get install -y zip
                - zip -r app-$BITBUCKET_BUILD_NUMBER.zip build package.json package-lock.json
              artifacts:
                - "*.zip"

          - step:
              name: Security Scan
              script:
                - echo "Perform a security scan for sensitive data..."
                - echo "See https://bitbucket.org/product/features/pipelines/integrations#security"
                - apt-get update && apt-get install -y git-secrets
                # Example usage: git secrets --scan

      - step:
          name: Deploy to Production
          trigger: manual          
          deployment: Production
          script:
            - pipe: atlassian/azure-web-apps-deploy:1.0.1
              variables:
                AZURE_APP_ID: $AZURE_APP_ID
                AZURE_PASSWORD: $AZURE_PASSWORD
                AZURE_TENANT_ID: $AZURE_TENANT_ID
                AZURE_RESOURCE_GROUP: $AZURE_RESOURCE_GROUP
                AZURE_APP_NAME: $AZURE_APP_NAME
                ZIP_FILE: app-$BITBUCKET_BUILD_NUMBER.zip        
                DEBUG: 'true'

Pipeline Breakdown

Let's examine each section of the pipeline:

Step 1: Installation

- step:
    name: Install
    caches:
      - node
    script:
      - echo "Installing dependencies..."
      - npm install --force

This step installs all dependencies required for the application. The --force flag helps overcome any dependency conflicts that might arise.

Step 2: Parallel Execution

- parallel:
    - step:
        name: Run build
        script:
          - echo "Building project..."
          - npm install --force
          - npm run build
          - apt-get update && apt-get install -y zip
          - zip -r app-$BITBUCKET_BUILD_NUMBER.zip build package.json package-lock.json
        artifacts:
          - "*.zip"

    - step:
        name: Security Scan
        script:
          - echo "Perform a security scan for sensitive data..."
          - echo "See https://bitbucket.org/product/features/pipelines/integrations#security"
          - apt-get update && apt-get install -y git-secrets
          # Example usage: git secrets --scan

This section runs two processes in parallel:

Build: Compiles the Node.js application and packages it for deployment
Security Scan: Checks for sensitive data in the codebase

Note that in our backend build, we're including:

The build directory with compiled code
package.json and package-lock.json files for dependency information

This is crucial for Azure App Service to correctly install and run the Node.js application.

Step 3: Deployment

- step:
    name: Deploy to Production
    trigger: manual          
    deployment: Production
    script:
      - pipe: atlassian/azure-web-apps-deploy:1.0.1
        variables:
          AZURE_APP_ID: $AZURE_APP_ID
          AZURE_PASSWORD: $AZURE_PASSWORD
          AZURE_TENANT_ID: $AZURE_TENANT_ID
          AZURE_RESOURCE_GROUP: $AZURE_RESOURCE_GROUP
          AZURE_APP_NAME: $AZURE_APP_NAME
          ZIP_FILE: app-$BITBUCKET_BUILD_NUMBER.zip        
          DEBUG: 'true'

This step:

Has a manual trigger for controlled deployments
Uses the Atlassian Azure Web Apps Deploy pipe
Passes necessary Azure credentials stored as repository variables
Specifies the zip file created in the build step

Azure App Service Configuration

Unlike frontend deployments, backend deployments require additional configuration in Azure App Service. Here's a step-by-step guide to setting up your Azure environment for Node.js backend applications:

1. Create an App Service

Log in to the Azure Portal
Click on "Create a resource" > "Web App"
Fill in the required details:
- Resource Group: Choose existing or create new
- Name: A unique name for your app
- Publish: Code
- Runtime stack: Node.js (version matching your application)
- Operating System: Linux (recommended for Node.js)
- Region: Choose appropriate region
Click "Review + create" and then "Create"

2. Configure Startup Command

For Node.js applications, you need to specify how Azure should start your application:

Navigate to your App Service in the Azure Portal
Go to "Configuration" > "General settings"
In the "Startup Command" field, enter your start command:
- For apps using a build directory: node build/server.js
- For apps using an express server: npm start
Click "Save"

3. Configure Environment Variables

Backend applications typically require environment variables for database connections, API keys, and other sensitive information:

In your App Service, go to "Configuration" > "Application settings"
Click "New application setting" to add each environment variable
Common variables to add:
- NODE_ENV: Set to production
- PORT: Usually set to 8080 (Azure will map this to the public port)
- Database connection strings
- API keys and secrets
- Authentication parameters
Click "Save" after adding all variables

4. Configure CORS (if needed)

If your backend API serves a frontend hosted elsewhere:

In your App Service, go to "CORS"
Add the domains that should be allowed to access your API
Click "Save"

5. Set up Deployment Credentials

For the Bitbucket pipeline to deploy to Azure:

In Azure Portal, go to Azure Active Directory
Register a new application
Create a client secret
Assign the application appropriate permissions to your resource group
Note down:
- Application (client) ID
- Directory (tenant) ID
- Client secret

These values will be used as the AZURE_APP_ID, AZURE_TENANT_ID, and AZURE_PASSWORD in your Bitbucket pipeline.

6. Configure Bitbucket Repository Variables

In Bitbucket:

Go to Repository settings > Repository variables
Add the following variables:
- AZURE_APP_ID: Your Azure service principal ID
- AZURE_PASSWORD: Your Azure service principal password
- AZURE_TENANT_ID: Your Azure tenant ID
- AZURE_RESOURCE_GROUP: The resource group containing your App Service
- AZURE_APP_NAME: The name of your App Service

Handling Backend-Specific Challenges

Challenge 1: Node.js Version Compatibility

Azure App Service supports specific Node.js versions. Ensure your application is compatible with the available versions in Azure.

Solution: In your Azure App Service settings, go to "Configuration" > "General settings" and select the appropriate Node.js version.

Challenge 2: Server Startup Detection

Azure App Service needs to detect when your Node.js application has successfully started.

Solution: Ensure your application listens on the port provided by the environment:

const port = process.env.PORT || 3000;
app.listen(port, () => {
  console.log(`Server running on port ${port}`);
});

Challenge 3: Application Logging

Troubleshooting deployment issues requires proper logging.

Solution: Enable application logging in Azure:

Go to your App Service > "App Service logs"
Enable "Application logging"
Set the log level to "Information" or "Verbose"
Enable "File System" logging
Click "Save"

You can then view logs in the "Log stream" section or download them for analysis.

Testing the Deployment

After deploying, verify that:

Your application is running by accessing the Azure App Service URL
API endpoints are accessible and returning expected responses
Environment variables are correctly loaded
Your application can connect to external services (databases, etc.)
Logs show no errors or warnings

Conclusion

Deploying a Node.js backend to Azure App Service using Bitbucket Pipelines provides a robust and automated deployment solution. By properly configuring both the pipeline and the Azure environment, you can ensure smooth and reliable deployments.

The key takeaways are:

Include all necessary files in your deployment package
Configure the correct startup command in Azure
Set up environment variables for your application
Configure CORS if needed
Enable proper logging for troubleshooting

With this setup, your backend deployments will be consistent, reliable, and easy to manage.

Deploying a Frontend Application to Azure App Service with Bitbucket CI/CD Pipeline

Anshul Jangale — Wed, 19 Mar 2025 10:50:46 +0000

Introduction

Deploying frontend applications to production environments can be challenging, especially when dealing with single-page applications (SPAs) like React. In this blog post, I'll walk through the process of setting up a complete CI/CD pipeline for deploying a frontend application to Azure App Service using Bitbucket Pipelines, and how to overcome common challenges like path mapping and URL rewriting.

Prerequisites

Before we begin, make sure you have:

A Bitbucket repository with your frontend code
An Azure subscription
An Azure App Service plan and web app
Service principal credentials for Azure deployment

Understanding the Pipeline Architecture

Our deployment pipeline follows these key steps:

Install dependencies - Prepare the environment
Parallel processes:
- Build the application - Compile, bundle, and package the app
- Security scan - Check for sensitive data in code
Manual deployment trigger - Deploy to production with approval

The Bitbucket Pipeline Configuration

Here's our complete bitbucket-pipelines.yml file:

image: node:20

pipelines:
  branches:
    main:
    - step:
        name: install
        caches:
          - node
        script:
          - rm -rf node_modules package-lock.json
          - npm install --legacy-peer-deps 

    - parallel:
      - step:
          name: Build
          caches:
            - node
          script:
            - rm -rf node_modules package-lock.json
            - npm install --legacy-peer-deps
            - npm run build              
            - mv web.config dist/  # Move web.config inside dist/
            - apt update && apt install zip
            - zip -r app-$BITBUCKET_BUILD_NUMBER.zip dist package.json -x *.git* bitbucket-pipelines.yml
          artifacts:
            - "*.zip"
      - step:
          name: Security Scan
          script:
            # Run a security scan for sensitive data.
            - pipe: atlassian/git-secrets-scan:0.5.1            

    - step:
        name: Deploy to Production
        trigger: manual
        deployment: Production
        script:
          - pipe: atlassian/azure-web-apps-deploy:1.2.3
            variables:
              AZURE_APP_ID: $AZURE_APP_ID
              AZURE_PASSWORD: $AZURE_PASSWORD
              AZURE_TENANT_ID: $AZURE_TENANT_ID
              AZURE_RESOURCE_GROUP: $AZURE_RESOURCE_GROUP
              AZURE_APP_NAME: $AZURE_APP_NAME
              ZIP_FILE: 'app-$BITBUCKET_BUILD_NUMBER.zip'
              DEBUG: 'true'

Let's break down each section:

Step 1: Installation

- step:
    name: install
    caches:
      - node
    script:
      - rm -rf node_modules package-lock.json
      - npm install --legacy-peer-deps

This step cleans any existing dependencies and installs fresh ones. The --legacy-peer-deps flag helps avoid compatibility issues between packages.

Step 2: Parallel Execution

- parallel:
  - step:
      name: Build
      caches:
        - node
      script:
        - rm -rf node_modules package-lock.json
        - npm install --legacy-peer-deps
        - npm run build              
        - mv web.config dist/  # Move web.config inside dist/
        - apt update && apt install zip
        - zip -r app-$BITBUCKET_BUILD_NUMBER.zip dist package.json -x *.git* bitbucket-pipelines.yml
      artifacts:
        - "*.zip"
  - step:
      name: Security Scan
      script:
        - pipe: atlassian/git-secrets-scan:0.5.1

This section runs two processes in parallel:

Build: Compiles the frontend application and packages it for deployment
Security Scan: Checks the codebase for accidentally committed secrets or sensitive data

The build step is particularly important as it:

Builds the application
Moves the web.config file into the distribution folder
Creates a zip archive with a unique name based on the build number
Defines the zip file as an artifact to be used in later steps

Step 3: Deployment

- step:
    name: Deploy to Production
    trigger: manual
    deployment: Production
    script:
      - pipe: atlassian/azure-web-apps-deploy:1.2.3
        variables:
          AZURE_APP_ID: $AZURE_APP_ID
          AZURE_PASSWORD: $AZURE_PASSWORD
          AZURE_TENANT_ID: $AZURE_TENANT_ID
          AZURE_RESOURCE_GROUP: $AZURE_RESOURCE_GROUP
          AZURE_APP_NAME: $AZURE_APP_NAME
          ZIP_FILE: 'app-$BITBUCKET_BUILD_NUMBER.zip'
          DEBUG: 'true'

This step:

Has a manual trigger for controlled deployments
Uses the Atlassian Azure Web Apps Deploy pipe
Passes necessary Azure credentials stored as repository variables
Specifies the zip file created in the build step

Configuring web.config for React SPA on Azure

One of the most challenging aspects of deploying a React SPA to Azure App Service is configuring URL rewriting correctly. Since SPAs use client-side routing, all routes need to be redirected to the index.html file.

Here's the web.config file we're using:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
    <system.webServer>
        <rewrite>
            <rules>
                <rule name="React SPA" stopProcessing="true">
                    <match url=".*" />
                    <conditions logicalGrouping="MatchAll">
                        <add input="{REQUEST_FILENAME}" matchType="IsFile" negate="true" />
                        <add input="{REQUEST_FILENAME}" matchType="IsDirectory" negate="true" />
                    </conditions>
                    <action type="Rewrite" url="/index.html" />
                </rule>
            </rules>
        </rewrite>
        <staticContent>
            <mimeMap fileExtension=".json" mimeType="application/json" />
            <mimeMap fileExtension=".js" mimeType="application/javascript" />
            <mimeMap fileExtension=".css" mimeType="text/css" />
            <mimeMap fileExtension=".ts" mimeType="application/javascript" />
            <mimeMap fileExtension=".tsx" mimeType="application/javascript" />
        </staticContent>
        <httpProtocol>
            <customHeaders>
                <add name="Access-Control-Allow-Origin" value="*" />
                <add name="X-Content-Type-Options" value="nosniff" />
            </customHeaders>
        </httpProtocol>
    </system.webServer>
</configuration>

Understanding the web.config configuration

URL Rewriting:
- The rule matches all URLs (match url=".*")
- It applies only when the requested URL doesn't match an existing file or directory
- It redirects all such requests to /index.html, allowing the React router to handle them
MIME Type Configuration:
- Ensures proper MIME types for various file extensions
- Critical for browsers to correctly interpret JavaScript, CSS, and JSON files
HTTP Headers:
- Access-Control-Allow-Origin: * - Enables cross-origin resource sharing
- X-Content-Type-Options: nosniff - Prevents MIME type sniffing security vulnerabilities

Overcoming Common Challenges

Challenge 1: Path Mapping Differences in Azure

Azure App Service has a unique architecture that uses different virtual and physical path mappings, which can cause issues when deploying frontend applications:

Understanding Azure's Path Structure:

Physical Path: The actual location on the server where your files are stored (e.g., D:\home\site\wwwroot)
Virtual Path: The URL path that users access (e.g., https://yourapp.azurewebsites.net/path)

Common Issues:

Root Path Configuration: By default, Azure deploys your application to the root of the App Service (/). If your frontend expects to be served from a subdirectory, path conflicts will occur.
Virtual Directory Mapping: Azure allows you to map different physical folders to different virtual paths through the Azure Portal or configuration files.

Solution:

In the Azure Portal, navigate to your App Service → Configuration → Path mappings
Ensure your application files are correctly mapped to the appropriate virtual path
For single-page applications, make sure the root directory (/) is correctly mapped to your application's distribution folder

Alternatively, you can define virtual applications and directories in your web.config:

<system.webServer>
    <virtualDirectory path="/" physicalPath="%SystemDrive%\home\site\wwwroot\dist" />
    <!-- Other configurations -->
</system.webServer>

This ensures that when a user accesses the root URL, they're served content from the dist folder where your built frontend application resides.

Challenge 2: Web.config Configuration

The biggest challenge is often configuring the web.config file correctly for client-side routing.

Solution: Our web.config file above solves this by:

Using the URL Rewrite module to redirect all non-file, non-directory requests to index.html
Configuring proper MIME types for all static assets
Setting appropriate HTTP headers for security and cross-origin requests

Setting Up Azure Environment Variables

For the deployment to work, you need to configure these repository variables in Bitbucket:

AZURE_APP_ID: Your Azure service principal ID
AZURE_PASSWORD: Your Azure service principal password
AZURE_TENANT_ID: Your Azure tenant ID
AZURE_RESOURCE_GROUP: The resource group containing your App Service
AZURE_APP_NAME: The name of your App Service

You can set these variables in Bitbucket by going to:
Repository settings > Repository variables

Testing the Deployment

After deploying, you should verify that:

Your application loads correctly
Deep linking works (you can navigate directly to any route)
Static assets (images, CSS, JavaScript) load properly
API calls work as expected

Conclusion

Deploying a frontend application to Azure App Service using Bitbucket Pipelines provides a robust and automated deployment workflow. The key challenges around path mapping and web.config configuration can be overcome with proper configuration.

By following this guide, you'll have a reliable CI/CD pipeline that builds, tests, and deploys your frontend application to Azure App Service, with proper routing for single-page applications.

Remember that the most critical parts are:

The correct web.config configuration
Proper understanding of Azure's virtual and physical path mapping
Secure handling of Azure credentials

With these in place, your deployments should be smooth and reliable.

DEV Community: Anshul Jangale

Delta Tables in Microsoft Fabric: What They Are and How They're Structured

What Is a Delta Table?

Where Do Delta Tables Live in the Fabric Lakehouse?

The Folder Structure of a Delta Table

The Parquet Files — Your Actual Data

The _delta_log Folder — The Transaction Log

Checkpoints: Keeping the Log Fast

Partitioned Delta Tables

How Fabric Uses This Structure

A Quick Example: Inspecting Your Delta Table

Summary: What a Delta Table Really Is

Apache Spark in Microsoft Fabric: How It Handles Big Data and Makes Your Life Easier

What Is Apache Spark?

The Architecture: How Spark Actually Works

1. The Driver

2. The Cluster Manager

3. The Executors

The Key Idea: RDDs and DataFrames

Lazy Evaluation: Spark's Secret Weapon

How Spark Handles Large Volumes of Data

Apache Spark in Microsoft Fabric

Serverless Spark Pools

Native OneLake Integration

Better Programmability

Built-in Runtime Optimization

Native Delta Lake Support

A Simple Example: PySpark in Fabric

Why This Matters for You

Summary

How I Finally implemented CI/CD for Microsoft Fabric — And What Nobody Tells You About It

The Problem We Were Trying to Solve

Choosing the Tools

The Service Principal Problem

Secrets Done Right

What the Flow Actually Looks Like

The Safety Net Nobody Thinks About Until They Need It

What Surprised Us

Where We Are Now

PySpark : The Big Brain of Data Processing

What Even Is PySpark?

Why Does This Even Matter?

How Is It Different from Other Tools?

PySpark vs Excel / Google Sheets

PySpark vs Pandas (Python)

PySpark vs SQL (Traditional Databases)

PySpark vs Hadoop (MapReduce)

PySpark vs Snowflake / BigQuery (Cloud Data Warehouses)

Where Does PySpark Actually Run?

When Should You Use PySpark?

The Bottom Line

Lakehouse or Warehouse : Which one to choose ?

Core Concepts

Data Warehouse

Data Lakehouse

Architectural Differences

Storage & Schema

Warehouse

Lakehouse

Compute & Query Engine

Warehouse

Lakehouse

Data Types & Workloads

Warehouse

Lakehouse

Governance & Reliability

Warehouse

Lakehouse

Performance & Cost

Warehouse

Lakehouse

Comparison Table

Pros & Cons in Practice

Data Warehouse – Strengths and Weaknesses

Strengths

Weaknesses

Data Lakehouse – Strengths and Weaknesses

Strengths

Weaknesses

When to Choose Which

The `_delta_log` Folder — The Transaction Log