DEV Community: Sankalp

Shortcut & Mirroring

Sankalp — Sun, 26 Apr 2026 19:01:30 +0000

Shortcut: Point to original source of data, only virtual connection. Data is not stored physically in target.

Mirroring: Data is physically stored in target location.

Pros & cons

MS Fabric Architect Interview Questions

Sankalp — Sun, 26 Apr 2026 17:48:01 +0000

Q1. What is Microsoft Fabric?
Unified SaaS data platform combining Data Engineering, Data Science, Data Warehouse, Real-Time Analytics, and Power BI
Built on OneLake (single data lake)
Eliminates need for multiple services like ADF, Synapse, Power BI separately

Q2. What is OneLake?
Central storage layer (like OneDrive for data)
Uses Delta Lake format
Supports shortcuts (no data duplication)

Q3. Difference between Microsoft Fabric and Azure Data Factory?
ADF → orchestration + pipelines only
Fabric → end-to-end platform (storage + compute + BI)
Fabric pipelines ≈ ADF but tightly integrated with lakehouse
🔹 2. Architecture & Design

Q4. How do you design a scalable Fabric architecture for 1000+ customers?
Expected points:
Workspace strategy (per customer vs domain-based)
Capacity planning (F SKU sizing)
Data isolation (schemas, folders, lakehouses)
Use of shortcuts for shared datasets
Governance via Purview

Q5. What are Lakehouse and Warehouse in Fabric?
Lakehouse
Files + tables (Delta format)
Good for data engineering & ML
Warehouse
SQL-based analytics
Optimized for BI queries

Q6. When would you use Lakehouse vs Warehouse?
Lakehouse → ingestion, transformation, ML
Warehouse → reporting, star schema, Power BI

🔹 3. Data Engineering & Pipelines
Q7. How do Fabric Data Pipelines differ from ADF pipelines?
Similar UI and activities
Fabric pipelines are tightly integrated with OneLake
No need for separate IR (mostly)
Better native support for Lakehouse

Q8. Explain incremental load strategies in Fabric.
You already know this—expect follow-ups:
Watermark (last run timestamp)
CDC (Change Data Capture)
Delta table merge
Using Copy Activity with filters

Q9. How do you implement pagination in Fabric pipelines?
(They may expect something like your ESRI API scenario)
Use Until loop
Maintain offset variable
Call API using Copy/Web activity
Append to Lakehouse table

🔹 4. Delta Lake & Data Modeling
Q10. What is Delta Lake and why is it important in Fabric?
ACID transactions
Time travel
Schema evolution
Supports incremental loads

Q11. How do you handle slowly changing dimensions (SCD) in Fabric?
Use MERGE in Delta tables
dbt snapshots (if using dbt)
Maintain valid_from, valid_to

Q12. Bronze, Silver, Gold architecture in Fabric?
Bronze → raw ingestion
Silver → cleaned/transformed
Gold → business-ready

🔹 5. Performance Optimization
Q13. How do you optimize performance in Fabric Lakehouse?
Partitioning (date/customer)
Z-ordering
File size optimization (avoid small files)
Caching

Q14. What is shortcut in OneLake and when to use it?
Reference external data without copying
Useful for multi-workspace sharing

🔹 6. Security & Governance
Q15. How do you secure data in Fabric?
Workspace-level access
Row-level security (Power BI)
Object-level security
Integration with Purview

Q16. How do you manage multi-tenant data securely?
Separate workspaces OR schemas
Use RBAC
Data masking

🔹 7. Real-Time & Advanced Topics
Q17. What is Real-Time Analytics in Fabric?
Event streams + KQL database
Used for IoT/log analytics

Q18. How would you design IoT data ingestion in Fabric?
Event streaming → KQL DB
Store raw in Lakehouse
Transform to Delta tables
Serve via Power BI

🔹 8. Scenario-Based Questions (VERY IMPORTANT)
Q19. A client has 700+ customers and 1000+ workspaces. How would you optimize?
Consolidate workspaces (domain-based)
Use shortcuts instead of duplication
Central governance
Capacity optimization

Q20. API data ingestion with pagination and failure handling?
Until loop
Retry logic
Logging table
Idempotent loads

Q21. How do you handle data quality in Fabric?
DQ rules table (like your setup)
PySpark validation
Separate failed records
Monitoring dashboards

🔹 9. Integration with Other Tools
Q22. How does Fabric integrate with dbt?
Use dbt with Lakehouse/Warehouse
dbt models for transformation
dbt snapshots for SCD

Q23. Can Fabric replace Snowflake?
Depends:
Fabric → unified + cheaper (in some cases)
Snowflake → mature + strong performance
Many orgs use hybrid

🔹 10. Trick / Deep Questions
Q24. What are limitations of Fabric?
Still evolving
Some enterprise features missing vs Synapse/Snowflake
Capacity-based pricing challenges
Q25. How does compute work in Fabric?
Capacity-based (F SKUs)
Shared compute across workloads
🔥 How to Prepare Smartly (Based on your profile)
You already worked on:
Pagination pipelines
Incremental load
dbt + Snowflake
👉 So focus on:
Mapping your experience → Fabric concepts
Scenario-based answers (interviewers LOVE this)
Architecture decisions (not just features)
If you want, I can:
Do a mock interview (real questions + feedback)
Or give answers to these questions in detail (architect-level)
Or create Fabric architecture diagrams for interviews

ACID with delta table

Sankalp — Mon, 19 Jan 2026 09:25:51 +0000

************************ ACID property *******************************

Atomicity:

Either All or NULL
Transaction should be complete all operation successfully then only commit else rollback all transaction.

Consistency:

Sum of total money of A & B should be same before and end of transaction.

Isolation:

Parallel schedule transaction can be converted in serial schedule conceptually to make the transaction consistence.

Durability:

Changes done by transaction should be permanent.

********************** Parquet Vs Delta file format *******************

Parquet:

Type : Column storage format.
Optimized for : Efficiently read performance, especially for data analytics.

Key features : 1.) Column wise compression --> Reduce file size.
2.) Splitable files --> Parallel processing.
3.) Work well with Hive, Spark, Big Query etc.

Delta:

Built on : Parquet format + transactional layer ( _delta logs).
Optimized for : Reliable, Scalable delta lake with support ACID transactions.

Key features    :   1.) Acid transactions (safe read/write)

            2.) Schema Enforcement        :     Delta Lake ensures that the data written to a table matches the table’s schema. 
                                    This prevents issues like inserting a string into a column that expects an integer.

            3.) Schema Evolution          :     When enabled, Delta can automatically adapt to changes in the schema (e.g., adding new                                      columns) during write operations. This is useful for agile data pipelines where the                                         schema may evolve over time.

            4.) Time Travel (Query Past Versions) : Delta Lake maintains a transaction log (_delta_log) that records every change to the                                        data. You can query a table as it existed at a specific point in time or version using.                                         This is useful for debugging, auditing.                                                             
                                    e.g. 
                                    SELECT * FROM table_name VERSION AS OF 5;
                                    -- or
                                    SELECT * FROM table_name TIMESTAMP AS OF '2025-06-10T12:00:00';

            5.)     Ideal for Streaming + Batch (Unified Workflows):
                                    >   Delta Lake supports both streaming and batch reads/writes on the same table.
                                    >   This unification simplifies architecture: you don’t need separate pipelines or                                            storage for real-time and historical data.

                                    e.g. 
                                    You can ingest real-time data using Spark Structured Streaming and run batch analytics                                      on the same Delta table.

Conclusion:
Use Parquet when you need fast, storage-efficient analytics on append-only data.
Use Delta when you need reliability, schema control, time travel, and transactional operations on top of Parquet.

        ************************ SCD Types *******************************

SCD Type 0 : Fixed , No changes allowed (e.g., Date of Birth).

SCD Type 1 : Overwrite, Old data is overwritten. No history is kept. Simple but loses historical data.

SCD Type 2 : Add Row, New row for every change with versioning or effective dates. Full history preserved.

SCD Type 3 : Add Column, Adds a new column to track previous value. Limited history (usually just 1 change).

SCD Type 4 : History Table, Separate historical table stores changes; main table holds current data. Good for large history storage.

SCD Type 6 : Hybrid, Combination of Types 1, 2, and 3. Tracks history with current data easily accessible.

when multiple job write same delta table

Sankalp — Fri, 16 Jan 2026 11:33:06 +0000

delta table solve this using ACID property.
each write operation commit transaction log, when delta detect that multiple job conflict in concurrent run, it safely fail one job instead of silently corrupt data.

sort merge join, hash join, nested loop join

Sankalp — Fri, 16 Jan 2026 11:24:41 +0000

nested loop join:
can use for small data set.
easy to use but scan every row to get the match record, performance will be degrade when tables are big

short merge join:
work well for large tables but data should be sorted (data should be bucketed) but sorting is expensive operation.

hash join:
work well for large tables but need extra memory for hash table.
hash table group the record for same id and create a hash key

starter pool and custom pool/ spark pool

Sankalp — Fri, 16 Jan 2026 09:23:25 +0000

starter pool:

good to use when we have to test the notebook. there are always active medium size node with default libraries, dynamic allocation and auto scale [1 to 10].
its help to initialize the session in 5 to 10 seconds.
charges are only when session is active.

spark pool/ custom pool:

user can configure as per demand of work

Microsoft purview - short detail

Sankalp — Fri, 16 Jan 2026 07:52:35 +0000

purview is a family of data governance, risk, compliance from Microsoft and integrated with fabric.
its help to govern, protect and manage entire data across cloud, on-premises and SaaS.

features:

provide the lineage of entire data for data governance

automatically show metadata of all fabric items in purview unified catalog

allow to view and manage data from purview catalog

protect data using sensitivity label(labels can be defined in catalog)

DLP (Data Loss Prevention) policies can be applied in Power BI semantic model (for e.g. detecting credit card no and generate alert)

all fabric user activity are logged in purview audit log

purview hub accessible to fabric admin, it offer dashboard for insight

Z-Ordering optimization

Sankalp — Fri, 16 Jan 2026 07:30:57 +0000

Z-Ordering is a technique to co-locate the related information in same set of files.
This feature improve the data reading dramatically because its ease to read relational data from same set of files.

E.g.

BEFORE Z-Ordering -

Data files are not organized by customer_id or order_date
Spark has no idea where the relevant rows live

So it:

Scans ~2,500 out of 2,700 files
Reads a huge amount of data
Causes high disk I/O
Takes a long time

Result:

Large scan
Slow query
Wasted resources

AFTER Z-Ordering -

What happens now:

Spark knows which files are likely to contain customer_id = 101
It skips irrelevant files (data skipping)
It reads only ~120 files instead of ~2,500

Result:

Small scan
Low I/O
Much faster query

refer below MS url for configuration

OPTIMIZE property configuration in fabric

Sankalp — Fri, 16 Jan 2026 07:22:10 +0000

spark perform very well for standard size large file but problem start occurring when it has to deal with many small files in the same time.

OPTIMZE, coalesce many small file in to a larger one to maintain the balance standard size.

it dynamically optimize the partition by generating file with default 128MB size (default size can be changed as per requirement)

Advantages:

maintain the ability of V-Order and Z-Order

coalesce small files in large balance file size(No matter how many tuple in file)

auto compaction of delta table and files

no impact on reading delta table before and after OPTIMIZE

Refer below MS Url for configuration

as per work load - resource profile

Sankalp — Fri, 16 Jan 2026 06:44:08 +0000

Resource profile can be configured as per the work load as mentioned below.

readHeavyForSpark
readHeavyForPBI
writeHeavy
custom

refer below Microsoft official URL for configuration in MS Fabric

V-Order optimization

Sankalp — Fri, 16 Jan 2026 06:17:23 +0000

V-Order optimize parquet file through sorting, row group distribution, encoding and compression.

Disadvantage of V-order optimization is that it increase the write time by up to 15 but positive side is that it boost the data compression by 50% and data read time improve by 10% and also in some cases read time improve up to 50% as well.

parquet engine can read it as a regular parquet file.
there is not any impact on delta table other features, like Z-Order, vacuum, time travel, compaction etc.

V-Order is disabled by default

In Fabric runtime 1.3 and higher versions, the spark.sql.parquet.vorder.enable setting is removed. As V-Order is applied automatically during Delta optimization using OPTIMIZE statements, there's no need to manually enable this setting in newer runtime versions. If you're migrating code from an earlier runtime version, you can remove this setting, as the engine now handles it automatically.

partition pruning

Sankalp — Fri, 16 Jan 2026 05:59:52 +0000

Working with Big Data often presents the challenge of slow query results due to the overhead of scanning massive datasets. Optimization involves more than just how you read or aggregate data; for high-performance scanning, data must be organized in a way that the Spark engine can consume efficiently.
This is where partition pruning becomes essential. If data is well-partitioned within storage systems like HDFS, S3, or ADLS, Spark queries will only scan the specific partition folders required, significantly reducing processing time.

you can see in below code, instead of giant file, spark create directory with hierarchy.

df.write.partitionBy("Year", "Month").parquet("/data/consumer")

Physical Result:

/data/sales/Year=2023/Month=01/

/data/sales/Year=2023/Month=02/

/data/sales/Year=2024/Month=01/

Partition Pruning happens automatically when you query the data.
If you run:
df = spark.read.parquet("/data/consumer").filter(col("Year") == 2024)

Spark looks at the query, looks at the folder structure, and says, "Okay, the user only wants 2024. I am going to completely ignore the 'Year=2023' folder. I won't even list the files inside it."

This can turn a 10TB scan into a 100GB scan instantly.