Z-Ordering optimization

Z-Ordering is a technique to co-locate the related information in same set of files.
This feature improve the data reading dramatically because its ease to read relational data from same set of files.

E.g.

BEFORE Z-Ordering -

Data files are not organized by customer_id or order_date
Spark has no idea where the relevant rows live

So it:

Scans ~2,500 out of 2,700 files
Reads a huge amount of data
Causes high disk I/O
Takes a long time

Result:

Large scan
Slow query
Wasted resources

AFTER Z-Ordering -

What happens now:

Spark knows which files are likely to contain customer_id = 101
It skips irrelevant files (data skipping)
It reads only ~120 files instead of ~2,500

Result:

Small scan
Low I/O
Much faster query

refer below MS url for configuration

DEV Community

Z-Ordering optimization

Top comments (0)