DEV Community

Sankalp
Sankalp

Posted on • Edited on

Z-Ordering optimization

Z-Ordering is a technique to co-locate the related information in same set of files.
This feature improve the data reading dramatically because its ease to read relational data from same set of files.

E.g.

BEFORE Z-Ordering -

Data files are not organized by customer_id or order_date
Spark has no idea where the relevant rows live

So it:

  1. Scans ~2,500 out of 2,700 files
  2. Reads a huge amount of data
  3. Causes high disk I/O
  4. Takes a long time

Result:

  • Large scan
  • Slow query
  • Wasted resources

AFTER Z-Ordering -

What happens now:

  1. Spark knows which files are likely to contain customer_id = 101
  2. It skips irrelevant files (data skipping)
  3. It reads only ~120 files instead of ~2,500

Result:

  • Small scan
  • Low I/O
  • Much faster query

refer below MS url for configuration

Top comments (0)