Eric Kahindi

Posted on Sep 15

Why I’m Switching to Parquet for Data Storage

#parquet #datalake #dataengineering

The first time I came across Parquet files was during my fourth-year project. I kept seeing Hugging Face recommend them whenever I uploaded a custom dataset, and I wondered: why are they so obsessed with this file format?

Fast forward to today, as I dive deeper into object storage and data lakes, Parquet shows up everywhere again. After some research and hands-on work, I finally get it: this format is not just hype. It’s genuinely better for large-scale data.

First, let's get the benefits out of the way

They’re simple – easy to read/write with common libraries.
They’re fast – optimized for data retreival queries.
They’re compact – storing the same data in less storage.
They’re schema-aware – built-in metadata and structure make them perfect for data lakes.

When I saw all this, my thoughts were, This can't be right, right?

So let's unpack the benefits

Simple to Use

Converting a CSV to Parquet takes just a few lines of Python:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Example dataframe
df = pd.DataFrame({
    "timestamp": ["2025-09-10", "2025-09-11"],
    "symbol": ["BTC", "BTC"],
    "close": [57800, 58800]
})

table = pa.Table.from_pandas(df)
pq.write_table(table, "crypto.parquet")

That’s it — you now have a Parquet file.

Faster Queries

Unlike CSVs, Parquet is a columnar storage format. Instead of organizing data row by row, it stores values by column, just like a pandas dataframe.
Think of it like this: CSV is like a text document; Parquet is like a database table optimized for analytics.

Need just one column? Parquet can read only that column instead of scanning the whole file.
Query engines (Spark, DuckDB, etc.) can skip irrelevant chunks entirely. This makes querying large datasets significantly faster.

More Compact

So they store more data in less space than if the data were in a CSV, and here's how they achieve it
Parquet is highly compressed by design. A few tricks it uses:

Integers - binary encoding (fewer bytes than text).
Strings - dictionary encoding (e.g., "BTC" stored once, then referenced by index).
Repeated values - run-length encoding (RLE).

Schema and Metadata

This is where data lakes come in
Parquet files are schema-aware. This means that data is stored in a relevant schema, which is basically the blueprint of the data inside the Parquet file
It describes:

Column names
Data types (int, float, string, timestamp, boolean, etc.)
Nullable or not
Nested structures (if any) You can even define your own schema:

import pyarrow as pa
import pyarrow.parquet as pq

schema = pa.schema([
    ("timestamp", pa.timestamp("s")),
    ("symbol", pa.string()),
    ("price", pa.float64())
])

data = [
    ["2025-09-10 00:00:00", "BTC", 57800.0],
    ["2025-09-11 00:00:00", "ETH", 1800.5],
]

table = pa.Table.from_arrays(list(zip(*data)), schema=schema)
pq.write_table(table, "crypto_with_schema.parquet")

Parquet also supports rich metadata, which is essential in data lakes. Without metadata, a data lake quickly turns into a “data swamp.”
Every Parquet file already stores basic metadata:

Row count
Column count
Data types
Compression info
Column statistics (min/max values, null counts)

But you can also add custom metadata:

pq.write_table(
    table,
    "btc_prices.parquet",
    metadata={
        b"source": b"CoinGecko",
        b"pipeline": b"airflow-crypto-etl",
        b"tokens": b"BTC,ETH,SOL,HYPE,BNB"
    }
)

Querying in a Data Lake

Once your Parquet files are in object storage (e.g., S3), you can query them directly with modern engines:

SELECT timestamp, close
FROM 's3://crypto-data/*.parquet'
WHERE symbol = 'BTC' AND timestamp >= '2025-09-01'

This makes Parquet a natural fit for tools like Spark, DuckDB, and Presto.

Bottom Line

Parquet isn’t just another file format. It’s a compact, fast, and schema-aware way of storing data that plays perfectly with modern data lakes.
If you care about performance, storage efficiency, and long-term scalability, Parquet is a no-brainer.

DEV Community