DEV Community

James Miller
James Miller

Posted on

Process 100M Rows in 1 Second: 8 Python Libraries That Make Excel Obsolete

When processing data, I often find Excel frustrating. It's not that its features aren't useful, but it loses at the starting line when it comes to invisible type conversions and non-reproducibility. Date format errors, crashing on large files, and untraceable logic—these issues are fatal in engineering projects.

So, I’ve curated a Python toolstack. No gimmicks, just libraries that solve specific problems.

1. DuckDB: Handling Billions of Rows with Ease

If you are processing hundreds of MBs to several GBs of data on a single machine, Pandas' memory usage can be maddening. DuckDB fills the gap between SQLite and distributed databases. It is an in-process OLAP database characterized by zero configuration and vectorized computation.

DuckDB can execute SQL queries directly on CSV, Parquet, or JSON files without loading all the data into memory.

import duckdb

# Execute SQL directly on a Parquet file, no need to create a DB or table
# The query looks like a Pandas operation, but the calculation is done in the DuckDB engine
result = duckdb.sql("""
    SELECT 
        department, 
        AVG(salary) as avg_salary 
    FROM 'employees.parquet' 
    WHERE join_date > '2022-01-01'
    GROUP BY department
    ORDER BY avg_salary DESC
""").df()

print(result)
Enter fullscreen mode Exit fullscreen mode

2. Ibis: Write Once, Run Anywhere

Ibis and DuckDB are the current "Golden Duo" in data engineering.

The core of Ibis is separating business logic from the execution engine. You can write query logic using a Pandas-like Python API, but the backend can seamlessly switch between DuckDB, ClickHouse, BigQuery, or even PySpark.

When using Ibis to drive DuckDB, you enjoy the elegance and type checking of Python code while leveraging DuckDB's blazing-fast execution engine.

import ibis

# Connect to DuckDB backend (can also be SQLite, Postgres, etc.)
con = ibis.duckdb.connect()

# Lazy loading of data; nothing is loaded into memory yet
table = con.read_csv("sales_data.csv")

# Build the query expression
expr = (
    table.filter(table["status"] == "completed")
         .group_by("region")
         .aggregate(total_revenue=table["amount"].sum())
         .order_by(ibis.desc("total_revenue"))
)

# SQL is generated and executed only when execute() is called
print(expr.execute())
Enter fullscreen mode Exit fullscreen mode

3. Polars: The Multithreaded DataFrame

Pandas is single-threaded, whereas Polars is written in Rust and natively supports parallel computing. When handling large datasets, Polars is often several times faster than Pandas.

Its design philosophy adopts "Lazy Evaluation": it builds a query plan first, optimizes it via an optimizer, and then executes it. This drastically reduces memory overhead.

import polars as pl

# Scan the file instead of reading it, enabling Lazy mode
q = (
    pl.scan_csv("large_dataset.csv")
    .filter(pl.col("age") > 30)
    .select(["name", "salary", "department"])
    .group_by("department")
    .agg(pl.col("salary").mean().alias("avg_salary"))
)

# collect() triggers the actual computation
df = q.collect()
print(df)
Enter fullscreen mode Exit fullscreen mode

4. PyArrow Compute: The Calculation Cornerstone

PyArrow is not just a data format standard; its compute module provides a set of high-performance vectorized calculation functions. Many modern data tools (including Pandas 2.0+) use it under the hood.

If you need extremely fast mathematical operations or string processing on arrays and don't want the overhead of a DataFrame, consider PyArrow Compute.

import pyarrow as pa
import pyarrow.compute as pc

# Create Arrow arrays
arr_a = pa.array([10, 20, 30, 40, None])
arr_b = pa.array([2, 4, 5, 8, 1])

# Use the compute kernel for vectorized multiplication, handling None automatically
result = pc.multiply(arr_a, arr_b)

# Filter data
mask = pc.greater(result, 50)
filtered = pc.filter(result, mask)

print(filtered)
Enter fullscreen mode Exit fullscreen mode

5. TinyDB: Lightweight Document Storage

Not every project needs PostgreSQL. For configuration management, storing small crawler data, or local CLI tools, TinyDB is perfect. It is a document database written in pure Python. Data is stored in JSON files, and the API feels as natural as manipulating Python lists.

from tinydb import TinyDB, Query

db = TinyDB('local_storage.json')
User = Query()

# Insert data
db.insert({'name': 'Alice', 'role': 'admin', 'points': 85})
db.insert({'name': 'Bob', 'role': 'user', 'points': 60})

# Query data
results = db.search(User.role == 'admin')
print(results)
Enter fullscreen mode Exit fullscreen mode

6. Rill: BI Tool for Developers

While Rill is more of a tool than a traditional Python library, it holds a unique place in the modern data stack. Based on DuckDB, Rill quickly reads local data (CSV, Parquet) and instantly generates interactive BI dashboards. It solves the pain point of "having to set up Superset just to look at a chart" and is great for exploring data distributions.

7. Numba: Python at C Speed

When your code contains many native for loops (e.g., scientific computing, complex algorithms), the Python interpreter often becomes the bottleneck. Numba is a JIT (Just-In-Time) compiler. By adding a single decorator, it compiles Python functions into machine code.

from numba import jit
import time

# Without @jit, Python loops are slow
# nopython=True forces generation of machine code
@jit(nopython=True)
def heavy_computation(n):
    total = 0
    for i in range(n):
        total += i * 2
    return total

start = time.time()
print(heavy_computation(100_000_000))
print(f"Time: {time.time() - start:.4f} seconds")
Enter fullscreen mode Exit fullscreen mode

8. Bonobo: Lightweight ETL Framework

For data migration tasks that don't require heavy scheduling systems like Airflow, Bonobo offers a lightweight, graph-based ETL solution. It uses pure Python code to define data flow logic, making it very clear and suitable for small to medium-sized data cleaning and transformation tasks.

import bonobo

def extract():
    yield {'id': 1, 'name': ' Item A '}
    yield {'id': 2, 'name': ' Item B '}

def transform(row):
    return {
        'id': row['id'], 
        'name_clean': row['name'].strip().upper()
    }

def load(row):
    print(f"Loading: {row}")

def get_graph(**options):
    graph = bonobo.Graph()
    graph.add_chain(extract, transform, load)
    return graph

if __name__ == '__main__':
    # Run using 'bonobo run' in CLI
    parser = bonobo.get_argument_parser()
    with bonobo.parse_args(parser) as options:
        bonobo.run(get_graph(**options))
Enter fullscreen mode Exit fullscreen mode

The Foundation: Managing Your Environment

The libraries above cover everything from data storage and calculation engines to ETL processes. Since these are Python libraries, you obviously need a Python environment.

But let's be real—developers never have just one project. You likely have one project requiring the cutting-edge Python 3.14, while another must run on Python 3.8, or even maintain a legacy system on ancient Python 2.7. Once your system's built-in Python environment gets messed up, fixing it is a nightmare.

This is where ServBay comes in.

ServBay provides developers with an isolated and clean Python environment. Its biggest advantages are one-click deployment and multi-version coexistence.

  • Instant Install: Stop wrestling with the command line. ServBay installs a dev environment containing common components with just a click.
  • Full Version Support: It supports everything from the latest Python 3.x down to early Python 2.x versions.
  • Environment Isolation: ServBay's environment is independent of the system, so it won't pollute your OS's built-in Python, ensuring system stability.

For efficiency-minded developers, delegating environment management to ServBay allows you to focus your energy on code and data logic, which is the smarter choice.

Top comments (0)