DEV Community

Wayne
Wayne

Posted on • Originally published at wheynelau.dev

Learnings of the Poor

Necessity is the mother of invention

I was already GPU poor, but a recent job change combined with rising component prices have also made me RAM and NVMe poor.

While I am nowhere close to the experts of optimisations in the early 2000s or 90s, I took this time to brush up on some fundamentals and key concepts in Python. As the saying goes:

"Premature optimisation is the root of all evil"

We are not looking for very deep level optimisations, these changes aim to follow the Pareto Principle where 80% of the outcome comes from 20% of the effort. The changes below may or may not be 20% effort but I would consider them low-effort.

As such, there won't be any discussion on performance profiling, where we are determining hot loops, cache misses, memory reallocations etc.

Iterators

Frankly I think this is an important concept that has a great carryover regardless of languages. Understanding iterators also helps if you need to think of channels, which is very important in Go.

The typical approach collects results at every stage into lists:

data = read_file("data.jsonl")
data = first_filter(data)
data = second_processing(data)
write_processed_data(data, "output.jsonl")
Enter fullscreen mode Exit fullscreen mode

The issue: if data.jsonl is bigger than your RAM, you run OOM very fast. Using yield instead keeps memory usage low:

from collections.abc import Iterator

def read_file(file: str) -> Iterator[dict]:
    with open(file, "r") as f:
        for line in f:
            yield json.loads(line)

def first_filter(input_data: Iterator[dict]) -> Iterator[dict]:
    for record in input_data:
        if is_good(record):
            yield record
Enter fullscreen mode Exit fullscreen mode

Each function in the pipeline takes an Iterator[dict] and yields records one at a time. Memory usage drops significantly.

Caveats:

  • Files are held open throughout the pipeline, so unintentional edits or moves will break it.
  • json.dumps does not add a trailing newline, so f.write(json.dumps(record) + '\n') is intentional when writing JSONL.

Learning points

I find that iterators are a step before understanding pipelines, channels, or pub/sub patterns. When you understand iterators, you understand the bottlenecks of your code. They are fundamentally all iterators that consume and yield.

If process_data is slow (1 line per second) while reading and filtering is fast (4 lines per second), the pipeline is bounded by 1 line per second. The solution is more processing workers bridged through queues or channels:

Read-worker-1 -> Filter-worker-1 -> Process-worker-{1..4} -> Write-worker-1
Enter fullscreen mode Exit fullscreen mode

Compression

In my Compression post, I mentioned that benchmarks should be done to know whether your use case supports compressions. For write once, read many scenarios, higher compression values may help.

Here is a measurement for an IO-constrained scenario (reading a JSONL file from NAS):

ZST: 100000it [00:05, 17220.01it/s]  (9.47 MB/s)
Raw: 100000it [00:40, 2492.39it/s]  (11.15 MB/s)
Enter fullscreen mode Exit fullscreen mode

Because data is compressed, you can read more data per buffer. More lines are stored per MB of compressed JSONL compared to its raw form.

Less is more

Less work means more efficient processing. It's about eliminating wasted work, not always adding a cache everywhere.

If filtering takes 1s per line and processing takes 5s per line:

  • Process then filter on 10000 lines: 10000 * 6s = 60000s
  • Filter then process on 10000 lines (50% bad): 10000 * 1s + 5000 * 5s = 35000s

No complex code, no need for compiled languages. Algorithmic complexity matters too. Choosing the right data structure — a set for membership checks instead of a list, a deque instead of a list for queue operations — can eliminate entire classes of wasted work regardless of language.

The full version with code examples and benchmarks is on my blog.

Top comments (0)