A0mineTV

Posted on Oct 3

Python Performance Optimization: A Practical, Detailed Guide

#webdev #programming #tutorial #python

A hands-on, copy–paste guide to measure, understand, and fix performance problems in Python. We’ll go from “it feels slow” to profiling → diffing → fixing → verifying—with runnable snippets and checklists you can reuse in every project.

1. What do we mean by “performance”?
2. Benchmarking correctly (before you “optimize”)
3. CPU profiling (find the real hotspots)
4. Memory profiling & leaks
5. Algorithmic wins (Big-O beats micro-tweaks)
6. Data structures that actually matter
7. Kill slow Python loops (builtins, vectorization, batching)
8. Concurrency: I/O-bound vs CPU-bound (GIL-aware)
9. Native acceleration: Cython, Numba, C-extensions, PyPy
10. Memory optimization patterns
11. Caching & memoization (and when not to)
12. Startup, packaging, and deployment
13. A repeatable performance workflow (checklist)
Appendix A. Micro-optimizations (use sparingly)
Appendix B. Reusable snippets & templates

1. What do we mean by “performance”?

“Performance” is not one number. Pick the right target:

Latency: time per operation (p50, p95, p99).
Throughput: ops/sec under load.
Memory: peak RSS, steady-state, object churn (allocs/sec).
Startup: import time, cold starts.
Energy/Cost: CPU seconds, cloud bill, core-hours.

Rule #1: Measure first. Make a baseline you can repeat and compare.

2. Benchmarking correctly (before you “optimize”)

2.1 Minimal harness (repeatable & honest)

# bench.py
import time, statistics as stats
from target import work  # the function you want to benchmark

def bench(n=30, warmup=3):
    # Warmup (JITs/caches/CPU freq settling, disk cache, etc.)
    for _ in range(warmup):
        work()
    runs = []
    for _ in range(n):
        t0 = time.perf_counter()
        work()
        runs.append(time.perf_counter() - t0)
    print({
        "n": n,
        "min": min(runs),
        "median": stats.median(runs),
        "mean": stats.mean(runs),
        "p95": stats.quantiles(runs, n=100)[94],
        "max": max(runs),
    })

if __name__ == "__main__":
    bench()

Use time.perf_counter() (highest resolution wall-clock).
Run many iterations; use median/p95; beware outliers and noisy neighbors.

2.2 `timeit` done right

python -m timeit -n 5 -r 10 "import target; target.work()"

-n = loops/measurement, -r = repeats; don’t rely on the default single shot.

2.3 Use a benchmarking tool when possible

pyperf (stabilizes runs, CPU affinity).
pytest-benchmark (CI-friendly; regression gates).

Store benchmarks in the repo. Performance is a contract, not a vibe.

3. CPU profiling (find the real hotspots)

3.1 Deterministic profiler (`cProfile` + `pstats`)

# profile_cpu.py
import cProfile as profile, pstats
from target import work

with profile.Profile() as pr:
    work()

ps = pstats.Stats(pr).sort_stats("cumtime")  # cumulative time by function
ps.print_stats(20)

cProfile measures every function call; overhead is acceptable for most app code.
Sort by cumtime to find where time is actually spent.

3.2 Human-friendly views (snakeviz / gprof2dot)

python -m cProfile -o out.prof -m target
snakeviz out.prof            # interactive flame graph in browser
# or:
gprof2dot -f pstats out.prof | dot -Tpng -o callgraph.png

3.3 Sampling profilers (low overhead)

pyinstrument (beautiful call trees; ~1% overhead).

pyinstrument -r html -o prof.html -m target

Scalene (separates Python vs native time; also memory profiling).

Pick one profiler, gather evidence, then optimize the top 1–3 hotspots only.

4. Memory profiling & leaks

4.1 `tracemalloc` (built-in)

# profile_mem.py
import tracemalloc, target
tracemalloc.start()
target.work()
current, peak = tracemalloc.get_traced_memory()
print("current=%dB peak=%dB" % (current, peak))
for stat in tracemalloc.take_snapshot().statistics("lineno")[:10]:
    print(stat)

Shows allocations by file/line, and peak usage.

4.2 Process-level memory

psutil.Process(os.getpid()).memory_info().rss (resident set size).
mprof / memray for timeline views of memory growth.

4.3 Common leak patterns

Global caches growing unbounded.
Queues/pipelines that never drain on failure.
Cycles with __del__ delaying GC (rare, but tricky).

5. Algorithmic wins (Big-O beats micro-tweaks)

5.1 Replace N² scans with hash lookups

# bad: O(n^2)
def dupe_any(xs):
    for i, a in enumerate(xs):
        for b in xs[i+1:]:
            if a == b: return True
    return False

# good: O(n)
def dupe_any_fast(xs):
    seen = set()
    for x in xs:
        if x in seen: return True
        seen.add(x)
    return False

5.2 Use the right tool

set/dict for membership & joins.
heapq for top-k / streaming medians.
bisect for sorted arrays (avoid full sorts).
deque for FIFO/LRU; array/memoryview for dense numeric buffers.

6. Data structures that actually matter

list vs tuple: tuples are smaller/immutable; lists are flexible and amortized O(1) append.
dict/set: O(1) average membership; pre-size if you know the count (dict.fromkeys trick).
namedtuple / dataclasses: add slots=True to save memory & speed attribute access.
array('d'), memoryview: compact numeric storage w/o NumPy.
bytes/bytearray for binary; prefer bytes.join for building large strings.

from dataclasses import dataclass

@dataclass(slots=True)
class Point:
    x: float
    y: float

7. Kill slow Python loops (builtins, vectorization, batching)

7.1 Prefer builtins (they run in C)

# bad
total = 0
for x in xs: total += x

# good
total = sum(xs)

# bad
out = []
for x in xs: out.append(x*x)

# good
out = [x*x for x in xs]  # still Python loop but optimized bytecode

7.2 Vectorize numeric workloads

If you’re doing heavy math, NumPy moves loops to C:

import numpy as np
a = np.arange(10_000_000, dtype=np.float64)
b = a * 0.1 + 2.0

Keep arrays contiguous and dtypes consistent; avoid many tiny allocations.

7.3 Batch I/O and syscalls

Read/write in chunks (e.g., readinto with a reusable buffer).
Group network/database calls (pooling, bulk operations).

8. Concurrency: I/O-bound vs CPU-bound (GIL-aware)

I/O-bound → asyncio or threads (concurrent.futures.ThreadPoolExecutor).
CPU-bound → multiprocessing or a native accelerator (Numba/Cython/C-extensions).

8.1 I/O with `asyncio` (concurrent HTTP)

import asyncio, aiohttp

async def fetch(session, url):
    async with session.get(url) as r:
        return await r.text()

async def main(urls):
    async with aiohttp.ClientSession() as s:
        return await asyncio.gather(*(fetch(s, u) for u in urls))

# asyncio.run(main(URLS))

8.2 CPU with processes

from concurrent.futures import ProcessPoolExecutor

def heavy(x):  # pure Python compute
    return sum(i*i for i in range(x))

with ProcessPoolExecutor() as ex:
    results = list(ex.map(heavy, [10_000_000]*8))

Threads don’t bypass the GIL for Python bytecode. Use processes or native code for true CPU parallelism.

9. Native acceleration: Cython, Numba, C-extensions, PyPy

Cython: static types + C compilation for hot loops.
Numba: JIT for numeric Python (nopython mode, vectorize).
C-extensions / CFFI / ctypes: hand-written native code for tight kernels.
PyPy: JIT interpreter—great for long-running, loop-heavy pure Python.

9.1 Numba example

import numpy as np
from numba import njit

@njit(fastmath=True)
def pairwise_sum(a, b):
    n = a.shape[0]
    out = np.empty(n, np.float64)
    for i in range(n):
        out[i] = a[i] + b[i]
    return out

9.2 Cython sketch (setup-free with pyximport)

# sumsq.pyx
def sumsq(lst):
    cdef Py_ssize_t i, n=len(lst)
    cdef double s=0.0
    for i in range(n):
        s += lst[i]*lst[i]
    return s

# driver.py
import pyximport; pyximport.install(language_level=3)
from sumsq import sumsq
print(sumsq([1.0]*1_000_000))

10. Memory optimization patterns

Prefer generators over lists when you only need to stream once.

def read_lines(path):
    with open(path, "rt") as f:
        for line in f:
            yield line.rstrip("
")

Use __slots__ / dataclasses(slots=True) to avoid per-instance dicts.
Reuse buffers for binary I/O (bytearray, memoryview).
Avoid building huge strings one + at a time → use ''.join(parts).

10.1 Track allocations with `tracemalloc` diffs

import tracemalloc, target

tracemalloc.start()
target.work()
snap1 = tracemalloc.take_snapshot()
target.work()
snap2 = tracemalloc.take_snapshot()
for stat in snap2.compare_to(snap1, "lineno")[:5]:
    print(stat)

11. Caching & memoization (and when not to)

Use functools.lru_cache for pure functions with repeated args.

from functools import lru_cache

@lru_cache(maxsize=4096)
def slow_parse(key: str) -> dict:
    ...

Don’t cache unbounded data (memory blowups). Evict or cap size.
Precompile regexes (rx = re.compile(...)) instead of recompiling in loops.

12. Startup, packaging, and deployment

Avoid expensive work at import time. Put it under if __name__ == "__main__":.
Lazy-import large libs in cold paths.
Use newer CPython versions—they routinely ship speedups.
Build wheels for native deps in CI to avoid slow cold starts on deploy.

13. A repeatable performance workflow (checklist)

Define the target: which metric (latency p95, throughput, peak RSS, startup)?
Write a reproducible benchmark (timeit/pyperf/pytest-benchmark).
Profile (CPU first, then memory if needed).
Fix the top hotspots (algorithm → data structure → builtins/vectorization → concurrency → native).
Verify with the benchmark; record before/after numbers.
Guard with CI benchmarks or perf budgets.
Document what you changed and why.

Appendix A. Micro-optimizations (use sparingly)

Localize attribute lookups in hot loops:

append = out.append
for x in xs: append(x*x)

Avoid tiny function calls inside critical loops (inline small helpers).
Prefer in on sets/dicts, avoid repeated .index() on lists.
Use str.join for concatenation of many pieces.
Pre-size lists when possible: [None]*n then fill.

Always prove with a micro-benchmark. Micro-wins can lose if they hurt readability or cache locality.

Appendix B. Reusable snippets & templates

Minimal perf test harness

import time, statistics as stats

def bench(fn, n=30, warmup=3, *args, **kwargs):
    for _ in range(warmup): fn(*args, **kwargs)
    runs = []
    for _ in range(n):
        t0 = time.perf_counter()
        fn(*args, **kwargs)
        runs.append(time.perf_counter() - t0)
    return {
        "n": n, "min": min(runs), "median": stats.median(runs),
        "mean": stats.mean(runs), "p95": stats.quantiles(runs, n=100)[94],
        "max": max(runs),
    }

cProfile helper (context manager)

import cProfile as profile, pstats, io
from contextlib import contextmanager

@contextmanager
def cprofile(sort="cumtime", limit=25):
    pr = profile.Profile()
    pr.enable()
    try:
        yield
    finally:
        pr.disable()
        s = io.StringIO()
        pstats.Stats(pr, stream=s).sort_stats(sort).print_stats(limit)
        print(s.getvalue())

tracemalloc helper

import tracemalloc

class MemSnap:
    def __init__(self): tracemalloc.start()
    def take(self): return tracemalloc.take_snapshot()
    @staticmethod
    def diff(a, b, key="lineno", n=10):
        for stat in b.compare_to(a, key)[:n]:
            print(stat)

asyncio concurrency pattern

import asyncio, aiohttp

async def gather_texts(urls):
    async with aiohttp.ClientSession() as s:
        return await asyncio.gather(*(s.get(u) for u in urls))

ProcessPool pattern for CPU

from concurrent.futures import ProcessPoolExecutor

def crunch(chunk): ...
def crunch_all(chunks):
    with ProcessPoolExecutor() as ex:
        return list(ex.map(crunch, chunks))

Top comments (1)

Neurolov AI • Oct 3

Really appreciate the balance between algorithmic improvements and micro-optimizations, very well explained.

DEV Community

Python Performance Optimization: A Practical, Detailed Guide

Table of contents

1. What do we mean by “performance”?

2. Benchmarking correctly (before you “optimize”)

2.1 Minimal harness (repeatable & honest)

2.2 `timeit` done right

2.3 Use a benchmarking tool when possible

3. CPU profiling (find the real hotspots)

3.1 Deterministic profiler (`cProfile` + `pstats`)

3.2 Human-friendly views (snakeviz / gprof2dot)

3.3 Sampling profilers (low overhead)

4. Memory profiling & leaks

4.1 `tracemalloc` (built-in)

4.2 Process-level memory

4.3 Common leak patterns

5. Algorithmic wins (Big-O beats micro-tweaks)

5.1 Replace N² scans with hash lookups

5.2 Use the right tool

6. Data structures that actually matter

7. Kill slow Python loops (builtins, vectorization, batching)

7.1 Prefer builtins (they run in C)

7.2 Vectorize numeric workloads

7.3 Batch I/O and syscalls

8. Concurrency: I/O-bound vs CPU-bound (GIL-aware)

8.1 I/O with `asyncio` (concurrent HTTP)

8.2 CPU with processes

9. Native acceleration: Cython, Numba, C-extensions, PyPy

9.1 Numba example

9.2 Cython sketch (setup-free with pyximport)

10. Memory optimization patterns

10.1 Track allocations with `tracemalloc` diffs

11. Caching & memoization (and when not to)

12. Startup, packaging, and deployment

13. A repeatable performance workflow (checklist)

Appendix A. Micro-optimizations (use sparingly)

Appendix B. Reusable snippets & templates

Top comments (1)

Table of contents

1. What do we mean by “performance”?

2. Benchmarking correctly (before you “optimize”)

2.1 Minimal harness (repeatable & honest)

2.2 timeit done right

2.3 Use a benchmarking tool when possible

3. CPU profiling (find the real hotspots)

3.1 Deterministic profiler (cProfile + pstats)

3.2 Human-friendly views (snakeviz / gprof2dot)

3.3 Sampling profilers (low overhead)

4. Memory profiling & leaks

4.1 tracemalloc (built-in)

4.2 Process-level memory

4.3 Common leak patterns

5. Algorithmic wins (Big-O beats micro-tweaks)

5.1 Replace N² scans with hash lookups

5.2 Use the right tool

6. Data structures that actually matter

7. Kill slow Python loops (builtins, vectorization, batching)

7.1 Prefer builtins (they run in C)

7.2 Vectorize numeric workloads

7.3 Batch I/O and syscalls

8. Concurrency: I/O-bound vs CPU-bound (GIL-aware)

8.1 I/O with asyncio (concurrent HTTP)

8.2 CPU with processes

9. Native acceleration: Cython, Numba, C-extensions, PyPy

9.1 Numba example

9.2 Cython sketch (setup-free with pyximport)

10. Memory optimization patterns

10.1 Track allocations with tracemalloc diffs

11. Caching & memoization (and when not to)

12. Startup, packaging, and deployment

13. A repeatable performance workflow (checklist)

Appendix A. Micro-optimizations (use sparingly)

Appendix B. Reusable snippets & templates

2.2 `timeit` done right

3.1 Deterministic profiler (`cProfile` + `pstats`)

4.1 `tracemalloc` (built-in)

8.1 I/O with `asyncio` (concurrent HTTP)

10.1 Track allocations with `tracemalloc` diffs