DEV Community

A0mineTV
A0mineTV

Posted on

Python Performance Optimization: A Practical, Detailed Guide

A hands-on, copy–paste guide to measure, understand, and fix performance problems in Python. We’ll go from “it feels slow” to profiling → diffing → fixing → verifying—with runnable snippets and checklists you can reuse in every project.

Table of contents


1. What do we mean by “performance”?

“Performance” is not one number. Pick the right target:

  • Latency: time per operation (p50, p95, p99).
  • Throughput: ops/sec under load.
  • Memory: peak RSS, steady-state, object churn (allocs/sec).
  • Startup: import time, cold starts.
  • Energy/Cost: CPU seconds, cloud bill, core-hours.

Rule #1: Measure first. Make a baseline you can repeat and compare.


2. Benchmarking correctly (before you “optimize”)

2.1 Minimal harness (repeatable & honest)

# bench.py
import time, statistics as stats
from target import work  # the function you want to benchmark

def bench(n=30, warmup=3):
    # Warmup (JITs/caches/CPU freq settling, disk cache, etc.)
    for _ in range(warmup):
        work()
    runs = []
    for _ in range(n):
        t0 = time.perf_counter()
        work()
        runs.append(time.perf_counter() - t0)
    print({
        "n": n,
        "min": min(runs),
        "median": stats.median(runs),
        "mean": stats.mean(runs),
        "p95": stats.quantiles(runs, n=100)[94],
        "max": max(runs),
    })

if __name__ == "__main__":
    bench()
Enter fullscreen mode Exit fullscreen mode
  • Use time.perf_counter() (highest resolution wall-clock).
  • Run many iterations; use median/p95; beware outliers and noisy neighbors.

2.2 timeit done right

python -m timeit -n 5 -r 10 "import target; target.work()"
Enter fullscreen mode Exit fullscreen mode
  • -n = loops/measurement, -r = repeats; don’t rely on the default single shot.

2.3 Use a benchmarking tool when possible

  • pyperf (stabilizes runs, CPU affinity).
  • pytest-benchmark (CI-friendly; regression gates).

Store benchmarks in the repo. Performance is a contract, not a vibe.


3. CPU profiling (find the real hotspots)

3.1 Deterministic profiler (cProfile + pstats)

# profile_cpu.py
import cProfile as profile, pstats
from target import work

with profile.Profile() as pr:
    work()

ps = pstats.Stats(pr).sort_stats("cumtime")  # cumulative time by function
ps.print_stats(20)
Enter fullscreen mode Exit fullscreen mode
  • cProfile measures every function call; overhead is acceptable for most app code.
  • Sort by cumtime to find where time is actually spent.

3.2 Human-friendly views (snakeviz / gprof2dot)

python -m cProfile -o out.prof -m target
snakeviz out.prof            # interactive flame graph in browser
# or:
gprof2dot -f pstats out.prof | dot -Tpng -o callgraph.png
Enter fullscreen mode Exit fullscreen mode

3.3 Sampling profilers (low overhead)

  • pyinstrument (beautiful call trees; ~1% overhead).
pyinstrument -r html -o prof.html -m target
Enter fullscreen mode Exit fullscreen mode
  • Scalene (separates Python vs native time; also memory profiling).

Pick one profiler, gather evidence, then optimize the top 1–3 hotspots only.


4. Memory profiling & leaks

4.1 tracemalloc (built-in)

# profile_mem.py
import tracemalloc, target
tracemalloc.start()
target.work()
current, peak = tracemalloc.get_traced_memory()
print("current=%dB peak=%dB" % (current, peak))
for stat in tracemalloc.take_snapshot().statistics("lineno")[:10]:
    print(stat)
Enter fullscreen mode Exit fullscreen mode
  • Shows allocations by file/line, and peak usage.

4.2 Process-level memory

  • psutil.Process(os.getpid()).memory_info().rss (resident set size).
  • mprof / memray for timeline views of memory growth.

4.3 Common leak patterns

  • Global caches growing unbounded.
  • Queues/pipelines that never drain on failure.
  • Cycles with __del__ delaying GC (rare, but tricky).

5. Algorithmic wins (Big-O beats micro-tweaks)

5.1 Replace N² scans with hash lookups

# bad: O(n^2)
def dupe_any(xs):
    for i, a in enumerate(xs):
        for b in xs[i+1:]:
            if a == b: return True
    return False

# good: O(n)
def dupe_any_fast(xs):
    seen = set()
    for x in xs:
        if x in seen: return True
        seen.add(x)
    return False
Enter fullscreen mode Exit fullscreen mode

5.2 Use the right tool

  • set/dict for membership & joins.
  • heapq for top-k / streaming medians.
  • bisect for sorted arrays (avoid full sorts).
  • deque for FIFO/LRU; array/memoryview for dense numeric buffers.

6. Data structures that actually matter

  • list vs tuple: tuples are smaller/immutable; lists are flexible and amortized O(1) append.
  • dict/set: O(1) average membership; pre-size if you know the count (dict.fromkeys trick).
  • namedtuple / dataclasses: add slots=True to save memory & speed attribute access.
  • array('d'), memoryview: compact numeric storage w/o NumPy.
  • bytes/bytearray for binary; prefer bytes.join for building large strings.
from dataclasses import dataclass

@dataclass(slots=True)
class Point:
    x: float
    y: float
Enter fullscreen mode Exit fullscreen mode

7. Kill slow Python loops (builtins, vectorization, batching)

7.1 Prefer builtins (they run in C)

# bad
total = 0
for x in xs: total += x

# good
total = sum(xs)

# bad
out = []
for x in xs: out.append(x*x)

# good
out = [x*x for x in xs]  # still Python loop but optimized bytecode
Enter fullscreen mode Exit fullscreen mode

7.2 Vectorize numeric workloads

If you’re doing heavy math, NumPy moves loops to C:

import numpy as np
a = np.arange(10_000_000, dtype=np.float64)
b = a * 0.1 + 2.0
Enter fullscreen mode Exit fullscreen mode
  • Keep arrays contiguous and dtypes consistent; avoid many tiny allocations.

7.3 Batch I/O and syscalls

  • Read/write in chunks (e.g., readinto with a reusable buffer).
  • Group network/database calls (pooling, bulk operations).

8. Concurrency: I/O-bound vs CPU-bound (GIL-aware)

  • I/O-boundasyncio or threads (concurrent.futures.ThreadPoolExecutor).
  • CPU-boundmultiprocessing or a native accelerator (Numba/Cython/C-extensions).

8.1 I/O with asyncio (concurrent HTTP)

import asyncio, aiohttp

async def fetch(session, url):
    async with session.get(url) as r:
        return await r.text()

async def main(urls):
    async with aiohttp.ClientSession() as s:
        return await asyncio.gather(*(fetch(s, u) for u in urls))

# asyncio.run(main(URLS))
Enter fullscreen mode Exit fullscreen mode

8.2 CPU with processes

from concurrent.futures import ProcessPoolExecutor

def heavy(x):  # pure Python compute
    return sum(i*i for i in range(x))

with ProcessPoolExecutor() as ex:
    results = list(ex.map(heavy, [10_000_000]*8))
Enter fullscreen mode Exit fullscreen mode

Threads don’t bypass the GIL for Python bytecode. Use processes or native code for true CPU parallelism.


9. Native acceleration: Cython, Numba, C-extensions, PyPy

  • Cython: static types + C compilation for hot loops.
  • Numba: JIT for numeric Python (nopython mode, vectorize).
  • C-extensions / CFFI / ctypes: hand-written native code for tight kernels.
  • PyPy: JIT interpreter—great for long-running, loop-heavy pure Python.

9.1 Numba example

import numpy as np
from numba import njit

@njit(fastmath=True)
def pairwise_sum(a, b):
    n = a.shape[0]
    out = np.empty(n, np.float64)
    for i in range(n):
        out[i] = a[i] + b[i]
    return out
Enter fullscreen mode Exit fullscreen mode

9.2 Cython sketch (setup-free with pyximport)

# sumsq.pyx
def sumsq(lst):
    cdef Py_ssize_t i, n=len(lst)
    cdef double s=0.0
    for i in range(n):
        s += lst[i]*lst[i]
    return s

# driver.py
import pyximport; pyximport.install(language_level=3)
from sumsq import sumsq
print(sumsq([1.0]*1_000_000))
Enter fullscreen mode Exit fullscreen mode

10. Memory optimization patterns

  • Prefer generators over lists when you only need to stream once.
def read_lines(path):
    with open(path, "rt") as f:
        for line in f:
            yield line.rstrip("
")
Enter fullscreen mode Exit fullscreen mode
  • Use __slots__ / dataclasses(slots=True) to avoid per-instance dicts.
  • Reuse buffers for binary I/O (bytearray, memoryview).
  • Avoid building huge strings one + at a time → use ''.join(parts).

10.1 Track allocations with tracemalloc diffs

import tracemalloc, target

tracemalloc.start()
target.work()
snap1 = tracemalloc.take_snapshot()
target.work()
snap2 = tracemalloc.take_snapshot()
for stat in snap2.compare_to(snap1, "lineno")[:5]:
    print(stat)
Enter fullscreen mode Exit fullscreen mode

11. Caching & memoization (and when not to)

  • Use functools.lru_cache for pure functions with repeated args.
from functools import lru_cache

@lru_cache(maxsize=4096)
def slow_parse(key: str) -> dict:
    ...
Enter fullscreen mode Exit fullscreen mode
  • Don’t cache unbounded data (memory blowups). Evict or cap size.
  • Precompile regexes (rx = re.compile(...)) instead of recompiling in loops.

12. Startup, packaging, and deployment

  • Avoid expensive work at import time. Put it under if __name__ == "__main__":.
  • Lazy-import large libs in cold paths.
  • Use newer CPython versions—they routinely ship speedups.
  • Build wheels for native deps in CI to avoid slow cold starts on deploy.

13. A repeatable performance workflow (checklist)

  1. Define the target: which metric (latency p95, throughput, peak RSS, startup)?
  2. Write a reproducible benchmark (timeit/pyperf/pytest-benchmark).
  3. Profile (CPU first, then memory if needed).
  4. Fix the top hotspots (algorithm → data structure → builtins/vectorization → concurrency → native).
  5. Verify with the benchmark; record before/after numbers.
  6. Guard with CI benchmarks or perf budgets.
  7. Document what you changed and why.

Appendix A. Micro-optimizations (use sparingly)

  • Localize attribute lookups in hot loops:
append = out.append
for x in xs: append(x*x)
Enter fullscreen mode Exit fullscreen mode
  • Avoid tiny function calls inside critical loops (inline small helpers).
  • Prefer in on sets/dicts, avoid repeated .index() on lists.
  • Use str.join for concatenation of many pieces.
  • Pre-size lists when possible: [None]*n then fill.

Always prove with a micro-benchmark. Micro-wins can lose if they hurt readability or cache locality.


Appendix B. Reusable snippets & templates

Minimal perf test harness

import time, statistics as stats

def bench(fn, n=30, warmup=3, *args, **kwargs):
    for _ in range(warmup): fn(*args, **kwargs)
    runs = []
    for _ in range(n):
        t0 = time.perf_counter()
        fn(*args, **kwargs)
        runs.append(time.perf_counter() - t0)
    return {
        "n": n, "min": min(runs), "median": stats.median(runs),
        "mean": stats.mean(runs), "p95": stats.quantiles(runs, n=100)[94],
        "max": max(runs),
    }
Enter fullscreen mode Exit fullscreen mode

cProfile helper (context manager)

import cProfile as profile, pstats, io
from contextlib import contextmanager

@contextmanager
def cprofile(sort="cumtime", limit=25):
    pr = profile.Profile()
    pr.enable()
    try:
        yield
    finally:
        pr.disable()
        s = io.StringIO()
        pstats.Stats(pr, stream=s).sort_stats(sort).print_stats(limit)
        print(s.getvalue())
Enter fullscreen mode Exit fullscreen mode

tracemalloc helper

import tracemalloc

class MemSnap:
    def __init__(self): tracemalloc.start()
    def take(self): return tracemalloc.take_snapshot()
    @staticmethod
    def diff(a, b, key="lineno", n=10):
        for stat in b.compare_to(a, key)[:n]:
            print(stat)
Enter fullscreen mode Exit fullscreen mode

asyncio concurrency pattern

import asyncio, aiohttp

async def gather_texts(urls):
    async with aiohttp.ClientSession() as s:
        return await asyncio.gather(*(s.get(u) for u in urls))
Enter fullscreen mode Exit fullscreen mode

ProcessPool pattern for CPU

from concurrent.futures import ProcessPoolExecutor

def crunch(chunk): ...
def crunch_all(chunks):
    with ProcessPoolExecutor() as ex:
        return list(ex.map(crunch, chunks))
Enter fullscreen mode Exit fullscreen mode

Top comments (1)

Collapse
 
neurolov__ai profile image
Neurolov AI

Really appreciate the balance between algorithmic improvements and micro-optimizations, very well explained.