A hands-on, copy–paste guide to measure, understand, and fix performance problems in Python. We’ll go from “it feels slow” to profiling → diffing → fixing → verifying—with runnable snippets and checklists you can reuse in every project.
Table of contents
- 1. What do we mean by “performance”?
- 2. Benchmarking correctly (before you “optimize”)
- 3. CPU profiling (find the real hotspots)
- 4. Memory profiling & leaks
- 5. Algorithmic wins (Big-O beats micro-tweaks)
- 6. Data structures that actually matter
- 7. Kill slow Python loops (builtins, vectorization, batching)
- 8. Concurrency: I/O-bound vs CPU-bound (GIL-aware)
- 9. Native acceleration: Cython, Numba, C-extensions, PyPy
- 10. Memory optimization patterns
- 11. Caching & memoization (and when not to)
- 12. Startup, packaging, and deployment
- 13. A repeatable performance workflow (checklist)
- Appendix A. Micro-optimizations (use sparingly)
- Appendix B. Reusable snippets & templates
1. What do we mean by “performance”?
“Performance” is not one number. Pick the right target:
- Latency: time per operation (p50, p95, p99).
- Throughput: ops/sec under load.
- Memory: peak RSS, steady-state, object churn (allocs/sec).
- Startup: import time, cold starts.
- Energy/Cost: CPU seconds, cloud bill, core-hours.
Rule #1: Measure first. Make a baseline you can repeat and compare.
2. Benchmarking correctly (before you “optimize”)
2.1 Minimal harness (repeatable & honest)
# bench.py
import time, statistics as stats
from target import work # the function you want to benchmark
def bench(n=30, warmup=3):
# Warmup (JITs/caches/CPU freq settling, disk cache, etc.)
for _ in range(warmup):
work()
runs = []
for _ in range(n):
t0 = time.perf_counter()
work()
runs.append(time.perf_counter() - t0)
print({
"n": n,
"min": min(runs),
"median": stats.median(runs),
"mean": stats.mean(runs),
"p95": stats.quantiles(runs, n=100)[94],
"max": max(runs),
})
if __name__ == "__main__":
bench()
- Use
time.perf_counter()
(highest resolution wall-clock). - Run many iterations; use median/p95; beware outliers and noisy neighbors.
2.2 timeit
done right
python -m timeit -n 5 -r 10 "import target; target.work()"
-
-n
= loops/measurement,-r
= repeats; don’t rely on the default single shot.
2.3 Use a benchmarking tool when possible
-
pyperf
(stabilizes runs, CPU affinity). -
pytest-benchmark
(CI-friendly; regression gates).
Store benchmarks in the repo. Performance is a contract, not a vibe.
3. CPU profiling (find the real hotspots)
3.1 Deterministic profiler (cProfile
+ pstats
)
# profile_cpu.py
import cProfile as profile, pstats
from target import work
with profile.Profile() as pr:
work()
ps = pstats.Stats(pr).sort_stats("cumtime") # cumulative time by function
ps.print_stats(20)
- cProfile measures every function call; overhead is acceptable for most app code.
- Sort by cumtime to find where time is actually spent.
3.2 Human-friendly views (snakeviz / gprof2dot)
python -m cProfile -o out.prof -m target
snakeviz out.prof # interactive flame graph in browser
# or:
gprof2dot -f pstats out.prof | dot -Tpng -o callgraph.png
3.3 Sampling profilers (low overhead)
- pyinstrument (beautiful call trees; ~1% overhead).
pyinstrument -r html -o prof.html -m target
- Scalene (separates Python vs native time; also memory profiling).
Pick one profiler, gather evidence, then optimize the top 1–3 hotspots only.
4. Memory profiling & leaks
4.1 tracemalloc
(built-in)
# profile_mem.py
import tracemalloc, target
tracemalloc.start()
target.work()
current, peak = tracemalloc.get_traced_memory()
print("current=%dB peak=%dB" % (current, peak))
for stat in tracemalloc.take_snapshot().statistics("lineno")[:10]:
print(stat)
- Shows allocations by file/line, and peak usage.
4.2 Process-level memory
-
psutil.Process(os.getpid()).memory_info().rss
(resident set size). -
mprof
/memray
for timeline views of memory growth.
4.3 Common leak patterns
- Global caches growing unbounded.
- Queues/pipelines that never drain on failure.
- Cycles with
__del__
delaying GC (rare, but tricky).
5. Algorithmic wins (Big-O beats micro-tweaks)
5.1 Replace N² scans with hash lookups
# bad: O(n^2)
def dupe_any(xs):
for i, a in enumerate(xs):
for b in xs[i+1:]:
if a == b: return True
return False
# good: O(n)
def dupe_any_fast(xs):
seen = set()
for x in xs:
if x in seen: return True
seen.add(x)
return False
5.2 Use the right tool
-
set
/dict
for membership & joins. -
heapq
for top-k / streaming medians. -
bisect
for sorted arrays (avoid full sorts). -
deque
for FIFO/LRU;array
/memoryview
for dense numeric buffers.
6. Data structures that actually matter
-
list
vstuple
: tuples are smaller/immutable; lists are flexible and amortized O(1) append. -
dict
/set
: O(1) average membership; pre-size if you know the count (dict.fromkeys
trick). -
namedtuple
/dataclasses
: addslots=True
to save memory & speed attribute access. -
array('d')
,memoryview
: compact numeric storage w/o NumPy. -
bytes
/bytearray
for binary; preferbytes.join
for building large strings.
from dataclasses import dataclass
@dataclass(slots=True)
class Point:
x: float
y: float
7. Kill slow Python loops (builtins, vectorization, batching)
7.1 Prefer builtins (they run in C)
# bad
total = 0
for x in xs: total += x
# good
total = sum(xs)
# bad
out = []
for x in xs: out.append(x*x)
# good
out = [x*x for x in xs] # still Python loop but optimized bytecode
7.2 Vectorize numeric workloads
If you’re doing heavy math, NumPy moves loops to C:
import numpy as np
a = np.arange(10_000_000, dtype=np.float64)
b = a * 0.1 + 2.0
- Keep arrays contiguous and dtypes consistent; avoid many tiny allocations.
7.3 Batch I/O and syscalls
- Read/write in chunks (e.g.,
readinto
with a reusable buffer). - Group network/database calls (pooling, bulk operations).
8. Concurrency: I/O-bound vs CPU-bound (GIL-aware)
-
I/O-bound →
asyncio
or threads (concurrent.futures.ThreadPoolExecutor
). - CPU-bound → multiprocessing or a native accelerator (Numba/Cython/C-extensions).
8.1 I/O with asyncio
(concurrent HTTP)
import asyncio, aiohttp
async def fetch(session, url):
async with session.get(url) as r:
return await r.text()
async def main(urls):
async with aiohttp.ClientSession() as s:
return await asyncio.gather(*(fetch(s, u) for u in urls))
# asyncio.run(main(URLS))
8.2 CPU with processes
from concurrent.futures import ProcessPoolExecutor
def heavy(x): # pure Python compute
return sum(i*i for i in range(x))
with ProcessPoolExecutor() as ex:
results = list(ex.map(heavy, [10_000_000]*8))
Threads don’t bypass the GIL for Python bytecode. Use processes or native code for true CPU parallelism.
9. Native acceleration: Cython, Numba, C-extensions, PyPy
- Cython: static types + C compilation for hot loops.
- Numba: JIT for numeric Python (nopython mode, vectorize).
-
C-extensions / CFFI /
ctypes
: hand-written native code for tight kernels. - PyPy: JIT interpreter—great for long-running, loop-heavy pure Python.
9.1 Numba example
import numpy as np
from numba import njit
@njit(fastmath=True)
def pairwise_sum(a, b):
n = a.shape[0]
out = np.empty(n, np.float64)
for i in range(n):
out[i] = a[i] + b[i]
return out
9.2 Cython sketch (setup-free with pyximport)
# sumsq.pyx
def sumsq(lst):
cdef Py_ssize_t i, n=len(lst)
cdef double s=0.0
for i in range(n):
s += lst[i]*lst[i]
return s
# driver.py
import pyximport; pyximport.install(language_level=3)
from sumsq import sumsq
print(sumsq([1.0]*1_000_000))
10. Memory optimization patterns
- Prefer generators over lists when you only need to stream once.
def read_lines(path):
with open(path, "rt") as f:
for line in f:
yield line.rstrip("
")
- Use
__slots__
/dataclasses(slots=True)
to avoid per-instance dicts. - Reuse buffers for binary I/O (
bytearray
,memoryview
). - Avoid building huge strings one
+
at a time → use''.join(parts)
.
10.1 Track allocations with tracemalloc
diffs
import tracemalloc, target
tracemalloc.start()
target.work()
snap1 = tracemalloc.take_snapshot()
target.work()
snap2 = tracemalloc.take_snapshot()
for stat in snap2.compare_to(snap1, "lineno")[:5]:
print(stat)
11. Caching & memoization (and when not to)
- Use
functools.lru_cache
for pure functions with repeated args.
from functools import lru_cache
@lru_cache(maxsize=4096)
def slow_parse(key: str) -> dict:
...
- Don’t cache unbounded data (memory blowups). Evict or cap size.
- Precompile regexes (
rx = re.compile(...)
) instead of recompiling in loops.
12. Startup, packaging, and deployment
- Avoid expensive work at import time. Put it under
if __name__ == "__main__":
. - Lazy-import large libs in cold paths.
- Use newer CPython versions—they routinely ship speedups.
- Build wheels for native deps in CI to avoid slow cold starts on deploy.
13. A repeatable performance workflow (checklist)
- Define the target: which metric (latency p95, throughput, peak RSS, startup)?
- Write a reproducible benchmark (timeit/pyperf/pytest-benchmark).
- Profile (CPU first, then memory if needed).
- Fix the top hotspots (algorithm → data structure → builtins/vectorization → concurrency → native).
- Verify with the benchmark; record before/after numbers.
- Guard with CI benchmarks or perf budgets.
- Document what you changed and why.
Appendix A. Micro-optimizations (use sparingly)
- Localize attribute lookups in hot loops:
append = out.append
for x in xs: append(x*x)
- Avoid tiny function calls inside critical loops (inline small helpers).
- Prefer
in
on sets/dicts, avoid repeated.index()
on lists. - Use
str.join
for concatenation of many pieces. - Pre-size lists when possible:
[None]*n
then fill.
Always prove with a micro-benchmark. Micro-wins can lose if they hurt readability or cache locality.
Appendix B. Reusable snippets & templates
Minimal perf test harness
import time, statistics as stats
def bench(fn, n=30, warmup=3, *args, **kwargs):
for _ in range(warmup): fn(*args, **kwargs)
runs = []
for _ in range(n):
t0 = time.perf_counter()
fn(*args, **kwargs)
runs.append(time.perf_counter() - t0)
return {
"n": n, "min": min(runs), "median": stats.median(runs),
"mean": stats.mean(runs), "p95": stats.quantiles(runs, n=100)[94],
"max": max(runs),
}
cProfile helper (context manager)
import cProfile as profile, pstats, io
from contextlib import contextmanager
@contextmanager
def cprofile(sort="cumtime", limit=25):
pr = profile.Profile()
pr.enable()
try:
yield
finally:
pr.disable()
s = io.StringIO()
pstats.Stats(pr, stream=s).sort_stats(sort).print_stats(limit)
print(s.getvalue())
tracemalloc helper
import tracemalloc
class MemSnap:
def __init__(self): tracemalloc.start()
def take(self): return tracemalloc.take_snapshot()
@staticmethod
def diff(a, b, key="lineno", n=10):
for stat in b.compare_to(a, key)[:n]:
print(stat)
asyncio concurrency pattern
import asyncio, aiohttp
async def gather_texts(urls):
async with aiohttp.ClientSession() as s:
return await asyncio.gather(*(s.get(u) for u in urls))
ProcessPool pattern for CPU
from concurrent.futures import ProcessPoolExecutor
def crunch(chunk): ...
def crunch_all(chunks):
with ProcessPoolExecutor() as ex:
return list(ex.map(crunch, chunks))
Top comments (1)
Really appreciate the balance between algorithmic improvements and micro-optimizations, very well explained.