DEV Community

Cover image for NumPy Arrays: Why Not Just Use a Python List?
Akhilesh
Akhilesh

Posted on

NumPy Arrays: Why Not Just Use a Python List?

You have been using NumPy arrays since post 17.

np.array([1, 2, 3]). np.zeros((3, 4)). np.random.randn(100). You have typed these dozens of times without stopping to ask why.

Why not just use a Python list? Lists hold numbers. Lists can be looped over. Lists support indexing. What does NumPy actually add?

The answer matters more than you might think. When you understand why NumPy arrays are different, you stop fighting the library and start using it the way it was designed to be used.


The Speed Difference Is Not Small

import numpy as np
import time

size = 5_000_000

python_list = list(range(size))
numpy_array = np.arange(size, dtype=np.float64)

start = time.time()
result_list = [x * 2.5 for x in python_list]
list_time = time.time() - start

start = time.time()
result_numpy = numpy_array * 2.5
numpy_time = time.time() - start

print(f"Python list: {list_time:.4f} seconds")
print(f"NumPy array: {numpy_time:.4f} seconds")
print(f"NumPy is {list_time / numpy_time:.0f}x faster")
Enter fullscreen mode Exit fullscreen mode

Output:

Python list: 0.8341 seconds
NumPy array: 0.0089 seconds
NumPy is 94x faster
Enter fullscreen mode Exit fullscreen mode

94 times faster. On 5 million numbers. That difference scales. When you are processing image datasets of millions of images, or training on millions of records, that gap is the difference between waiting 2 minutes and waiting 3 hours.

The reason is how memory works.

A Python list stores references to objects scattered across memory. Each number is a full Python object with overhead, type information, reference counts. To multiply a list by 2.5, Python visits each object individually, one at a time.

A NumPy array stores raw numbers packed tightly into one continuous block of memory. All the same type. No overhead. NumPy passes that block to optimized C code that processes everything in parallel, using CPU vectorization instructions designed exactly for this.

One scatters. One packs. Packing wins.


dtypes: Every Array Has a Type

Every element in a NumPy array must be the same type. That is the trade. Less flexibility, enormous speed.

int_array   = np.array([1, 2, 3, 4])
float_array = np.array([1.0, 2.0, 3.0])
bool_array  = np.array([True, False, True])

print(int_array.dtype)    # int64
print(float_array.dtype)  # float64
print(bool_array.dtype)   # bool
Enter fullscreen mode Exit fullscreen mode

Output:

int64
float64
bool
Enter fullscreen mode Exit fullscreen mode

int64 means 64-bit integer. Can store numbers from roughly -9 quintillion to +9 quintillion. float64 means 64-bit floating point. Standard decimal precision.

You can specify the dtype explicitly.

small_ints  = np.array([1, 2, 3], dtype=np.int8)
half_float  = np.array([1.0, 2.0, 3.0], dtype=np.float32)

print(f"int8 range: {np.iinfo(np.int8).min} to {np.iinfo(np.int8).max}")
print(f"Memory: {small_ints.nbytes} bytes vs {np.array([1,2,3], dtype=np.int64).nbytes} bytes")
Enter fullscreen mode Exit fullscreen mode

Output:

int8 range: -128 to 127
Memory: 3 bytes vs 24 bytes
Enter fullscreen mode Exit fullscreen mode

int8 uses 3 bytes for three numbers. int64 uses 24. Eight times less memory. When you are loading image pixel data (values 0-255), using uint8 instead of float64 can cut your memory usage by 8x. On a dataset of 100,000 images, that is the difference between fitting in RAM and not.

This matters in deep learning. GPU memory is limited and expensive. Using float32 instead of float64 halves your memory usage with minimal precision loss. Most neural network training uses float32 for exactly this reason.


Array Creation: All the Ways You Actually Need

print(np.zeros((3, 4)))        # all zeros
print(np.ones((2, 3)))         # all ones
print(np.full((2, 3), 7))      # filled with a value
print(np.eye(4))               # identity matrix
print(np.arange(0, 10, 2))     # like range(), returns array
print(np.linspace(0, 1, 5))    # 5 evenly spaced from 0 to 1
Enter fullscreen mode Exit fullscreen mode

Output:

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

[[1. 1. 1.]
 [1. 1. 1.]]

[[7 7 7]
 [7 7 7]]

[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

[0 2 4 6 8]

[0.   0.25 0.5  0.75 1.  ]
Enter fullscreen mode Exit fullscreen mode

np.linspace is the one most beginners miss. It gives you n evenly spaced numbers between a start and end value, inclusive. When you are plotting a function or creating a range of learning rates to test, this is what you reach for.


Indexing and Slicing

Everything from Python lists, extended to multiple dimensions.

data = np.array([
    [10, 20, 30, 40],
    [50, 60, 70, 80],
    [90, 100, 110, 120]
])

print(data[1, 2])         # single element: 70
print(data[0, :])         # first row: [10 20 30 40]
print(data[:, 1])         # second column: [20 60 100]
print(data[1:, 2:])       # bottom right 2x2 block
print(data[[0, 2], :])    # rows 0 and 2
Enter fullscreen mode Exit fullscreen mode

Output:

70
[10 20 30 40]
[ 20  60 100]
[[ 70  80]
 [110 120]]
[[ 10  20  30  40]
 [ 90 100 110 120]]
Enter fullscreen mode Exit fullscreen mode

That last one, data[[0, 2], :], is fancy indexing. Pass a list of indices and you get those specific rows back. This is how you select a subset of training samples without a loop.


Boolean Indexing: Filter Without a Loop

This is one of the most useful NumPy features and most beginners do not use it enough.

scores = np.array([72, 88, 45, 91, 63, 54, 79, 96, 38, 82])

passing = scores[scores >= 60]
print(f"All scores:   {scores}")
print(f"Passing only: {passing}")

mask = scores >= 60
print(f"Mask: {mask}")
Enter fullscreen mode Exit fullscreen mode

Output:

All scores:   [72 88 45 91 63 54 79 96 38 82]
Passing only: [72 88 91 63 79 96 82]
Mask: [ True  True False  True  True False  True  True False  True]
Enter fullscreen mode Exit fullscreen mode

scores >= 60 creates a boolean array. Using that boolean array as an index filters the original array. No loop. No list comprehension. One line.

This is how data filtering works in NumPy at scale.

students = np.array([
    [72, 23, 1],
    [88, 25, 0],
    [45, 19, 1],
    [91, 31, 1],
    [54, 22, 0]
])

high_scorers_over_21 = students[(students[:, 0] >= 70) & (students[:, 1] > 20)]
print(high_scorers_over_21)
Enter fullscreen mode Exit fullscreen mode

Output:

[[72 23  1]
 [88 25  0]
 [91 31  1]]
Enter fullscreen mode Exit fullscreen mode

Score above 70 AND age above 20. Two conditions, one line, no loop.


Broadcasting: The Rule That Confuses Everyone

Broadcasting is NumPy's way of doing operations between arrays of different shapes.

matrix = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

row = np.array([10, 20, 30])

result = matrix + row
print(result)
Enter fullscreen mode Exit fullscreen mode

Output:

[[11 22 33]
 [14 25 36]
 [17 28 39]]
Enter fullscreen mode Exit fullscreen mode

matrix is (3, 3). row is (3,). They are different shapes. NumPy automatically expanded row across all three rows of matrix before adding.

Broadcasting rule: dimensions are compared from the right. Either they are equal, or one of them is 1. If one dimension is 1, it gets stretched to match.

The most common broadcasting you will do:

data = np.random.randn(1000, 8)
col_means = data.mean(axis=0)    # shape (8,)
col_stds  = data.std(axis=0)     # shape (8,)

normalized = (data - col_means) / col_stds   # (1000,8) - (8,) works via broadcasting
print(f"Normalized shape: {normalized.shape}")
print(f"Column means after: {normalized.mean(axis=0).round(4)}")
Enter fullscreen mode Exit fullscreen mode

Output:

Normalized shape: (1000, 8)
Column means after: [-0.  0.  0. -0.  0.  0.  0. -0.]
Enter fullscreen mode Exit fullscreen mode

Subtract the mean vector from every row, divide every row by the std vector. Zero loops. One line. Broadcasting handles the shape mismatch automatically.


Useful Operations Grouped Together

arr = np.array([3, 1, 4, 1, 5, 9, 2, 6, 5, 3])

print(arr.sum())          # 39
print(arr.min())          # 1
print(arr.max())          # 9
print(arr.mean())         # 3.9
print(arr.std())          # 2.3
print(np.sort(arr))       # sorted copy
print(np.argsort(arr))    # indices that would sort it
print(np.unique(arr))     # unique values
print(np.cumsum(arr))     # running total
Enter fullscreen mode Exit fullscreen mode

Output:

39
1
9
3.9
2.3
[1 1 2 3 3 4 5 5 6 9]
[1 3 6 0 9 2 8 4 7 5]
[1 2 3 4 5 6 9]
[ 3  4  8  9 14 23 25 31 36 39]
Enter fullscreen mode Exit fullscreen mode

np.argsort returns the indices that would sort the array. So index 1 comes first (value 1), then index 3 (also value 1), and so on. Useful when you want to know the ranking of items, not just their sorted values. For example, ranking recommendations or finding top predictions.


Reshaping and Stacking

flat = np.arange(24)
grid = flat.reshape(4, 6)
print(f"Flat: {flat.shape}  Grid: {grid.shape}")

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])

vertical   = np.vstack([a, b])
horizontal = np.hstack([a, b])

print(f"\nvstack: {vertical.shape}")
print(vertical)

print(f"\nhstack: {horizontal.shape}")
print(horizontal)
Enter fullscreen mode Exit fullscreen mode

Output:

Flat: (24,)  Grid: (4, 6)

vstack: (4, 2)
[[1 2]
 [3 4]
 [5 6]
 [7 8]]

hstack: (2, 4)
[[1 2 5 6]
 [3 4 7 8]]
Enter fullscreen mode Exit fullscreen mode

vstack stacks vertically (more rows). hstack stacks horizontally (more columns). You will use these when combining datasets, concatenating batches, or building feature matrices from multiple sources.


Try This

Create numpy_practice.py.

You have exam scores for 200 students across 5 subjects:

np.random.seed(99)
scores = np.random.randint(30, 101, size=(200, 5))
subject_names = ["Math", "Science", "English", "History", "CS"]
Enter fullscreen mode Exit fullscreen mode

Do all of the following without loops:

Calculate each student's average score (mean across axis 1). Find the top 10 students by average score using np.argsort.

Calculate each subject's mean and standard deviation (across axis 0). Which subject has the highest average? Which has the most variance?

Find all students who failed (below 40) in at least one subject. How many are there? Hint: use boolean indexing and np.any.

Normalize the entire scores matrix so each subject has mean 0 and std 1. Verify by printing the column means and stds after normalization.

Build a new matrix containing only the rows of students who passed all five subjects (all scores above 40). What is its shape?


What's Next

You know NumPy deeply now. The next tool is Pandas. If NumPy is for raw numerical computation, Pandas is for data that has labels, column names, mixed types, and the messy structure of real-world datasets. It is where you spend most of your time before models even enter the picture.

Top comments (0)