DEV Community

Dipti Moryani
Dipti Moryani

Posted on

A Quick Overview of NumPy and SciPy

Statistical analysis remains one of the foundational building blocks of data science, machine learning, and scientific computing. Whether you're exploring a dataset, preparing features, or validating results, two Python libraries—NumPy and SciPy—form the backbone of almost every analytical workflow.
This article provides a practical walkthrough of how to perform statistical analysis using NumPy and SciPy, starting from array creation to descriptive statistics and distribution insights.

  1. A Quick Overview of NumPy and SciPy
    NumPy
    NumPy (short for Numerical Python) provides:
    Multidimensional array objects (ndarray)
    Efficient vectorized operations
    Broadcasting
    Linear algebra operations
    Random sampling utilities
    The biggest advantage of NumPy is speed and memory efficiency. NumPy arrays use compact memory layouts and rely on fast C-based implementations, making them dramatically faster than Python lists.
    SciPy
    SciPy builds on NumPy and provides higher-level scientific routines, such as:
    Statistical tests and distributions
    Optimization
    Integration
    Signal processing
    Linear algebra extensions
    For statistics specifically, SciPy’s scipy.stats module includes:
    Probability distributions
    Hypothesis tests
    Skewness, kurtosis
    Confidence intervals
    Advanced statistical functions

  2. Getting Started: Installing and Importing NumPy
    You can install NumPy in two ways:
    Using pip
    pip install numpy scipy

Using Anaconda
NumPy and SciPy come preinstalled.
Once installed:
import numpy as np
from scipy import stats

  1. Creating Arrays in NumPy Let’s create a simple 5×5 matrix. a = np.arange(25).reshape(5, 5)

np.arange(25) creates a sequence from 0 to 24, and .reshape(5,5) forms it into a 5×5 matrix.
Checking the Data Type
a.dtype

Most default integers are stored as 32-bit or 64-bit depending on your system.
Number of Elements
a.size

1-D Array
arr = np.array([1, 2, 3, 4])

2-D Array
mat = np.array([[1, 2], [3, 4]])

  1. Basic Operations in NumPy NumPy performs operations element-wise: a = np.array([1, 2, 3]) b = np.array([4, 5, 6])

a - b
a * b
a ** 2
a > 2
a < 2

Vectorization avoids loops and dramatically speeds up calculations.

  1. Indexing and Slicing Using our earlier 5×5 matrix: a = np.arange(25).reshape(5, 5)

First Row
a[0, :]

First Column
a[:, 0]

Element at 2nd Row, 3rd Column
a[1, 2]

Remember: Indexing starts at 0.

  1. Stacking Arrays You can join arrays vertically or horizontally. a = np.full((5, 5), 1) b = np.full((5, 5), 2)

Vertical Stack
np.vstack([a, b])

Horizontal Stack
np.hstack([a, b])

  1. Descriptive Statistics Using NumPy and SciPy Descriptive statistics summarize data using: Measures of central tendency (mean, median, mode) Measures of spread (range, variance, standard deviation, IQR) Shape of distribution (skewness) Let’s break these down.

7.1 Mean
The mean is the average of all numbers.
data = np.arange(28).reshape(7, 4)
np.mean(data)

Mean by Columns
np.mean(data, axis=0)

Mean by Rows
np.mean(data, axis=1)

7.2 Median
Median is the middle value after sorting.
np.median(data)
np.median(data, axis=0)

Median is more robust than mean when the data contains outliers.

7.3 Mode
NumPy doesn’t have a built-in mode function, but SciPy does:
stats.mode(data, axis=0, keepdims=True)

This returns:
Mode values
Their count of occurrences
Mode is most useful for categorical data.

7.4 Range
Range = max − min.
np.ptp(data, axis=0)

Limitations:
Sensitive to outliers
Doesn’t describe the distribution between extremes

7.5 Variance
Variance measures how far values spread from the mean.
np.var(data)

High variance = data points are far from the mean.
Low variance = data points are tightly packed.

7.6 Standard Deviation
Standard deviation is the square root of variance.
np.std(data)

Useful for:
Normal distribution analysis
Understanding spread

7.7 Interquartile Range (IQR)
stats.iqr(data, axis=0)

IQR is useful for:
Detecting outliers
Understanding central spread

7.8 Skewness
Skewness measures the asymmetry of a distribution.
stats.skew(data, axis=0)

Positive skew: Long tail on the right
Negative skew: Long tail on the left
Skewness helps diagnose whether mean or median is more reliable.

  1. Why Descriptive Statistics Matter Descriptive statistics: Summarize large datasets quickly Provide insights into shape, spread, and central value Are foundational for machine learning feature preprocessing Guide decisions before applying inferential statistics However, descriptive stats cannot generalize beyond the dataset. To go further—test hypotheses, understand significance, build predictions—you need inferential statistics, which SciPy also supports.

Conclusion
NumPy and SciPy together provide a powerful, efficient, and flexible toolkit for statistical analysis. From array manipulation to deep descriptive statistics, these libraries form the foundation of virtually every modern data science workflow.
By mastering these basics, you set the stage for more advanced work in:
Machine learning
Data modeling
Signal processing
Scientific computation
At Perceptive Analytics, our experienced Tableau Consultants help businesses build scalable dashboards, automate reporting, and unlock deeper insights from their data. Organizations looking to expand their BI capabilities can also hire Power BI consultants from our certified team to implement robust data models, optimize performance, and accelerate analytics adoption.

Top comments (0)