Emil Ossola

Posted on Jun 21, 2023 • Edited on Jun 25, 2023

Learning Data Science with Python's NumPy Library

Data science, as we know it today, has its roots in various fields such as statistics, computer science, and machine learning. The term "data science" was coined in 2008 by statisticians William Cleveland and Trevor Hastie. However, the field of data science has been in existence for several decades.

In the 1960s, the term "data mining" was first used to refer to the process of extracting useful information from large datasets. In the 1980s, the introduction of personal computers and the internet led to an explosion of data. This led to the development of new techniques and tools for managing and analyzing large datasets.

With the increasing amount of data being generated every day, data science is becoming more important than ever before. Today, data science has become an essential field in industries such as healthcare, finance, and retail, providing personalized recommendations, predicting and preventing diseases, and optimizing transportation systems for us.

Python's role in data science

Python has become a popular programming language in recent years, especially in the field of data science. It has various libraries and frameworks that make it powerful for data analysis, visualization, and manipulation. Python's simplicity, flexibility, and ease of learning have made it popular among data scientists.

Some of the most widely used libraries in Python for data science are NumPy, Pandas, Matplotlib, and Scikit-learn. NumPy is a fundamental library for scientific computing and data analysis in Python. It provides support for large, multi-dimensional arrays and matrices, along with a large collection of mathematical functions to operate on these arrays.

What is NumPy?

NumPy is widely used in the scientific computing community and is an essential tool for data scientists in Python. Its main functionality is to handle multi-dimensional arrays, matrices, and mathematical functions operating on these arrays.

NumPy stands for Numerical Python. It provides a high-performance multi-dimensional array object and tools for working with these arrays. NumPy is widely used in data science and machine learning for performing mathematical operations on arrays and matrices.

It is an essential library for scientific computing with Python, and it is a core library used in the scientific Python ecosystem, including SciPy, Matplotlib, Pandas, and scikit-learn. NumPy can be easily installed using the Python package manager pip.

Some of the advantages of NumPy are:

NumPy arrays are faster and more efficient than Python's built-in data structures.
NumPy arrays allow for mathematical operations to be performed on entire arrays rather than on individual elements, which is significantly faster.
NumPy arrays can be easily broadcast to perform operations on arrays with different shapes and sizes.
NumPy includes a wide range of statistical functions for working with arrays, such as mean, median, standard deviation, and more.
NumPy integrates well with other libraries in the Python data science ecosystem, such as pandas, matplotlib, and scikit-learn, making it an essential tool for data analysis and machine learning.

NumPy is not the only data science library available in Python. There are other libraries that can be used for data manipulation and analysis, such as Pandas, Scikit-learn, and TensorFlow. Pandas is particularly useful for working with structured data, while Scikit-learn is mainly used for machine learning tasks. TensorFlow, on the other hand, is used for building and training deep learning models. While these libraries are useful in their own right, NumPy provides the foundation for many of these libraries and is an essential tool for any data science project in Python. Its ability to handle large arrays and perform calculations efficiently make it the go-to library for numerical computing.

How to Install NumPy?

Before diving into the NumPy library, it is important to ensure that your system meets the requirements for its installation. NumPy is primarily developed using Python and C programming languages, and it depends on some external libraries such as BLAS and LAPACK for its performance. Therefore, to use NumPy efficiently, you should have a system with the following requirements:

Python 3.7 or later
NumPy version 1.15 or later
BLAS and LAPACK libraries
At least 2 GB of RAM
64-bit operating system

To install and use NumPy in Python, first make sure you have Python 3 or above installed on your system. You can check your Python version by running the command python --version in your terminal or command prompt. Once you have Python installed, you can install NumPy using pip, the Python package manager. To install NumPy, run the command pip install numpy in your terminal or command prompt. NumPy should now be installed and ready to use in your Python environment.

Alternatively, you can also use the Python online compiler provided by Lightly IDE to learn through this tutorial right in your web browser.

If you're using Lightly IDE, the setup process is rather simple. You can simply create an account or log in to your existing account, and create a Python project. Then, follow the above instructions to install NumPy using the pip package manager.

Python NumPy Fundamentals

NumPy (Numerical Python) is a powerful Python library widely used for scientific computing and data analysis. One of the key features of NumPy is its ability to efficiently handle large arrays and perform various operations on them. In this article, we will explore the fundamentals of NumPy, focusing on array creation, indexing and slicing, reshaping arrays, and broadcasting.

Creation of Arrays in Python NumPy

In NumPy, arrays can be created in several ways. The most straightforward way to create an array is to use the np.array() function. This function takes a sequence-like object, such as a list or a tuple, and creates an array from it.

import numpy as np

my_list = [1, 2, 3, 4, 5]
my_array = np.array(my_list)
print(my_array)

Output:

[1 2 3 4 5]

NumPy also provides several convenience functions for creating arrays of a certain shape or size, such as np.zeros() and np.ones(), which create arrays of all zeros and all ones, respectively.

import numpy as np

zeros_array = np.zeros((3, 4))  # creates a 3x4 array of zeros
print(zeros_array)

Output:

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

Indexing and Slicing Arrays in Python NumPy

In NumPy, arrays can be indexed and sliced in the same way as Python lists. Indexing starts at 0 as usual, and negative indexing can also be used to access elements from the end of the array.

import numpy as np

my_array = np.array([1, 2, 3, 4, 5])
print(my_array[0])   # accessing the first element
print(my_array[-1])  # accessing the last element

Output:

1
5

Slicing allows us to extract a subset of an array using the colon (:) operator. We can specify the start and end indices of the slice, as well as the step size. Slicing can be useful when we need to work with a subset of a larger array, or when we want to extract specific rows or columns from a two-dimensional array.

import numpy as np

my_array = np.array([1, 2, 3, 4, 5])
print(my_array[1:4])    # slicing elements from index 1 to 3
print(my_array[:3])     # slicing elements from the beginning to index 2
print(my_array[3:])     # slicing elements from index 3 to the end
print(my_array[::2])    # slicing every second element

Output:

[2 3 4]
[1 2 3]
[4 5]
[1 3 5]

Reshaping Arrays in Python NumPy

Reshaping arrays allows you to change their dimensions without modifying the underlying data. For instance, we can convert a one-dimensional array into a two-dimensional array by specifying the number of rows and columns.

Similarly, we can convert a two-dimensional array into a three-dimensional array or vice versa. When reshaping an array, it is essential to ensure that the new shape contains the same number of elements as the original array. Otherwise, we will get a ValueError indicating an incompatible shape.

import numpy as np

my_array = np.array([1, 2, 3, 4, 5, 6])
reshaped_array = my_array.reshape((2, 3))  # reshaping to a 2x3 array
print(reshaped_array)

Output:

[[1 2 3]
 [4 5 6]]

Broadcasting in Python NumPy

Broadcasting is a powerful feature in NumPy that allows you to perform operations on arrays of different shapes. When operating on arrays with different dimensions, NumPy automatically broadcasts the smaller array to match the shape of the larger array. This eliminates the need for explicit looping and greatly improves computational efficiency. Let's see an example:

import numpy as np

a =

 np.array([1, 2, 3])
b = 2

result = a * b  # broadcasting b to match the shape of a
print(result)

Output:

[2 4 6]

In the above example, the scalar value 2 is broadcasted to match the shape of the array a, and the multiplication is performed element-wise.

NumPy provides numerous powerful functions and methods for manipulating arrays, performing mathematical operations, and conducting advanced computations. Understanding the fundamental concepts of array creation, indexing, slicing, reshaping, and broadcasting sets the foundation for utilizing NumPy's capabilities effectively.

Python NumPy Operations

NumPy provides a wide range of mathematical functions to perform operations on arrays, making the task of data manipulation easier for data scientists.

A few of the basic mathematical operations that can be performed on NumPy arrays include addition, subtraction, multiplication, division, and exponentiation. These operations can be easily performed on arrays using the respective operators, and the results are returned in the form of a new array.

For instance, the + operator is used for addition, and the - operator is used for subtraction. Similarly, the * operator is used for multiplication, the / operator is used for division, and the ** operator is used for exponentiation.

Statistical operations in Python NumPy

Other than mathematical operations, NumPy can also be used to perform various statistical operations. Some of the commonly used functions in NumPy for statistical calculations are mean(), median(), min(), max(), var(), std(), percentile(), corrcoef(), cov(), histogram(), etc. These functions can be applied to an entire array or a specific axis of the array.

Certainly! Here are examples of how to use some common NumPy functions:

mean():

import numpy as np

my_array = np.array([1, 2, 3, 4, 5])
mean_value = np.mean(my_array)
print(mean_value)

Output:

3.0

median():

import numpy as np

my_array = np.array([1, 2, 3, 4, 5])
median_value = np.median(my_array)
print(median_value)

Output:

3.0

min():

import numpy as np

my_array = np.array([1, 2, 3, 4, 5])
min_value = np.min(my_array)
print(min_value)

Output:

max():

import numpy as np

my_array = np.array([1, 2, 3, 4, 5])
max_value = np.max(my_array)
print(max_value)

Output:

var():

import numpy as np

my_array = np.array([1, 2, 3, 4, 5])
variance = np.var(my_array)
print(variance)

Output:

2.0

std():

import numpy as np

my_array = np.array([1, 2, 3, 4, 5])
standard_deviation = np.std(my_array)
print(standard_deviation)

Output:

1.4142135623730951

percentile():

import numpy as np

my_array = np.array([1, 2, 3, 4, 5])
percentile_value = np.percentile(my_array, 75)
print(percentile_value)

Output:

4.0

corrcoef():

import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([5, 4, 3, 2, 1])
correlation_coefficient = np.corrcoef(x, y)
print(correlation_coefficient)

Output:

[[ 1. -1.]
 [-1.  1.]]

cov():

import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([5, 4, 3, 2, 1])
covariance = np.cov(x, y)
print(covariance)

Output:

[[ 2.5 -2.5]
 [-2.5  2.5]]

histogram():

import numpy as np

my_array = np.array([1, 2, 2, 3, 3, 3, 4, 4, 5])
hist, bins = np.histogram(my_array, bins=[1, 3, 5])
print(hist)
print(bins)

Output:

[4 5]
[1 3 5]

These are just a few examples of the many functions available in NumPy for statistical analysis and data manipulation. NumPy provides a comprehensive suite of functions

Sorting and searching arrays in Python NumPy

NumPy provides several functions to sort and search arrays. Sorting an array arranges its elements in ascending or descending order. The function np.sort() returns a sorted copy of an array, while the method arr.sort() sorts an array in place. You can also specify the axis along which to sort multidimensional arrays.

Here are examples of sorting and searching arrays using NumPy:

Sorting in ascending order:

import numpy as np

my_array = np.array([3, 1, 4, 2, 5])
sorted_array = np.sort(my_array)
print(sorted_array)

Output:

[1 2 3 4 5]

Sorting in descending order:

import numpy as np

my_array = np.array([3, 1, 4, 2, 5])
sorted_array = np.sort(my_array)[::-1]
print(sorted_array)

Output:

[5 4 3 2 1]

Searching an array returns the indices of elements that meet a specified condition. The function np.where() returns the indices where a condition is true, while the method arr.nonzero() returns the indices of nonzero elements in an array.

import numpy as np

my_array = np.array([1, 2, 3, 4, 5])
indices = np.where(my_array == 3)
print(indices)

Output:

(array([2]),)

In the above example, np.where() returns the indices where the value 3 is found in the array.

You can also use the np.argmin() and np.argmax() functions to find the indices of the minimum and maximum values in an array.

import numpy as np

my_array = np.array([1, 5, 3, 2, 4])
min_index = np.argmin(my_array)
max_index = np.argmax(my_array)
print(min_index)
print(max_index)

Output:

0
1

In the above example, np.argmin() returns the index of the minimum value in the array, and np.argmax() returns the index of the maximum value.

Checking if a value or condition exists using np.any() and np.all():

import numpy as np

my_array = np.array([1, 2, 3, 4, 5])
contains_three = np.any(my_array == 3)
all_positive = np.all(my_array > 0)
print(contains_three)
print(all_positive)

Output:

True
True

In the above example, np.any() checks if the array contains the value 3, while np.all() checks if all elements in the array are greater than 0.

Using NumPy for Data Science

NumPy is a useful Python library for data analysis, which provides many functions to work with arrays. One of its core functionalities is the ability to load data from various file formats, including .txt, .csv, and .npy.

NumPy can also handle missing or incomplete data by replacing NaN values with a specified value or removing them altogether. To load data into NumPy, we typically use the loadtxt() function, which can be configured to read in data from different types of files. Another option is the genfromtxt() function, which is more flexible and can handle different data types and separators.

To load data into NumPy, you can use various functions provided by the library, depending on the format of the data. Here are examples of loading data from different sources:

Loading data from a text file using np.loadtxt():

import numpy as np

data = np.loadtxt('data.txt')
print(data)

In the above example, the np.loadtxt() function is used to load data from a text file named 'data.txt' into a NumPy array. The resulting array is then printed.

Loading data from a CSV file using np.genfromtxt():

import numpy as np

data = np.genfromtxt('data.csv', delimiter=',')
print(data)

Both functions can be used to load data into NumPy arrays, which can then be manipulated and analyzed using NumPy's extensive capabilities.

Cleaning data with NumPy

Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting or removing errors, inconsistencies, and inaccuracies in datasets. It involves transforming raw, messy data into a clean and reliable format suitable for analysis, modeling, and decision-making.

NumPy library provides various functions to clean the data. NumPy provides several methods to handle missing values in the data, such as numpy.nan, np.isnan() and np.isfinite(). Nan is a floating-point value that represents missing or undefined data.

Let's take a simple example using NumPy and Pandas together to illustrate data cleaning steps:

import numpy as np
import pandas as pd

# Creating a sample dataset with missing values and outliers
data = np.array([[1, 2, 3, np.nan],
                 [4, np.nan, 6, 7],
                 [8, 9, 10, 11],
                 [12, 13, 14, 15],
                 [16, 17, np.nan, 19]])

# Converting NumPy array to Pandas DataFrame
df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'])

# Data cleaning steps
# 1. Handling missing values
df.dropna(inplace=True)  # Remove rows with any missing values

# 2. Handling outliers (e.g., removing values above a threshold)
threshold = 10
df = df[df['A'] <= threshold]

# 3. Data formatting and standardization (if applicable)
df['A'] = df['A'].astype(int)  # Convert column to integer data type

# 4. Duplicates (if applicable)
df.drop_duplicates(inplace=True)

# Print the cleaned DataFrame
print(df)

In the above example, we create a sample dataset using a NumPy array with missing values (np.nan) and outliers. We then convert the array into a Pandas DataFrame. The subsequent data cleaning steps using Pandas include:

Handling missing values: We use the dropna() function to remove rows containing any missing values.
Handling outliers: We filter the DataFrame based on a threshold value (threshold) to remove values above that threshold.
Data formatting and standardization: We convert the 'A' column to integer data type using the astype() function.
Duplicates: We remove any duplicate rows using the drop_duplicates() function.

Finally, we print the cleaned DataFrame to observe the results.

Visualizing Data with NumPy

NumPy is not only used for generating arrays but also for data visualization. NumPy provides a number of functions to aid in data visualization, including generating histograms, scatterplots, and line plots. These functions can be used to gain insight into the underlying patterns and distributions of data.

Additionally, NumPy arrays can be used to create visualizations with Python in conjunction with other libraries such as Matplotlib and Seaborn. These libraries are particularly useful for creating complex and informative graphs and charts that can be used to communicate important insights from data analyses.

Future of data science with Python NumPy

NumPy has been a fundamental tool for data science and machine learning. It has a wide array of mathematical and scientific functions that make it easier for developers to perform data analysis, data manipulation, and data visualization.

With the growth of Big Data and Machine Learning, NumPy will continue to play a vital role in data science. Its integration with other libraries like Pandas, SciPy, and Matplotlib will continue to make data analysis, data visualization, and data manipulation in Python more accessible to data scientists.

Learning Python with a Python online compiler

Learning a new programming language might be intimidating if you're just starting out. Lightly IDE, however, makes learning Python simple and convenient for everybody. Lightly IDE was made so that even complete novices may get started writing code.

Lightly IDE's intuitive design is one of its many strong points. If you've never written any code before, don't worry; the interface is straightforward. You may quickly get started with Python programming with our Python online compiler only a few clicks.

The best part of Lightly IDE is that it is cloud-based, so your code and projects are always accessible from any device with an internet connection. You can keep studying and coding regardless of where you are at any given moment.

Lightly IDE is a great place to start if you're interested in learning Python. Learn and collaborate with other learners and developers on your projects and receive comments on your code now.