DEV Community

Cover image for Pandas & NumPy: Building Blocks of Data Science
Shrutik
Shrutik

Posted on

Pandas & NumPy: Building Blocks of Data Science

Pandas and NumPy are essential tools in data science for data manipulation and numerical computations. This article offers a brief overview of their core functionalities and use cases.

Pandas
Overview:

Pandas is a powerful and open-source Python library. The Pandas library is used for data manipulation and analysis. Pandas consist of data structures and functions to perform efficient operations on data.
It's widely used for data wrangling, cleaning, preparation, and analysis, making it a fundamental tool in data science.

Key Features:

  1. Data Structures:
    Series: One-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index.
    DataFrame: Two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).

  2. Data Manipulation:
    Indexing and Selecting Data: Label-based and integer-based indexing for accessing data.
    Merging and Joining: Functions like merge(), join(), and concat() to combine data from multiple DataFrames.
    Group By: Splitting data into groups based on some criteria and then applying a function to each group independently.

  3. Data Cleaning:
    Handling missing data by filling, dropping, or interpolating values.
    Removing duplicates and filtering unwanted data.

  4. Input and Output Tools:
    Reading from and writing to various file formats, such as CSV, Excel, SQL databases, and JSON.

  5. Data Aggregation and Transformation:
    Aggregation functions like sum(), mean(), min(), max(), and custom aggregations.
    Transformation functions like apply(), map(), and vectorized string operations.

  6. Time Series Analysis:
    Date and time functionality, including resampling, frequency conversion, and time zone handling.

Uses:

  1. Data Manipulation and Cleaning:
    Essential for handling and cleaning structured data.
    Provides functions to handle missing data, filter and reformat datasets.

  2. Data Analysis:
    Used for exploratory data analysis (EDA) and descriptive statistics.
    Functions like groupby, merge, pivot, and melt are fundamental for data aggregation and transformation.

  3. Time Series Analysis:
    Robust support for time series data, including date-time functionality, resampling, and rolling windows.

  4. Data Wrangling:
    Used for preparing data for machine learning models and other analytical tasks.

  5. Integration with Databases and File Systems:
    Capable of reading from and writing to various file formats (CSV, Excel, SQL databases, JSON).

Fields:
Data Science
Business Analytics
Financial Analysis
Social Sciences
Healthcare
Marketing

Technical Details:
Language: Written in Python, but uses C and Cython for performance-critical parts.
Data Structures: Series (1D) and DataFrame (2D).
Integration: Built on top of NumPy, enabling seamless interaction. Can be integrated with visualization libraries like Matplotlib and Seaborn.

Example Code:

import pandas as pd

# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [24, 27, 22, 32],
        'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}
df = pd.DataFrame(data)

# Data manipulation
df['Age'] = df['Age'] + 1

# Filtering data
df_filtered = df[df['Age'] > 25]

# Grouping and aggregation
grouped = df.groupby('City').mean()

print(df)
print(df_filtered)
print(grouped)
Enter fullscreen mode Exit fullscreen mode

Installing Pandas
The first step in working with Pandas is to ensure whether it is installed in the system or not. If not, then we need to install it on our system using the pip command.

Follow these steps to install Pandas:
Step 1: Type ‘cmd’ in the search box and open it.
Step 2: Locate the folder using the cd command where the python-pip file has been installed.
Step 3: After locating it, type the command:

pip install pandas

Importing Pandas
After the Pandas have been installed in the system, you need to import the library.
This module is generally imported as follows:

import pandas as pd

Note: Here, pd is referred to as an alias for the Pandas. However, it is not necessary to import the library using the alias, it just helps in writing less code every time a method or property is called.

NumPy
Overview:

NumPy is the fundamental package for scientific computing with Python. It provides support for arrays, matrices, and many mathematical functions to operate on these data structures efficiently.

Key Features:

  1. N-dimensional Array Object (ndarray):
    Efficient multi-dimensional array operations.
    Supports a variety of data types including integers, floats, and complex numbers.

  2. Mathematical Functions:
    Element-wise operations, such as addition, subtraction, multiplication, and division.
    Mathematical operations like exponential, logarithmic, trigonometric, and statistical functions.

  3. Linear Algebra:
    Functions for linear algebra operations like dot product, matrix multiplication, determinants,and singular value decomposition.

  4. Random Number Generation:
    Tools for generating random numbers and creating random samples from various probability distributions.

  5. Broadcasting:
    Allows arithmetic operations on arrays of different shapes without needing explicit looping.

  6. Integration with C/C++ and Fortran:
    Tools to interface with code written in these languages, enhancing performance for computational-heavy tasks.

Uses:

  1. Numerical Computations:
    Fundamental for numerical operations on large datasets.
    Used in mathematical calculations, including linear algebra, statistics, and Fourier transforms.

  2. Scientific Computing:
    Widely used in fields like physics, chemistry, engineering, and finance for simulations, modeling, and algorithm development.

  3. Machine Learning:
    Provides efficient data structures for handling input data and performing matrix operations in machine learning algorithms.

  4. Data Analysis:
    Supports vectorized operations, which are essential for performance in data analysis tasks.

Fields:
Data Science
Machine Learning
Artificial Intelligence
Physics
Chemistry
Engineering
Finance

Technical Details:
Language: Written in C for performance but used with Python.
Array Object: ndarray, which allows for efficient manipulation and computation of multi-dimensional arrays.
Broadcasting: Mechanism that allows arithmetic operations on arrays of different shapes.
Integration: Can be integrated with other scientific computing libraries like SciPy, Matplotlib, and Pandas.

Example Code:

import numpy as np

# Creating an array
array = np.array([1, 2, 3, 4, 5])

# Array operations
array = array * 2

# Reshaping an array
reshaped_array = array.reshape((1, 5))

# Linear algebra example
matrix_a = np.array([[1, 2], [3, 4]])
matrix_b = np.array([[5, 6], [7, 8]])
matrix_product = np.dot(matrix_a, matrix_b)

print(array)
print(reshaped_array)
print(matrix_product)

Enter fullscreen mode Exit fullscreen mode

Install Python NumPy:
Numpy can be installed for Mac and Linux users via the following pip command:

pip install numpy

Windows does not have any package manager analogous to that in Linux or Mac.

Comparison and Integration
Pandas vs. NumPy:

  1. Use Case:
    Pandas: Best for data manipulation, analysis, and handling tabular data.
    NumPy: Best for numerical computations, working with multi-dimensional arrays, and performing vectorized operations.

  2. Data Structures:
    Pandas: Series and DataFrame.
    NumPy: ndarray.

  3. Functionality:
    Pandas: Rich functionality for data wrangling, such as merging, joining, and group-by operations.
    NumPy: Extensive support for mathematical operations and efficient handling of numerical data.

  4. Integration:
    Seamless Interoperability: Pandas is built on top of NumPy, meaning Pandas data structures utilize NumPy arrays internally. This allows for efficient computation and easy conversion between Pandas and NumPy.

Example:
Convert a Pandas DataFrame to a NumPy array using the .values attribute.
Convert a NumPy array to a Pandas DataFrame using pd.DataFrame().

Closure:
In this article, we've covered the basics of Pandas and NumPy, including their core functionalities and uses in data science.
Detailed explanations, code examples, and advanced operations will be covered in a separate post dedicated to each topic.
Stay tuned for a deeper dive into the technical aspects and practical applications of these powerful tools.

I hope this post was informative and helpful.
If you have any questions, please feel free to leave a comment below.

Happy Coding 👍🏻!
Thank You

Cover Image Credits: Caitlin Muncy | Dribbble

Top comments (0)