DEV Community

Saint
Saint

Posted on

A Beginner's Guide to Data Analysis with Python: Using Pandas and NumPy

Data is everywhere, and Python is the go-to language for making sense of it all. Its simple syntax and powerful libraries have made it a favourite among data scientists. If you're looking to get started in data analysis, you've come to the right place.

This guide will walk you through a complete beginner's project using Python's two most essential data analysis libraries: NumPy and Pandas.

The Power Duo: NumPy and Pandas

Think of NumPy and Pandas as the foundational tools for data analysis in Python.

  • NumPy (Numerical Python): This is the engine for numerical computing. Its core feature is the powerful n-dimensional array (ndarray), which allows for high-speed mathematical operations on large datasets. This is achieved through vectorisation, which applies operations to entire arrays simultaneously instead of looping through elements one by one.

  • Pandas: Built on top of NumPy, Pandas is your tool for data manipulation and analysis. It introduces the DataFrame, a two-dimensional table similar to a spreadsheet, which makes working with structured data intuitive and robust.

Let's Get Analysing the Iris Dataset

The best way to learn is through hands-on experience. We'll use the classic Iris flower dataset, which is ideal for beginners due to its small size, cleanliness, and ease of understanding. It contains measurements for 150 iris flowers from three different species.

Step 1: Set Up and Load the Data

First, let's import our libraries and load the dataset directly from its URL into a Pandas DataFrame. The read_csv() function is powerful enough to handle this in one step.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# URL for the Iris dataset
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

# Use column names because the file doesn't have a header
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

# Load the data
iris_df = pd.read_csv(csv_url, names=col_names)
Enter fullscreen mode Exit fullscreen mode

Step 2: Get a First Look at Your Data

Now that the data is loaded, let's perform a quick inspection to understand its structure. Pandas has some handy functions for this.

  • .head(): Shows the first five rows.

  • .shape: Shows the number of rows and columns.

  • .info(): Gives a summary of columns, data types, and non-null values.

  • .describe(): Provides descriptive statistics for numerical columns.

# See the first 5 rows
print("--- First 5 Rows ---")
print(iris_df.head())

# Get the dimensions (rows, columns)
print("\n--- DataFrame Shape ---")
print(iris_df.shape)

# Get a concise summary
print("\n--- DataFrame Info ---")
iris_df.info()

# Get descriptive statistics
print("\n--- Descriptive Stats ---")
print(iris_df.describe())
Enter fullscreen mode Exit fullscreen mode

From this, we can confirm we have 150 rows and five columns, with no missing data!

Step 3: Ask Questions and Analyse

This is where we start "slicing and dicing" the data to find insights.

Question 1: "Show me only the 'Iris-setosa' flowers."

We can use boolean indexing to filter the DataFrame based on a condition.

# Filter for rows where the species is 'Iris-setosa'
setosa_df = iris_df[iris_df['species'] == 'Iris-setosa']
print(setosa_df.head())
Enter fullscreen mode Exit fullscreen mode

Question 2: "What is the average sepal length for each species?"

The .groupby() method is perfect for this. It splits the data by category (such as 'species'), applies a function (like .mean()), and combines the results.

# Group by species and calculate the mean for each column
avg_by_species = iris_df.groupby('species').mean()
print(avg_by_species)
Enter fullscreen mode Exit fullscreen mode

This simple command provides a powerful summary, highlighting apparent differences in the average measurements between the species.

Visualise Your Findings

A chart is often better than a table. Let's create a bar chart to compare the average petal width for each species. Pandas can plot directly from our grouped data.

# Calculate the average petal width for each species
avg_petal_width = iris_df.groupby('species')['petal_width'].mean()

# Create the bar plot
avg_petal_width.plot(kind='bar', title='Average Petal Width by Species')

# Add labels for clarity
plt.ylabel('Average Petal Width (cm)')
plt.xticks(rotation=0) # Keep species names horizontal

# Display the plot
plt.show()
Enter fullscreen mode Exit fullscreen mode

This visualisation instantly tells a story: Iris-virginica has, on average, a much broader petal than Iris-setosa.

What's Next?

Congratulations! You've just completed an end-to-end data analysis project. You loaded, analysed, and visualised a dataset to uncover insights.

This is just the beginning. The best way to get better is to practice. Find another dataset and try applying these same steps. What questions can you ask? What stories can you tell with the data analysis?

Top comments (0)