DEV Community: justkmike

Data Cleaning with Pandas

justkmike — Fri, 20 Oct 2023 11:28:27 +0000

In this guide, we'll explore various data-cleaning techniques using Python and the Pandas library. We'll also cover functions like head(), tail(), info(), describe(), shape, and size, and demonstrate how to remove empty cells, deal with wrong data formats, access data and remove duplicates.

DataFrame Basics

`head()` and `tail()`

These functions display the first and last n rows of a DataFrame, respectively.

# Display the first 5 rows
df.head()

# Display the last 5 rows
df.tail()

`info()`

info() provides essential information about the DataFrame, including column data types, non-null counts, and memory usage.

df.info()

`describe()`

describe() offers statistical summaries of the DataFrame, such as mean, median, and quartiles.

df.describe()

`shape`

shape returns the dimensions of the DataFrame as a tuple (number of rows, number of columns).

df.shape

`size`

size returns the total number of elements in the DataFrame.

df.size

Data Cleaning

Removing Empty Cells

`dropna()`

dropna() removes rows with empty cells, and it can create a new DataFrame. If you want to modify the existing DataFrame, use the inplace=True parameter.

# Create a new DataFrame with empty cells removed
new_df = df.dropna()

# Modify the existing DataFrame in-place
df.dropna(inplace=True)

`[fillna](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna)()`

fillna("Value to replace with") replaces empty cells with a specified value. It also supports additional parameters like axis, method, limit, and value.

# Replace empty cells with a specific value
df.fillna("Replacement Value", inplace=True)

Handling Wrong Data Formats

For example, to convert a column named "date" to datetime format:

df["date"] = pd.to_datetime(df["date"])

Removing Duplicates

To identify and remove duplicate rows:

`duplicated()`

duplicated() returns a Boolean Series, indicating whether each row is a duplicate (True) or not (False).

duplicate_rows = df.duplicated()

`drop_duplicates()`

drop_duplicates() removes duplicate rows. Use the inplace=True parameter to modify the existing DataFrame.

df.drop_duplicates(inplace=True)

Accessing Data in a DataFrame

`at` and `iat`

`at`

at is used to get or set a specific element by row and column labels.

# Get the value at row 2, column "name"
value = df.at[2, "name"]

# Assign a new value to the selected element
df.at[2, "name"] = "Justkmike"

`iat`

iat is used to access elements by row and column index.

# Get the value at row 1, column 2
value = df.iat[1, 2]

# Update data at a specific index
df.iat[1, 2] = 10

`[loc](https://www.statology.org/pandas-loc-vs-iloc/)` and `iloc`

`loc`

loc selects rows using index labels.

# Select a row with the index label "12-23-23"
selected_row = df.loc["12-23-23"]

`iloc`

iloc selects rows using integer-based indexing.

# Select the first two rows and the first two columns
selected_data = df.iloc[0:2, 0:2]

Remember that this is just the tip of the iceberg regarding Pandas. There are many more operations and functions available for data manipulation. If you encounter any issues or need further assistance, feel free to contact mwkariuki2e@gmail.com. Stay tuned for our next guide on data visualization. See you! 😊

Python Pandas: Introduction to pandas (part 2)

justkmike — Wed, 18 Oct 2023 10:32:08 +0000

In the previous article, we looked at what DataFrames and Series are in pandas. We also explored how to create a pandas DataFrame (referred to as df here) from a list and a dictionary. I hope you've also researched the key differences between a DataFrame and a Series in pandas. In this article, we will build on that knowledge.

To understand better, in data science, data can be sourced from various sources, and each source may store the data in different formats, such as comma-separated values (CSV), Excel sheets, SQL files, and more. It's up to you to know what tools to use. Now, let's dive into the available functions in pandas.

Reading and Writing CSV Files

Pandas provides read_csv() and to_csv() functions to work with CSV files.

Reading CSV Files

read_csv() is used to read data from CSV files. For example:

import pandas as pd

csvdf = pd.read_csv("file_path.csv")

By default, read_csv() assumes the first row in your file is the header row. If this is not the case, you can use the header parameter to specify the row number to use as the header:

csvdf = pd.read_csv("file_path.csv", header=3)

Note that all rows above the specified header will be ignored when creating the DataFrame.

Reading CSV Files with Multiple Headers

If your CSV file has multiple header rows, you can specify them using a list with the header parameter:

csvdf = pd.read_csv("file_path.csv", header=[0, 2])

Reading CSV Files Without Headers

If your CSV file doesn't include headers, but you have the column names separately, you can use the names parameter to provide a list of column names:

csvdf = pd.read_csv("file_path.csv", names=["first_name", "last_name", "gender"])

You can check the set headers using csvdf.columns.

Adding a Prefix to Column Names

If you have many columns and labeling them is cumbersome, you can add a prefix to the columns using the prefix parameter:

csvdf = pd.read_csv("file_path.csv", header=None, prefix="Col_")

Setting the Index Column

By default, DataFrames are indexed from 0 to n-1. If you want to use a specific column as the index, you can do so with the index_col parameter:

csvdf = pd.read_csv("file_path.csv", index_col='company')
# Alternate approach using column index
csvdf = pd.read_csv("file_path.csv", index_col=0)

If it's a multi-index DataFrame, use a list to indicate the index, similar to how we did for the headers.

Reading CSV Files with Defined Columns and Rows

If you don't need all the columns and rows in the provided file, you can filter them using the usecols and nrows parameters. For example:

csvdf = pd.read_csv("file_path.csv", index_col=0, usecols=['names', 'gender'], nrows=10)

This means you'll get a DataFrame with only the 'names' and 'gender' columns and only 10 rows of the available rows.

Skipping Rows

If you need to skip specific rows, like those with even row numbers, you can use the skiprows functionality:

csvdf = pd.read_csv("file_path.csv", index_col=0, skiprows=lambda x: x % 2 == 0)

This uses a lambda function to skip rows where the row number is even.

Writing to CSV

To save a DataFrame to a CSV file, use the to_csv() method. By default, it also includes the index in the file. You can disable this by setting the index parameter to `False:

python csvdf.to_csv("file_name.csv", index=False)

Most of these functionalities ie read_json() and read_excel() are similar _(and I will not be handling that at least for now)_when writing to CSV as when reading from CSV. Note that this is not an exhaustive guide on read_csv, so I encourage you to explore further and practice to better understand. You can also find free courses on platforms like Simplilearn to boost your skills and resume.

Introduction to Pandas:Python Pandas library for data science(Part 1)

justkmike — Fri, 06 Oct 2023 07:23:30 +0000

What is Pandas?
Pandas is a Python library designed for data manipulation and analysis. It simplifies various data-related tasks, making them more efficient and accessible. Whether you're working with datasets, performing data cleaning, exploration, or statistical analysis, Pandas provides the tools to help you achieve your goals.

Why Use Pandas?
Pandas offer numerous advantages for data scientists and analysts:

Data Analysis: Pandas simplifies data analysis by providing powerful data structures and functions.
Data Cleaning: It offers tools for cleaning and preprocessing data, such as handling missing values and outliers.
Data Manipulation: Pandas allows you to reshape and transform data, making it suitable for your specific analysis needs.
Readability: It enhances data readability through structured data frames and series.
Simplified Workflow: Pandas streamlines data-related tasks, saving time and effort in data projects. For more in-depth information, you can explore resources like W3Schools,geeksforgeeks for more info.

How to Install Pandas:
You can easily install Pandas using the Python package manager, pip. Open your command prompt or terminal and run the following command:

pip install pandas

Usage:
Once Pandas is installed, you can import it into your Python script or notebook using the alias 'pd':

import pandas as pd

You can check the installed Pandas version with:

print(pd.__version__)

Pandas Series:
A Pandas Series is a one-dimensional data structure representing a single column in a data frame. It is homogenous, meaning it contains elements of the same data type, and each element has a label (index). Here's an example:

import pandas as pd

my_list = [30, 20, 23, 34]
my_series = pd.Series(my_list)
print(my_series)

Pandas Labels:
By default, Pandas assigns labels indexed from 0 to n-1, where n is the length of the series. However, you can customize the index as you prefer. Here's an example:

custom_index = ['a', 'b', 'c', 'd', 'e']
my_new_series = pd.Series(my_list, index=custom_index)
print(my_new_series)

You can access a series item using its label, like this:

print(my_new_series['a'])

Key-Value Objects in Pandas Series:
If you have a dictionary with key-value pairs, you can transform it into a Pandas Series. The keys will become the labels for the series.

DataFrames:
Pandas DataFrames are multidimensional tables with rows and columns. They can be thought of as collections of Pandas Series, and they are commonly used for structured data. Here's an example of creating a DataFrame from a dictionary:

my_dict = {
    "name": ["Mike", "John"],
    "age": [12, 23]
}
new_df = pd.DataFrame(my_dict)

You can access specific rows using .loc[]. For example:

new_df.loc["row_index"]

To access multiple rows, you can pass a list of indices:

new_df.loc[[0, 1]]

You can also specify named indexes when creating the DataFrame by providing a list of indexes to the index argument.

When you need to load data from sources like CSV files, Excel files, or JSON files into a DataFrame, Pandas provides built-in functions like pd.read_csv(), pd.read_excel(), and pd.read_json() to simplify the process.

I will be showing you how to use the functions in the next article feel free to go ahead and do some research on your own, see you on the next one.

Data Science! Data Science

justkmike — Sat, 30 Sep 2023 09:41:22 +0000

Before we dive into data science, What is data?
Data is a collection of raw facts and figures collected for a specific task e.g. census data, number of cars that use a road per hour, daily weather updates, etc. Once this data is collected it is not useful to anyone until some analysis is done on it and this brings us to data science.

Data science Definition
Data science is the study of data to extract meaningful insights for business and solve real-life problems. Easy right :).
Data science is a multidisciplinary field that is it combines different disciplines including mathematics, statistics, probability, programming, and more.

Key Components of Data Science

Data collection
Data cleaning
Data Exploration
Modeling
Interpretation

Data Collection
Data collection is the first step in data analysis. This involves gathering data from existing databases or collecting it directly from locals. eg if I need to do an analysis of how a school is performing I can get the data from KNEC or visit various schools, get the data from them, and do my "thing".

Data cleaning
You have the data now, many a time when this data is messy, has errors, has outliers, and has duplicates basically means it cannot be used in raw form. Data cleaning involves removing all the irregularities that may interfere with the correct analysis and insights.

Data Exploration

It's time to work with the data now that you've cleaned. Data exploration is summarising and visualizing data in order to better comprehend its properties. Finding patterns and relationships in the data is made easier by methods like data visualization, descriptive statistics, and exploratory data analysis (EDA).

Modeling
It entails creating mathematical and statistical models to predict the future or unearth undiscovered information. Regression analysis, clustering, neural networks, and machine learning algorithms are examples of common modeling techniques.

Interpretation

Data scientists interpret the outcomes of model training and prediction to obtain actionable insights. To fully understand the impact of the findings and make informed choices, this phase necessitates domain expertise e.g. transport, Medicine, climate, etc.

Data Science Applications

Health Care
Transport
Sports etc

Conclusion
Data science is the process of extracting meaning full insights that will help businesses make informed decisions as well us solve reallife problems.