Pandas is a powerful Python library used for data manipulation and analysis.
Created by Wes McKinney in 2008, it provides data structures and functions for working with structured data efficiently. Pandas allows users to analyze big data, clean messy datasets, and derive meaningful insights.
Photo by Kevin Canlas on Unsplash
Key Features of Pandas
- Data Structures: Pandas introduces two primary data structures:
- Series: A one-dimensional labeled array
- DataFrame: A two-dimensional labeled data structure with columns of potentially different types
Data Manipulation: Pandas offers functions for analyzing, cleaning, exploring, and manipulating data.
Data Analysis: It enables users to perform complex operations like correlation analysis, grouping, and statistical calculations.
Data Visualization: Pandas integrates well with other libraries to create insightful visualizations.
Practical Examples
Indexing with loc
The loc function enables label-based indexing in DataFrames, allowing precise data selection:
import pandas as pd
# Create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}, index=['x', 'y', 'z'])
# Select rows with label 'y' and 'z', and columns 'A' and 'C'
result = df.loc[['y', 'z'], ['A', 'C']]
print(result)
The iloc function provides integer-based indexing for DataFrame selection:
import pandas as pd
# Create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
# Select rows 0 and 2, and columns 1 and 2
result = df.iloc[[0, 2], [1, 2]]
print(result)
Date Conversion with to_datetime
The to_datetimefunction transforms various date formats into standardized datetime objects:
import pandas as pd
# Convert string to datetime
date_string = "2023-09-17 14:30:00"
dt_object = pd.to_datetime(date_string)
print(dt_object)
# Convert multiple date strings
date_series = pd.Series(['20200101', '20200201', '20200301'])
dt_series = pd.to_datetime(date_series, format='%Y%m%d')
print(dt_series)
Output:
2023-09-17 14:30:00
0 2020-01-01
1 2020-02-01
2 2020-03-01
dtype: datetime64[ns]
Pandas simplifies data manipulation tasks, making it an essential tool for data scientists and analysts. Its versatile functions like loc, iloc, and to_datetime provide powerful ways to interact with and transform data, enabling efficient data processing and analysis in Python.
Something to consider while using loc or iloc
Letโs convert the object column date to datetime using loc
import pandas as pd
df = pd.DataFrame({'date': ['2023-01-01', '2023-02-15', '2023-03-31']})
df.loc[:, 'date'] = pd.to_datetime(df.loc[:, 'date'])
print(df)
print(df.dtypes)
Output:
date
0 2023-01-01 00:00:00
1 2023-02-15 00:00:00
2 2023-03-31 00:00:00
date object
dtype: object
If you observe, the dtype is object not datetime64[ns]. If you try to extract the date using df['date'].dt.date
You will see an error as the conversion was not successful.
Traceback (most recent call last):
File "/HelloWorld.py", line 11, in <module>
print(df.dt.date)
^^^^^
File "/usr/local/lib/python3.12/dist-packages/pandas/core/generic.py", line 6299, in __getattr__
return object. __getattribute__ (self, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'DataFrame' object has no attribute 'dt'. Did you mean: 'at'?
The reason lies in the changes made in version 2.x.x of Pandas.
From Whatโs new in 2.0.0 (April 3, 2023):
Changed behavior in setting values with df.loc[:, foo] = bar or df.iloc[:, foo] = bar, these now always attempt to set values inplace before falling back to casting (GH 45333)
How to overcome:
The best way to address this issue is to either avoid using loc or iloc or as suggested on the Pandas documentation use DataFrame.__setitem__()
df = pd.DataFrame({'date': ['2023-01-01', '2023-02-15', '2023-03-31']})
df['date'] = pd.to_datetime(df.loc[:, 'date'])
print(df)
print(df.dtypes)
print(df['date'].dt.date)
Top comments (0)