DEV Community

nicodemus
nicodemus

Posted on

Using Python for Data Analysis

Data analysis is the process of examining, cleaning, and interpreting data to find useful patterns, trends, or insights.

Why Choose Python for Data Analytics?

  • Ease of Use and Readability: Python has an intuitive syntax, making it easy for beginners and professionals to learn and use.

  • Extensive Library Support: A rich ecosystem of libraries such as Pandas, NumPy, Matplotlib, Seaborn, makes data processing smooth and efficient. These libraries provide built-in functions for cleaning, transforming, visualizing, and modeling data.

  • Scalability and Performance: Python can handle small datasets for academic projects and large-scale enterprise datasets efficiently. It integrates well with big data frameworks like Apache Spark to scale processing power further.

Need for a Data Analysis Workflow

A data analysis workflow is a process that provides a set of steps for your analysis team to follow when analyzing data.
By following the defined steps, your analysis become systematic, which minimizes the possibility that you’ll make mistakes or miss something. Furthermore, when you carefully document your work, you can reapply your steps against future data as it becomes available.

Acquiring Your Data.

step 1
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
import numpy as np

step 2
Reading Data From CSV Files
You can obtain your data in a variety of file formats. One of the most common is the comma-separated values (CSV) file. This is a text file that separates each piece of data with commas. The first row is usually a header row that defines the file’s content, with the subsequent rows containing the actual data
df = pd.read_csv('data.csv')

Cleaning Your Data With Python.

The data cleaning stage of the data analysis workflow is often the stage that takes the longest, particularly when there’s a large volume of data to be analyzed. It’s at this stage that you must check over your data to make sure that it’s free from poorly formatted, incorrect, duplicated, or incomplete data.

steps in cleaning your data.

1.Dealing With Missing Data.
Check missing values
print(df.isnull().sum())
Fill missing values with the column average
df.fillna(df.mean(), inplace=True)
importance
Filling missing values prevents data bias and ensures accuracy in your analysis.

2.Correcting Invalid Data Types.
Convert 'age' column to integer
df['age'] = df['age'].astype(int)

Convert 'date' column to proper date format
df['date'] = pd.to_datetime(df['date'])
importance
Fixing data types ensures that operations like calculations or sorting work correctly.
3.Fixing Inconsistencies in Data.
Convert all text to lowercase for consistency
df['category'] = df['category'].str.lower()

Replace variations with a standardized term
df['category'].replace({'elec': 'electronics', 'electro': 'electronics'}, inplace=True)

4.Removing Duplicate Data.
Check for duplicates
print(df.duplicated().sum())

Remove duplicate entries
df.drop_duplicates(inplace=True)

importance
It improves Data Accuracy
5.Storing Your Cleansed Data.
Save cleaned data
df.to_csv('cleaned_data.csv', index=False)

connecting to my database.

How to Connect DBeaver to PostgreSQL with SQLAlchemy

Step 1: Install SQLAlchemy & PostgreSQL Driver

pip install sqlalchemy psycopg2
sqlalchemy

Step 2: Set Up Your Connection in DBeaver
Open DBeaver

Click Database > New Connection → Select PostgreSQL.
Enter your credentials:

Host: localhost

Port: 5432

Database Name: Use your database name.

Username/Password:

Test Connection → If it’s successful, you're good to go.

Step 3: Connect SQLAlchemy to Your Database

`DATABASE_URL = "postgresql://username:password@localhost:5432/dbname"

engine = create_engine(DATABASE_URL)

with engine.connect() as connection:
print("Connected successfully!")`

After successful connection we can Build visuals in python

Group by category and sum sales
category_sales = df.groupby('Category')['Sales ($)'].sum()

Create bar graph
plt.bar(category_sales.index, category_sales.values, color=['blue', 'green', 'red'])

Labels and title
plt.xlabel('Product Category')
plt.ylabel('Total Sales ($)')
plt.title('Sales by Category')

Show graph
plt.show()

Create line graph
plt.plot(df['Date'], df['Sales ($)'], marker='o', linestyle='-', color='blue')

Labels and title
plt.xlabel('Date')
plt.ylabel('Sales ($)')
plt.title('Sales Trend Over Time')

Rotate x-axis labels for better readability
plt.xticks(rotation=45)

Show graph
plt.show()

creating a pie chart.
labels = list(category_sales.keys())

sizes = list(category_sales.values())

plt.title('Sales Distribution by Category')
plt.figure(figsize=(6,6))

plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=['blue', 'green', 'red'])

Show chart
plt.show()

conclusion

Top comments (0)