DEV Community

Cover image for Introduction to Python for Data Analysis: A Beginner’s Guide
joseph mwangi
joseph mwangi

Posted on

Introduction to Python for Data Analysis: A Beginner’s Guide

Introduction

For a long time, I viewed programming as something reserved for software engineers and computer scientists. As someone with a background in scientific research and a growing interest in data analytics, I assumed tools like Excel, SQL, and Power BI were enough to answer most questions hidden in data.

Then I started learning Python, and what first looked like a programming language full of strange syntax quickly revealed itself as one of the most powerful tools a data analyst or data scientist can have. Python is not just about writing code; it is about automating repetitive tasks, cleaning messy datasets, analysing millions of records, and creating reproducible workflows that can be shared with anyone.

In this article, I share my beginner-friendly understanding of Python and how it is used in the data analytics space. If you are just starting your journey in data analysis, this guide will give you a practical overview of what Python is, why it matters, and the core concepts you need to know.

What Is Python?

Python is a high-level, general-purpose programming language known for its readability and simplicity.

It was created by Guido van Rossum and first released in 1991. The whole idea behind creating Python was that Code should be easy to read and easy to write.

Unlike many programming languages that require complex syntax, Python uses clear and concise statements that often resemble plain English.A better example is you can print hello data world and run to get the output.

a simple code

That single line displays text on the screen and demonstrates how approachable Python can be.

Why Python Is Important in Data Analysis

Python has become one of the most widely used languages in data analytics, data science, machine learning, and artificial intelligence. Its strength lies in its versatility.

1. Automating Repetitive Tasks

Data analysts often perform the same operations repeatedly:

  • Renaming hundreds of files
  • Cleaning dozens of spreadsheets
  • Downloading reports from APIs
  • Merging datasets

Python can automate these tasks.Let me give you a real-world scenario: Imagine receiving 200 CSV files from different branches every month. Opening and cleaning each file manually in Excel would take hours. With Python, a short script can process all files in seconds.

Python script for automating file processing

Python script for automating file processing

2. Handling Large and Complex Data

Excel becomes slow when datasets grow to hundreds of thousands or millions of rows.

Python, especially with the pandas library, can efficiently process large datasets and perform advanced transformations. Real-World Scenario
Analysing e-commerce transactions from Jumia or Amazon with millions of records is practical in Python but cumbersome in spreadsheets.

3. Advanced Data Cleaning

Real-world data is rarely perfect.

You may encounter:

  • Missing values
  • Duplicate records
  • Inconsistent text formats
  • Incorrect dates

Python provides tools to clean and standardize data systematically. Real-World Scenario: Converting NAIROBI, Nairobi, and nairobi into consistent values is a simple operation in Python.

Harmonizing columns names

4. Reproducibility

Every step of your analysis is stored in code.

This means:

  • Your work can be repeated
  • Errors can be traced
  • Colleagues can reproduce your results.

Python Basics Every Data Analyst Should Know

1.Variables

Variables store data values.

name = "Joseph"
age = 28
Enter fullscreen mode Exit fullscreen mode

name is the variable that stores the name Joseph and age is the variable that store the 28
Think of variables as labeled containers.

2.Data Types

Python supports several built-in data types.

Data_Type Example
Strings "Joseph
integer 28
Float 23.43
Boolean True/False

Demonstrating Python data types

3.Operators

Operators allow you to perform calculations and comparisons.

Arithmetic Operators

Arithmetic Operators

Comparison Operators

Comparison operators are used to compare two values:

Operator Name Example
== Equal x == y
!= Not equal x != y
> Greater than x > y
< Less than x < y
>= Greater than or equal to x >= y
<= Less than or equal to x <= y

Logical Operators

Logical operators are used to combine conditional statements:

operator Description example
and returns true if both conditions are true x = 5, print(x<=5 and x < 10) (output:True)
or returns true if one of the conditions is true x = 5, print(x<4 and x < 10) (output:True)
not Reverse the result, returns False if the result is true x = 5,print(not(x > 3 and x < 10)) (output:False)

4.Data Structures

Lists
Lists are used to store multiple items in a single variable.
List items are ordered, changeable, and allow duplicate values.
List items are indexed; the first item has index [0], the second item has index [1], etc.
list uses square brackets [ ]

fruits = ["apple", "banana", "mango"]
Enter fullscreen mode Exit fullscreen mode

Tuples
Tuples are used to store multiple items in a single variable.
Tuple items are ordered, unchangeable, and allow duplicate values.
Tuples use parentheses ()

coordinates = (1.2, 3.4)
Enter fullscreen mode Exit fullscreen mode

Dictionaries
Dictionaries are used to store data values in (key: value) pairs.
A dictionary is a collection that is ordered, changeable, and does not allow duplicates.
Dictionary uses curly brackets {}

student = {"name": "Amina", "score": 90}
Enter fullscreen mode Exit fullscreen mode

Sets
A set is a collection that is unordered, unchangeable, and unindexed.
Sets are used to store multiple items in a single variable.
Sets cannot have two items with the same value.
Sets uses curly brackets {}

cities = {"Nairobi", "Mombasa", "Kisumu"}
Enter fullscreen mode Exit fullscreen mode

These structures help organise and manipulate data efficiently.

Conditional Statements

marks = 75

if marks >= 70:
    print("Pass")
else:
    print("Fail")
Enter fullscreen mode Exit fullscreen mode

For Loops

Loops repeat tasks automatically.

for number in range(1, 6):
    print(number)
Enter fullscreen mode Exit fullscreen mode

Real-World Scenario

Processing each row in a dataset or iterating through multiple files.

Functions

The functions package reusable logic.

def greet(name):
    return f"Hello, {name}!"
Enter fullscreen mode Exit fullscreen mode

Functions make code cleaner and easier to maintain.

Python Libraries for Data Analysis

One of Python's greatest strengths is its ecosystem of libraries.

Requests

requests url is used to interact with web APIs.

import requests

response = requests.get("https://dummyjson.com/products")
data = response.json()
Enter fullscreen mode Exit fullscreen mode

This is useful for collecting real-time data from online sources.

Pandas

pandas url is the most widely used library for data manipulation.

import pandas as pd
#loading an excel file into a notebook
df = pd.read_csv("sales.csv") 
df.head()
Enter fullscreen mode Exit fullscreen mode
import pandas as pd
data_json = data.json()        
#transforms a JSON file into a dataframe
df = pd.DataFrame(data_json[:100])
df
Enter fullscreen mode Exit fullscreen mode

With pandas, you can:

  • Load data
  • Filter rows
  • Handle missing values
  • Group and summarize
  • Merge datasets

For more information about pandas, refer to this video.
youtube link

<br>
Getting data URLs and loading a dataset with requests and pandas

Python Enhancement Proposals (PEP 8)

Like people, Python has its own likes and dislikes, its own "pet peeves". It likes clean indentation, meaningful variable names, and consistent formatting, and it dislikes messy spacing, unclear names, and poorly organized code. To help programmers understand what Python “prefers” and what it “dislikes,” the Python community created Python Enhancement Proposals (PEPs), with PEP 8 PEP providing the most widely used guidelines for writing readable and consistent code.

Indentation

Python uses indentation (typically 4 spaces) to define code blocks.

if True:
    print("Indented correctly")
Enter fullscreen mode Exit fullscreen mode

Line Length

Recommended maximum line length is 79 characters.

Naming Conventions

Variables and Functions: snake_case

total_sales = 500

def calculate_average():
    pass
Enter fullscreen mode Exit fullscreen mode

Classes: PascalCase

class StudentRecord:
    pass
Enter fullscreen mode Exit fullscreen mode

Constants: UPPER_CASE

PI = 3.14159
Enter fullscreen mode Exit fullscreen mode

Docstrings

Docstrings describe what a function does.

def add_numbers(a, b):
    """Return the sum of two numbers."""
    return a + b
Enter fullscreen mode Exit fullscreen mode

Docstrings are essential for writing maintainable code.

Final Thoughts

Python has shown me that data analysis is not just about creating charts or writing queries; it is about building repeatable processes that turn raw data into reliable insights.

Although I am still at the beginning of my learning journey, I can already see why Python has become such an essential tool for analysts and scientists. If you are starting out, focus on the fundamentals, practice consistently, and trust that each small script you write is another step toward becoming a more effective data professional.

A few weeks into this journey, I already understand why Python is considered the backbone of data science.

And this is only the beginning!

Top comments (0)