Brian Muriithi

Posted on May 10 • Edited on May 14

Python in data analytics: a beginners guide

#python #datascience #programming #beginners

Python started as a hobby project. Guido van Rossum spent Christmas break in 1989 writing an interpreter he described as "a little scripting language." He published it in 1991. That origin matters because it shows how much Python has developed to get to where it is today. Python has grown to be one of the most popular programming language. In data analytics specifically, Python overtook R, SAS, Excel, and other proprietary tools not because it was marketed well, but because analysts kept choosing it for each new problem they encountered.

What Python actually is

Python is a general-purpose, interpreted programming language. "Interpreted" means you run code directly without compiling it first, which makes it fast. "General-purpose" means the same language handles web servers, automation, scientific computing, and data analysis.

The syntax is minimal and indentation defines code blocks instead of brackets. You can often read a Python function and understand what it does even if you have never written Python before. Python is also free, open source, and runs on Windows, Mac, and Linux without modification. That combination was actually decisive in getting Python adopted inside companies where budget and IT policy often block paid software.

Why data analysts picked Python over the alternatives

R was excellent for statistics but hard to learn for people without a formal statistics background. SAS was comprehensive but expensive and proprietary. Excel worked for small datasets but fell apart past a few hundred thousand rows. Python was cheap, readable, and already had libraries capable of serious data work.

In 2008, Wes McKinney was a quantitative analyst at AQR Capital Management running financial calculations that he found painful to do in existing tools. He built pandas, a library that gave Python a spreadsheet-like structure for data manipulation. It became public in 2009. By 2012, analysts across finance, tech, and academia had adopted it.

Pandas gave Python something it was missing: a clean way to load a CSV, examine the data, filter rows, group by category, and aggregate columns in a few lines. Before pandas, doing any of that required more setup than most analysts would tolerate.

Once pandas existed, the rest Python adoption moved quickly. More people used Python for data work, which meant more contributors improving the libraries, which attracted more people. That cycle is still running.

The libraries worth knowing

Python itself does not do statistics or draw charts. The libraries do. Four of them appear in almost every data analytics project.

NumPy handles numerical computation. It provides arrays that behave like mathematical vectors and matrices, and operations on those arrays run fast because NumPy's internals are written in C. Most other data libraries sit on top of NumPy.

pandas provides the DataFrame, which is the structure data analysts actually work with. Think of it as a table with column names and row labels. You load a file, you get a DataFrame, you clean the data in that DataFrame, and you pass it to a model or a chart.

Matplotlib handles visualization. John Hunter created it in 2003 to generate figures in Python, drawing some inspiration from MATLAB. It is verbose by modern standards but configurable down to individual pixels when you need that level of control.

Seaborn wraps around Matplotlib and produces statistical charts with less code. If you want a box plot, a correlation heatmap, or a regression line on a scatter plot, Seaborn gets you there faster than raw Matplotlib would.

Cleaning data: where most of the work goes

Most data analysts spend most of their time on data cleaning or handling data that is broken in some way.

Missing values are one of the most common problems. A column might have null entries because a form field was optional or because records were merged from different systems. You cannot simply ignore them; they cause errors in calculations and bias in results.

import pandas as pd

df = pd.read_csv("sales_data.csv")

# Check which columns have missing values and how many
print(df.isnull().sum())

# Drop rows where any value is missing
df_clean = df.dropna()

# Or fill missing values with the column mean
df["revenue"] = df["revenue"].fillna(df["revenue"].mean())

Beyond missing values, you deal with duplicates, incorrect data types (a column that should be numbers stored as text), inconsistent category labels ("New York," "new york," "NY"), and dates stored in five different formats within the same column.

These problems are not glamorous. But fixing them before analysis is the difference between results you can trust and results that only look reasonable.

Analyzing data: from raw numbers to answers

Once the data is clean, Python makes it straightforward to answer questions about it. Say you have sales data for a retail chain across 200 stores and three years. You want to know which regions are growing and which are not.

# Group sales by region and year, then sum revenue
regional_sales = df.groupby(["region", "year"])["revenue"].sum().reset_index()

# Compute year-over-year growth rate
regional_sales["growth"] = regional_sales.groupby("region")["revenue"].pct_change()

print(regional_sales.sort_values("growth", ascending=False).head(10))

It runs and produces a sorted table showing the ten fastest-growing region-year combinations in your dataset.

Beyond grouping and aggregating, Python handles statistical tests, time series analysis, and correlation matrices. SciPy adds t-tests, chi-squared tests, and regression analysis for more rigorous statistical questions. Scikit-learn adds the full range of machine learning models when the goal shifts from describing what happened to predicting what will happen next.

Visualizing data: making the analysis usable

Numbers in a table and numbers in a chart communicate differently. A chart of monthly revenue over three years shows trends, seasonality, and anomalies in ways that scrolling through a spreadsheet does not.

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 5))
plt.plot(df["month"], df["revenue"])
plt.title("Monthly revenue, 2022-2024")
plt.xlabel("Month")
plt.ylabel("Revenue (USD)")
plt.tight_layout()
plt.savefig("revenue_trend.png")

Seaborn makes it easier to produce charts that carry statistical meaning. A box plot shows the full distribution of values including outliers. A heatmap shows correlations between many variables at once. A pair plot generates scatter plots for every combination of numerical columns in your dataset, which is often the fastest way to spot patterns worth investigating before you commit to a specific analysis.

Why beginners should start with Python

The first reason is practical: Python is what the job market wants. Data analyst job postings ask for Python more than any other programming language, by a wide margin in most tech-adjacent industries. The 2023 Kaggle Data Science Survey found that over 87% of respondents use Python regularly in their work.

The second reason is the feedback loop. Python is interpreted, so you see the result of each line immediately. You write a line, run it, see what happens, and adjust. That makes learning faster and less frustrating than compiled languages where you chase errors through a build process before seeing any output.

The third reason is the tooling. There is a tone of freely available browser-based Python environment with no local setup required, which means you can practice real data analysis without installing anything. Jupyter Notebooks let you mix code, output, and notes in a single document, which is how most analysts actually share their work.

This being said, it is important to mention that Python will not teach you statistics, though. You can use it to run a regression without understanding what a regression assumes about your data. The code might run, the output look authoritative, but the result is wrong because the data violated an assumption the analyst did not know to check for. Learning Python alongside statistics, is the right approach.

Finally, it worth noting that when you run into a problem in Python, someone has usually had it before and written about it. Stack Overflow alone has over 2.2 million Python questions and answers as of 2024. Reviewing other people's code is also a good way to learn and develop your own problem solving skills.

Start with a real dataset you care about. Load it with pandas. See what is broken. Fix it. Ask a question. Answer it. That is the whole Python loop.

How far are you in your Python learning journey? Let me know in the comments.

DEV Community