Susan Cook

Posted on Feb 1

Python Data Science: The Beginners Guide 2026

#datascience #python #beginners #programming

What is Data Science? The Simple Truth

Python for data science means three things:

Read data (CSV, databases, websites)
Clean data (remove wrong values, fix missing data)
Find patterns (use math to understand)

That is all data science is. Not complicated.

Why Python? Five Reasons

Reason 1: Easy to Read

Python code looks like English. Compare:

Java:

public class Hello {
    public static void main(String[] args) {
        System.out.println("Hello");
    }
}

Python:

print("Hello")

Same result. Python needs 1 line. Java needs 5 lines.

Reason 2: Huge Community

When you have a problem, millions of people have solved it. You search Google. You find answer in 5 minutes.

Reason 3: Free Everything

No cost. Download Python = $0. All libraries = $0. All tools = $0.

Reason 4: Industry Standard

Google, Netflix, Amazon, Facebook use Python. When you learn Python, you learn what real companies use.

Reason 5: Quick Income Growth

Junior data scientist: $85,000 - $120,000 per year
Senior data scientist: $180,000 - $280,000 per year

That is 2-3x income growth in 5-7 years. Real money.

Getting Started: Week by Week

Week 1: The Foundation

Open Google Colab (colab.research.google.com). Free. No installation.

Learn three things:

1. Variables (Containers)

name = "Alex"
age = 28
salary = 85000

print(name)      # Output: Alex
print(age)       # Output: 28
print(salary)    # Output: 85000

Variables hold information. Think like boxes.

2. Lists (Groups of Items)

fruits = ["apple", "banana", "orange"]
numbers = [1, 2, 3, 4, 5]

print(fruits[0])     # Output: apple (first item)
print(fruits[1])     # Output: banana (second item)
print(numbers[-1])   # Output: 5 (last item)

fruits.append("mango")  # Add new item
print(fruits)        # Output: ["apple", "banana", "orange", "mango"]

Lists hold multiple items in order.

3. Loops (Repeat Actions)

numbers = [1, 2, 3, 4, 5]

for number in numbers:
    print(number)

# Output:
# 1
# 2
# 3
# 4
# 5

Loops repeat code. Useful for processing many items.

4. Conditions (If-Then Logic)

age = 28

if age >= 18:
    print("Adult")
else:
    print("Not adult")

# Output: Adult

Conditions make decisions. Test something. Do different things based on result.

Practice this week: Write 10 small programs using these concepts.

Week 2: Numbers and Math (NumPy)

NumPy is a library for fast math. When you have millions of numbers, NumPy is 100x faster than normal Python.

Basic NumPy

import numpy as np

# Create array
numbers = np.array([1, 2, 3, 4, 5])

# Math operations
doubled = numbers * 2         # [2, 4, 6, 8, 10]
squared = numbers ** 2        # [1, 4, 9, 16, 25]
sum_all = np.sum(numbers)     # 15
average = np.mean(numbers)    # 3.0
maximum = np.max(numbers)     # 5
minimum = np.min(numbers)     # 1

NumPy does all math instantly.

Using NumPy on Data

import numpy as np

# Salaries of 5 employees
salaries = np.array([50000, 60000, 75000, 80000, 90000])

# Find average
avg_salary = np.mean(salaries)
print(f"Average: ${avg_salary}")  # Output: Average: $71000

# Find highest
max_salary = np.max(salaries)
print(f"Highest: ${max_salary}")  # Output: Highest: $90000

# Give 10% raise to everyone
new_salaries = salaries * 1.10
print(new_salaries)  # [55000, 66000, 82500, 88000, 99000]

Simple. Fast. Powerful.

Practice this week: Do 5 math problems using NumPy. Get comfortable with arrays.

Week 3: Data Tables (Pandas)

Pandas is for reading and organizing data. Think Excel, but in Python and more powerful.

Basic Pandas

import pandas as pd

# Create simple data
data = {
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "salary": [50000, 60000, 75000]
}

# Convert to table
df = pd.DataFrame(data)

# View table
print(df)

# Output:
#       name  age  salary
# 0    Alice   25   50000
# 1      Bob   30   60000
# 2  Charlie   35   75000

This creates a table (like Excel).

Reading Data from File

import pandas as pd

# Read CSV file
df = pd.read_csv("employees.csv")

# See first 5 rows
print(df.head())

# See information about data
print(df.describe())

# See column names
print(df.columns)

# See data shape (rows, columns)
print(df.shape)  # Output: (614, 13) means 614 rows, 13 columns

Most of your work starts here: read data, understand it, clean it.

Basic Data Manipulation

# Find average salary
avg = df["salary"].mean()
print(f"Average salary: ${avg}")

# Find maximum age
max_age = df["age"].max()
print(f"Oldest employee: {max_age} years old")

# Count employees in each department
department_count = df.groupby("department").size()
print(department_count)

# Find average salary by department
dept_salary = df.groupby("department")["salary"].mean()
print(dept_salary)

These operations do what Excel does. But in seconds, not minutes.

Important: Never Use Loops for Data

SLOW (bad):

for i in range(len(df)):
    df.loc[i, "salary"] = df.loc[i, "salary"] * 1.10

FAST (good):

df["salary"] = df["salary"] * 1.10

Same result. The fast way is 100x faster.

This is critical: Always use Pandas operations, never loops.

Practice this week: Load a CSV file. Explore it. Calculate averages. Group by categories.

Week 4: Machine Learning (Your First Model)

This is where things get exciting. You teach a computer to recognize patterns.

The Concept

Machine learning works like this:

Show computer examples with answers
Computer learns pattern
Show computer new examples without answers
Computer predicts answers

Simple Example: Predict Loan Approval

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Read data
df = pd.read_csv("loan_data.csv")

# Get features (input) and target (output)
features = df[["credit_score", "income", "loan_amount"]]
target = df["approved"]  # 1 = yes, 0 = no

# Split: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
    features, target, test_size=0.2
)

# Create model
model = RandomForestClassifier(n_estimators=100)

# Train model (show it examples)
model.fit(X_train, y_train)

# Test model (check if it learned)
accuracy = model.score(X_test, y_test)
print(f"Model accuracy: {accuracy*100:.1f}%")  # Output: Model accuracy: 78.5%

# Make prediction on new data
new_person = [[700, 50000, 200000]]  # credit score, income, loan amount
prediction = model.predict(new_person)
print(f"Loan approved: {prediction[0] == 1}")  # Output: Loan approved: True

This 30-line program:

Reads data
Splits data
Creates model
Trains model
Tests model
Makes prediction

This is professional data science.

The 2026 Additions: New Tools

Polars (Fast Pandas for Big Data)

When your data is huge (50GB, 100GB), Polars is 8x faster than Pandas.

import polars as pl

# Read huge file instantly
df = pl.read_csv("huge_file.csv")

# Process instantly
result = df.filter(
    pl.col("age") > 25
).group_by("city").agg(
    pl.col("salary").mean()
)

print(result)

Polars syntax is similar to Pandas. Easy to learn if you know Pandas.

LLM Integration (AI Helps You Code)

ChatGPT can now write Python code for you.

You ask: "Write Python code to read a CSV and find average salary by department."

ChatGPT writes:

import pandas as pd

df = pd.read_csv("employees.csv")
avg_salary = df.groupby("department")["salary"].mean()
print(avg_salary)

This saves time. Instead of writing code, you direct AI what to do.

FastAPI (Put Your Model Online)

After you build a model, people want to use it.

from fastapi import FastAPI

app = FastAPI()

# Your model predicts loan approval
@app.post("/predict")
def predict_loan(income: int, credit_score: int):
    prediction = model.predict([[credit_score, income]])
    return {"approved": prediction[0] == 1}

Now people can use your model on a website. This is real professional work.

Critical Mistakes to Avoid

Mistake 1: Using Loops on Data

Bad:

total = 0
for salary in salaries:
    total = total + salary
average = total / len(salaries)

Good:

average = np.mean(salaries)

Loops are slow. Vectorization (all at once) is fast.

Mistake 2: Not Cleaning Data

Real data is dirty. Missing values. Wrong values. Duplicates.

# Check for problems
print(df.isnull().sum())      # How many empty cells?
print(df.duplicated().sum())  # How many duplicate rows?

# Fix problems
df = df.dropna()              # Remove rows with empty cells
df = df.drop_duplicates()     # Remove exact duplicates
df = df[df["age"] > 0]        # Remove impossible values

Spend 50% of time here. Good data = good model.

Mistake 3: Testing on Training Data

If you train and test on same data, model looks perfect (99% accurate). But it is lying.

Always split:

80% training data (teach model)
20% test data (check if it learned)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2
)

# Train on training data
model.fit(X_train, y_train)

# Test on test data only
accuracy = model.score(X_test, y_test)

Now accuracy is real (maybe 75-80%, not 99%).

Mistake 4: Not Using Virtual Environments

Different projects need different library versions. Without virtual environments, they conflict.

# Create isolated environment
conda create -n datasci python=3.11

# Activate it
conda activate datasci

# Install libraries
pip install pandas numpy scikit-learn

Each project stays clean and separate.

Your 12-Week Learning Path

Week	Focus	What You Learn
1-2	Python Basics	Variables, lists, loops, conditions
3	NumPy	Arrays, math operations
4	Pandas	Data tables, reading files, manipulation
5-6	Machine Learning	Classification, regression, models
7	Data Cleaning	Handle missing values, fix problems
8	Model Evaluation	Accuracy, precision, recall
9	Feature Engineering	Create useful input variables
10	Advanced Models	Gradient Boosting, Neural Networks
11	Deployment	FastAPI, Docker, production
12	Real Project	Build complete project, deploy it

After 12 weeks: You are ready for junior data scientist job ($85K+).

Salaries: What You Will Earn

Stage	Years	Salary	What Changes
Start	0	$55K-$85K	Learn Python
Junior	2	$85K-$120K	Build models
Mid-level	5	$120K-$180K	Understand production
Senior	8	$180K-$280K	Use AI, lead projects
Expert	10+	$250K-$400K+	Design systems

The jump from $120K to $280K happens when you understand:

Production deployment
Model monitoring
AI/LLM integration
System design

Not just time. Skills.

Start Right Now

Go to colab.research.google.com
Create new notebook
Copy this code:

print("I am starting my data science journey")
print("I will learn Python")
print("I will build real projects")
print("I will earn $100K+")

Run it. See it works.

You have started. You are a programmer.

Everything else is practice.

In 12 weeks: You will be ready for a real job.
In 2 years: You will earn $85K-$120K.
In 5 years: You will earn $120K-$180K.

Start today. Not tomorrow. Today.

DEV Community