DEV Community

Cover image for Python Data Science: The Beginners Guide 2026
Susan Cook
Susan Cook

Posted on

Python Data Science: The Beginners Guide 2026

What is Data Science? The Simple Truth

Python for data science means three things:

  1. Read data (CSV, databases, websites)
  2. Clean data (remove wrong values, fix missing data)
  3. Find patterns (use math to understand)

That is all data science is. Not complicated.

Why Python? Five Reasons

Reason 1: Easy to Read

Python code looks like English. Compare:

Java:

public class Hello {
    public static void main(String[] args) {
        System.out.println("Hello");
    }
}
Enter fullscreen mode Exit fullscreen mode

Python:

print("Hello")
Enter fullscreen mode Exit fullscreen mode

Same result. Python needs 1 line. Java needs 5 lines.

Reason 2: Huge Community

When you have a problem, millions of people have solved it. You search Google. You find answer in 5 minutes.

Reason 3: Free Everything

No cost. Download Python = $0. All libraries = $0. All tools = $0.

Reason 4: Industry Standard

Google, Netflix, Amazon, Facebook use Python. When you learn Python, you learn what real companies use.

Reason 5: Quick Income Growth

Junior data scientist: $85,000 - $120,000 per year
Senior data scientist: $180,000 - $280,000 per year

That is 2-3x income growth in 5-7 years. Real money.


Getting Started: Week by Week

Week 1: The Foundation

Open Google Colab (colab.research.google.com). Free. No installation.

Learn three things:

1. Variables (Containers)

name = "Alex"
age = 28
salary = 85000

print(name)      # Output: Alex
print(age)       # Output: 28
print(salary)    # Output: 85000
Enter fullscreen mode Exit fullscreen mode

Variables hold information. Think like boxes.

2. Lists (Groups of Items)

fruits = ["apple", "banana", "orange"]
numbers = [1, 2, 3, 4, 5]

print(fruits[0])     # Output: apple (first item)
print(fruits[1])     # Output: banana (second item)
print(numbers[-1])   # Output: 5 (last item)

fruits.append("mango")  # Add new item
print(fruits)        # Output: ["apple", "banana", "orange", "mango"]
Enter fullscreen mode Exit fullscreen mode

Lists hold multiple items in order.

3. Loops (Repeat Actions)

numbers = [1, 2, 3, 4, 5]

for number in numbers:
    print(number)

# Output:
# 1
# 2
# 3
# 4
# 5
Enter fullscreen mode Exit fullscreen mode

Loops repeat code. Useful for processing many items.

4. Conditions (If-Then Logic)

age = 28

if age >= 18:
    print("Adult")
else:
    print("Not adult")

# Output: Adult
Enter fullscreen mode Exit fullscreen mode

Conditions make decisions. Test something. Do different things based on result.

Practice this week: Write 10 small programs using these concepts.


Week 2: Numbers and Math (NumPy)

NumPy is a library for fast math. When you have millions of numbers, NumPy is 100x faster than normal Python.

Basic NumPy

import numpy as np

# Create array
numbers = np.array([1, 2, 3, 4, 5])

# Math operations
doubled = numbers * 2         # [2, 4, 6, 8, 10]
squared = numbers ** 2        # [1, 4, 9, 16, 25]
sum_all = np.sum(numbers)     # 15
average = np.mean(numbers)    # 3.0
maximum = np.max(numbers)     # 5
minimum = np.min(numbers)     # 1
Enter fullscreen mode Exit fullscreen mode

NumPy does all math instantly.

Using NumPy on Data

import numpy as np

# Salaries of 5 employees
salaries = np.array([50000, 60000, 75000, 80000, 90000])

# Find average
avg_salary = np.mean(salaries)
print(f"Average: ${avg_salary}")  # Output: Average: $71000

# Find highest
max_salary = np.max(salaries)
print(f"Highest: ${max_salary}")  # Output: Highest: $90000

# Give 10% raise to everyone
new_salaries = salaries * 1.10
print(new_salaries)  # [55000, 66000, 82500, 88000, 99000]
Enter fullscreen mode Exit fullscreen mode

Simple. Fast. Powerful.

Practice this week: Do 5 math problems using NumPy. Get comfortable with arrays.


Week 3: Data Tables (Pandas)

Pandas is for reading and organizing data. Think Excel, but in Python and more powerful.

Basic Pandas

import pandas as pd

# Create simple data
data = {
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "salary": [50000, 60000, 75000]
}

# Convert to table
df = pd.DataFrame(data)

# View table
print(df)

# Output:
#       name  age  salary
# 0    Alice   25   50000
# 1      Bob   30   60000
# 2  Charlie   35   75000
Enter fullscreen mode Exit fullscreen mode

This creates a table (like Excel).

Reading Data from File

import pandas as pd

# Read CSV file
df = pd.read_csv("employees.csv")

# See first 5 rows
print(df.head())

# See information about data
print(df.describe())

# See column names
print(df.columns)

# See data shape (rows, columns)
print(df.shape)  # Output: (614, 13) means 614 rows, 13 columns
Enter fullscreen mode Exit fullscreen mode

Most of your work starts here: read data, understand it, clean it.

Basic Data Manipulation

# Find average salary
avg = df["salary"].mean()
print(f"Average salary: ${avg}")

# Find maximum age
max_age = df["age"].max()
print(f"Oldest employee: {max_age} years old")

# Count employees in each department
department_count = df.groupby("department").size()
print(department_count)

# Find average salary by department
dept_salary = df.groupby("department")["salary"].mean()
print(dept_salary)
Enter fullscreen mode Exit fullscreen mode

These operations do what Excel does. But in seconds, not minutes.

Important: Never Use Loops for Data

SLOW (bad):

for i in range(len(df)):
    df.loc[i, "salary"] = df.loc[i, "salary"] * 1.10
Enter fullscreen mode Exit fullscreen mode

FAST (good):

df["salary"] = df["salary"] * 1.10
Enter fullscreen mode Exit fullscreen mode

Same result. The fast way is 100x faster.

This is critical: Always use Pandas operations, never loops.

Practice this week: Load a CSV file. Explore it. Calculate averages. Group by categories.


Week 4: Machine Learning (Your First Model)

This is where things get exciting. You teach a computer to recognize patterns.

The Concept

Machine learning works like this:

  1. Show computer examples with answers
  2. Computer learns pattern
  3. Show computer new examples without answers
  4. Computer predicts answers

Simple Example: Predict Loan Approval

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Read data
df = pd.read_csv("loan_data.csv")

# Get features (input) and target (output)
features = df[["credit_score", "income", "loan_amount"]]
target = df["approved"]  # 1 = yes, 0 = no

# Split: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
    features, target, test_size=0.2
)

# Create model
model = RandomForestClassifier(n_estimators=100)

# Train model (show it examples)
model.fit(X_train, y_train)

# Test model (check if it learned)
accuracy = model.score(X_test, y_test)
print(f"Model accuracy: {accuracy*100:.1f}%")  # Output: Model accuracy: 78.5%

# Make prediction on new data
new_person = [[700, 50000, 200000]]  # credit score, income, loan amount
prediction = model.predict(new_person)
print(f"Loan approved: {prediction[0] == 1}")  # Output: Loan approved: True
Enter fullscreen mode Exit fullscreen mode

This 30-line program:

  • Reads data
  • Splits data
  • Creates model
  • Trains model
  • Tests model
  • Makes prediction

This is professional data science.


The 2026 Additions: New Tools

Polars (Fast Pandas for Big Data)

When your data is huge (50GB, 100GB), Polars is 8x faster than Pandas.

import polars as pl

# Read huge file instantly
df = pl.read_csv("huge_file.csv")

# Process instantly
result = df.filter(
    pl.col("age") > 25
).group_by("city").agg(
    pl.col("salary").mean()
)

print(result)
Enter fullscreen mode Exit fullscreen mode

Polars syntax is similar to Pandas. Easy to learn if you know Pandas.

LLM Integration (AI Helps You Code)

ChatGPT can now write Python code for you.

You ask: "Write Python code to read a CSV and find average salary by department."

ChatGPT writes:

import pandas as pd

df = pd.read_csv("employees.csv")
avg_salary = df.groupby("department")["salary"].mean()
print(avg_salary)
Enter fullscreen mode Exit fullscreen mode

This saves time. Instead of writing code, you direct AI what to do.

FastAPI (Put Your Model Online)

After you build a model, people want to use it.

from fastapi import FastAPI

app = FastAPI()

# Your model predicts loan approval
@app.post("/predict")
def predict_loan(income: int, credit_score: int):
    prediction = model.predict([[credit_score, income]])
    return {"approved": prediction[0] == 1}
Enter fullscreen mode Exit fullscreen mode

Now people can use your model on a website. This is real professional work.


Critical Mistakes to Avoid

Mistake 1: Using Loops on Data

Bad:

total = 0
for salary in salaries:
    total = total + salary
average = total / len(salaries)
Enter fullscreen mode Exit fullscreen mode

Good:

average = np.mean(salaries)
Enter fullscreen mode Exit fullscreen mode

Loops are slow. Vectorization (all at once) is fast.

Mistake 2: Not Cleaning Data

Real data is dirty. Missing values. Wrong values. Duplicates.

# Check for problems
print(df.isnull().sum())      # How many empty cells?
print(df.duplicated().sum())  # How many duplicate rows?

# Fix problems
df = df.dropna()              # Remove rows with empty cells
df = df.drop_duplicates()     # Remove exact duplicates
df = df[df["age"] > 0]        # Remove impossible values
Enter fullscreen mode Exit fullscreen mode

Spend 50% of time here. Good data = good model.

Mistake 3: Testing on Training Data

If you train and test on same data, model looks perfect (99% accurate). But it is lying.

Always split:

  • 80% training data (teach model)
  • 20% test data (check if it learned)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2
)

# Train on training data
model.fit(X_train, y_train)

# Test on test data only
accuracy = model.score(X_test, y_test)
Enter fullscreen mode Exit fullscreen mode

Now accuracy is real (maybe 75-80%, not 99%).

Mistake 4: Not Using Virtual Environments

Different projects need different library versions. Without virtual environments, they conflict.

# Create isolated environment
conda create -n datasci python=3.11

# Activate it
conda activate datasci

# Install libraries
pip install pandas numpy scikit-learn
Enter fullscreen mode Exit fullscreen mode

Each project stays clean and separate.


Your 12-Week Learning Path

Week Focus What You Learn
1-2 Python Basics Variables, lists, loops, conditions
3 NumPy Arrays, math operations
4 Pandas Data tables, reading files, manipulation
5-6 Machine Learning Classification, regression, models
7 Data Cleaning Handle missing values, fix problems
8 Model Evaluation Accuracy, precision, recall
9 Feature Engineering Create useful input variables
10 Advanced Models Gradient Boosting, Neural Networks
11 Deployment FastAPI, Docker, production
12 Real Project Build complete project, deploy it

After 12 weeks: You are ready for junior data scientist job ($85K+).


Salaries: What You Will Earn

Stage Years Salary What Changes
Start 0 $55K-$85K Learn Python
Junior 2 $85K-$120K Build models
Mid-level 5 $120K-$180K Understand production
Senior 8 $180K-$280K Use AI, lead projects
Expert 10+ $250K-$400K+ Design systems

The jump from $120K to $280K happens when you understand:

  • Production deployment
  • Model monitoring
  • AI/LLM integration
  • System design

Not just time. Skills.


Start Right Now

  1. Go to colab.research.google.com
  2. Create new notebook
  3. Copy this code:
print("I am starting my data science journey")
print("I will learn Python")
print("I will build real projects")
print("I will earn $100K+")
Enter fullscreen mode Exit fullscreen mode
  1. Run it. See it works.

You have started. You are a programmer.

Everything else is practice.

In 12 weeks: You will be ready for a real job.
In 2 years: You will earn $85K-$120K.
In 5 years: You will earn $120K-$180K.

Start today. Not tomorrow. Today.

Top comments (0)