What is Data Science? The Simple Truth
Python for data science means three things:
- Read data (CSV, databases, websites)
- Clean data (remove wrong values, fix missing data)
- Find patterns (use math to understand)
That is all data science is. Not complicated.
Why Python? Five Reasons
Reason 1: Easy to Read
Python code looks like English. Compare:
Java:
public class Hello {
public static void main(String[] args) {
System.out.println("Hello");
}
}
Python:
print("Hello")
Same result. Python needs 1 line. Java needs 5 lines.
Reason 2: Huge Community
When you have a problem, millions of people have solved it. You search Google. You find answer in 5 minutes.
Reason 3: Free Everything
No cost. Download Python = $0. All libraries = $0. All tools = $0.
Reason 4: Industry Standard
Google, Netflix, Amazon, Facebook use Python. When you learn Python, you learn what real companies use.
Reason 5: Quick Income Growth
Junior data scientist: $85,000 - $120,000 per year
Senior data scientist: $180,000 - $280,000 per year
That is 2-3x income growth in 5-7 years. Real money.
Getting Started: Week by Week
Week 1: The Foundation
Open Google Colab (colab.research.google.com). Free. No installation.
Learn three things:
1. Variables (Containers)
name = "Alex"
age = 28
salary = 85000
print(name) # Output: Alex
print(age) # Output: 28
print(salary) # Output: 85000
Variables hold information. Think like boxes.
2. Lists (Groups of Items)
fruits = ["apple", "banana", "orange"]
numbers = [1, 2, 3, 4, 5]
print(fruits[0]) # Output: apple (first item)
print(fruits[1]) # Output: banana (second item)
print(numbers[-1]) # Output: 5 (last item)
fruits.append("mango") # Add new item
print(fruits) # Output: ["apple", "banana", "orange", "mango"]
Lists hold multiple items in order.
3. Loops (Repeat Actions)
numbers = [1, 2, 3, 4, 5]
for number in numbers:
print(number)
# Output:
# 1
# 2
# 3
# 4
# 5
Loops repeat code. Useful for processing many items.
4. Conditions (If-Then Logic)
age = 28
if age >= 18:
print("Adult")
else:
print("Not adult")
# Output: Adult
Conditions make decisions. Test something. Do different things based on result.
Practice this week: Write 10 small programs using these concepts.
Week 2: Numbers and Math (NumPy)
NumPy is a library for fast math. When you have millions of numbers, NumPy is 100x faster than normal Python.
Basic NumPy
import numpy as np
# Create array
numbers = np.array([1, 2, 3, 4, 5])
# Math operations
doubled = numbers * 2 # [2, 4, 6, 8, 10]
squared = numbers ** 2 # [1, 4, 9, 16, 25]
sum_all = np.sum(numbers) # 15
average = np.mean(numbers) # 3.0
maximum = np.max(numbers) # 5
minimum = np.min(numbers) # 1
NumPy does all math instantly.
Using NumPy on Data
import numpy as np
# Salaries of 5 employees
salaries = np.array([50000, 60000, 75000, 80000, 90000])
# Find average
avg_salary = np.mean(salaries)
print(f"Average: ${avg_salary}") # Output: Average: $71000
# Find highest
max_salary = np.max(salaries)
print(f"Highest: ${max_salary}") # Output: Highest: $90000
# Give 10% raise to everyone
new_salaries = salaries * 1.10
print(new_salaries) # [55000, 66000, 82500, 88000, 99000]
Simple. Fast. Powerful.
Practice this week: Do 5 math problems using NumPy. Get comfortable with arrays.
Week 3: Data Tables (Pandas)
Pandas is for reading and organizing data. Think Excel, but in Python and more powerful.
Basic Pandas
import pandas as pd
# Create simple data
data = {
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"salary": [50000, 60000, 75000]
}
# Convert to table
df = pd.DataFrame(data)
# View table
print(df)
# Output:
# name age salary
# 0 Alice 25 50000
# 1 Bob 30 60000
# 2 Charlie 35 75000
This creates a table (like Excel).
Reading Data from File
import pandas as pd
# Read CSV file
df = pd.read_csv("employees.csv")
# See first 5 rows
print(df.head())
# See information about data
print(df.describe())
# See column names
print(df.columns)
# See data shape (rows, columns)
print(df.shape) # Output: (614, 13) means 614 rows, 13 columns
Most of your work starts here: read data, understand it, clean it.
Basic Data Manipulation
# Find average salary
avg = df["salary"].mean()
print(f"Average salary: ${avg}")
# Find maximum age
max_age = df["age"].max()
print(f"Oldest employee: {max_age} years old")
# Count employees in each department
department_count = df.groupby("department").size()
print(department_count)
# Find average salary by department
dept_salary = df.groupby("department")["salary"].mean()
print(dept_salary)
These operations do what Excel does. But in seconds, not minutes.
Important: Never Use Loops for Data
SLOW (bad):
for i in range(len(df)):
df.loc[i, "salary"] = df.loc[i, "salary"] * 1.10
FAST (good):
df["salary"] = df["salary"] * 1.10
Same result. The fast way is 100x faster.
This is critical: Always use Pandas operations, never loops.
Practice this week: Load a CSV file. Explore it. Calculate averages. Group by categories.
Week 4: Machine Learning (Your First Model)
This is where things get exciting. You teach a computer to recognize patterns.
The Concept
Machine learning works like this:
- Show computer examples with answers
- Computer learns pattern
- Show computer new examples without answers
- Computer predicts answers
Simple Example: Predict Loan Approval
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Read data
df = pd.read_csv("loan_data.csv")
# Get features (input) and target (output)
features = df[["credit_score", "income", "loan_amount"]]
target = df["approved"] # 1 = yes, 0 = no
# Split: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
features, target, test_size=0.2
)
# Create model
model = RandomForestClassifier(n_estimators=100)
# Train model (show it examples)
model.fit(X_train, y_train)
# Test model (check if it learned)
accuracy = model.score(X_test, y_test)
print(f"Model accuracy: {accuracy*100:.1f}%") # Output: Model accuracy: 78.5%
# Make prediction on new data
new_person = [[700, 50000, 200000]] # credit score, income, loan amount
prediction = model.predict(new_person)
print(f"Loan approved: {prediction[0] == 1}") # Output: Loan approved: True
This 30-line program:
- Reads data
- Splits data
- Creates model
- Trains model
- Tests model
- Makes prediction
This is professional data science.
The 2026 Additions: New Tools
Polars (Fast Pandas for Big Data)
When your data is huge (50GB, 100GB), Polars is 8x faster than Pandas.
import polars as pl
# Read huge file instantly
df = pl.read_csv("huge_file.csv")
# Process instantly
result = df.filter(
pl.col("age") > 25
).group_by("city").agg(
pl.col("salary").mean()
)
print(result)
Polars syntax is similar to Pandas. Easy to learn if you know Pandas.
LLM Integration (AI Helps You Code)
ChatGPT can now write Python code for you.
You ask: "Write Python code to read a CSV and find average salary by department."
ChatGPT writes:
import pandas as pd
df = pd.read_csv("employees.csv")
avg_salary = df.groupby("department")["salary"].mean()
print(avg_salary)
This saves time. Instead of writing code, you direct AI what to do.
FastAPI (Put Your Model Online)
After you build a model, people want to use it.
from fastapi import FastAPI
app = FastAPI()
# Your model predicts loan approval
@app.post("/predict")
def predict_loan(income: int, credit_score: int):
prediction = model.predict([[credit_score, income]])
return {"approved": prediction[0] == 1}
Now people can use your model on a website. This is real professional work.
Critical Mistakes to Avoid
Mistake 1: Using Loops on Data
Bad:
total = 0
for salary in salaries:
total = total + salary
average = total / len(salaries)
Good:
average = np.mean(salaries)
Loops are slow. Vectorization (all at once) is fast.
Mistake 2: Not Cleaning Data
Real data is dirty. Missing values. Wrong values. Duplicates.
# Check for problems
print(df.isnull().sum()) # How many empty cells?
print(df.duplicated().sum()) # How many duplicate rows?
# Fix problems
df = df.dropna() # Remove rows with empty cells
df = df.drop_duplicates() # Remove exact duplicates
df = df[df["age"] > 0] # Remove impossible values
Spend 50% of time here. Good data = good model.
Mistake 3: Testing on Training Data
If you train and test on same data, model looks perfect (99% accurate). But it is lying.
Always split:
- 80% training data (teach model)
- 20% test data (check if it learned)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2
)
# Train on training data
model.fit(X_train, y_train)
# Test on test data only
accuracy = model.score(X_test, y_test)
Now accuracy is real (maybe 75-80%, not 99%).
Mistake 4: Not Using Virtual Environments
Different projects need different library versions. Without virtual environments, they conflict.
# Create isolated environment
conda create -n datasci python=3.11
# Activate it
conda activate datasci
# Install libraries
pip install pandas numpy scikit-learn
Each project stays clean and separate.
Your 12-Week Learning Path
| Week | Focus | What You Learn |
|---|---|---|
| 1-2 | Python Basics | Variables, lists, loops, conditions |
| 3 | NumPy | Arrays, math operations |
| 4 | Pandas | Data tables, reading files, manipulation |
| 5-6 | Machine Learning | Classification, regression, models |
| 7 | Data Cleaning | Handle missing values, fix problems |
| 8 | Model Evaluation | Accuracy, precision, recall |
| 9 | Feature Engineering | Create useful input variables |
| 10 | Advanced Models | Gradient Boosting, Neural Networks |
| 11 | Deployment | FastAPI, Docker, production |
| 12 | Real Project | Build complete project, deploy it |
After 12 weeks: You are ready for junior data scientist job ($85K+).
Salaries: What You Will Earn
| Stage | Years | Salary | What Changes |
|---|---|---|---|
| Start | 0 | $55K-$85K | Learn Python |
| Junior | 2 | $85K-$120K | Build models |
| Mid-level | 5 | $120K-$180K | Understand production |
| Senior | 8 | $180K-$280K | Use AI, lead projects |
| Expert | 10+ | $250K-$400K+ | Design systems |
The jump from $120K to $280K happens when you understand:
- Production deployment
- Model monitoring
- AI/LLM integration
- System design
Not just time. Skills.
Start Right Now
- Go to colab.research.google.com
- Create new notebook
- Copy this code:
print("I am starting my data science journey")
print("I will learn Python")
print("I will build real projects")
print("I will earn $100K+")
- Run it. See it works.
You have started. You are a programmer.
Everything else is practice.
In 12 weeks: You will be ready for a real job.
In 2 years: You will earn $85K-$120K.
In 5 years: You will earn $120K-$180K.
Start today. Not tomorrow. Today.
Top comments (0)