Princess Mae Sanchez

Posted on Oct 31

My ML Learning Journey: From Confusion to Building a Working Model

#machinelearning #kaggle #python #tutorial

I'm learning machine learning, and I want to share this journey with you. Not as an expert—I literally started Kaggle's "Intro to Machine Learning" course last week—but as someone who just figured out how to build their first predictive model and wants to help others do the same.

If you've been curious about AI and machine learning but thought it was too complicated, or if terms like "neural networks" and "algorithms" sound intimidating, this post is for you. Let me show you that it's actually way more approachable than you think!

Why I'm Learning Machine Learning

I've been fascinated by AI for a while now. Every time I see AI-powered recommendations on Netflix, autocomplete on my phone, or ChatGPT writing code, I wonder: "How does this actually work?"

I wanted to go beyond just using AI tools—I wanted to understand the fundamentals. That's when I discovered Kaggle's free "Intro to Machine Learning" course, and honestly? It's been one of the best decisions I've made this year.

My goal: Understand how machines learn from data and build my own models (even simple ones!)

Why I'm sharing publicly: Learning in public keeps me accountable, helps me remember concepts better by teaching them, and hopefully helps someone else who's just starting out.

What I Learned This Week

Here are the key concepts I wrapped my head around:

How to load and explore data with Pandas
The difference between features (X) and targets (y)
Building a decision tree model
Why you can't just test on training data (this was a big "aha!" moment)
What overfitting and underfitting actually mean
How Random Forests make better predictions

Now let me teach you what I learned!

Tutorial: Build Your First Machine Learning Model (Seriously, You Can Do This!)

Let me walk you through building a house price predictor using the Melbourne Housing dataset—the same project I just completed. We'll go step by step, and I'll explain everything in plain English.

What You'll Need

A Kaggle account (it's free!)
Basic Python knowledge (if you know variables and functions, you're good)
The Melbourne Housing dataset from Kaggle (it's already available when you start!)

Pro tip: I'm doing this directly in a Kaggle notebook - no setup required! Just click "New Notebook" on Kaggle and you're ready to code.

Step 1: Understanding the Problem

Goal: Predict how much a house in Melbourne will cost based on its features (size, number of rooms, land size, etc.)
**Think of it like this: **If I told you a Melbourne house has 4 bedrooms, 2 bathrooms, 500 sqm land, and 150 sqm building area, could you guess roughly what it costs? You'd probably compare it to other houses you know. That's exactly what we're teaching the computer to do!

Step 2: Setting Up - Import Libraries

First, we need to bring in our tools:

What just happened?

pandas is like Excel for Python—it handles data tables
sklearn (scikit-learn) contains all our machine learning tools
We're importing specific tools we'll need for building and testing models

Think of it like: Opening your toolbox before starting a project - we're grabbing the hammer, screwdriver, and wrench we'll need!

Step 3: Loading the Data

Important note for Kaggle users: When you attach a dataset in Kaggle, it's stored in /kaggle/input/[dataset-name]/. That's why we use that special path!

Step 4: Exploring the Data

Before building any model, you need to understand your data:

What this tells me:

count: How many houses have this information
mean: The average value (e.g., average price is around $1M!)
min/max: The range of values
50%: The median (middle value) - helps spot outliers
Missing values: Some houses don't have all information filled in

Why this matters: The Melbourne dataset has 13,580 houses, but I noticed that some columns like BuildingArea only have 7,130 values. That means almost half the houses are missing this info! We need to handle this.

Step 5: Choosing and Cleaning Our Data

Here's where we decide what to use and clean up the missing values:

Breaking this down:
Choosing features - I picked characteristics that logically affect price:
-Rooms: More rooms = usually more expensive
-Bathroom: More bathrooms = usually more expensive
-Landsize: Bigger lot = usually more expensive
-BuildingArea: Bigger house = usually more expensive

.dropna()- This removes any house that's missing data in our chosen columns

Started with 13,580 houses
After cleaning: about 6,196 complete houses
We lose some data, but the remaining data is reliable!

Why X and y?

In math, X represents input variables and** y** is what we're solving for
X (features) → goes into the model → y (prediction) comes out

Real-world analogy: It's like doing a survey - you can only use complete responses, so you filter out any surveys with missing answers.

Step 6: The Critical Step - Split Your Data!

This is where I made my first mistake, so pay close attention!

Why split the data? Here's the critical lesson:
Imagine you're studying for a test. You memorize 10 practice questions. Then the test has those EXACT 10 questions. You ace it! But did you actually learn the material, or did you just memorize?

Same with ML models:

Training data(75%): The model learns patterns from these houses
Validation data(25%): We test on houses the model has NEVER seen

What's **random_state=1?**

Ensures we get the same random split every time
Makes results reproducible (crucial for debugging!)
You can use any number (1, 42, 123, etc.) This prevents *overfitting *(memorization) and ensures our model actually learned patterns, not just memorized answers!

Step 7: Building Your First Model - Decision Tree

Here's where the magic happens:

Breaking it down:
DecisionTreeRegressor- This is our model type. Think of it as a flowchart that asks questions:

"Does the house have more than 3 rooms?"
"Is the land size larger than 200 sqm?"
"Is the building area larger than 100 sqm?"
Based on the answers, it navigates to a prediction

random_state=1- Ensures consistent results every time

.fit(train_X, train_y) - This is the training! The model studies the training houses and learns patterns

Step 8: Measuring Accuracy - Mean Absolute Error (MAE)

Now let's see how good our model is:

Step 9: Making It Better with Random Forest

Single decision trees are okay, but Random Forests are MUCH better. Here's the concept:

Analogy:

One decision tree = asking one real estate agent's opinion
Random Forest = asking 100 agents and averaging their opinions

Which would you trust more? The crowd wisdom!

Why is Random Forest better?

Creates many different decision trees (typically 100 trees)
Each tree is trained on a slightly different subset of data
Each tree makes its own prediction
Final prediction = average of all trees
Result: More accurate, more stable, less prone to overfitting!

The comparison:
Just by switching algorithms, we improved by over $50,000 in prediction accuracy!

Step 10: Comparing Models Side by Side

Let's see the comparison clearly:

Random Forest is clearly the winner! 🏆

Step 11: Seeing Real Predictions

Let's look at how our best model (Random Forest) actually predicts on specific houses:

Step 12: Train Final Model on ALL Data

Once you're satisfied with your model's performance, train it on ALL available data:

Why do this?

You already validated that Random Forest works well
We were holding back 25% of data for validation
Now we use ALL 6,196 houses to train
This makes the model even more accurate for real predictions

Think of it like: You practiced with 75% of your study materials and tested yourself on 25%. Now that you know you understand the material, you study ALL of it before the real exam.

Step 13: Predict New House Prices!

Now comes the fun part - predicting prices for houses not in our dataset!

🎉 You just built a machine learning model that can predict Melbourne house prices!

Bonus: Finding the Optimal Model Complexity

Want to see how different tree sizes affect accuracy? Here's a bonus experiment:

What this shows:

Too few nodes (5) = too simple, misses patterns
Too many nodes (500) = memorizes training data
Sweet spot around 100-250 nodes

This is the overfitting vs underfitting tradeoff in action!

Resources That Helped Me

Here's what I found most useful:
Free Courses
Kaggle's Intro to Machine Learning - Where I started (highly recommend!)

Documentation

Scikit-learn docs - Super clear with tons of examples
Pandas docs - Essential for data manipulation

Tips for Absolute Beginners

If you're just starting out like me, here's my advice:
1. You Don't Need a Math PhD
I'm not a math genius. I haven't taken calculus in years. You can still learn ML! Start with the practical stuff, the math will make more sense later.
2. Code Along, Don't Just Watch
I learn by doing. Watch a tutorial, then code it yourself. Change things. Break stuff. See what happens.
3. Start with Kaggle
Kaggle gives you:- Free courses with interactive coding

Datasets ready to use
A community of learners
Real competitions to test your skills

4. Don't Get Stuck on Theory
I spent 2 days trying to understand decision tree math. Then I just built one and it clicked. Sometimes you need to do it to get it.

5. Learn in Public
Sharing my learning journey:

Keeps me accountable
Helps me remember by teaching others
Connects me with other learners
Creates a portfolio of my progress

6. It's Okay to Be Confused
I was confused for 90% of this week. That's normal! Push through it. Things will click.

The Honest Truth About Learning ML

Let me be real with you:
It's not as hard as you think. The basics of ML are surprisingly accessible. You don't need to understand complex math to build your first models.
It's harder than it looks. There's a lot of trial and error. Your first models will probably be bad. That's okay!
It's incredibly rewarding. When your model makes its first decent prediction, it feels like magic (even though you know it's not).
It's an ongoing journey. A week in, I've barely scratched the surface. There's so much more to learn, and that's exciting!

Final Thoughts

A week ago, I thought machine learning was this impossibly complex field. Today, I built a model that can predict house prices with reasonable accuracy.

Is it perfect? No.
Am I an expert? Definitely not.
Did I learn a ton and have fun doing it? Absolutely!

If you've been curious about ML but haven't taken the first step—this is your sign. Start today.

You don't need to be ready, you don't need to know everything, you just need to begin.

The best time to start learning ML was yesterday. The second best time is right now.

Let's do this! 🚀

Thanks for reading! Now go build something cool! 💻✨

DEV Community