DEV Community

Temiloluwa Valentine
Temiloluwa Valentine

Posted on

Predict House Prices with Python: A Beginner’s Machine Learning Guide


In the last article, “Getting Started with AI,” we covered the fundamentals—what machine learning is, the types of problems it solves, and the tools you need.

Theory is important. But it only matters when you build something real.

So let’s build.

In this article, you’ll learn how to predict house prices using machine learning.

Not a toy example. A real regression problem that real estate companies, investors, and data scientists solve every day.

You’ll understand:

  • How to structure data for a model
  • How to train a machine learning system
  • How to test if it actually works
  • How to make predictions on new data

By the end, you’ll have built your first machine learning model. And more importantly, you’ll understand the process, because this same process works for predicting stock prices, weather, customer churn, or anything else.

Let’s go.

Step 1: Get Your Data

Machine learning starts with data. You need examples to learn from.

For this project, we’re using the Housing Prices dataset from Kaggle, which is a free dataset with real house data: size, number of bedrooms, bathrooms, parking, e.t.c and most importantly, the price.

This is your training material. The model will learn the relationship between house features (size, bedrooms) and price.

How to get the data:

Go to Kaggle

  1. Search “Housing Prices Dataset” or just click this link (https://www.kaggle.com/datasets/yasserh/housing-prices-dataset)
  2. Download the CSV file
  3. Upload to Google Colab

Load the data:

import pandas as pd

Load the dataset

df = pd.read_csv(f"{path}/Housing.csv ")
print(df)

Look at the first few rows

print(df.head())

You now have your data loaded. Next step: prepare it for the model.

Step 2: Prepare Your Data

Raw data isn’t ready for machine learning. You need to organize it.

Your dataset has features (inputs) and a target (output). Features are what you know: size, bedrooms, bathrooms, e.t.c. Target is what you want to predict: price.

The model learns the relationship between features and target. So you need to separate them.

Separate features and target:

Target (what we want to predict)

y = df['price']

Features (drop price column)

X = df.drop('price', axis=1)

Convert yes/no columns to 1/0
This is to convert the text to numbers because machine learning only understands numbers, not text, so we convert the "yes" to 1 and the "no" to 0.

binary_columns = [
‘mainroad’, ‘guestroom’, ‘basement’,
‘hotwaterheating’, ‘airconditioning’, ‘prefarea’
]
for col in binary_columns:
X[col] = X[col].map({'yes': 1, 'no': 0})
We will handle furnishingstatus because some are furnished, semi-furnished, and unfurnished

X = pd.get_dummies(X, columns=[‘furnishingstatus’], drop_first=True)

Split into training and testing:

Here’s the critical part: you can’t test on the same data you trained on. The model will memorize the answers instead of learning.

So split your data:

  • 80% for training (the model learns)
  • 20% for testing (we check if it actually works)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
What random_state means:
Scikit-learn randomly selects:
80% of the data for training

20% for testing

If you don’t set random_state, the split will be different every time you run the code.

That means:

Your training data changes

Your test data changes

Your accuracy changes

That’s not good for debugging or comparing models.

Why this matters:

Training data teaches the model. Test data proves it works on new data it’s never seen.

Without this split, you’ll think your model is perfect. But it will fail when it meets real, unseen data.

Now your data is ready. Time to train.

Step 3: Train the Model

Now comes the magic. You’re going to teach a machine to predict house prices.

Create the model:

from sklearn.linear_model import LinearRegression
model = LinearRegression()
That’s it. You’ve created an empty machine learning model. It knows nothing yet.

Train it:

Train on your data

model.fit(X_train, y_train)
This is where learning happens. The model analyzes your training data to identify the mathematical relationship between features (size, bedrooms, e.t.c) and price.

It’s asking, "What pattern connects these house features to their prices?”

What’s happening behind the scenes:

The model is drawing a line (or curve) through your data. It’s trying to find the best line that fits all the houses, where features predict price most accurately.

This process is called “fitting” or “training.”

In a few seconds, your model learned from hundreds of house examples. That’s machine learning.

Step 4: Test the Model

Your model is trained. But does it work?

Time to test it on data it’s never seen before.

Make predictions:

Predict on test data

predictions = model.predict(X_test)
print(predictions[:5])
The model now looks at houses in the test set and predicts their prices. It’s guessing based on what it learned.

Comparison
We compare the actual price with the predicted price

comparison = pd.DataFrame({
“Actual Price”: y_test.values[:5],
“Predicted Price”: predictions[:5]
})
print(comparison)
Check how accurate it is:

from sklearn.metrics import mean_squared_error, r2_score

Calculate error

print("R²:", r2_score(y_test, predictions))
print("MAE:", mean_absolute_error(y_test, predictions))

After training and testing our linear regression model, we can see how well it predicts house prices:

R² Score: 0.65

Mean Absolute Error (MAE): ₦970,000

What this means:

R² Score (0–1): Measures how much of the variation in house prices the model can explain.

0.65 means our model explains about 65% of the differences in house prices.

The closer to 1, the better the model is at capturing patterns.

Mean Absolute Error (MAE): Shows the average amount our predictions are off.

₦970,000 means, on average, the predicted price is roughly ₦970k higher or lower than the actual price.

Lower is better, but for a first beginner model, this is acceptable.

Even though the model isn’t perfect, it successfully learns patterns from the data. This is exactly what beginners need to understand: how to go from raw data to predictions using machine learning.

With this foundation, you can now experiment with more features, larger datasets, or advanced models in the future.

Conclusion: You’re Now a Machine Learning Engineer

You just built a real machine learning system.

Not in theory. In practice. With code. With data. With real predictions.

What you learned:

  1. Data is everything—garbage in, garbage out
  2. Splitting data prevents lying to yourself
  3. Training finds patterns automatically
  4. Testing proves it actually works
  5. Predictions are just applying what you learned

Why this matters:

This exact process solves real problems:

  • Predicting stock prices
  • Detecting diseases in medical images
  • Recommending products
  • Forecasting demand
  • Detecting fraud

Every machine learning project follows this same pipeline. Master it, and you can build anything.

What’s next:

Now that you understand the process, you can:

  • Try different algorithms (Random Forest, SVM, Neural Networks)
  • Use bigger datasets
  • Add more features
  • Build on real problems in your own life

The tools are free. The knowledge is available. The only limit is how much you’re willing to build.

Keep building.

Use this process on a problem you care about.

That’s where real learning happens.

  • Temiloluwa Valentine

AI #MachineLearning #BuildingInPublic

Top comments (0)