David Bean

Posted on Oct 21

Building My First ML Data Pipeline

#beginners #datascience #machinelearning

Three Days, One Deployed Dashboard, and a Lesson About Letting Data Drive Business Questions

I just finished my first complete machine learning project—a renewable energy investment analysis dashboard that's now live on Streamlit Cloud. Three days of work. 181,915 rows of data. And one really important lesson: your initial business problem is probably wrong.

I'm a software engineer learning ML with Claude designing my course. This project clarified a lot about how data science work actually happens.

Day 1: When Your Business Problem Meets Reality

I started with a plan: build a tool to help optimize fossil fuel plant modernization schedules based on renewable production patterns. Sounded reasonable. Turned out to be impossible with my data.

I had a renewable energy dataset covering 52 countries from 2010-2022. Six energy types. Good coverage. But after loading it into the interactive EDA dashboard I'd built the previous week, reality hit:

Dataset showed production, not capacity or demand
Renewables depend on weather—you can't schedule them
No grid data, no regional breakdowns
Historical trends can't predict modernization timing

My business problem didn't match what the data could actually answer.

The pivot: I asked a different question. Instead of "when should plants modernize," I asked "which countries represent the best opportunities for battery storage investments based on renewable penetration, growth rates, and energy mix diversity?"

That question? The data could answer it perfectly.

What I Learned: Validate Before You Commit

The EDA dashboard from Week 2 was useful here. Twenty minutes of exploration showed me:

Scale mismatches (totals mixed with individual sources)
Missing data patterns (expected in first-year entries)
Distribution issues (couldn't fix with log transforms)
Time coverage worked for trend analysis

Claude pointed out the business problem didn't match the data. You deal with the situation you're in, so we pivoted to a question the data could actually answer.

Day 1 Continued: The Preprocessing Pipeline

Coming from C++ where I think about data flow and single responsibilities, I built a five-function pipeline:

load_and_clean → filter → aggregate → calculate_metrics → rank

Each function takes a DataFrame, returns a DataFrame, has one clear job, prints progress, and handles edge cases.

The Scale Problem I Almost Got Wrong

Early on, my visualizations looked terrible. Some categories showed values 100x larger than others. My first instinct: log transformation.

Wrong.

The real issue: my data mixed individual renewable sources (Hydro = 1,000 GWh) with aggregate totals (Total Electricity = 200,000 GWh). These shouldn't be on the same chart at all.

Solution: Filter out aggregates entirely. Keep only the discrete renewable sources.

This wasn't a math problem—it was a data structure problem. No transformation fixes a fundamental category mismatch.

Day 2: When Your Model Is "Wrong" (But Actually Right)

I trained a Random Forest model to predict storage infrastructure scores:

Input: Percentages of Hydro, Wind, Solar, Geothermal, Other
Output: Storage need score (0-100)
Performance: R² = 0.948

Model worked. Then I tested extreme cases:

100% Hydro: Score 56.21

100% Wind: Score 31.37

Wait. Wind is intermittent—shouldn't it need MORE storage than stable hydro? Why was my model backwards?

I debugged for 15 minutes before realizing: the model wasn't wrong. My assumption was.

My Day 1 scoring formula:

storage_score = 0.4 × renewable_share + 0.4 × growth_rate + 0.2 × diversity

This measured investment opportunity, not technical storage need. Countries with high hydro (Norway, Iceland) scored high because:

Very high renewable penetration (27-30%)
Mature markets ready for more storage
High penetration signals strong renewable commitment

The model learned exactly what I trained it on. I just forgot what I'd actually built versus what I thought I was building.

Lesson: Models optimize for your training signal, not your intentions. When behavior seems wrong, check what you actually trained it on.

Day 3: Production Deployment Teaches Fast

I built a four-tab Streamlit dashboard:

Overview: Top 10 investment opportunities
Country Analysis: Interactive comparisons
Predictions: ML model with input sliders
Technical Details: Full methodology

Building for production exposed design flaws I'd never catch in a Jupyter notebook.

Problem 1: Path Management

Local: model = joblib.load('storage_model.pkl') worked fine

Streamlit Cloud: FileNotFoundError

Why? My dashboard lived in a src/ subfolder, models in the parent directory. Relative paths resolved from where the code runs, not where the file lives.

Fix:

import os
current_dir = os.path.dirname(os.path.abspath(__file__))
parent_dir = os.path.dirname(current_dir)
model = joblib.load(os.path.join(parent_dir, 'storage_model.pkl'))

Problem 2: Requirements File Location

Streamlit Cloud looks for requirements.txt at repository root, not in subdirectories. Took two deployment failures to figure this out.

Problem 3: Feature Scaling

Almost made a critical mistake: feeding raw percentages directly to the model.

Wrong:

input_data = np.array([[hydro, wind, solar, geo, other]])
prediction = model.predict(input_data)  # Wrong!

Right:

input_data = np.array([[hydro, wind, solar, geo, other]])
input_scaled = scaler.transform(input_data)  # Scale first!
prediction = model.predict(input_scaled)

Models trained on scaled features expect scaled inputs. Skip this and predictions don't work.

Lesson: Development and production environments have different problems. Same issues I deal with in systems work—environment differences, dependencies, synchronization—show up in ML deployments.

What Three Days Produced

Live dashboard with public URL

GitHub repo with professional README

Trained ML model (three deployment patterns: batch/API/edge)

Complete data pipeline with reproducible preprocessing

Documentation with screenshots

Top investment opportunities identified:

Netherlands (63.08) - 838% growth rate
Iceland (62.05) - 29.5% renewable penetration
Norway (59.47) - Strong baseline, steady growth
Hungary (52.82) - 658% growth, emerging market
UK (48.90) - Large market, 504% growth

Technical stats:

181,915 data points processed
52 countries analyzed
156 months of time series
8,033 predictions/second (batch)
89.4 KB model (ONNX edge deployment)
R² = 0.948

What Actually Surprised Me

1. Preprocessing Takes Most of the Time

In C++, optimization takes most of the time. In ML, data cleaning and feature engineering dominated. Good preprocessing makes modeling straightforward. Bad preprocessing makes it impossible.

2. Production Deployment Shows Problems Fast

Jupyter notebooks hide issues:

Path dependencies
Environment differences
Feature scaling synchronization
Input validation

Deploying early forced me to deal with these.

3. The README Matters

I spent 30 minutes writing a professional README:

Business problem clearly stated
Technical approach explained
Setup instructions
Screenshots
Live demo URL

Project looks more complete with good documentation.

4. End-to-End Matters More Than Depth

I could've spent three days optimizing model accuracy from 0.948 to 0.952. Instead I built a complete pipeline: data → model → deployment → documentation.

For real job hunting, I hope this matters more.

Real Bugs I Hit

Bug 1: Streamlit Cloud couldn't find plotly module

Cause: requirements.txt in wrong directory

Fix: Moved to repo root, specified plotly>=5.0.0

Bug 2: Model files not loading

Cause: Relative paths broken in cloud environment

Fix: Used os.path.dirname(__file__) for portable paths

Bug 3: "Random Forest" truncated in UI columns

Cause: Text too long for column width

Fix: Made it a subheader instead of metric in column

Bug 4: Predictions looked weird

Cause: Forgot to scale input features

Fix: Applied scaler before model.predict()

Claude caught most of these during code review. I understand the patterns now—scoping issues, path management, feature preprocessing flow. I'm delegating implementation details and focusing on understanding architecture.

What's Next

This was Portfolio Project 1 of 6. Each project adds new capabilities:

Project 1 (Done): Data analysis dashboard, traditional ML
Project 2: Traditional ML pipeline with feature engineering
Project 3: Deep learning computer vision
Project 4: Generative AI with LLMs
Project 5: MLOps with CI/CD
Project 6: ML systems engineering specialization

Goal isn't just learning ML—it's building a portfolio proving I can deliver production ML systems.

Tools That Helped

Streamlit (dashboard framework)
Plotly (interactive viz)
scikit-learn (Random Forest, preprocessing)
Pandas (data manipulation)
Streamlit Cloud (deployment)
Claude (course design, code review, debugging partner)

Live Demo & Code

🔗 Live Dashboard: https://portfolio1-bixsugdscx8hs5w8ybdasd.streamlit.app/

💻 GitHub Repository: https://github.com/bean2778/ai_learning_2025

📊 Dataset: Global Renewable Energy Production (2010-2022)

About this series: I'm a software engineer learning machine learning with Claude designing my curriculum. Week 3 done: EDA, problem formulation, first portfolio project deployed. More posts coming on traditional ML, deep learning, and production systems.

Connect:

LinkedIn: www.linkedin.com/in/bean2778
GitHub: https://github.com/bean2778/ai_learning_2025
Previous: blog 2

Next: Traditional ML fundamentals—supervised learning, evaluation metrics, bias-variance tradeoff.

Time: 3 days (Days 19-21 of 270-day roadmap)

Status: Portfolio Project 1 complete ✅

Coffee consumed: Enough

DEV Community