DEV Community

David Bean
David Bean

Posted on

Building My First ML Data Pipeline

Three Days, One Deployed Dashboard, and a Lesson About Letting Data Drive Business Questions

I just finished my first complete machine learning project—a renewable energy investment analysis dashboard that's now live on Streamlit Cloud. Three days of work. 181,915 rows of data. And one really important lesson: your initial business problem is probably wrong.

I'm a software engineer learning ML with Claude designing my course. This project clarified a lot about how data science work actually happens.

Day 1: When Your Business Problem Meets Reality

I started with a plan: build a tool to help optimize fossil fuel plant modernization schedules based on renewable production patterns. Sounded reasonable. Turned out to be impossible with my data.

I had a renewable energy dataset covering 52 countries from 2010-2022. Six energy types. Good coverage. But after loading it into the interactive EDA dashboard I'd built the previous week, reality hit:

  • Dataset showed production, not capacity or demand
  • Renewables depend on weather—you can't schedule them
  • No grid data, no regional breakdowns
  • Historical trends can't predict modernization timing

My business problem didn't match what the data could actually answer.

The pivot: I asked a different question. Instead of "when should plants modernize," I asked "which countries represent the best opportunities for battery storage investments based on renewable penetration, growth rates, and energy mix diversity?"

That question? The data could answer it perfectly.

What I Learned: Validate Before You Commit

The EDA dashboard from Week 2 was useful here. Twenty minutes of exploration showed me:

  • Scale mismatches (totals mixed with individual sources)
  • Missing data patterns (expected in first-year entries)
  • Distribution issues (couldn't fix with log transforms)
  • Time coverage worked for trend analysis

Claude pointed out the business problem didn't match the data. You deal with the situation you're in, so we pivoted to a question the data could actually answer.

Day 1 Continued: The Preprocessing Pipeline

Coming from C++ where I think about data flow and single responsibilities, I built a five-function pipeline:

load_and_clean  filter  aggregate  calculate_metrics  rank
Enter fullscreen mode Exit fullscreen mode

Each function takes a DataFrame, returns a DataFrame, has one clear job, prints progress, and handles edge cases.

The Scale Problem I Almost Got Wrong

Early on, my visualizations looked terrible. Some categories showed values 100x larger than others. My first instinct: log transformation.

Wrong.

The real issue: my data mixed individual renewable sources (Hydro = 1,000 GWh) with aggregate totals (Total Electricity = 200,000 GWh). These shouldn't be on the same chart at all.

Solution: Filter out aggregates entirely. Keep only the discrete renewable sources.

This wasn't a math problem—it was a data structure problem. No transformation fixes a fundamental category mismatch.

Day 2: When Your Model Is "Wrong" (But Actually Right)

I trained a Random Forest model to predict storage infrastructure scores:

  • Input: Percentages of Hydro, Wind, Solar, Geothermal, Other
  • Output: Storage need score (0-100)
  • Performance: R² = 0.948

Model worked. Then I tested extreme cases:

100% Hydro: Score 56.21

100% Wind: Score 31.37

Wait. Wind is intermittent—shouldn't it need MORE storage than stable hydro? Why was my model backwards?

I debugged for 15 minutes before realizing: the model wasn't wrong. My assumption was.

My Day 1 scoring formula:

storage_score = 0.4 × renewable_share + 0.4 × growth_rate + 0.2 × diversity
Enter fullscreen mode Exit fullscreen mode

This measured investment opportunity, not technical storage need. Countries with high hydro (Norway, Iceland) scored high because:

  • Very high renewable penetration (27-30%)
  • Mature markets ready for more storage
  • High penetration signals strong renewable commitment

The model learned exactly what I trained it on. I just forgot what I'd actually built versus what I thought I was building.

Lesson: Models optimize for your training signal, not your intentions. When behavior seems wrong, check what you actually trained it on.

Day 3: Production Deployment Teaches Fast

I built a four-tab Streamlit dashboard:

  1. Overview: Top 10 investment opportunities
  2. Country Analysis: Interactive comparisons
  3. Predictions: ML model with input sliders
  4. Technical Details: Full methodology

Building for production exposed design flaws I'd never catch in a Jupyter notebook.

Problem 1: Path Management

Local: model = joblib.load('storage_model.pkl') worked fine

Streamlit Cloud: FileNotFoundError

Why? My dashboard lived in a src/ subfolder, models in the parent directory. Relative paths resolved from where the code runs, not where the file lives.

Fix:

import os
current_dir = os.path.dirname(os.path.abspath(__file__))
parent_dir = os.path.dirname(current_dir)
model = joblib.load(os.path.join(parent_dir, 'storage_model.pkl'))
Enter fullscreen mode Exit fullscreen mode

Problem 2: Requirements File Location

Streamlit Cloud looks for requirements.txt at repository root, not in subdirectories. Took two deployment failures to figure this out.

Problem 3: Feature Scaling

Almost made a critical mistake: feeding raw percentages directly to the model.

Wrong:

input_data = np.array([[hydro, wind, solar, geo, other]])
prediction = model.predict(input_data)  # Wrong!
Enter fullscreen mode Exit fullscreen mode

Right:

input_data = np.array([[hydro, wind, solar, geo, other]])
input_scaled = scaler.transform(input_data)  # Scale first!
prediction = model.predict(input_scaled)
Enter fullscreen mode Exit fullscreen mode

Models trained on scaled features expect scaled inputs. Skip this and predictions don't work.

Lesson: Development and production environments have different problems. Same issues I deal with in systems work—environment differences, dependencies, synchronization—show up in ML deployments.

What Three Days Produced

Live dashboard with public URL

GitHub repo with professional README

Trained ML model (three deployment patterns: batch/API/edge)

Complete data pipeline with reproducible preprocessing

Documentation with screenshots

Top investment opportunities identified:

  1. Netherlands (63.08) - 838% growth rate
  2. Iceland (62.05) - 29.5% renewable penetration
  3. Norway (59.47) - Strong baseline, steady growth
  4. Hungary (52.82) - 658% growth, emerging market
  5. UK (48.90) - Large market, 504% growth

Technical stats:

  • 181,915 data points processed
  • 52 countries analyzed
  • 156 months of time series
  • 8,033 predictions/second (batch)
  • 89.4 KB model (ONNX edge deployment)
  • R² = 0.948

What Actually Surprised Me

1. Preprocessing Takes Most of the Time

In C++, optimization takes most of the time. In ML, data cleaning and feature engineering dominated. Good preprocessing makes modeling straightforward. Bad preprocessing makes it impossible.

2. Production Deployment Shows Problems Fast

Jupyter notebooks hide issues:

  • Path dependencies
  • Environment differences
  • Feature scaling synchronization
  • Input validation

Deploying early forced me to deal with these.

3. The README Matters

I spent 30 minutes writing a professional README:

  • Business problem clearly stated
  • Technical approach explained
  • Setup instructions
  • Screenshots
  • Live demo URL

Project looks more complete with good documentation.

4. End-to-End Matters More Than Depth

I could've spent three days optimizing model accuracy from 0.948 to 0.952. Instead I built a complete pipeline: data → model → deployment → documentation.

For real job hunting, I hope this matters more.

Real Bugs I Hit

Bug 1: Streamlit Cloud couldn't find plotly module

Cause: requirements.txt in wrong directory

Fix: Moved to repo root, specified plotly>=5.0.0

Bug 2: Model files not loading

Cause: Relative paths broken in cloud environment

Fix: Used os.path.dirname(__file__) for portable paths

Bug 3: "Random Forest" truncated in UI columns

Cause: Text too long for column width

Fix: Made it a subheader instead of metric in column

Bug 4: Predictions looked weird

Cause: Forgot to scale input features

Fix: Applied scaler before model.predict()

Claude caught most of these during code review. I understand the patterns now—scoping issues, path management, feature preprocessing flow. I'm delegating implementation details and focusing on understanding architecture.

What's Next

This was Portfolio Project 1 of 6. Each project adds new capabilities:

  • Project 1 (Done): Data analysis dashboard, traditional ML
  • Project 2: Traditional ML pipeline with feature engineering
  • Project 3: Deep learning computer vision
  • Project 4: Generative AI with LLMs
  • Project 5: MLOps with CI/CD
  • Project 6: ML systems engineering specialization

Goal isn't just learning ML—it's building a portfolio proving I can deliver production ML systems.

Tools That Helped

  • Streamlit (dashboard framework)
  • Plotly (interactive viz)
  • scikit-learn (Random Forest, preprocessing)
  • Pandas (data manipulation)
  • Streamlit Cloud (deployment)
  • Claude (course design, code review, debugging partner)

Live Demo & Code

🔗 Live Dashboard: https://portfolio1-bixsugdscx8hs5w8ybdasd.streamlit.app/

💻 GitHub Repository: https://github.com/bean2778/ai_learning_2025

📊 Dataset: Global Renewable Energy Production (2010-2022)


About this series: I'm a software engineer learning machine learning with Claude designing my curriculum. Week 3 done: EDA, problem formulation, first portfolio project deployed. More posts coming on traditional ML, deep learning, and production systems.

Connect:

Next: Traditional ML fundamentals—supervised learning, evaluation metrics, bias-variance tradeoff.


Time: 3 days (Days 19-21 of 270-day roadmap)

Status: Portfolio Project 1 complete ✅

Coffee consumed: Enough

Top comments (0)