Three Days, One Deployed Dashboard, and a Lesson About Letting Data Drive Business Questions
I just finished my first complete machine learning project—a renewable energy investment analysis dashboard that's now live on Streamlit Cloud. Three days of work. 181,915 rows of data. And one really important lesson: your initial business problem is probably wrong.
I'm a software engineer learning ML with Claude designing my course. This project clarified a lot about how data science work actually happens.
Day 1: When Your Business Problem Meets Reality
I started with a plan: build a tool to help optimize fossil fuel plant modernization schedules based on renewable production patterns. Sounded reasonable. Turned out to be impossible with my data.
I had a renewable energy dataset covering 52 countries from 2010-2022. Six energy types. Good coverage. But after loading it into the interactive EDA dashboard I'd built the previous week, reality hit:
- Dataset showed production, not capacity or demand
- Renewables depend on weather—you can't schedule them
- No grid data, no regional breakdowns
- Historical trends can't predict modernization timing
My business problem didn't match what the data could actually answer.
The pivot: I asked a different question. Instead of "when should plants modernize," I asked "which countries represent the best opportunities for battery storage investments based on renewable penetration, growth rates, and energy mix diversity?"
That question? The data could answer it perfectly.
What I Learned: Validate Before You Commit
The EDA dashboard from Week 2 was useful here. Twenty minutes of exploration showed me:
- Scale mismatches (totals mixed with individual sources)
- Missing data patterns (expected in first-year entries)
- Distribution issues (couldn't fix with log transforms)
- Time coverage worked for trend analysis
Claude pointed out the business problem didn't match the data. You deal with the situation you're in, so we pivoted to a question the data could actually answer.
Day 1 Continued: The Preprocessing Pipeline
Coming from C++ where I think about data flow and single responsibilities, I built a five-function pipeline:
load_and_clean → filter → aggregate → calculate_metrics → rank
Each function takes a DataFrame, returns a DataFrame, has one clear job, prints progress, and handles edge cases.
The Scale Problem I Almost Got Wrong
Early on, my visualizations looked terrible. Some categories showed values 100x larger than others. My first instinct: log transformation.
Wrong.
The real issue: my data mixed individual renewable sources (Hydro = 1,000 GWh) with aggregate totals (Total Electricity = 200,000 GWh). These shouldn't be on the same chart at all.
Solution: Filter out aggregates entirely. Keep only the discrete renewable sources.
This wasn't a math problem—it was a data structure problem. No transformation fixes a fundamental category mismatch.
Day 2: When Your Model Is "Wrong" (But Actually Right)
I trained a Random Forest model to predict storage infrastructure scores:
- Input: Percentages of Hydro, Wind, Solar, Geothermal, Other
- Output: Storage need score (0-100)
- Performance: R² = 0.948
Model worked. Then I tested extreme cases:
100% Hydro: Score 56.21
100% Wind: Score 31.37
Wait. Wind is intermittent—shouldn't it need MORE storage than stable hydro? Why was my model backwards?
I debugged for 15 minutes before realizing: the model wasn't wrong. My assumption was.
My Day 1 scoring formula:
storage_score = 0.4 × renewable_share + 0.4 × growth_rate + 0.2 × diversity
This measured investment opportunity, not technical storage need. Countries with high hydro (Norway, Iceland) scored high because:
- Very high renewable penetration (27-30%)
- Mature markets ready for more storage
- High penetration signals strong renewable commitment
The model learned exactly what I trained it on. I just forgot what I'd actually built versus what I thought I was building.
Lesson: Models optimize for your training signal, not your intentions. When behavior seems wrong, check what you actually trained it on.
Day 3: Production Deployment Teaches Fast
I built a four-tab Streamlit dashboard:
- Overview: Top 10 investment opportunities
- Country Analysis: Interactive comparisons
- Predictions: ML model with input sliders
- Technical Details: Full methodology
Building for production exposed design flaws I'd never catch in a Jupyter notebook.
Problem 1: Path Management
Local: model = joblib.load('storage_model.pkl') worked fine
Streamlit Cloud: FileNotFoundError
Why? My dashboard lived in a src/ subfolder, models in the parent directory. Relative paths resolved from where the code runs, not where the file lives.
Fix:
import os
current_dir = os.path.dirname(os.path.abspath(__file__))
parent_dir = os.path.dirname(current_dir)
model = joblib.load(os.path.join(parent_dir, 'storage_model.pkl'))
Problem 2: Requirements File Location
Streamlit Cloud looks for requirements.txt at repository root, not in subdirectories. Took two deployment failures to figure this out.
Problem 3: Feature Scaling
Almost made a critical mistake: feeding raw percentages directly to the model.
Wrong:
input_data = np.array([[hydro, wind, solar, geo, other]])
prediction = model.predict(input_data) # Wrong!
Right:
input_data = np.array([[hydro, wind, solar, geo, other]])
input_scaled = scaler.transform(input_data) # Scale first!
prediction = model.predict(input_scaled)
Models trained on scaled features expect scaled inputs. Skip this and predictions don't work.
Lesson: Development and production environments have different problems. Same issues I deal with in systems work—environment differences, dependencies, synchronization—show up in ML deployments.
What Three Days Produced
Live dashboard with public URL
GitHub repo with professional README
Trained ML model (three deployment patterns: batch/API/edge)
Complete data pipeline with reproducible preprocessing
Documentation with screenshots
Top investment opportunities identified:
- Netherlands (63.08) - 838% growth rate
- Iceland (62.05) - 29.5% renewable penetration
- Norway (59.47) - Strong baseline, steady growth
- Hungary (52.82) - 658% growth, emerging market
- UK (48.90) - Large market, 504% growth
Technical stats:
- 181,915 data points processed
- 52 countries analyzed
- 156 months of time series
- 8,033 predictions/second (batch)
- 89.4 KB model (ONNX edge deployment)
- R² = 0.948
What Actually Surprised Me
1. Preprocessing Takes Most of the Time
In C++, optimization takes most of the time. In ML, data cleaning and feature engineering dominated. Good preprocessing makes modeling straightforward. Bad preprocessing makes it impossible.
2. Production Deployment Shows Problems Fast
Jupyter notebooks hide issues:
- Path dependencies
- Environment differences
- Feature scaling synchronization
- Input validation
Deploying early forced me to deal with these.
3. The README Matters
I spent 30 minutes writing a professional README:
- Business problem clearly stated
- Technical approach explained
- Setup instructions
- Screenshots
- Live demo URL
Project looks more complete with good documentation.
4. End-to-End Matters More Than Depth
I could've spent three days optimizing model accuracy from 0.948 to 0.952. Instead I built a complete pipeline: data → model → deployment → documentation.
For real job hunting, I hope this matters more.
Real Bugs I Hit
Bug 1: Streamlit Cloud couldn't find plotly module
Cause: requirements.txt in wrong directory
Fix: Moved to repo root, specified plotly>=5.0.0
Bug 2: Model files not loading
Cause: Relative paths broken in cloud environment
Fix: Used os.path.dirname(__file__) for portable paths
Bug 3: "Random Forest" truncated in UI columns
Cause: Text too long for column width
Fix: Made it a subheader instead of metric in column
Bug 4: Predictions looked weird
Cause: Forgot to scale input features
Fix: Applied scaler before model.predict()
Claude caught most of these during code review. I understand the patterns now—scoping issues, path management, feature preprocessing flow. I'm delegating implementation details and focusing on understanding architecture.
What's Next
This was Portfolio Project 1 of 6. Each project adds new capabilities:
- Project 1 (Done): Data analysis dashboard, traditional ML
- Project 2: Traditional ML pipeline with feature engineering
- Project 3: Deep learning computer vision
- Project 4: Generative AI with LLMs
- Project 5: MLOps with CI/CD
- Project 6: ML systems engineering specialization
Goal isn't just learning ML—it's building a portfolio proving I can deliver production ML systems.
Tools That Helped
- Streamlit (dashboard framework)
- Plotly (interactive viz)
- scikit-learn (Random Forest, preprocessing)
- Pandas (data manipulation)
- Streamlit Cloud (deployment)
- Claude (course design, code review, debugging partner)
Live Demo & Code
🔗 Live Dashboard: https://portfolio1-bixsugdscx8hs5w8ybdasd.streamlit.app/
💻 GitHub Repository: https://github.com/bean2778/ai_learning_2025
📊 Dataset: Global Renewable Energy Production (2010-2022)
About this series: I'm a software engineer learning machine learning with Claude designing my curriculum. Week 3 done: EDA, problem formulation, first portfolio project deployed. More posts coming on traditional ML, deep learning, and production systems.
Connect:
- LinkedIn: www.linkedin.com/in/bean2778
- GitHub: https://github.com/bean2778/ai_learning_2025
- Previous: blog 2
Next: Traditional ML fundamentals—supervised learning, evaluation metrics, bias-variance tradeoff.
Time: 3 days (Days 19-21 of 270-day roadmap)
Status: Portfolio Project 1 complete ✅
Coffee consumed: Enough
Top comments (0)