Michael Neang

Posted on Jun 7

Finishing a Tokyo Metro Data Pipeline: The Join I Refused to Fake

#devchallenge #githubchallenge #dataengineering #sql

GitHub “Finish-Up-A-Thon” Challenge Submission

This is a submission for the GitHub Finish-Up-A-Thon Challenge.

The most important part of my Tokyo Metro dashboard was not a chart.

It was the join I refused to create.

When I came back to this project, I already had useful pieces: station data, passenger averages, line names, revenue data, Japanese and English labels, and Python scripts that could clean some of it. But the project was stuck in the uncomfortable middle. The CSVs existed. The idea still made sense. Some scripts worked.

But I would not have confidently handed the repo to another person.

Not because the idea is bad, but because finishing the last 20% is not glamorous. It is field names, validation, weird edge cases, documentation, screenshots, and deciding what the data should not claim.

The GitHub Finish-Up-A-Thon gave me a reason to stop leaving it half-finished. GitHub Copilot became useful once I treated it less like a generator and more like a reviewer for the parts I was most likely to rush.

I revived the project into a small transit analytics workflow: Python cleaning scripts, cleaned CSV outputs, SQLite validation, SQL business queries, a documented data model, and a Japanese-first bilingual Tableau dashboard called:

東京メトロ需要・収益分析 / Tokyo Metro Demand & Revenue Analytics

Project links:

Final dashboard: a Japanese-first bilingual view of revenue trends, passenger demand by line, and major station hubs.

I was not trying to present this as an official Tokyo Metro product. I wanted it to become a finished portfolio project that could survive basic questions:

Where did the data go?

Which joins are valid?

What does the dashboard answer?

What are the limitations?

My background is in business systems, reporting, data quality, and operational workflows, so I approached this less like a “make a cool chart” project and more like a small internal analytics deliverable. In that kind of work, a clean visual is not enough. The numbers need a path behind them.

What I Built

The finished version has a simple pipeline:

Processed data

→ cleaned CSVs

→ SQLite database validation

→ SQL business queries

→ Tableau dashboard

→ README and visual documentation

The major outputs are:

cleaned station, line, passenger, and revenue datasets
a SQLite rebuild with row-count validation
SQL queries for passenger demand, station hubs, and revenue trends
an ERD and standalone revenue caveat
a published Japanese-first bilingual Tableau dashboard
a README that explains what the project does and what the data cannot claim

The final dashboard focuses on three questions:

How is system-wide revenue trending by fiscal year and month?
Which lines carry the highest passenger demand?
Which stations stand out as major hubs?

I kept the interface Japanese-first because the subject is Tokyo Metro. English is included for accessibility, but Japanese leads the dashboard.

I have visited Tokyo three times, and that made me more careful about the interface. I did not want the Japanese labels to feel decorative. I wanted them to lead the dashboard because that is the natural context of the data.

I also kept the filters limited:

会計年度 / Fiscal Year for revenue views
路線 / Line for passenger views

I intentionally avoided adding every possible slicer. A dashboard does not become better just because it has more controls.

Demo

The final dashboard is a compact view of demand and revenue.

It is designed around a smaller set of practical questions:

Are revenue trends recovering or weakening?
Which lines carry the most passenger demand?
Which stations act as major hubs?
What should the viewer not assume from this data?

The dashboard includes a note that revenue is system-wide, while passenger data is station/line-based. That prevents the viewer from assuming the revenue data can be filtered by line or station.

I skipped a map on purpose. The cleaned station data did not include reliable latitude and longitude fields. A map would have looked nice, but it would not have been honest.

The Comeback Story

Before the finish-up work, this project had potential but no real finish line.

The old version had a minimal README, cleaning scripts that needed polish, inconsistent field naming around revenue and fiscal month fields, no strong SQLite validation story, no clear data model explanation, and no published dashboard.

Before: the project had the idea, but not enough documentation or finish.

The biggest fix was not visual.

The biggest fix was deciding what had to be true before the dashboard deserved to exist.

For this project, I treated “finished” as a checklist:

Can the cleaned files be regenerated?
Can the database be rebuilt?
Can the row counts be validated?
Do the SQL queries answer real questions?
Are the data relationships honest?
Can someone open the dashboard without me explaining it live?

Instead of jumping straight back into Tableau, I went through the cleaning scripts, the field names, the SQLite import, and the data model.

One issue that forced the project to grow up was the station data. The Marunouchi Branch Line created a duplicate Station_ID problem during SQLite loading. That was annoying at the time, but it was also useful. SQLite caught something Tableau might have let me hide visually.

So I fixed the cleaning logic instead of working around the problem.

After that cleanup, the database rebuild became a real proof point. The final import step loads:

14 rows into Lines
291 rows into Stations
119 rows into Passengers
48 rows into Revenue

That terminal output became one of the most important screenshots in the project. It showed that the pipeline could rebuild from cleaned files instead of depending on a manual Tableau-only workflow.

The SQLite rebuild became the project’s validation checkpoint.

The Data Modeling Decision That Mattered

The most important modeling decision was deciding what not to join.

The passenger side of the project has a natural path:

Lines → Stations → Passengers

That lets me analyze passenger demand by line and by station.

Revenue was different.

The revenue table had fiscal year, fiscal month, commuter revenue, non-commuter revenue, total revenue, and YoY percentage fields. It did not have Line_ID or Station_ID.

So I kept revenue standalone.

That is not as exciting as adding another relationship in a diagram, but it is more honest. If I forced revenue into the station or line model, the dashboard would look more connected while becoming less trustworthy.

I have seen enough reporting work to know that the dangerous chart is not always the broken one. Sometimes it is the clean chart built on a relationship that should not exist.

That became the core lesson of this project:

Do not join data just because the dashboard would look more impressive.

Revenue remains standalone because the source does not include line or station keys.

My Experience with GitHub Copilot

The most useful Copilot work happened after I already had something running.

I did not ask it to build the whole project. I used it more like a second reviewer for the parts that are easy to rush when you are trying to finish:

Are the backend scripts readable?
Are there obvious validation gaps?
Are the comments and print messages clear?
Are the SQL queries understandable to someone reviewing the project?
Am I about to make a risky change for no real benefit?

The best prompts were specific. For example, I asked Copilot to review import_data_to_sqlite.py, clean_station_data.py, and create_line_data.py for clarity, validation, and maintainability while preserving the output schema.

That last part mattered. When a project is close to finished, a large refactor can be a trap. I needed safe improvements, not a new project.

Here is the kind of Copilot help that was actually useful: not “build the project for me,” but “review these scripts and point out what would make them safer to rerun.”

Copilot reviewing the backend scripts for validation and maintainability. I used it as a reviewer, not as an autopilot.

One concrete example was the SQLite import workflow. I had a script that could load the cleaned files, but Copilot’s review pushed me to think more like a maintainer: check expected inputs, preserve the schema, keep the output predictable, and make the final row counts visible.

The most useful outcome was not a giant rewrite. It was a tighter import workflow: expected files checked, table schemas respected, and row counts printed at the end so the pipeline had a visible pass/fail moment.

I still had to test the result myself. Copilot did not replace that. But the review helped turn the script from “works on my machine” into something I could actually show in the submission.

I also did not accept every suggestion. The useful ones were the ones that made the pipeline easier to verify without changing the data model. Suggestions that felt like unnecessary refactoring or scope creep stayed out.

The pattern that worked best was boring, but useful:

Ask for review
→ keep the safe suggestions
→ reject the risky ones
→ rerun the pipeline
→ verify the output

That is probably how I will use Copilot more often: not as a replacement for judgment, but as pressure against sloppy finish-up work.

What the SQL Layer Added

Once the database loaded, I used SQL as a sanity check before going back to Tableau.

The queries were not meant to be complicated. They were meant to answer questions a transit analyst or business reviewer would understand:

Which lines carry the most passenger demand?
Which stations act as major hubs?
How has revenue changed by fiscal year?
How do commuter and non-commuter revenue streams compare?

That helped me avoid treating Tableau as the only proof. The dashboard became the final interface, not the only place where the logic existed.

This mattered because a dashboard can hide a lot. SQL forced me to look at the grain of the data before designing around it.

SQL checks helped validate business questions before building the Tableau view.

Before and After

Here is the short version of the comeback:

Area	Before	After
Documentation	Minimal README	Project page with links, screenshots, and caveats
Data cleaning	Scripts existed but needed cleanup	Normalized cleaned datasets
Database	No strong validation checkpoint	SQLite rebuild with row-count validation
SQL	Exploratory or missing	Business-focused queries
Data model	Relationships not clearly explained	ERD plus standalone revenue note
Dashboard	Not finished	Published bilingual Tableau dashboard
Copilot	Not part of the workflow	Used for review, validation, and polish

That before-and-after is what led me to a checklist I will reuse.

My Finish-Up Checklist for Data Projects

Before calling a data project done, I want to be able to answer:

Can I regenerate the cleaned data?
Can I rebuild the database or analysis layer?
Can I validate row counts?
Can I explain every relationship in the model?
Can I explain what is intentionally not related?
Can SQL answer the core business questions?
Can the dashboard be used without me narrating it?
Are the caveats visible enough that someone will not overread the data?

What I Would Improve Next

If I had more time, I would add:

automated tests for the cleaning scripts
a GitHub Actions check that rebuilds the SQLite database
clearer source-data documentation
accurate latitude and longitude data before attempting a map

I skipped a map on purpose. Without reliable latitude and longitude data, it would have made the project look better while making it less accurate.

Final Takeaway

For this repo, finishing meant cleaning the data, validating the load, documenting the model, refusing a bad join, and building a dashboard someone else could actually use.

Copilot helped with the review and polish, but the main value came from deciding what “done” should mean.

That is what changed this from an abandoned side project into a shareable analytics workflow.

DEV Community