I've been in software development since 2017. Started as a .NET developer, transitioned into SDE-III, and for the past 4 years working as data engineer. Eight years in tech means I'm not new to building things but data engineering was a different world for me. More of a sql related task but different in its own uniqueness. Like a different world in itself.
I knew dbt existed. I'd seen it in job descriptions. But I had no real idea what it actually did or why companies were treating it like a big deal. My mental model was basically: "May be a new tool like most other available in market."
Spoiler: I was wrong. Here's what I built and what changed my mind.
The Problem I Was Trying to Solve
I wanted a real data engineering project I could point to. Not a tutorial I followed on YouTube. Not a dataset from Kaggle. Something with actual data, actual decisions, and actual engineering challenges.
I'm based in Pakistan and the Pakistan Stock Exchange (PSX) had no decent open data tooling. 285 shariah complaint listed stocks, updated daily, and nothing clean to work with. That felt like a real problem worth solving.
So I built a pipeline from scratch.
The Stack
- Python — scraping daily snapshots from PSX
- PostgreSQL on Neon — cloud database, free tier
- dbt-core — transformations
- Windows Task Scheduler — automation
Nothing exotic. Deliberately. I wanted to prove the architecture matters more than the tools.
The Architecture: Medallion
If you've read anything about modern data engineering you've seen this pattern:
Bronze → Silver → Gold
Raw data → Cleaned data → Analytics-ready data
Here's what that looks like in practice for this project:
Bronze — PsxAllShr table in PostgreSQL. Raw scraper output. Every field is TEXT. No constraints except a primary key. Nothing gets rejected here. This is the immutable source of truth.
Silver — stg_psx_daily_snapshot dbt view. This is where the actual work happens: strip commas from numbers like 15,616, strip % signs from change percentages, cast everything to proper numeric types, deduplicate in case the scraper runs twice.
Gold — Two mart tables:
-
mart_top_movers— all 285 stocks ranked by daily change % -
mart_sector_summary— market cap, volume, average change grouped by sector per day
Where dbt Changed My Thinking
My background is SSIS. In SSIS, if you have 10 packages and one depends on another, you wire them manually with sequence containers, precedence constraints, all done by hand in a GUI.
dbt replaces all of that with one function call:
from {{ ref('stg_psx_daily_snapshot') }}
That single ref() tells dbt: this model depends on stg_psx_daily_snapshot. Run that first. If it fails, don't run this. Draw the lineage graph. Document the dependency. Generate the docs.
All of that from one line.
Coming from SSIS where I'd spend hours managing package execution order, this felt almost too simple. But that's the point. dbt isn't doing anything you couldn't do manually, it's just making it impossible to skip the discipline.
The Part Nobody Talks About: Data Quality
Before this project my answer to "how do you ensure data quality" would have been: "I check the data manually after the run."
That's not good enough.
dbt has built-in tests. Four lines of YAML:
columns:
- name: symbol
tests:
- not_null
- name: id
tests:
- unique
- not_null
Now every time the pipeline runs, dbt automatically checks: are there nulls in symbol? Are IDs unique? If any test fails, the run fails. Bad data never reaches the gold layer.
When I ran dbt test and saw this:
5 of 5 PASS .................. [PASS]
Completed successfully
That was the moment it clicked. This isn't "just SQL." This is SQL with the same discipline software engineers apply to code — testing, documentation, version control, dependency management.
The Lineage Graph
One command generates a full documentation site:
dbt docs generate
dbt docs serve
This is what you get:
raw.PsxAllShr → stg_psx_daily_snapshot → mart_top_movers
psx_sector_mapping → mart_sector_summary
Anyone who clones this repo can run dbt docs serve and immediately understand the entire pipeline without reading a single line of code. In SSIS you'd write a Word document for this and it would be outdated within a week.
What the Pipeline Actually Produces
Top movers on any given day:
| Symbol | Name | Price | Change % | Volume |
|---|---|---|---|---|
| TRSM | Trust Modaraba | 17.24 | +10.02% | 2,082,863 |
| MSCL | Metropolitan Steel | 26.25 | +10.02% | 2,021,553 |
| FCEPL | Frieslandcampina Engro | 85.87 | +10.01% | 1,842,353 |
Multiple stocks hitting exactly 10% on the same day? That's PSX's upper circuit breaker triggering — a signal you'd miss if you were just eyeballing a spreadsheet.
What's Next
The pipeline is running daily. Data is accumulating. Once I hit 30 days of history I'll add:
- 7-day and 30-day moving averages
- Relative strength by sector
- A live dashboard (Metabase or Superset)
- Prefect for proper orchestration
The GitHub repo is open if you want to look at the code or build something similar:
🔗 github.com/muzzamilanis/psx-data-pipeline
One Thing I'd Tell Anyone Transitioning into Data Engineering
You don't need a certification to call yourself a data engineer. You need a project you built yourself, that solves a real problem, that you can explain and defend in an interview.
Build something. Break it. Fix it. Document it. Ship it.
That's the portfolio.

Top comments (0)