muzzamilanis

Posted on Apr 29

Building a PSX Data Pipeline: How I get introduced to dbt

#dataengineering #python #dbt #postgres

I've been in software development since 2017. Started as a .NET developer, transitioned into SDE-III, and for the past 4 years working as data engineer. Eight years in tech means I'm not new to building things but data engineering was a different world for me. More of a sql related task but different in its own uniqueness. Like a different world in itself.

I knew dbt existed. I'd seen it in job descriptions. But I had no real idea what it actually did or why companies were treating it like a big deal. My mental model was basically: "May be a new tool like most other available in market."

Spoiler: I was wrong. Here's what I built and what changed my mind.

The Problem I Was Trying to Solve

I wanted a real data engineering project I could point to. Not a tutorial I followed on YouTube. Not a dataset from Kaggle. Something with actual data, actual decisions, and actual engineering challenges.

I'm based in Pakistan and the Pakistan Stock Exchange (PSX) had no decent open data tooling. 285 shariah complaint listed stocks, updated daily, and nothing clean to work with. That felt like a real problem worth solving.

So I built a pipeline from scratch.

The Stack

Python — scraping daily snapshots from PSX
PostgreSQL on Neon — cloud database, free tier
dbt-core — transformations
Windows Task Scheduler — automation

Nothing exotic. Deliberately. I wanted to prove the architecture matters more than the tools.

The Architecture: Medallion

If you've read anything about modern data engineering you've seen this pattern:

Bronze → Silver → Gold

Raw data → Cleaned data → Analytics-ready data

Here's what that looks like in practice for this project:

Bronze — PsxAllShr table in PostgreSQL. Raw scraper output. Every field is TEXT. No constraints except a primary key. Nothing gets rejected here. This is the immutable source of truth.

Silver — stg_psx_daily_snapshot dbt view. This is where the actual work happens: strip commas from numbers like 15,616, strip % signs from change percentages, cast everything to proper numeric types, deduplicate in case the scraper runs twice.

Gold — Two mart tables:

mart_top_movers — all 285 stocks ranked by daily change %
mart_sector_summary — market cap, volume, average change grouped by sector per day

Where dbt Changed My Thinking

My background is SSIS. In SSIS, if you have 10 packages and one depends on another, you wire them manually with sequence containers, precedence constraints, all done by hand in a GUI.

dbt replaces all of that with one function call:

from {{ ref('stg_psx_daily_snapshot') }}

That single ref() tells dbt: this model depends on stg_psx_daily_snapshot. Run that first. If it fails, don't run this. Draw the lineage graph. Document the dependency. Generate the docs.

All of that from one line.

Coming from SSIS where I'd spend hours managing package execution order, this felt almost too simple. But that's the point. dbt isn't doing anything you couldn't do manually, it's just making it impossible to skip the discipline.

The Part Nobody Talks About: Data Quality

Before this project my answer to "how do you ensure data quality" would have been: "I check the data manually after the run."

That's not good enough.

dbt has built-in tests. Four lines of YAML:

columns:
  - name: symbol
    tests:
      - not_null
  - name: id
    tests:
      - unique
      - not_null

Now every time the pipeline runs, dbt automatically checks: are there nulls in symbol? Are IDs unique? If any test fails, the run fails. Bad data never reaches the gold layer.

When I ran dbt test and saw this:

5 of 5 PASS .................. [PASS]
Completed successfully

That was the moment it clicked. This isn't "just SQL." This is SQL with the same discipline software engineers apply to code — testing, documentation, version control, dependency management.

The Lineage Graph

One command generates a full documentation site:

dbt docs generate
dbt docs serve

This is what you get:

raw.PsxAllShr → stg_psx_daily_snapshot → mart_top_movers

psx_sector_mapping → mart_sector_summary

Anyone who clones this repo can run dbt docs serve and immediately understand the entire pipeline without reading a single line of code. In SSIS you'd write a Word document for this and it would be outdated within a week.

What the Pipeline Actually Produces

Top movers on any given day:

Symbol	Name	Price	Change %	Volume
TRSM	Trust Modaraba	17.24	+10.02%	2,082,863
MSCL	Metropolitan Steel	26.25	+10.02%	2,021,553
FCEPL	Frieslandcampina Engro	85.87	+10.01%	1,842,353

Multiple stocks hitting exactly 10% on the same day? That's PSX's upper circuit breaker triggering — a signal you'd miss if you were just eyeballing a spreadsheet.

What's Next

The pipeline is running daily. Data is accumulating. Once I hit 30 days of history I'll add:

7-day and 30-day moving averages
Relative strength by sector
A live dashboard (Metabase or Superset)
Prefect for proper orchestration

The GitHub repo is open if you want to look at the code or build something similar:
🔗 github.com/muzzamilanis/psx-data-pipeline

One Thing I'd Tell Anyone Transitioning into Data Engineering

You don't need a certification to call yourself a data engineer. You need a project you built yourself, that solves a real problem, that you can explain and defend in an interview.

Build something. Break it. Fix it. Document it. Ship it.

That's the portfolio.

DEV Community