DEV Community: Mungai M.

How WhatsApp Accounts Really Get "Hacked" (and How to Lock Yours Down)

Mungai M. — Wed, 24 Jun 2026 02:12:28 +0000

Quick question before you read on. If someone took over your WhatsApp right now, who would they hit first? Your family group? Your clients? The estate WhatsApp where everyone trusts your name? Sit with that for a second, because that is the whole game. Losing the app is annoying. Losing the trust attached to your name is the part that costs people money.

In Kenya this usually looks like a panicked "nisaidie na fare" message, a fake M-PESA reversal, a too-good job link, or a sudden text asking you to read back a six-digit code. The wrapping changes from country to country, but the trick underneath is the same everywhere: someone wants your number, your account, or your contacts' trust.

Here is the part most people get wrong. They picture a hacker in a hoodie tearing through encryption. That is almost never what happens. WhatsApp's end-to-end encryption is genuinely hard to break, and the people coming after your account are not trying to break it. They are trying to get you to hand over the keys.

What "hacking" actually means here

The word people are reaching for is account takeover. Someone registers your phone number on their own phone, and from that moment they are you, as far as your contacts can tell.

Think of it like a thief who never touches the padlock on your gate. He just shows up in a delivery uniform, looks like he belongs, and you open the gate yourself. Once he is in, he can message your contacts for money, drop scam links in your groups, sit in your private chats pretending to be you, and use your name to go find the next victim.

So the honest framing is not "my app got hacked." It is "my access got stolen, and usually I helped without realising it."

The ways they get in

Method	What it looks like	What they're after
OTP / code theft	A six-digit code arrives, then someone asks you to share it	To register your WhatsApp on their phone
Phishing links	A link about free data, a refund, a job, a prize, a "blocked account"	Your passwords, your details, or a malware install
Malware	An unofficial app or a permission you should not have granted	Quietly reading your SMS codes and watching your activity
SIM swap	Your line suddenly drops to "Emergency calls only" for hours	Your number itself, so every code and alert comes to them
Voicemail abuse	A login attempt overnight while you sleep	Your code, left as a voicemail behind a default PIN
Linked-device abuse	A strange computer shows up under Linked Devices	A few seconds with your unlocked phone to mirror your chats

How these play out in real life

"I sent you a code by mistake"

This one works on good people, which is exactly why it spreads. You get a six-digit WhatsApp code you never asked for. Seconds later a "friend" (whose account was already stolen) or someone claiming to be Customer Care messages you: so sorry, I sent my code to your number by accident, can you read it back?

Share it and you have just handed over your account.

It works because they manufacture urgency. You are trying to be helpful, or you do not want to keep an authority figure waiting, and they are counting on you reacting before you think. The rule that beats it is short: if you did not request a code, you never share it. Not with a friend, not with a relative, not with anyone wearing a Customer Care voice.

The link that promises something

Scam links lead with things people want or fear: free bundles, a KRA refund, a job opening, a delivery problem, a prize, an account "about to be suspended." The page is built to look like the real brand, sometimes pixel for pixel. The goal is always one of three things: get you to type a password, hand over personal details, or install something nasty.

The tell is the rush. A real bank or government service is not going to threaten you into clicking inside the next two minutes.

The sudden network blackout (SIM swap)

A SIM swap is when someone talks a mobile provider into moving your number onto a SIM card they control. The first sign on your end is your phone going dead quiet. Full signal one minute, "Emergency calls only" the next, in a place where coverage is normally fine. Meanwhile every call, code, and banking alert meant for you is landing on their phone.

This is the scary one, because your number sits at the centre of everything: WhatsApp, M-PESA, mobile banking, account recovery. They usually gather your ID details first (sometimes from social media, sometimes by calling and pretending to "update your records") and then convince an agent the SIM was lost.

The midnight code grab (voicemail)

This sounds old-school, and it still works. The attacker tries to log in late at night. WhatsApp sends the SMS code, you are asleep, nothing happens. So they tap "Call me instead." Your phone rings, goes unanswered, and the automated voice reads the code straight into your voicemail. If your voicemail PIN is still the factory default, they simply dial in remotely and listen to it.

Plenty of people lock their screen and never once think about their voicemail. Attackers know that.

The office snooper (linked devices)

Sometimes there is no remote trickery at all. You leave your phone unlocked on the desk while you grab lunch. It takes about five seconds for someone to open WhatsApp, hit Linked Devices, and scan a QR code onto their own laptop. Now your chats mirror to their screen and they can message as you. The giveaway is a device you do not recognise sitting in your Linked Devices list.

Warning signs worth stopping for

Most of these attacks announce themselves if you are paying attention:

A verification code shows up when you were not logging in anywhere.
Someone is suddenly very keen for you to read back a six-digit number.
Your phone loses signal for an unusually long stretch with no explanation.
There is a browser or computer in Linked Devices that you never set up.
A friend asks why you sent them a weird message you never sent.
Your battery is draining fast, the phone runs hot, or apps you do not remember installing have appeared.

One of these is enough reason to stop and check. You do not need a full set.

How to actually protect yourself

None of this requires you to be technical. It is a handful of habits.

Turn on Two-Step Verification

If you do one thing from this whole article, do this. It adds a custom PIN that WhatsApp asks for whenever your number is registered on a new device. So even if a scammer cons you out of the SMS code, that PIN still stands between them and your account.

WhatsApp > Settings > Account > Two-step verification > Turn on

Pick a PIN you will remember, and add a recovery email you actually check. The email is your way back in if you ever forget the PIN.

Treat the code as private, full stop

No real support team, telco, friend, or relative needs your six-digit code, your PIN, or your password. Not Safaricom, not WhatsApp, not your cousin. If a request for a code reaches you, the answer is no, regardless of who it appears to come from.

Check your linked devices, and let WhatsApp warn you

WhatsApp > Settings > Linked Devices

Anything you do not recognise, a browser, a computer, an open session, tap it and log it out. Make this a once-in-a-while habit, not a one-time thing.

Better still, let WhatsApp do the watching for you. Keep push notifications turned on for the app, and WhatsApp will alert you the moment a new device gets linked to your account. If that notification ever pops up for something you did not set up, tap it, review the device, and remove it right there. That turns a slow discovery (noticing a strange laptop days later) into a real-time warning while the attacker is still trying to get comfortable. WhatsApp also auto-disconnects linked devices after 30 days of inactivity, but 30 days is plenty of time for damage, so do not lean on that. The alert is your early signal.

One more thing that matters here: only ever link through official WhatsApp surfaces, which means WhatsApp Web, the Windows or Mac app, an Android tablet, or a companion phone. Linking through some unofficial "WhatsApp viewer" app or website is exactly how accounts get compromised, and it can get your account banned on top of that.

Slow down before you tap a link

Scammers win on speed. Before you click, run four quick checks:

Was I actually expecting this message?
Does this sender make sense?
Does the link look right, or is it a near-miss spelling of a real site?
Am I being pushed to act right now?

A few seconds of doubt kills most of these.

Lock down the phone itself

Your WhatsApp is only as safe as the phone behind it. Use a strong screen lock, keep the software updated, skip unofficial app stores, be stingy with permissions (an SMS app does not need to read your messages), and turn on a biometric app lock if your phone supports it.

For Kenyan readers: protect the line, not just the app

Your phone number is the master key here, so it deserves its own protection. These steps are Safaricom-heavy because of market share, but Airtel and Telkom have equivalents.

Lock your SIM against swaps. On Safaricom, dial *100*100# to whitelist your line. Once it is on, your SIM can only be replaced in person at a Safaricom shop with your ID, not by some agent in a back room. If anyone then tries to register a new line on your details, you get an SMS asking you to confirm or reject it.

Audit what is registered to your ID. Dial *106# (this one works on Safaricom, Airtel, and Telkom). It lists every number tied to your national ID. If a ghost number you do not recognise shows up, report it from that same menu, or visit a shop to deregister it. Criminals love registering lines on stolen IDs, and this is how you catch it early.

Deal with your voicemail. If you do not use it, ask your provider to switch it off. If you do, change the PIN away from the default. That closes the midnight-code-grab route entirely.

Know who really calls you. Safaricom's customer care line is 0722 000 000. If a call comes from some random personal number claiming to be "Safaricom Care," asking you to read a code or press a sudden USSD prompt, hang up. No legitimate agent does that.

Report the fakes. Forward scam SMS to 333 so Safaricom can act on it. For anything more serious, the DCI runs an anonymous WhatsApp line at 0709 570 000 and a toll-free hotline at 0800 722 203.

Verify M-PESA emergencies independently. The "I sent you money by mistake, please send it back" routine is a classic. Scammers fake an M-PESA receipt SMS that looks completely real. Before you move a shilling, check your actual balance with *334# or the M-PESA app, and tell the other person to request an official reversal through Safaricom.

If you run groups, you're a target

Admins of big family, estate, church, school, SACCO, or chama groups get hit more than they expect, because a busy group full of trusting people is a goldmine. Tighten a few settings and you cut off most of it:

Approve new members. Turn on participant approval so nobody slides in unnoticed.
Restrict edits to admins. Scammers love renaming a group to something like "Bitcoin Investments KE" to lend a scam credibility. Lock the name, photo, and description to admins.
Use admin review. Where available, this lets members flag a suspicious message to you, so you can delete it for everyone and remove the poster.
Control who can add you. Settings > Privacy > Groups, switch from "Everyone" to "My contacts." That stops strangers from dragging you into random crypto-scam groups.

If it has already happened

Move fast. Speed is most of the battle.

Open WhatsApp and register your number again with a fresh SMS code. The moment you enter that new code, the attacker's session is kicked out.
Go to Linked Devices and log out anything you do not recognise.
Tell your close contacts, by normal call or SMS, that your account was compromised and to ignore any money requests. This is what stops the scam from spreading to people who trust you.
Check your voicemail PIN and your SIM status while you are at it.

If the attacker managed to set their own Two-Step Verification PIN, full recovery can take up to seven days. Annoying, yes, but re-registering still locks them out of the live session immediately, so do it anyway.

A few questions people always ask

Can I get hacked just by opening a message or answering a call?
No. Reading a text or picking up a call does not, by itself, compromise your phone. The scam needs you to do something: read out a code, tap a malicious link, or enter your PIN somewhere. That is also the good news, because it means you are the deciding factor.

I clicked a dodgy link. Now what?
Get off the internet (toggle airplane mode), and do not type any password into whatever page opened. Run a security scan, then keep an eye on your email and social accounts for strange login attempts. If you entered anything sensitive, change that password right away from a device you trust.

Someone says they sent me money by mistake and is begging me to send it back.
This is the fake-reversal scam almost every time. Do not send anything. Check your real M-PESA balance with *334# or the app, and tell them to request an official reversal through Safaricom. A genuine misdirected payment gets fixed through the network, not through you.

Does changing my number make me safer?
Not on its own. Your security lives in your habits, not your digits. If you do change numbers, use the in-app Change Number feature so the account migrates cleanly, and set up a fresh Two-Step Verification PIN on the new one.

Do this today, not tomorrow

Pick up your phone and knock these out now:

Turn on Two-Step Verification and add a recovery email.
Open Linked Devices and log out anything unfamiliar.
Decide, right now, that you will never share a six-digit code with anyone.
Dial *106# and check for ghost numbers on your ID.
On Safaricom, dial *100*100# to lock your SIM against swaps.
Change or kill your voicemail PIN.
Tell the people most likely to get caught (older relatives, younger users) that urgent WhatsApp money requests always get verified by a phone call first.

Staying safe on WhatsApp was never really a technical skill. It comes down to a short list: protect the number, guard the code, secure the phone, and slow down when something feels urgent. Attackers go for the easiest target in the room, not the smartest one. Twenty minutes of setup is usually all it takes to stop being that target.

Stay sharp, and keep your digital gate locked.

A few questions people always ask

Do this today, not tomorrow

Pick up your phone and knock these out now:

Turn on Two-Step Verification and add a recovery email.
Open Linked Devices and log out anything unfamiliar.
Decide, right now, that you will never share a six-digit code with anyone.
Dial *106# and check for ghost numbers on your ID.
On Safaricom, dial *100*100# to lock your SIM against swaps.
Change or kill your voicemail PIN.
Tell the people most likely to get caught (older relatives, younger users) that urgent WhatsApp money requests always get verified by a phone call first.

Stay sharp, and keep your digital gate locked.

How Python Powers Real-World Data Analytics

Mungai M. — Sun, 10 May 2026 18:27:29 +0000

Python Isn't a Trend. It's the Standard.

If you work with data in any capacity, whether wrangling CSVs for a monthly report or building production dashboards for executive stakeholders, you will eventually encounter Python. Not as an option. As the default.

A decade ago, data analysis meant Excel, maybe some VBA macros if you were ambitious. Organisations that needed serious number-crunching turned to proprietary tools like SAS, SPSS, or MATLAB, typically running on expensive enterprise licences. Those tools served their purpose, but they buckled under the demands of modern data volumes and the pace at which teams now need to iterate. As companies scaled their data operations, they gravitated toward Python, drawn by its readability, its zero licence cost, and an open-source ecosystem that covers everything from flat-file parsing to deep learning.

The entire modern analytics stack leans on Python. pandas for tabular manipulation, NumPy for numerical computation, scikit-learn for machine learning, Airflow for pipeline orchestration. To work effectively with data today, you need Python proficiency the same way a network engineer needs the command line. It's not a nice-to-have. It's table stakes.

Why Python Dominates the Analytics Landscape

Python's dominance didn't happen by accident. Several structural advantages compound to make it the language of choice for data practitioners at every level.

Readability as a Design Principle

Python was designed from the ground up for human readability. Where languages like Java or C++ demand boilerplate (class declarations, type annotations, semicolons, curly braces), Python uses whitespace indentation and minimal syntax. A loop that iterates over a dataset's columns looks almost like pseudocode:

for column in df.columns:
    print(column, df[column].dtype)

This matters enormously in analytics work. Your primary goal is understanding the data, not fighting the language. When a finance analyst needs to prototype a revenue calculation, the cognitive overhead of the language itself should approach zero.

Ecosystem Depth

The Python Package Index (PyPI) hosts over 500,000 packages. For data work specifically, the ecosystem is unmatched. The table below maps common analytics tasks to their standard Python libraries:

Task	Primary Library	What It Does
Tabular data manipulation	`pandas`	DataFrames: read, filter, group, merge, pivot, export
Numerical computation	`numpy`	N-dimensional arrays with C-optimized math operations
Static visualization	`matplotlib`	Full-control charting: line, bar, scatter, histogram
Statistical visualization	`seaborn`	Publication-quality plots with intelligent defaults
Interactive dashboards	`plotly`	Browser-rendered charts with hover, zoom, toggle
Machine learning	`scikit-learn`	Classification, regression, clustering, model evaluation
Statistical modelling	`statsmodels`	OLS regression, hypothesis testing, time-series analysis
HTTP requests	`requests`	Fetch data from REST APIs and web endpoints
Database connectivity	`sqlalchemy`	Unified interface to PostgreSQL, MySQL, SQLite, and others
Excel I/O	`openpyxl`	Read and write `.xlsx` files programmatically

These aren't isolated tools. They interoperate. You fetch JSON from an API with requests, parse it into a pandas DataFrame, run a regression with statsmodels, and plot the residuals with matplotlib, all in the same script. That composability is Python's real competitive advantage.

Interoperability Across the Stack

Python connects to practically everything. Relational databases (PostgreSQL, MySQL, SQLite) via SQLAlchemy or native drivers. Cloud platforms (AWS, GCP, Azure) via their respective SDKs. Business intelligence tools like Power BI and Tableau can execute Python scripts directly within their transformation pipelines. File format support spans CSV, JSON, Parquet, Avro, Excel, and HDF5.

Whatever system your data lives in, Python almost certainly has a mature, actively maintained connector for it.

Community and Market Demand

Stack Overflow's annual developer surveys consistently rank Python among the top three most-used programming languages globally. Employers across finance, healthcare, e-commerce, telecommunications, and government list Python proficiency as a core requirement for analyst and data science roles. For anyone building a career in data, Python skills are not just transferable; they are expected.

The Analytics Workflow: Ingest, Clean, Analyse, Visualise

A typical Python analytics workflow follows four stages. Each stage maps to specific library capabilities, and understanding the full pipeline is what separates someone who can write a script from someone who can deliver reliable, repeatable analysis.

Ingestion: Getting Data In

Data arrives from multiple sources: flat files on disk, REST APIs over HTTP, database queries, cloud storage buckets. Python handles all of them through a consistent pattern. For remote APIs, the requests library is the standard:

import requests
response = requests.get("https://api.example.com/data")
data = response.json()

For structured files, pandas provides a family of read_* functions that load data directly into DataFrames:

import pandas as pd
df = pd.read_csv("transactions.csv")

pd.read_json(), pd.read_excel(), pd.read_parquet(), and pd.read_sql() cover the remaining common formats with minimal configuration.

Cleaning: The Work Nobody Sees

Real-world data is messy. Dates arrive in inconsistent formats ("12/03/2024" vs "2024-03-12"), numeric columns contain stray text, and cells are left blank. This is not the exception; it is the norm. Analysts routinely spend more time cleaning data than analysing it.

pandas makes short work of these problems. Standardising a date column, coercing a text-contaminated numeric column, and imputing missing values each take a single line:

df["signup_date"] = pd.to_datetime(df["signup_date"])
df["age"] = pd.to_numeric(df["age"], errors="coerce")
df["age"] = df["age"].fillna(df["age"].median())

The errors="coerce" parameter is critical. Rather than crashing when it encounters a non-numeric string like "N/A" or "unknown", pandas converts that entry to NaN, which you can then handle with an explicit imputation strategy. Median is the standard choice for numeric fields because it is robust to outliers. Mean would be skewed by extreme values, and dropping rows entirely risks introducing selection bias.

Analysis: Asking Questions with Code

Once the data is clean, pandas lets you interrogate it with operations that map directly to SQL concepts. Grouping, aggregation, filtering, joining, and pivoting are all first-class operations:

by_country = df.groupby("country")["revenue"].mean()
top_5 = by_country.sort_values(ascending=False).head(5)

That two-line pipeline groups revenue by country, computes the mean for each group, sorts in descending order, and returns the top five. The equivalent SQL query would be longer and would require a database connection. Python lets you perform the same operations in memory, on data from any source, with no server infrastructure.

For more complex analysis, scikit-learn provides a consistent API across dozens of machine learning algorithms. Customer segmentation using K-Means clustering, for example, takes fewer than ten lines from data preparation to fitted model:

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

features = df[["recency", "frequency", "monetary"]]
scaled = StandardScaler().fit_transform(features)
model = KMeans(n_clusters=4, random_state=42)
df["segment"] = model.fit_predict(scaled)

Visualisation: Making the Data Speak

A chart communicates in seconds what a table cannot convey in minutes. After computing aggregates, you can render them directly:

top_5.plot(kind="barh", title="Top 5 Countries by Avg Revenue")

For static, publication-quality output, seaborn provides statistically-oriented defaults with attractive colour palettes:

import seaborn as sns
sns.boxplot(data=df, x="segment", y="monetary")

For interactive exploration, plotly renders charts in the browser where stakeholders can hover over data points, zoom into regions, and toggle series on and off, all without installing software. This is particularly valuable for dashboards shared with non-technical decision-makers who need to explore the data themselves.

Python in Production: Three Scenarios

Theory is useful. Seeing how Python solves real problems is better.

Retail Inventory Optimisation

A retail chain operating across 30 stores in East Africa collected point-of-sale data independently at each location but had no unified view. An analyst used pandas to merge CSVs from every store, standardise product codes (which varied across branches), and compute reorder points based on rolling 30-day sales averages. The result was an automated weekly report that reduced stock-outs by roughly 15 percent in the first quarter after deployment. The entire pipeline ran as a scheduled Python script on a cloud VM, with no GUI, no manual intervention, and no expensive BI licence.

Public Health Survey Analysis

A public health NGO running a maternal health programme needed to identify regions with the lowest vaccination coverage. The team used requests to ingest survey responses from a JSON API, pandas to clean inconsistent region names (different enumerators had spelled the same districts differently), and plotly to produce a choropleth map coloured by coverage rate. Decision-makers could hover over each region to see exact percentages, which directly informed resource allocation. The alternative would have been weeks of manual spreadsheet consolidation.

Customer Segmentation for Targeted Marketing

An online marketplace applied scikit-learn's K-Means clustering to purchase history data (recency, frequency, monetary value) and segmented customers into four tiers. Marketing then tailored email campaigns to each tier: re-engagement offers for dormant customers, loyalty rewards for top spenders. Open rates increased, and the high-value segment received priority customer service, directly improving retention metrics.

The Python Data Toolkit at a Glance

The table below summarises the core libraries, their roles, and the installation commands. All are available via pip and compatible with the latest stable Python release.

Library	Role	Install
`pandas`	Tabular data manipulation and I/O	`pip install pandas`
`numpy`	Numerical arrays and linear algebra	`pip install numpy`
`matplotlib`	Static plotting and chart generation	`pip install matplotlib`
`seaborn`	Statistical visualization	`pip install seaborn`
`plotly`	Interactive browser-based charts	`pip install plotly`
`scikit-learn`	Machine learning algorithms	`pip install scikit-learn`
`statsmodels`	Statistical modelling and hypothesis testing	`pip install statsmodels`
`requests`	HTTP requests for API data ingestion	`pip install requests`
`sqlalchemy`	Database connectivity	`pip install sqlalchemy`
`openpyxl`	Excel file read/write	`pip install openpyxl`
`jupyter`	Interactive notebook environment	`pip install jupyter`

Getting Started: A Practical Roadmap

If you are approaching Python for the first time, the path from zero to productive analyst is shorter than you might expect. Here is a concrete sequence.

Install Python and set up your environment. Download Python 3.10 or newer from python.org or install the Anaconda distribution, which bundles Python with pandas, NumPy, matplotlib, and Jupyter Notebook out of the box. Create a virtual environment for every project to isolate dependencies cleanly.

Learn the fundamentals in context. The official Python tutorial is thorough and free. For data-specific learning, Kaggle's "Intro to Python" and "Pandas" micro-courses are structured around real datasets, which keeps the motivation loop tight.

Practice with real data. Kaggle Datasets, the UCI Machine Learning Repository, and government open-data portals (Kenya's Open Data initiative, for example) offer thousands of free datasets to experiment with. Download something that interests you and try to answer a specific question: "Which month had the highest rainfall?" or "Which product category generates the most revenue?" Concrete questions drive concrete learning.

Build in public. Share your notebooks on GitHub. Write short tutorials on platforms like Dev.to. The act of explaining your analysis to an audience forces a level of clarity that private practice does not. It also builds a portfolio that employers notice.

Connect to the community. Python user groups, data science meetups (many run virtually), and the pandas tag on Stack Overflow are environments where you can ask questions and absorb patterns from practitioners who have solved the problems you are about to encounter.

Where This Leads

Python's position in data analytics is not the result of hype or marketing. It earned that position through genuine utility: readable syntax, zero licence cost, an ecosystem of libraries that covers every stage of the analytics lifecycle, and a community that continuously pushes the tooling forward.

For anyone building a career in data, whether you are cleaning CSVs for a monthly report or architecting a production ML pipeline, Python proficiency is the foundation everything else sits on. Not optional. Foundational.

Open a notebook, load a dataset, and write your first import pandas as pd. The data is waiting.

This article was written to help aspiring data analysts and early-career engineers build a practical understanding of Python's role in the modern analytics stack. The article was submitted in fulfilment of a LuxDevHQ Cohort 7 Data Engineering assignment ©adev3loper

Level Up Your SQL: Subqueries & CTEs in the Real World

Mungai M. — Sun, 19 Apr 2026 16:01:04 +0000

In the first SQL article, you met the basics: tables, data types, simple SELECTs, and filters. Now you’re ready for the point where “one query” quietly turns into “three nested queries and a headache.”

This is where subqueries and Common Table Expressions (CTEs) earn their keep. They let you think in layers, express complex logic clearly, and avoid turning every report into a wall of SQL noise.

In this follow‑up, we’ll stay practical and Postgres‑friendly. You’ll see where subqueries and CTEs fit, how they differ, and how to refactor queries when they start to sag under their own weight. The goal is that, as you read each example, you can mentally picture what rows are being produced at each step.

Subqueries: Queries Inside Queries

A subquery is just a query inside another query. The outer query asks a question, and the inner query helps answer part of it.

You’ll see subqueries most often in:

SELECT – to compute a single value per row.
FROM – as an inline view (a “temporary table” for that query).
WHERE / HAVING – to filter based on another result set.
JOIN conditions – for advanced patterns or cleaner predicates.

Conceptually, the database plans everything together, but you can think of the subquery as producing a value or set of values that the outer query then uses.

Common subquery shapes

Let’s walk through the main types. For each one, ask yourself two questions:

What does the inner query return (shape: one row, many rows, one column, many columns)?
How does the outer query use those rows?

Scalar subquery

A scalar subquery returns exactly one row and one column. It behaves like a single value you can plug into an expression.

SELECT
  c.customer_id,
  c.first_name,
  c.last_name,
  (
    SELECT AVG(total_amount)
    FROM assignment.sales
  ) AS avg_order_value
FROM assignment.customers c;

Here’s how to read this:

Inner query: SELECT AVG(total_amount) FROM assignment.sales scans the sales table and returns one number: the average order value across all customers.
Outer query: for each row in customers, we just attach that same single value as avg_order_value.

So the result is: one row per customer, plus a column showing the global average order value for comparison.

Single‑row subquery

A single‑row subquery returns one row (maybe multiple columns), which you can compare as a unit.

SELECT *
FROM assignment.products p
WHERE (p.category, p.price) = (
  SELECT category, MAX(price)
  FROM assignment.products
  GROUP BY category
  ORDER BY MAX(price) DESC
  LIMIT 1
);

Step by step:

The inner query groups products by category, computes MAX(price) per category, orders by that max price descending, and keeps only the first row with LIMIT 1.
That gives you something like (category = 'Electronics', price = 1500.00).
The outer query then finds any product row whose (category, price) pair exactly matches that tuple. In practice, you’ll get the single priciest product in the dataset.

Multi‑row subquery with `IN`

Here the subquery returns many rows, and the outer query checks if a value is in that set.

-- Customers who have ever placed an order
SELECT *
FROM assignment.customers c
WHERE c.customer_id IN (
  SELECT DISTINCT customer_id
  FROM assignment.sales
);

Think of the inner query as producing a list like [1, 2, 4, 7, 9, ...] of customer_ids that appear in the sales table. The outer query then says: “give me all customers whose customer_id is in that list.”

Correlated subquery

A correlated subquery depends on columns from the outer query, so it is logically evaluated per row of the outer query.

-- Per product, compute total quantity via a correlated subquery
SELECT
  p.product_id,
  p.product_name,
  (
    SELECT SUM(s.quantity_sold)
    FROM assignment.sales s
    WHERE s.product_id = p.product_id
  ) AS total_quantity_sold
FROM assignment.products p;

Here:

For each row in products (outer query), Postgres plugs that row’s product_id into the inner query.
The inner query filters sales down to that product and sums its quantity_sold.
The result is one row per product with a per‑product total quantity.

On small tables this is fine. On big ones, “run this inner aggregate for each product” can become expensive, which is why we sometimes refactor to joins or CTEs.

Inline view (subquery in `FROM`)

An inline view is just a subquery in the FROM clause. You can treat its output like a temporary table with a name.

SELECT
  dep.category,
  dep.total_revenue
FROM (
  SELECT
    p.category,
    SUM(s.total_amount) AS total_revenue
  FROM assignment.products p
  JOIN assignment.sales s
    ON s.product_id = p.product_id
  GROUP BY p.category
) AS dep
WHERE dep.total_revenue > 10000;

Mentally:

The inner query computes one row per category with total_revenue.
We give that result the alias dep.
The outer query simply filters that temporary table to categories over a threshold.

This is the same idea you’ll see later with CTEs, just without the WITH keyword.

When Subqueries Shine (and When They Hurt)

Subqueries are great tools when:

You want existence checks: WHERE EXISTS (SELECT 1 FROM ... ) to answer “does this related row exist?”
You need single‑value lookups: comparing against global MAX(), AVG(), or a config row.
You prefer contained filtering logic: WHERE x IN (SELECT ...) instead of more joins.

For example, a simple existence check:

SELECT c.customer_id, c.first_name
FROM assignment.customers c
WHERE EXISTS (
  SELECT 1
  FROM assignment.sales s
  WHERE s.customer_id = c.customer_id
);

The inner query returns at least one row if the customer has any sales. EXISTS doesn’t care what the row contains, just whether it exists.

But there are downsides:

Correlated subqueries can be slow, because they are logically evaluated per outer row. For a million products, that can mean a million inner aggregations.
Deep nesting makes queries hard to read and debug.
Refactoring is painful when logic is split across many layers of parentheses.

As a heuristic: use subqueries for small, sharp problems. When logic grows or the data set is large, move toward CTEs or joins.

CTEs: Naming Your Thought Process

A CTE (Common Table Expression) is a named subquery defined at the top of your statement using WITH. It only exists for that one query, but it gives you a clear label for each step.

Basic pattern:

WITH customer_totals AS (
  SELECT
    customer_id,
    SUM(total_amount) AS total_spent
  FROM assignment.sales
  GROUP BY customer_id
)
SELECT
  c.customer_id,
  c.first_name,
  ct.total_spent
FROM assignment.customers c
JOIN customer_totals ct
  ON ct.customer_id = c.customer_id
WHERE ct.total_spent > 1000;

Read it like this:

customer_totals CTE: “Build a small table that has one row per customer with their total_spent.”
Main query: “Join that table to customers and filter on total_spent.”

Instead of one big tangled query, you now have a clear two‑step story.

Chained / multiple CTEs

You can define several CTEs, where later ones depend on earlier ones. This feels a lot like building a small data pipeline.

WITH base_sales AS (
  SELECT *
  FROM assignment.sales
  WHERE sale_date >= DATE '2023-01-01'
),
product_totals AS (
  SELECT
    product_id,
    SUM(total_amount) AS total_revenue
  FROM base_sales
  GROUP BY product_id
)
SELECT *
FROM product_totals
WHERE total_revenue > 2000;

Here:

base_sales isolates “sales from 2023 onward.”
product_totals aggregates only that subset.
The final query filters heavy hitters.

Each step is small and named, so it’s easy to tweak just one without touching the others.

Recursive CTE (hierarchies)

Recursive CTEs are designed for hierarchical data: org charts, category trees, folder structures.

-- employees(emp_id, manager_id, full_name)
WITH RECURSIVE manager_chain AS (
  -- Anchor: start from the employee we care about
  SELECT
    e.emp_id,
    e.manager_id,
    e.full_name,
    0 AS depth
  FROM employees e
  WHERE e.emp_id = 42

  UNION ALL

  -- Recursive step: move one level up each time
  SELECT
    m.emp_id,
    m.manager_id,
    m.full_name,
    mc.depth + 1 AS depth
  FROM employees m
  JOIN manager_chain mc
    ON m.emp_id = mc.manager_id
)
SELECT *
FROM manager_chain
ORDER BY depth;

Walkthrough:

The anchor query grabs the starting employee.
The recursive query says: “Given everyone we already know about in manager_chain, join to their managers and add them to the chain.”
This continues until there’s no higher manager.

You end up with a neat ordered list from “employee” up to “CEO.”

MATERIALIZED vs NOT MATERIALIZED (Postgres)

Postgres can treat CTEs in two ways:

Materialized: run the CTE once, store the result (in memory or on disk), then reuse it.
Not materialized / inlined: conceptually splice the CTE into the main query so the planner can push filters and joins through it.

You can nudge Postgres:

WITH cte AS MATERIALIZED (
  SELECT * FROM big_expensive_view
),
filtered AS (
  SELECT * FROM cte WHERE important_flag = true
)
SELECT * FROM filtered;

Use MATERIALIZED when:

The CTE is expensive but reused multiple times.
You want an optimization fence (no predicate pushdown) for predictable behavior.

Otherwise, let Postgres inline or explicitly use NOT MATERIALIZED and let the planner be clever.

Subqueries vs CTEs: How to Choose

You can often write the same query with either subqueries or CTEs. Choosing is about clarity and performance.

Readability & maintainability

Subqueries are great for tiny, local bits of logic.
CTEs shine when you can describe your query as a sequence of steps.

Performance & planner behavior

Non‑correlated subqueries often optimize similarly to joins.
Correlated subqueries can explode in cost on large tables.
CTEs may be inlined or materialized depending on hints and Postgres version; always check the plan.

Reuse & provenance

If you copy‑paste the same subquery twice, that’s a good signal it wants to be a CTE.
CTEs give you one place to change logic while keeping intent obvious: “these rows came from customer_totals.”

When to reach for JOINs / LATERAL

If your subquery is really “find related rows,” a JOIN is usually clearer. JOIN LATERAL is ideal when you want “top N per row” or “latest row per group.”

Concrete Refactors (Postgres)

Let’s refactor some common patterns so you can see where subqueries, CTEs, and LATERAL each feel natural.

1. Correlated COUNT → CTE + JOIN

Correlated version:

SELECT
  c.customer_id,
  c.first_name,
  (
    SELECT COUNT(*)
    FROM assignment.sales s
    WHERE s.customer_id = c.customer_id
  ) AS order_count
FROM assignment.customers c;

For each customer row, the inner query counts matching sales rows. On a small dataset, no problem. On a large one, this is “run COUNT again and again.”

CTE + JOIN version:

WITH order_counts AS (
  SELECT
    customer_id,
    COUNT(*) AS order_count
  FROM assignment.sales
  GROUP BY customer_id
)
SELECT
  c.customer_id,
  c.first_name,
  COALESCE(oc.order_count, 0) AS order_count
FROM assignment.customers c
LEFT JOIN order_counts oc
  ON oc.customer_id = c.customer_id;

Now:

order_counts computes every customer’s count once.
The final query simply joins that small aggregated table to customers.
Customers with no sales come out as NULL, which we turn into 0 with COALESCE.

2. “Latest row per group” with LATERAL

Correlated style:

SELECT
  c.customer_id,
  c.first_name,
  (
    SELECT s.sale_date
    FROM assignment.sales s
    WHERE s.customer_id = c.customer_id
    ORDER BY s.sale_date DESC
    LIMIT 1
  ) AS last_order_date
FROM assignment.customers c;

Again, one inner query per customer. Let’s make the intent more explicit with LATERAL.

LATERAL refactor:

SELECT
  c.customer_id,
  c.first_name,
  s_last.sale_date AS last_order_date
FROM assignment.customers c
LEFT JOIN LATERAL (
  SELECT sale_date
  FROM assignment.sales s
  WHERE s.customer_id = c.customer_id
  ORDER BY sale_date DESC
  LIMIT 1
) AS s_last ON TRUE;

LATERAL means: “for each row of customers, run this subquery with access to that row’s columns.”

The inner block still does ORDER BY ... LIMIT 1, but it’s clearer that we’re pairing one latest sale row per customer.
We can easily extend it to return more columns (amount, product, etc.) from the latest sale.

3. Recursive CTE for an org chart

Revisiting our recursive example, now you can see how the pieces fit:

Anchor: define the starting point.
Recursive step: define how to move to the “next level.”
Final SELECT: decide what to display.

WITH RECURSIVE manager_chain AS (
  -- Start from a specific employee
  SELECT
    e.emp_id,
    e.manager_id,
    e.full_name,
    0 AS depth
  FROM employees e
  WHERE e.emp_id = 42

  UNION ALL

  -- Then move up the chain to each manager
  SELECT
    m.emp_id,
    m.manager_id,
    m.full_name,
    mc.depth + 1 AS depth
  FROM employees m
  JOIN manager_chain mc
    ON m.emp_id = mc.manager_id
)
SELECT *
FROM manager_chain
ORDER BY depth;

If you print the results, you’ll literally see: depth = 0 for the employee, 1 for their manager, 2 for the manager’s manager, and so on. That step‑by‑step mental picture is what makes recursive CTEs click.

Practical Checklist Before You Ship That Query

Can I name this step? If you can describe a piece of logic (“active 2023 sales”), that’s a good candidate for a CTE.
Is this subquery correlated? If it references outer columns, run EXPLAIN ANALYZE and see if it’s doing repeated work.
Are join/filter columns indexed? Especially customer_id, product_id, and date columns used in subqueries or CTEs.
Am I repeating logic? Repeated subqueries usually want to be a CTE or an inline view.
Do I really need MATERIALIZED? Force it only when you want “compute once, reuse many times” and you accept the temp cost.
Can a JOIN / LATERAL express this more clearly? If yes, it often improves both readability and performance.

Wrapping Up

In the first article, getting a SELECT to run was the win. As your queries grow, the real win is writing SQL that your future self can understand at a glance and your database can execute efficiently.

Subqueries and CTEs are how you get there: they let you structure your thinking, name your steps, and refactor as your logic evolves. The best way to internalize them now is to pick one of your existing “medium‑ugly” queries, rewrite it with a couple of CTEs, and see how much easier it is to read, explain, and tune.

This article was written to help data engineers, from early-career to mid-level, master how and when to use subqueries, CTEs or other alternatives. As always, practice makes perfect, so go ahead: break some eggs! The article was submitted in fulfilment of a LuxDevHQ Cohort 7 DataEngieering assignment ©adev3loper

Welcome to the World of SQL

Mungai M. — Mon, 13 Apr 2026 07:45:12 +0000

Your First Step into Data Analysis

You might have wondered how your favorite streaming app instantly recommends the perfect movie, or how your bank retrieves your transaction history the second you log in. Behind the scenes, these platforms are almost certainly using SQL.

SQL, which stands for Structured Query Language, is the standard language we use to communicate with databases, asking them to find, organize, or update information. Think of it as the ultimate, super‑powered search bar for raw data. It allows you to ask complex questions, organize massive amounts of information, and update records in the blink of an eye. Whether you want to know how many pairs of shoes a store sold last Tuesday or which customers signed up for a newsletter in the past hour, SQL is the tool that makes finding those exact answers possible.

The Digital Filing Cabinet

To really understand SQL, it helps to first understand what a database actually is. If data is the raw information; like individual receipts, customer names, or the prices of coffee; a database is the secure container that holds it all.

Imagine a massive, highly organized digital filing cabinet. Inside this cabinet, you have different drawers, which we call tables. Instead of a chaotic pile of loose papers, each table is structured much like a spreadsheet with a neat grid of rows and columns. Every row represents a single file or record (like one specific customer), and every column represents a specific attribute (like that customer’s email address).

When building these digital filing cabinets, developers typically choose between two main database types: relational and NoSQL. You would pick a relational database (traditionally accessed using SQL) when your data is highly structured and requires absolute accuracy, like financial ledgers or inventory systems. You might pick a NoSQL database when you are dealing with flexible, unstructured, or rapidly changing data like social media posts.

Speaking the Language of Data

Even though SQL is considered a single language, it actually acts like a Swiss Army knife with several core sublanguages, each assigned to a specific job:

DDL (Data Definition Language): This is the builder; it sets up and alters the actual structure of your tables and database.
DML (Data Manipulation Language): This handles the day‑to‑day operations, allowing you to insert, update, or delete the actual rows of data.
DQL (Data Query Language): This is how you ask questions and fetch the exact information you want to see.
DCL (Data Control Language): This acts as the security guard, managing permissions and deciding who gets access to the database.
TCL (Transaction Control Language): This is your safety net, allowing you to permanently save (commit) your changes or hit undo (rollback) if something goes wrong.

To keep everything running smoothly, databases use something called data types. Data types matter immensely because they tell the database exactly what kind of information is allowed in each column. This prevents messy errors (like accidentally saving a name in a phone number field) and makes searching the database incredibly fast and efficient.

Here are some of the most common types you will encounter:

Integer: Whole numbers without decimals, like 42 or 100.
Text/Varchar: Words, names, or alphanumeric characters of varying lengths, like "Coffee Mug".
Numeric/Decimal: Exact numbers with decimal points, perfectly suited for a price like 19.99.
Timestamp: A specific calendar date and exact time, like 2026-04-13 09:41:00.
Boolean: Simple true or false values, used for things like checking if an item is currently in stock (TRUE).
JSON: A flexible format used to store nested or unstructured data, like a customer’s specific color and size preferences all in one spot.

Rules of the Game: Schemas and Constraints

To prevent your beautiful filing cabinet from turning into chaos, databases use a schema (the blueprint) and constraints (the strict rules).

A primary key acts as a unique identifier for every single row, ensuring no two records are ever confused. A foreign key acts as a bridge, linking information across different tables so they can talk to each other without duplicating data.

Other constraints keep the data perfectly clean:

NOT NULL enforces a strict rule that a field can never be left completely empty.
UNIQUE guarantees that no two entries in a column are identical, which is perfect for user email addresses.
A CHECK constraint enforces a specific logical rule that the data must pass. For example, you can ensure a product’s price is always greater than zero with a tiny snippet of SQL:

CHECK (price > 0)

Getting Hands-On: A Mini Example

Let’s bring this all together with a concrete example. Imagine you manage a tiny fictional table called products for a local coffee shop.

products
───────────────
product_id | name   | category | price | in_stock
----------+--------+----------+-------+---------
1         | Coffee | Beverage |  3.50 | TRUE
2         | Tea    | Beverage |  2.50 | TRUE
3         | Mug    | Merch    | 12.00 | FALSE
4         | T-Shirt| Merch    | 20.00 | TRUE

If you wanted to filter this table to only show the names and prices of your beverages, you would use the SELECT and WHERE keywords:

SELECT name, price
FROM products
WHERE category = 'Beverage';

What if you wanted to count exactly how many items you have in each category? You can use COUNT to tally them up and GROUP BY to organize the results into neat buckets:

SELECT category, COUNT(product_id)
FROM products
GROUP BY category;

Finally, if you want to create custom labels on the fly; like categorizing your items into pricing tiers; you can use the CASE WHEN expression. It evaluates your rules row by row and outputs a new category:

SELECT
  name,
  price,
  CASE
    WHEN price > 10 THEN 'Premium'
    ELSE 'Standard'
  END AS price_tier
FROM products;

5 Practical Tips for Beginners

As you start your journey into data analysis, keep these five actionable tips in mind:

Choose the right data types. Always pick the smallest, most specific data type for your columns to keep your database running fast and save storage space.
Avoid the SELECT * trap. Instead of grabbing every single column with an asterisk, only select the exact columns you actually need so you do not slow down the system or fetch unnecessary data.
Test queries safely. Never run a DELETE or UPDATE command without double‑checking your WHERE clause, or you might accidentally overwrite or delete every single row in your table.
Use LIMIT for exploration and pagination. When exploring a massive table for the first time, add a LIMIT (or your database’s equivalent) to the end of your query to prevent overwhelming your screen with millions of rows.
Explore learning resources. Practice makes perfect, so try out free interactive websites like SQLZoo, SQLBolt, or DataLemur to get comfortable writing real queries right in your browser.

Now that you know the basics of how databases work, jump into a free sample dataset and try writing your very first query today!

This article was written to help new data engineers, build an appetite for the Structured Query Language. The article was submitted in fulfilment of a LuxDevHQ Cohort 7 DataEngieering assignment ©adev3loper

How to Publish a Power BI Report and Embed It on a Website

Mungai M. — Sun, 05 Apr 2026 19:58:07 +0000

You have built a Power BI report. The charts look sharp, the DAX measures are doing their job, and the data model is clean. Now what? The report is sitting on your local machine in a .pbix file that nobody else can see or interact with.

This article walks you through the final stretch: publishing that report to the Power BI Service and embedding it on a website. We cover two approaches. The first is Publish to web, which makes your report publicly accessible to anyone with the link. The second is the Website or portal method, which requires viewers to sign in and respects your data permissions. Both produce an interactive iframe you drop into your HTML. We will also cover workspace creation, publishing from Desktop, responsive design, URL filtering, and troubleshooting.

What you need before you start

Power BI has a few moving parts, so let us get the prerequisites out of the way:

Power BI Desktop installed with a finished .pbix report ready to go.
A Power BI account with at least a Pro or Premium Per User (PPU) license. A free license lets you publish to "My Workspace" but does not allow sharing or embedding.
A work or school Microsoft account. Personal Gmail or Yahoo accounts will not work for Power BI sign-in.
Publish to web enabled by your tenant admin. If you are on a personal Pro subscription, this is usually on by default. In corporate environments, your admin may need to flip the switch in the Admin Portal under Tenant Settings.

With those in place, let us get into it.

Step 1: Create a Workspace in the Power BI Service

A workspace is a container in the Power BI cloud where your reports, datasets, and dashboards live. Think of it as a shared folder with permissions. You could publish directly to "My Workspace" (your personal area), but creating a dedicated workspace is better practice because it lets you control who has access and keeps related content organized.

Here is how to create one:

Open your browser and go to app.powerbi.com. Sign in with your organizational account.
In the left sidebar, click Workspaces.
Click the + New workspace button.

A panel slides out from the right. Fill it in:

Workspace name: Give it something descriptive. For this example, "Electronics Sales Reports" works.
Description (optional): A short note about what this workspace contains. Useful when your organization has dozens of workspaces and someone needs to find the right one.
Under Advanced, confirm the License mode is set to Pro (or Premium Per User if your organization uses PPU).
Click Apply.

Your workspace is now live and empty, ready to receive reports. You will see it listed under the Workspaces section in the left sidebar from now on.

A note on workspace roles: When you create a workspace, you are automatically the Admin. You can add other users as Members, Contributors, or Viewers from the workspace settings. For embedding purposes, you only need to be an Admin or Member yourself.

Step 2: Publish Your Report from Power BI Desktop

Publishing sends your .pbix file (the report, its data model, and dataset) from your local machine up to the workspace you just created. The process is straightforward.

Open your finished report in Power BI Desktop.
Make sure you are signed in to your Power BI account. Check the top-right corner of the Desktop window. If it says "Sign in," click it and authenticate with your work account.
Click the Publish button on the Home tab of the ribbon. It is on the far right side of the toolbar.

A dialog appears asking you to select a destination. You will see "My Workspace" and any workspaces you have access to. Select the workspace you just created ("Electronics Sales Reports") and click Select.

Power BI Desktop uploads the report and dataset to the cloud. This takes a few seconds to a minute depending on your dataset size. When it finishes, you see a "Success!" message with a link to open the report in Power BI Service.

Click that link to verify everything looks right. Your report is now live in the cloud and accessible to anyone with workspace permissions.

What if you need to update? Just make changes in Power BI Desktop and click Publish again. It will ask if you want to overwrite the existing report and dataset. Confirm, and the cloud version updates immediately.

Step 3: Generate the Public Embed Code (Publish to Web)

With the report living in the Power BI Service, you can now generate a public embed code. This creates a shareable link and an HTML iframe snippet that you can drop into any website.

Important security note: "Publish to web" makes your report publicly accessible. Anyone with the link or embed code can view it without signing in. Do not use this for confidential or sensitive data. For internal portals where authentication is required, use the "Website or portal" embed option instead, which enforces Power BI sign-in.

Here is how to generate the embed code:

Open the report in the Power BI Service at app.powerbi.com. Navigate to your workspace and click on the report.
Click File in the top menu bar.
Hover over Embed report in the dropdown.
Click Publish to web (public).

A dialog appears warning you that the report will be publicly visible. Review the warning, then click Create embed code.
A second confirmation asks you to acknowledge that the data will be publicly accessible. Click Publish.
The Success dialog appears with two pieces of output:
- A shareable link you can send via email or message.
- An HTML iframe snippet you can paste directly into a website.

Choose a Size from the dropdown (the recommended sizes are 800x600 for medium or 960x596 for a widescreen 16:9 fit). You can always adjust the width and height manually later.
Click Copy to grab the iframe code.

The embed code looks something like this. The src value is the unique embed URL that Power BI generates for your report (it starts with the Power BI domain followed by /view?r= and an encoded token):

<iframe
  title="Electronics Sales Report"
  width="800"
  height="600"
  src="<your-embed-url>"
  frameborder="0"
  allowFullScreen="true">
</iframe>

That src URL is the magic. It points to a read-only, publicly accessible render of your report that Power BI hosts for you.

Step 4: Embed the Report on Your Website

This is the simplest step. You have an iframe. Any website that supports HTML can host it.

Basic HTML page

If you are building a standalone page, drop the iframe into your HTML:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Sales Dashboard</title>
  <style>
    body {
      font-family: 'Segoe UI', sans-serif;
      margin: 0;
      padding: 2rem;
      background: #f5f6fa;
    }
    h1 {
      color: #1a1a2e;
      margin-bottom: 1rem;
    }
    .report-container {
      max-width: 1000px;
      margin: 0 auto;
    }
    iframe {
      width: 100%;
      border: none;
      border-radius: 8px;
      box-shadow: 0 2px 8px rgba(0,0,0,0.1);
    }
  </style>
</head>
<body>
  <div class="report-container">
    <h1>Electronics Sales Dashboard</h1>
    <iframe
      title="Electronics Sales Report"
      width="960"
      height="596"
      src="<your-embed-url>"
      allowFullScreen="true">
    </iframe>
  </div>
</body>
</html>

Making it responsive

The default iframe has fixed width and height. To make it fluid on mobile and desktop, wrap it in a container with a percentage-based aspect ratio:

<style>
  .pbi-wrapper {
    position: relative;
    width: 100%;
    padding-bottom: 56.25%; /* 16:9 aspect ratio */
    height: 0;
    overflow: hidden;
  }
  .pbi-wrapper iframe {
    position: absolute;
    top: 0;
    left: 0;
    width: 100%;
    height: 100%;
    border: none;
  }
</style>

<div class="pbi-wrapper">
  <iframe
    title="Electronics Sales Report"
    src="<your-embed-url>"
    allowFullScreen="true">
  </iframe>
</div>

This scales the report proportionally on any screen size.

WordPress, Wix, and other CMS platforms

Most website builders have a "Custom HTML" or "Embed" block. In WordPress, add a Custom HTML block in the editor and paste the iframe code. In Wix, use the Embed a Widget option under Add > Embed. The process is similar on Squarespace, Webflow, or any other platform that supports custom HTML.

Managing Your Embed Codes

Once your report is embedded and live, you might need to update or revoke access later. Power BI gives you a management interface for this.

Go to your workspace in the Power BI Service, click the Settings gear icon, and select Manage embed codes. From here you can:

Retrieve the embed code again if you lost it.
Delete the code, which immediately disables the public link and breaks any embedded iframes that use it.

You can only create one embed code per report. If you delete it and create a new one, the URL changes and you will need to update your website.

Data refresh behavior

The embedded report automatically reflects data refreshes. When you refresh the dataset in the Power BI Service (either manually or on a schedule), the cached data updates within about an hour. For reports that need near-real-time updates, keep in mind that the public embed cache refreshes periodically, not instantly.

Alternative: Embed in a Website or Portal (Secure Method)

The "Publish to web" approach above is perfect for public-facing content, but what if your report contains sensitive business data that should only be visible to authenticated users within your organization? That is where the Website or portal embed option comes in.

This method generates an iframe that requires viewers to sign in with their Power BI account before they can see the report. It respects all workspace permissions and Row-Level Security (RLS) rules, making it the right choice for internal dashboards, company intranets, and employee portals.

Prerequisites for secure embedding

Viewers need a Power BI Pro or Premium Per User (PPU) license, or the workspace must be assigned to a Premium capacity (so free-license users with Viewer role can access it).
You must have at least a Contributor role in the workspace where the report lives.
The report must be published to the Power BI Service (Steps 1-2 above still apply).
Your portal or website must support HTTPS. Secure embeds will not work on HTTP pages.

Generating the secure embed code

Open the report in the Power BI Service at app.powerbi.com.
Click File in the top menu bar.
Hover over Embed report.
This time, click Website or portal (instead of "Publish to web").

The Secure embed code dialog appears. It looks similar to the public embed dialog, but with one critical difference: the URL includes an autoAuth=true parameter that triggers automatic authentication.
You get the same two outputs:
- A shareable link for direct access.
- An HTML iframe snippet for embedding.
Click Copy on whichever you need.

The secure iframe code looks like this:

<iframe
  title="Electronics Sales Report"
  width="1080"
  height="760"
  src="<your-secure-embed-url>"
  frameborder="0"
  allowFullScreen="true">
</iframe>

Notice the differences from the public embed. The secure URL uses /reportEmbed instead of /view, includes a reportId parameter with your report's unique GUID, and appends autoAuth=true to handle the sign-in flow. The full pattern looks like this: <power-bi-domain>/reportEmbed?reportId=<your-report-id>&autoAuth=true.

What the viewer experience looks like

When someone visits your website and hits the embedded report for the first time in their browser session, they see a "Sign in to view this report" prompt. After they authenticate with their organizational account, the report loads with full interactivity. Once signed in, any other embedded Power BI reports on the same site load automatically without a second prompt.

If a user does not have permission to view the report, they see an access-denied message instead. This is the security enforcement in action.

Granting access to viewers

The secure embed does not automatically give anyone access. You need to explicitly share the report or grant workspace access:

Workspace roles: Add users as Viewers in the workspace settings. This gives them access to everything in the workspace.
Direct sharing: Click the Share button on the specific report and enter user emails. This grants access to just that report.
Microsoft 365 Groups: If you manage access through M365 groups, add the group to the workspace membership.

Customizing the secure embed with URL parameters

One advantage of the secure embed is that you can control which page opens and pre-filter the data using URL parameters. This is useful when you have a single report but want different portal pages to show different views.

Opening a specific page:

Append &pageName=ReportSection2 to the iframe src URL, where ReportSection2 is the page identifier you can find at the end of the report's URL in the Power BI Service.

<your-secure-embed-url>&pageName=ReportSection2

Pre-filtering data:

Append a $filter parameter to show only specific data. For example, to show only the "Computers" category:

<your-secure-embed-url>&$filter=DimProduct/Category eq 'Computers'

You can combine page navigation and filters to build a lightweight portal experience without any custom code beyond basic HTML links or buttons.

A word of caution: URL filters are a convenience feature, not a security mechanism. Users can modify the URL in their browser to remove or change filters. If you need to enforce data visibility rules, use Row-Level Security in your data model. That way, even if someone strips the filter parameters, they still only see the rows they are authorized to access.

Enabling Copilot in secure embeds

If your organization has Copilot enabled and the workspace is on Premium or Fabric capacity, you can check the Enable Copilot box in the secure embed dialog. This lets users interact with Copilot directly inside the embedded report, asking natural language questions about the data.

Choosing the Right Embed Method

Now that we have covered both approaches in detail, here is how they compare side by side:

Method	Access	Authentication	RLS Support	Best for
Publish to web	Anyone with the link	None required	No	Public dashboards, blog posts, marketing pages
Website or portal	Organizational users only	Power BI sign-in required	Yes	Internal portals, intranets, employee dashboards
Power BI Embedded (Azure)	Custom application users	App-managed tokens	Yes	ISV products, customer-facing SaaS applications

The first two methods are covered in this article and require no coding beyond pasting an iframe. The third method (Power BI Embedded) is a developer-oriented approach that uses the Power BI JavaScript SDK and Azure Active Directory tokens for full programmatic control. It is the right choice for software vendors building analytics into their own products.

When to use which:

Your company blog needs a public sales trends chart? Publish to web.
Your HR team needs an internal headcount dashboard on the company intranet? Website or portal.
You are building a SaaS product and want to offer embedded analytics to your customers? Power BI Embedded.

Common Issues and Fixes

Publish to web issues

"I do not see the Publish to web option."
Your Power BI admin has likely disabled the Publish to web tenant setting. Contact your admin and ask them to enable it under Admin Portal > Tenant Settings > Export and Sharing Settings > Publish to web.

"The embed code shows a blank page."
Check that your report does not use Row-Level Security (RLS). Publish to web does not support RLS. Also confirm the embed code status is "Active" in the Manage embed codes page.

"The report looks cut off in the iframe."
Adjust the width and height values in the iframe tag. Power BI recommends adding 56 pixels to the height to accommodate the bottom toolbar. For a 16:9 layout, try 960x596 or 800x506.

"Changes I made to the report are not showing."
The public embed caches data for up to one hour. After making changes, wait for the cache to refresh, or manually refresh the dataset in the Power BI Service to force an update.

Secure embed issues

"Users are prompted to sign in repeatedly."
Due to Chromium security updates, the autoAuth mechanism may require re-authentication more often than expected. Ensure your portal uses HTTPS and that users have pop-up windows enabled. If the problem persists, consider using the Power BI Embedded SDK with the "user-owns-data" method for a smoother single sign-on experience.

"The embedded report says 'You do not have access'."
The viewer has not been granted permission to the report. Share the report with them directly (click Share on the report in the Power BI Service) or add them to the workspace as a Viewer.

"Secure embed does not show on my portal served over HTTP."
The secure embed requires HTTPS on the hosting page. Power BI will not render authenticated content inside an insecure frame. Set up an SSL certificate on your web server.

"The embed works but Copilot does not appear."
Copilot requires three things: the Enable Copilot checkbox selected in the embed dialog, an active Copilot tenant switch in admin settings, and the workspace assigned to Premium or paid Fabric capacity. If any of these are missing, Copilot will not load.

Key Takeaways

Publishing and embedding a Power BI report follows a consistent pattern: create a workspace, publish from Desktop, generate the embed code, and paste the iframe into your site. The decision point is which embed method fits your scenario.

A few things to keep in mind:

Pick the right method for your audience. Publish to web is genuinely public, with zero authentication. Website or portal requires sign-in and enforces permissions. Do not use the public option for sensitive data just because it is easier.
One embed code per report (public). Deleting it breaks all existing embeds. The secure method uses a stable report URL, so it is less fragile.
Responsive design matters. Wrap the iframe in a percentage-based container so it scales on mobile. The fixed-width default looks terrible on small screens.
Refresh behavior differs. Public embeds cache for about an hour. Secure embeds reflect data refreshes more directly since they query the live service.
RLS only works with secure embed. If your data model uses Row-Level Security to restrict what different users see, Publish to web will strip all RLS rules. Use the Website or portal method to keep row-level filters intact.
URL parameters are your friend. The secure embed supports pageName and $filter parameters, letting you build lightweight multi-view portals without custom code.

Both methods produce a fully interactive Power BI dashboard inside your website, complete with filters, slicers, and drill-through. The difference is whether your audience walks in through an open door or shows an ID at the gate.

This article is part of a Power BI learning series covering data cleaning, DAX, star schema modeling, and report publishing. The article was submitted in fulfilment of a LuxDevHQ Cohort 7 DataEngieering assignment ©adev3loper*

Data Modeling in Power BI: Joins, Relationships, and Schemas Explained

Mungai M. — Sat, 28 Mar 2026 08:47:48 +0000

Data modeling is where raw data becomes usable intelligence. In Power BI, it's not a preliminary step you rush through. It's the architectural foundation that determines whether your reports are fast or sluggish, your DAX is clean or convoluted, and your numbers are right or wrong.

Under the hood, Power BI runs the Analysis Services VertiPaq engine, an in-memory columnar database that relies on structured relationships and compressed tables to aggregate millions of rows quickly. A well-built model means near-instant visualizations and precise DAX calculations. A poorly built one means slow performance, memory exhaustion, circular dependencies, and incorrect results.

This article covers the full landscape: Fact vs. Dimension tables, Star/Snowflake/Flat Table schemas, all six Power Query join types with practical scenarios, Power BI relationship configuration (cardinality, cross-filter direction, active/inactive states), role-playing dimensions, and common modeling pitfalls like ambiguous paths and circular dependencies.

Fact Tables vs. Dimension Tables

Every optimized analytical model starts by separating data into two types: Fact tables (the numbers) and Dimension tables (the context). This separation is the cornerstone of dimensional modeling, and VertiPaq is specifically designed to leverage it. Mixing quantitative metrics with descriptive text in a single table compromises compression efficiency and query speed.

Fact Tables

Fact tables hold the quantitative metrics, measurements, and transactional events generated by a business process. They represent the numerical reality of what happened: how much was sold, how many units shipped, what discount was applied.

Structurally, Fact tables have a massive number of rows (potentially hundreds of millions) but a narrow column footprint. They contain two types of columns: Foreign Keys (integer-based IDs like EmployeeID, StoreID, DateKey that link back to Dimension tables) and Numeric Measures (the actual values being aggregated: Transaction_Amount, Units_Sold, Discount_Applied).

One critical principle: every Fact table must have a consistent grain. The grain defines what a single row represents. "One row per product sold per receipt per store," for example. Mixing grains (daily transactions alongside monthly aggregates in the same table) causes double-counting and forces convoluted DAX to resolve.

Dimension Tables

Dimension tables provide the qualitative context that makes the numbers meaningful. They answer "who," "where," "what," and "why." Customers, Products, Sales Representatives, Geographic Regions.

Structurally, they're the inverse of Fact tables: relatively few rows but many columns. A Customer dimension might have 100,000 rows but 80 columns capturing everything from First_Name to Lifetime_Value_Tier to Acquisition_Channel. Every Dimension table needs a Primary Key (a column with strictly unique values) that matches the Foreign Key in the Fact table.

When an analyst drags Region_Name onto a chart axis, they're using a Dimension table attribute to slice the raw numeric data in the Fact table. That's the entire relationship in action.

Schemas: Star, Snowflake, and Flat Table

The spatial arrangement and normalization level connecting Facts and Dimensions defines your schema. Your choice directly impacts VertiPaq performance, model memory footprint, and DAX complexity.

Star Schema (the Gold Standard)

The Star Schema is universally recommended for Power BI. It features a single, compressed central Fact table surrounded by multiple Dimension tables, each joined directly via a simple one-to-many relationship. No intermediate lookup tables, no secondary dimension branches, no complex relationship chains.

To achieve this, Dimension tables are deliberately denormalized during data preparation. Instead of separate tables for Product, Product_Subcategory, and Product_Category, everything collapses into a single Product dimension.

Why this works so well in Power BI:

Performance. Only one relationship "hop" from any dimension attribute to the fact data. VertiPaq is engineered to traverse these single-tier relationships with maximum efficiency.

Simple DAX. Filter context flows cleanly from dimension slicer to Fact table. No need for complex filter-modification functions.

Intuitive for users. One business entity equals one table. Self-service report authors don't get lost.

The trade-off is data redundancy: a long string like "Industrial Manufacturing Equipment" might repeat across thousands of product rows. But VertiPaq handles this through dictionary encoding, storing the string once and using a tiny integer reference everywhere else. The theoretical storage penalty is virtually eliminated in memory.

Snowflake Schema

A Snowflake Schema normalizes one or more Dimension tables into hierarchical sub-tables. Instead of one Product table, you get Product joined to Product_Subcategory joined to Product_Category, branching outward.

The advantage is storage efficiency and strict data conformity. The disadvantage in Power BI is severe: multi-hop traversal degrades reporting performance, and DAX authoring gets significantly more complex. Filters must propagate through intermediate tables, leading to unexpected behaviors and potential ambiguous pathway errors.

The universal recommendation: use Power Query to merge and denormalize Snowflake structures into a Star Schema before loading into the model.

Flat Table (DLAT / "One Big Table")

The Flat Table abandons Fact/Dimension separation entirely, joining everything into a single massive table with potentially hundreds of columns.

In Power BI Import mode, this is a severe anti-pattern. Appending text-heavy dimensional attributes alongside millions of transaction rows causes catastrophic data duplication, bloats the in-memory cache, slows refreshes, and complicates DAX. Overriding filter context on a single attribute in a Star Schema is trivial (CALCULATE([Measure], ALL('Product'))). In a Flat Table, you must list every column individually.

There is one legitimate exception: DirectQuery mode. When Power BI passes DAX as SQL queries to a backend warehouse (Snowflake, BigQuery, Databricks), a pre-joined materialized Flat Table eliminates runtime SQL JOINs, which can be computationally expensive. In this specific scenario, a DLAT can yield faster visual rendering. For Import mode (the vast majority of implementations), Star Schema remains the imperative.

Power Query Joins: Combining Data at the Source Layer

Before data enters VertiPaq, it's extracted, cleaned, and transformed in Power Query. Joins (merges) in Power Query are physical operations: they permanently combine columns from two tables based on matching keys during ETL. This is fundamentally different from Power BI Relationships, which are virtual, dynamic filter mechanisms applied in memory during user interaction.

Power Query supports six join types, all derived from standard SQL relational algebra.

1. Inner Join

Returns only rows with matching keys in both tables. Unmatched rows from either side are discarded.

Scenario: Sales analysis limited to currently active employees. Inner Join on EmployeeID drops sales records tied to terminated employees and active employees with no sales.

2. Left Outer Join

The most commonly used join for data modeling. Preserves all rows from the left table. Matching rows from the right table are appended; unmatched left rows get null values in the right-table columns.

Scenario: A Customer master list enriched with campaign responses. Customers who didn't respond still appear with null in the feedback columns.

3. Right Outer Join

The inverse: preserves all rows from the right table, appending only matching rows from the left.

Scenario: Ensuring all new products from a supplier catalog appear in your model, even if no sales exist yet.

4. Full Outer Join

Preserves all rows from both tables. Matched rows are combined; unmatched rows from either side are retained with null values for the missing columns.

Scenario: Reconciling employee records across two separate HR systems. Every employee from both systems appears, with gaps showing where records don't align.

5. Left Anti Join ("Rows only in first")

Returns strictly the rows from the left table that have no match in the right table. Every matched row is discarded.

Scenario: Generating a list of campaign targets who haven't been contacted yet. Left Anti Join subtracts contacted customers from the target list.

6. Right Anti Join ("Rows only in second")

Returns strictly the rows from the right table that have no match in the left table.

Scenario: Comparing a digital inventory system against a physical warehouse audit. Right Anti Join reveals items found on the warehouse floor that don't exist in the system, flagging undocumented overstock or data entry failures.

Step-by-Step: Merging in Power Query

From the Home ribbon in Power BI Desktop, click Transform data to open Power Query Editor.
In the Queries pane, select the table that will act as the Left (primary) table.
On the Home ribbon, in the Combine group, click Merge Queries (or "Merge Queries as New" to preserve originals).
In the Merge dialog, click the matching column header(s) in the Left table preview. Select the Right table from the dropdown, then click its matching column header(s).
Select your Join Kind from the dropdown at the bottom.
Power Query shows an estimated match count. Validate, then click OK.
The merge adds a column of nested Table objects. Click the expand icon (divergent arrows) in the column header, select which columns to flatten, and click OK.
Click Close & Apply to load the result into VertiPaq.

Power BI Relationships: The Semantic Layer

While Power Query joins weld data together during ETL, Relationships are virtual, logical connections established post-load. They propagate filter context between tables. Selecting "2024" in a Date slicer generates a filter that travels down the relationship pathway to isolate matching rows in the Fact table.

Important: Power BI relationships do not enforce data integrity (no prevention of orphan records, no cascading deletes like SQL). They define filter propagation rules only.

Cardinality

One-to-Many (1:) / Many-to-One (:1): The same relationship viewed from opposite sides. The "one" side is the Primary Key (unique values in the Dimension); the "many" side is the Foreign Key (duplicates in the Fact). This is the structural glue of the Star Schema and the optimal relationship type.

One-to-One (1:1): Both columns contain unique values. Rare, and often indicates the tables should be merged into one. Legitimate exceptions: isolating columns for row-level security or separating rarely-queried wide text columns to save memory.

Many-to-Many (:): Both columns contain duplicates. Common in scenarios like students enrolled in multiple courses. Connecting two many-to-many dimensions directly causes extreme ambiguity and incorrect aggregations. The solution is a Bridge Table (junction table) capturing every unique combination, transforming the relationship into two predictable one-to-many connections.

Cross-Filter Direction

Single Direction (default and recommended). Filters flow from the Dimension ("one" side) down to the Fact ("many" side). A single arrowhead on the relationship line, pointing toward the Fact.

Both Directions (bi-directional). Filters flow both ways. Denoted by a double arrowhead. Occasionally necessary (dynamically shrinking a slicer list, propagating across Bridge tables), but deploy with extreme caution. Indiscriminate bi-directional filtering forces massive cross-table permutations and is the leading cause of ambiguous path errors.

Active vs. Inactive Relationships

Power BI allows multiple relationships between the same two tables but enforces that only one can be Active at a time.

Active (solid line): the default filter path. Standard DAX measures use this automatically.
Inactive (dashed line): dormant until explicitly invoked via USERELATIONSHIP() in a DAX measure.

Step-by-Step: Creating Relationships

Method 1: Model View. Click the network icon on the left nav to open the Model View canvas. Click and drag a Primary Key column from the Dimension table to the Foreign Key column in the Fact table. Power BI auto-detects cardinality and cross-filter direction. Double-click the line to edit.

Method 2: Manage Relationships Dialog. From the Modeling tab, click Manage relationships > New. Select tables and columns from dropdowns, review the auto-detected settings, confirm the "Make this relationship active" checkbox, and click OK.

Joins vs. Relationships: When to Use Which

Feature	Power Query Join (Physical)	Power BI Relationship (Logical)
What it does	Physically combines columns into one table	Virtual connection for dynamic filter propagation
When it runs	During ETL/data refresh	In-memory at query time during user interaction
Memory impact	Can inflate row counts and duplicate text strings	Maintains compressed, narrow tables
Flexibility	Static until next refresh	Dynamic; can toggle via DAX (`USERELATIONSHIP`, `CROSSFILTER`)
Best use case	Denormalizing Snowflake to Star; appending columns from tiny lookups	Building Star Schemas; connecting Fact to Dimension tables

The rule: for Import mode, rely on Relationships to build a Star Schema. Use Power Query joins to flatten hyper-normalized data or append a few columns from minor lookup tables. Don't join everything into a Flat Table in Import mode.

Role-Playing Dimensions

A classic challenge: a single Dimension table needs multiple roles. A Date table relating to a Sales fact might connect on OrderDate, ShipDate, and DeliveryDate. Power BI only allows one active relationship between any two tables, so you get one solid line and two dashed lines.

Option 1: Duplicate the Dimension

Use Power Query to reference and duplicate the Date table into independent Order_Date, Ship_Date, and Delivery_Date tables. Each gets its own active relationship.

Pros: Intuitive for self-service users. Easy to visualize two roles simultaneously on one chart.
Cons: Inflates the model. Duplicating a small Date table (3,650 rows) is negligible. Duplicating a multi-million row Customer table (acting as both BillTo and ShipTo) is costly.

Option 2: USERELATIONSHIP() in DAX

Keep one Dimension table with one active relationship. Author DAX measures that temporarily activate the inactive paths:

Sales_by_ShipDate = 
CALCULATE(
    SUM(Sales[Amount]), 
    USERELATIONSHIP('Date'[Date], Sales[ShipDate])
)

Pros: Minimal model size. Single source of truth.
Cons: Every metric for a secondary role needs its own measure. Analyzing two roles in the same visual requires advanced DAX.

General guideline: duplicate small lookup tables; use USERELATIONSHIP() for large dimensions.

Common Modeling Pitfalls

Ambiguous Paths

These errors occur when VertiPaq detects multiple possible routes for a filter to travel between two tables. The engine can't guess which path you intended, so it throws an error or disables relationships.

The most common cause: reckless bi-directional filtering across multiple tables, creating loops that interact with existing single-direction paths. Another cause: a shared parent dimension (like Location) filtering both Customer and Store, which both filter the same Sales Fact, creating competing parallel paths.

Fix: return to strict Star Schema architecture. Use single-direction, one-to-many relationships exclusively. If Bridge tables are required, enable bi-directional filtering on only one side. Better yet, disable bi-directional filtering globally and manage it via the CROSSFILTER DAX function only where explicitly needed.

Circular Dependencies

A circular dependency is an infinite computational loop: Object A requires Object B, but Object B requires Object A. Power BI detects this and blocks the operation.

These rarely come from obvious formulas. They typically emerge from context transition in Calculated Columns. When a Calculated Column uses CALCULATE(), DAX transforms the current row context into a filter context, making the column depend on all other columns in the table. A second Calculated Column using CALCULATE() in the same table creates a mutual dependency lock.

Fixes:

Switch to a Measure. Measures evaluate dynamically at query time, bypassing the row-level context transition issue entirely.
Exclude conflicting columns. Use ALLEXCEPT() or REMOVEFILTERS() to strip the dependency.
Move the logic upstream. Perform complex row-level arithmetic in Power Query or the source database before VertiPaq ever sees it.

Power BI data modeling rewards discipline. Star Schema, clean cardinality, single-direction filtering, and deliberate separation of physical joins from logical relationships. Get those fundamentals right and everything downstream (DAX, performance, user adoption) gets dramatically easier. The article was submitted in fulfilment of a LuxDevHQ Cohort 7 DataEngieering assignment ©adev3loper

How Linux Powers Real-World Data Engineering

Mungai M. — Thu, 26 Mar 2026 13:18:28 +0000

Linux Isn't Optional. It's the Foundation.

If you work in data engineering, you might spend most of your day inside managed cloud consoles and PaaS dashboards. It's easy to forget what's running underneath. But peel back those abstractions and you'll find Linux everywhere. AWS, GCP, and Azure all run on Linux distributions to provision compute instances, manage virtualization, and orchestrate containers. For data engineers building and maintaining resilient pipelines, Linux proficiency isn't a nice-to-have. It's table stakes.

Not long ago, enterprise data integration meant dragging and dropping in GUI-based ETL tools like SQL Server Integration Services (SSIS), typically on Windows servers. Those tools worked fine for basic pipelines, but they buckled under the scalability and automation demands of big data. As organizations scaled, they gravitated toward open-source Linux distributions (Red Hat Enterprise Linux, CentOS, Ubuntu), drawn by their stability, security, and resource efficiency.

The entire modern distributed processing stack was born on Linux. Hadoop, Spark, Kafka, Airflow. All of them depend on Linux kernel features for memory management, disk I/O, and concurrent processing across clusters. To administer these tools effectively, you need the command line. You SSH into servers, edit DAGs, inspect execution logs, manage background scheduling, and debug production failures in real time. An engineer with strong Linux skills immediately signals that they can manage the full data lifecycle at the system level.

The Terminal: Your Primary Data Interface

The Linux terminal, whether you call it the console, the shell, or the command prompt, is the direct line between you and the operating system kernel. Unlike graphical interfaces that abstract system calls away, the terminal gives you unfiltered access to the filesystem, network interfaces, and process schedulers.

In data engineering, where batch tasks and high-volume data manipulation are constant, this matters. You can kick off a multi-terabyte download in one terminal window while running log analysis in another, with the kernel handling the multitasking without breaking a sweat. And when a GUI crashes due to memory exhaustion or a bad config, the CLI is often the only way back in.

Navigating the Filesystem

The filesystem is your initial staging area: the place where raw data lands before it gets loaded into a database or distributed file system. You'll navigate it with the basics: pwd to check where you are, cd to move around, ls to see what's there. In practice, you'll lean on flags constantly. Running ls -alF (often aliased to ll) gives you a comprehensive view: hidden files, byte-level sizes, ownership, and permissions all at a glance.

For exploring complex directory structures, tree visualizes the hierarchy of your data partitions. The pushd/popd stack commands let you dive deep into nested log directories and snap right back to where you started.

Once you're in the right directory, file manipulation takes over. cp duplicates raw data for backup before transformation. mv renames files or shifts them across partitions after processing. mkdir creates new directories on the fly for daily partitioned extracts. touch is subtly versatile: its primary job is updating timestamps, but pipelines frequently use it to create empty marker files (like _SUCCESS flags) that signal to downstream orchestration sensors that an upstream job completed. And rm permanently deletes files, an operation that demands caution, especially when you're automating the purging of old staging data.

Access Control, Security, and System Management

Production environments are multi-tenant. Strict access control isn't optional; it's required for data governance and security compliance. You need to manage who can view sensitive datasets, modify transformation scripts, or execute pipeline triggers.

chmod is the primary tool for managing file permissions. Before a freshly written ETL shell script can run, you must explicitly grant execution rights: chmod a+x etl_pipeline.sh or chmod 755 script.sh. File ownership is managed through chown (change owner) and chgrp (change group), ensuring only authorized service accounts (like the airflow or spark user) can access specific data partitions. In complex enterprise setups, setfacl creates Access Control Lists (ACLs) that go beyond the standard owner/group/others model.

When you need admin privileges (installing dependencies, restarting daemons), sudo temporarily elevates your permissions to root. The su command lets you switch your entire shell session to another user, which is invaluable when testing whether a service account has the right permissions to run a pipeline.

A few other tools deserve mention here. history is essential for auditing previously executed commands when something breaks. who shows logged-in users, letting you verify no unauthorized connections exist on a sensitive database server. For finding files across sprawling filesystems, find does deep real-time traversal, while locate (paired with updatedb) offers near-instant searches against a pre-built index. And when you're editing config files on a remote server with no GUI, nano handles straightforward edits, while vim, steep learning curve and all, enables lightning-fast text manipulation once you've internalized the keybindings.

The Command Line as a High-Throughput ETL Engine

Before data reaches your data warehouse or processing framework, it usually needs inspection, cleansing, and formatting. Python is the standard for complex transformations, but the Linux shell provides a suite of text-processing utilities that act as a remarkably fast, memory-efficient ETL engine. These tools are written in C and process data as continuous streams rather than loading entire files into memory, so they often outperform scripted solutions when filtering or aggregating gigabyte-scale flat files.

Inspecting and Aggregating Data

Understanding your data starts with looking at it. cat prints an entire file to stdout, but on massive datasets it'll overwhelm your terminal. Instead, use head to sample the first few rows (checking headers and schema alignment) and tail to inspect the end of a file. The tail -f flag is indispensable for monitoring real-time application logs. For full exploration, less gives you paginated viewing, letting you scroll forward and backward through massive files without the memory cost of loading the whole thing.

To validate data completeness after a network transfer, wc -l counts lines instantly, letting you confirm that extracted row counts match expectations. The file command analyzes a file's magic numbers to determine its actual type and encoding, which is essential for catching a mislabeled binary masquerading as a CSV.

Stream Editing and Relational Operations

The real power of the Linux shell comes from standard streams (stdin, stdout, stderr) and the pipe operator (|). Piping lets you chain the output of one utility directly into the next, building multi-stage data processing workflows entirely in the terminal.

Here's how common Linux text utilities map to SQL operations:

Linux Command	What It Does	SQL Equivalent
`grep`	Filters rows matching string patterns or regex	`WHERE`
`cut`	Extracts specific columns by delimiter	`SELECT`
`awk`	Line-by-line processing with conditionals and arithmetic	`SELECT` with calculated columns
`sed`	Stream editing: find and replace with regex	`REPLACE()` / string functions
`sort`	Orders lines alphabetically or numerically	`ORDER BY`
`uniq`	Removes adjacent duplicates; `-c` adds frequency counts	`DISTINCT` / `GROUP BY`
`paste`	Merges lines from multiple files side by side	Horizontal `JOIN`

These utilities chain together beautifully. Say you need to find the most frequent IP addresses generating 500 errors in a web server log. Instead of writing a Python script:

cat error.log | grep '500 Internal Server Error' | awk '{print $1}' | sort | uniq -c | sort -nr

That single line filters for 500 errors, extracts the IP address column, sorts to group duplicates, counts occurrences, and sorts the results in descending order. Append tee to display results on screen while simultaneously writing to an audit file.

Parallel Processing: Xargs and GNU Parallel

Sequential pipe chains are elegant and memory-efficient, but they're single-threaded. When you're processing thousands of log files or migrating large repositories, you need parallelism.

Xargs

xargs reads items from stdin, parses them into arguments, and feeds them to another command. This solves a fundamental Unix constraint: many commands don't accept stdin natively, and passing extremely long argument lists via wildcard expansion triggers the dreaded Argument list too long (ARG_MAX) error.

If you need to delete millions of temporary staging files, rm -f *.tmp will fail when the expansion exceeds the kernel's argument limit. The fix:

find . -name "*.tmp" -print0 | xargs -0 rm -f

find locates files and separates names with null bytes (-print0), and xargs reads those null-terminated strings (-0) to batch filenames into safe chunks, handling filenames with spaces correctly in the process.

For parallel execution, add the -P flag. To hash thousands of files for integrity verification across 8 concurrent processes:

find . -type f | xargs -P 8 md5sum

GNU Parallel

xargs is ubiquitous, but GNU Parallel is purpose-built for complex parallel workloads. Its key advantage: when running concurrent jobs via xargs, output from different processes can interleave chaotically. GNU Parallel buffers each job's output until completion, keeping results contiguous and readable.

GNU Parallel also natively supports distributing jobs across multiple remote servers via SSH, turning a single workstation into a master node for an ad-hoc distributed processing cluster. As CI/CD and automation become critical metrics for engineering teams, these parallel processing tools represent a serious operational advantage.

Automating Pipelines via Shell Scripting

Individual commands become powerful when you stitch them into automated pipelines through shell scripting. A shell script is a text file containing commands, control flow, and variables, letting you dictate how data moves from point A to point B without manual intervention.

Building Resilient ETL Scripts

A Bash script starts with a shebang line (#!/bin/bash), telling the OS which interpreter to use. Within these scripts, you build complete ETL routines.

A practical example: launch a Linux VM on a cloud platform and build a pipeline that extracts financial metrics from an external API, performs local aggregations, and loads the output into a relational database. The script uses curl or wget to pull raw JSON or CSV data, then employs awk and sed to filter malformed records and calculate daily aggregates. Finally, it pipes transformed data directly into the database:

echo "COPY access_log FROM '/tmp/transformed_data.csv' DELIMITER ',' CSV;" \
  | psql --username=postgres --host=localhost

Because data pipelines often take hours to complete, you can't risk tying them to an SSH session that might drop. Running nohup ./etl_pipeline.sh & detaches the process from your terminal entirely. It'll keep running even if your connection dies, with output redirected to a log file.

Scheduling with Cron and At

Historically, scheduling meant cron. The cron daemon monitors system directories for timing instructions. Running crontab -e opens your scheduling table, where you define intervals using five fields (minute, hour, day of month, month, day of week) followed by the command:

00 21 * * * /path/to/script.sh > /path/to/output.log

That fires the ETL pipeline at 9:00 PM every day. For one-off jobs (a database backup in two hours, a temporary server reboot), the at command queues an operation without cluttering your crontab.

But cron has real limitations. It's purely time-based: no awareness of upstream dependencies, no retry mechanisms for failed tasks, no centralized monitoring or alerting. While you can theoretically build a rudimentary DAG in Bash by chaining scripts with && and custom alerting, enterprise-scale platforms need dedicated orchestration.

Advanced Orchestration: Airflow and Prefect

The industry consensus is clear: production data platforms should migrate beyond cron toward orchestration frameworks that handle complex dependencies and dynamic resource allocation. Apache Airflow and Prefect are two of the most prominent, and both require deep Linux integration for production deployment.

Apache Airflow in Production

Deploying Airflow beyond local development is a serious architectural undertaking. You need a Web Server for the UI, a Scheduler to monitor and trigger DAGs, a DAG Processor to parse workflow definitions, an Executor for routing logic, a Metadata Database (typically PostgreSQL or MySQL) for state history, and horizontally scalable Workers for the actual computation.

Configuration on Linux leans heavily on environment variables. While Airflow defaults to airflow.cfg, best practice is to override dynamically. Airflow recognizes variables structured as AIRFLOW__{SECTION}__{KEY}. You can inject secure credentials without hardcoding:

export AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql://user:password@host/db

For additional flexibility, appending _cmd to a config key lets Airflow derive values from shell command execution.

Managing a high-volume cluster requires careful resource tuning. When DAG counts climb into the hundreds, you must allocate more CPU and memory to the Scheduler. Workers continuously polling the metadata database exhaust connection limits, so you'll need connection pooling. PgBouncer is the standard choice. Logs from highly parallel Celery workers will eventually consume all local disk space, so production setups route execution logs to remote object storage (S3, GCS).

Security integrates tightly with Linux systems. On platforms like Google Cloud, server access and user permissions for Airflow nodes are governed by OS Login and Pluggable Authentication Modules (PAM). For Hadoop cluster authentication, the airflow kerberos command continuously refreshes security tokens from a Kerberos Keytab, typically isolated in a separate container that writes temporary tokens to a shared volume.

Prefect: A Modern Alternative

Airflow's steep learning curve, complex DAG abstraction, and heavy infrastructure requirements are well-documented pain points. Local development alone demands at least 4 GB of RAM and multiple background services, which creates significant friction for rapid iteration.

Prefect was designed for data engineering and MLOps teams who want a frictionless developer experience. Instead of learning Airflow's operational syntax, you write pure Python with @flow and @task decorators. This supports dynamic, runtime workflows using native Python loops and branching, something Airflow's static DAG model struggles with.

Deploying Prefect on RHEL demonstrates the simpler architecture: a Server and Worker model with PostgreSQL as the backend, replacing Airflow's complex web of schedulers and executors. To ensure these processes survive reboots and restart on failure, you create custom systemd service files that embed Prefect into the Linux initialization sequence.

Network Diagnostics in Distributed Data Architectures

Modern data platforms are inherently distributed, comprising separate database servers, cloud storage buckets, API endpoints, and worker nodes. When something fails, it's often a network issue, not a bug in your Python or SQL. The ability to troubleshoot across the TCP/IP stack is a critical skill.

Layer 3 and Layer 4: Connectivity and Ports

Troubleshooting starts at Layer 3 (Network) to verify basic reachability. ping sends ICMP echo requests to check if a remote server is alive. If the host is unreachable or latency is spiking, traceroute (or the real-time mtr) maps the exact path packets take, isolating where connections drop or congest. The ip and route commands let you view and modify local routing tables.

But a live host doesn't mean a specific service is reachable. At Layer 4 (Transport), you validate port availability. If an Airflow worker can't reach a Redis broker on port 6379, use nc (Netcat) to test socket connectivity:

nc -zv 192.168.1.1 6379

This tells you immediately whether the port is open or a firewall rule is blocking traffic.

On the server side, verify that applications have bound to the correct network interface with ss (which has largely replaced netstat). Running ss -tuln gives a clean list of all listening TCP and UDP ports. If a port is unexpectedly occupied, lsof identifies which process holds the lock.

Layer 7: Application Diagnostics and Packet Capture

At the Application layer, failures often stem from DNS resolution issues, especially in cloud environments where IPs change frequently. nslookup and dig query DNS servers for A records, CNAME aliases, and MX records, confirming that endpoint URLs resolve correctly.

For API integration, curl is the industry standard debugging tool. Use it to simulate POST requests, inject authentication headers, and inspect response codes:

curl -I https://api.example.com

wget excels at reliably downloading large datasets over flaky connections, with built-in resume capabilities.

For the most stubborn issues (intermittent packet drops, malformed TCP handshakes, unencrypted data leaks), tcpdump captures raw network traffic in real time, letting you analyze exact byte structures on the wire.

Tool	TCP/IP Layer	Primary Use in Data Engineering
`ping`	Layer 3 (Network)	Server availability and round-trip latency
`traceroute` / `mtr`	Layer 3 (Network)	Mapping network hops and routing bottlenecks
`nc` / `telnet`	Layer 4 (Transport)	Testing whether specific database ports are reachable
`ss` / `netstat`	Layer 4 (Transport)	Confirming services (e.g., Kafka) are listening
`dig` / `nslookup`	Layer 7 (Application)	Diagnosing DNS resolution failures
`curl`	Layer 7 (Application)	Testing REST API endpoints and inspecting headers

Containerization and Immutable Execution Environments

Ensuring a pipeline runs identically on a developer's laptop, a staging server, and a production cluster is paramount. Containerization, predominantly through Docker, achieves this reproducibility by leveraging Linux kernel features: control groups (cgroups) for resource limits and namespaces for process isolation.

Docker Fundamentals

Docker containers are stateless and ephemeral. Any data written to the container's internal filesystem vanishes when it terminates. Databases like PostgreSQL need persistent storage, so you use Docker volumes to map internal directories to the host's filesystem. Exposing a containerized database to external networks requires explicit port mapping (binding the container's internal port to a host port).

Inside containers, managing isolated Python environments prevents dependency conflicts between libraries. Modern workflows use tools like uv to build reproducible Python environments directly within the container definition.

The Entrypoint Script

The Dockerfile's ENTRYPOINT and CMD directives control what happens when a container launches. In practice, containers rarely execute a single command cleanly on startup. They need initialization: waiting for database connections, running migrations, exporting environment variables.

This is handled by an entrypoint.sh script. The Dockerfile copies it in with COPY and grants execution permissions via RUN chmod +x /entrypoint.sh. Security best practice: keep this script immutable (no write permissions) to prevent runtime modification.

The script typically ends with exec python app.py "$@". The exec command is crucial: it replaces the current bash process with the target application, so the Python app becomes PID 1. This matters because PID 1 receives system signals (like SIGTERM) from the container orchestrator, enabling graceful shutdowns that prevent data corruption. The "$@" variable passes through any arguments from CMD or docker run.

One gotcha: when environment variables need to configure binary paths (like mssql-tools), declare them with ENV in the Dockerfile, not in .bashrc, which Docker doesn't source during automated execution.

Knowledge Dissemination: The Technical Publishing Ecosystem

A data engineering project isn't done when the pipeline runs. Documentation and knowledge sharing are part of the lifecycle. Because the field depends heavily on fast-evolving open-source tools, the community relies on technical blogging to document integration edge cases, architectural patterns, and debugging methodologies.

The primary platforms for developer-focused publishing are Hashnode, Dev.to, and Towards Data Science (TDS). Contributing to these builds your professional reputation while enriching the community's collective knowledge base.

Markdown, YAML Front Matter, and Structure

Developer platforms overwhelmingly use Markdown, a lightweight markup language created by John Gruber and Aaron Swartz that's readable in raw form and compiles cleanly to HTML. It handles formatting (bold, italics, lists, blockquotes, links) without cluttering your writing with HTML tags. Crucially, it supports syntax-highlighted code blocks, which are non-negotiable when demonstrating SQL queries or Python scripts.

Article metadata lives in YAML front matter, a block of key-value pairs at the top of the file enclosed by triple dashes (---):

Front Matter Key	Purpose
`title`	The H1 heading and HTML title tag (mandatory)
`tags`	Keywords for indexing (e.g., `linux`, `dataengineering`, `python`)
`canonical_url`	Tells search engines which URL is the original source (critical for cross-posting)
`cover_image`	Header image URL for social media previews
`published`	Boolean controlling whether the post is live or draft

Structure matters for accessibility. Follow a logical narrative: problem statement, technical solution with code examples, real-world conclusion. Headings must follow semantic hierarchy, so don't skip from H2 to H6 or screen readers will choke. Provide alt-text for diagrams, go easy on emojis, and never use Unicode characters to create "fancy fonts."

GitOps for Technical Blogs

A growing trend is treating documentation as code. Write articles locally in your IDE, commit the Markdown to GitHub, and use CI/CD pipelines to automate publishing.

Hashnode offers native GitHub integration. Install the Hashnode app on your repository, and when you push a Markdown file to the designated branch, Hashnode parses the front matter and publishes or updates accordingly, matching posts by slug.

For cross-posting to multiple platforms simultaneously (Dev.to, Hashnode, Medium), engineers build custom GitHub Actions. These scripts trigger on push, extract front matter metadata, and submit the article via each platform's REST API using keys stored in GitHub Secrets. This eliminates the overhead of manual cross-posting entirely.

TDS Editorial Standards

While automated syndication maximizes reach, Towards Data Science enforces strict editorial guidelines. Authors submit drafts through a contributor form with a note on the topic's timeliness. The editorial board reviews every submission for technical accuracy, logical progression, and clarity.

TDS rejects superficial listicles, basic tutorials without novel perspectives, and clickbait titles. Authors must demonstrate that a technical gap exists and that their solution is superior to existing approaches. Media usage is scrutinized: custom graphs (Python, R, D3.js) are preferred, external imagery must be properly attributed, and AI-generated images require verified commercial rights. Code must appear in proper code blocks, never screenshots. Only authors with verified, non-anonymous profiles are permitted to contribute.

This article was written to help data engineers, from early-career to mid-level, build a deeper appreciation for the Linux skills that underpin every modern data platform. These fundamentals are what separate operators from architects. The article was submitted in fulfilment of a LuxDevHQ Cohort 7 DataEngieering assignment ©adev3loper

DEV Community: Mungai M.

How WhatsApp Accounts Really Get "Hacked" (and How to Lock Yours Down)

What "hacking" actually means here

The ways they get in

How these play out in real life

"I sent you a code by mistake"

The link that promises something

The sudden network blackout (SIM swap)

The midnight code grab (voicemail)

The office snooper (linked devices)

Warning signs worth stopping for

How to actually protect yourself

Turn on Two-Step Verification

Treat the code as private, full stop

Check your linked devices, and let WhatsApp warn you

Slow down before you tap a link

Lock down the phone itself

For Kenyan readers: protect the line, not just the app

If you run groups, you're a target

If it has already happened

A few questions people always ask

Do this today, not tomorrow

A few questions people always ask

Do this today, not tomorrow

How Python Powers Real-World Data Analytics

Python Isn't a Trend. It's the Standard.

Why Python Dominates the Analytics Landscape

Readability as a Design Principle

Ecosystem Depth

Interoperability Across the Stack

Community and Market Demand

The Analytics Workflow: Ingest, Clean, Analyse, Visualise

Ingestion: Getting Data In

Cleaning: The Work Nobody Sees

Analysis: Asking Questions with Code

Visualisation: Making the Data Speak

Python in Production: Three Scenarios

Retail Inventory Optimisation

Public Health Survey Analysis

Customer Segmentation for Targeted Marketing

The Python Data Toolkit at a Glance

Getting Started: A Practical Roadmap

Where This Leads

Level Up Your SQL: Subqueries & CTEs in the Real World

Subqueries: Queries Inside Queries

Common subquery shapes

Scalar subquery

Single‑row subquery

Multi‑row subquery with IN

Correlated subquery

Inline view (subquery in FROM)

When Subqueries Shine (and When They Hurt)

CTEs: Naming Your Thought Process

Chained / multiple CTEs

Recursive CTE (hierarchies)

MATERIALIZED vs NOT MATERIALIZED (Postgres)

Subqueries vs CTEs: How to Choose

Concrete Refactors (Postgres)

1. Correlated COUNT → CTE + JOIN

2. “Latest row per group” with LATERAL

3. Recursive CTE for an org chart

Practical Checklist Before You Ship That Query

Wrapping Up

Welcome to the World of SQL

Your First Step into Data Analysis

The Digital Filing Cabinet

Speaking the Language of Data

Rules of the Game: Schemas and Constraints

Getting Hands-On: A Mini Example

5 Practical Tips for Beginners

How to Publish a Power BI Report and Embed It on a Website

What you need before you start

Step 1: Create a Workspace in the Power BI Service

Step 2: Publish Your Report from Power BI Desktop

Step 3: Generate the Public Embed Code (Publish to Web)

Step 4: Embed the Report on Your Website

Basic HTML page

Making it responsive

WordPress, Wix, and other CMS platforms

Managing Your Embed Codes

Multi‑row subquery with `IN`

Inline view (subquery in `FROM`)