DEV Community: Braeson Nyahera

Using symbolic links(symlinks) in airflow to manage DAGs across multiple folders.

Braeson Nyahera — Tue, 23 Jun 2026 11:40:54 +0000

If you've ever run a dag file in airflow, we know that airflow manages one folder for DAG files i.e /dags_folder. That begs the question how would it be able to track dags in different folders without copy pasting the file to /dags_folder.
After being faced with such a situation symbolic links(symlinks) was the simplest solution to my problem. In this post I will walk through what are symlinks and the solution that they bring.

Symbolic Links(Symlinks)

Symlinks are pointers in filesystem, they say "The real content lies somewhere else." When a program reads a symlink, the OS silently redirects it to the target.
That property is exactly what is needed for this to work. We can make airflow /dags_folder to appear to containg different directories inside it without coping the files.

Setting it up

1. Confirming the original dag folder

After making sure apache-airflow is installed you can check the dags_folder location

airflow config get-value core dags_folder

Such a response is expected /home/bryson/airflow/dags
The contents of the /dags_folder is:

2. Defining our secondary directory

Our secondary directory will be located at /home/bryson/dev/pipeline/dags

3. Current airflow api-server

Only the files inside the airflow/dags are currently displayed in the api-server dashboard.

4. Creating symlink for the secondary folder

ln -s source_path target_path

ln -s /home/bryson/dev/pipeline/dags /home/bryson/airflow/dags/fx_prices

This will create a link from /home/bryson/airflow/dags to /home/bryson/dev/pipeline/dags
Sample of the symlink created:

5. Api-server after creating a symlink

As you can notice our Api-server now also detects the dag files inside our secondary directory.

Why symlinks

Symlinking is great for local developnment as it's fast to set up and keeps directories independent.

Points to note

Sometimes not all files in a folder are dags, to ensure only dags are detected we use .airflowignore to have the names of the files to be ignored.
This helps only to manage different folders containing dag files from the same api-server dashboard

[Boost]

Braeson Nyahera — Mon, 22 Jun 2026 10:32:21 +0000

joseph mwangi

Jun 14

Pandas for Data Cleaning: A Practical Guide for Beginners

#beginners #datascience #python #tutorial

7 min read

Incremental vs Delta Extraction and Incremental Load vs Upsert in ETL pipelines

Braeson Nyahera — Tue, 16 Jun 2026 14:26:03 +0000

Introduction

Modern systems generate vast amount of data daily if not hourly or every second, whether it is through creating entirely new data or just updating existing data. Engineers work around this by having frameworks on how each of this instances are handled.
Terms such as Incremental extraction, delta extraction,incremental load and upsert are frequently used in ELT/ETL processes.
Although these concepts are relatable, misunderstanding them impacts the optimality of the pipeline, hence it is necessary to know each as a unit of it's own.

Understanding where they fit

An ETL pipeline consists of three main steps:

Extract
Transform
Load

Incremental extraction and delta extraction are found the extract step whereas Incremental load and Upsert are found in the load step.

Incremental Extraction vs Delta Extraction

Incremental Extraction

Incremental extraction refers to the process of extracting data that has been changed since a certain point in time. This ensures that not all the data is extracted in every instance.
This strategy heavily depends on timestamps to know data which is new to be extracted.
For example: In a table containing 1000000 records, if only 500 records are added then those are the only ones which would be added.

Sample code:

SELECT * FROM orders 
WHERE last_modified > '2026-06-14 00:00:00';

Key Limitation

Incremental extraction cannot detect hard deletes. If a record is removed from the source, it leaves no updated timestamp behind, so the deletion is invisible to this strategy. Deleted rows will persist in the destination as stale data.

Delta Extraction

Delta extraction works by capturing exact changes which have occurred since the last extraction. This include inserts, updates and deletions.
The term "delta" represents difference between two states of data. It depends on CDC(Change Data Capture) which includes; database transaction logs, triggers, or difference tables, rather than relying on timestamp to capture every insert, update, and delete.

Incremental Load vs Upsert

Incremental load

Incremental load is the process of loading only new or modified records. This will prevent the whole dataset being reloaded every other time making the process less efficient.
If out of 10000 records only 10 were changed or added then only the ones which were changed are the ones to be loaded.

Upsert

Upsert works as a combination of two processes, Insert and Update. This comes in when only appending data is not enough. This process firsts check if the record exists, if it does then it is updated but if it does not exist it is loaded.
For upsert to work you need a unique identifier and a logic to check if the record exists.

Sample code:

INSERT INTO customers (customer_id, name, city)
VALUES (101, 'Jane', 'Nairobi')
ON CONFLICT (customer_id)
DO UPDATE
SET
 name = EXCLUDED.name,
 city = EXCLUDED.city;

Wrapping Up

These four concepts, incremental extraction, delta extraction, incremental load, and upsert, each solve a specific part of the same problem: keeping data systems in sync efficiently. Understanding them individually is what allows an engineer to design pipelines that are correct, performant, and maintainable.

The choice between them is rarely about which is "better" rather it is about what the data requires. Append-only data with no deletes calls for a simple incremental extract and load. Frequently updated records with hard deletes call for delta and precise upsert logic. Matching the strategy to the data is the mark of a well-designed pipeline.

Understanding Logic, Reusability and Integrity On SQL ; Procedures, Functions and Transactions.

Braeson Nyahera — Sun, 03 May 2026 20:13:52 +0000

SQL is widely known for data querying and manipulation but systems do grow; data becomes larger; processes become repetitive and operations become sensitive. SQL has some features which enables it to be considered a fully fledged programming language. Some of the features which I discuss in this article are procedures, functions and transactions. Each of these concepts serve distinct purposes.
Procedures execute operations, functions return values, and transactions ensure those operations are safe.

Stored Procedures

These are set of SQL statements stored in the database and executed as a unit which are used to perform tasks such as UPDATE, INSERT, DELETE etc.
They are triggered by calling them and passing the expected parameters by the procedure.
Here is an example of a procedure:

CREATE OR REPLACE PROCEDURE increase_salary(p_dept TEXT, p_percent NUMERIC)
LANGUAGE plpgsql
AS $$
BEGIN
    UPDATE employees
    SET salary = salary + (salary * p_percent / 100)
    WHERE department = p_dept;
END;
$$;

Here is how a procedure is called:

CALL increase_salary('IT', 10);

Functions

This is a reusable logic block which can return values and can be used inside queries for data selection.
It can be used with SELECT, WHERE and is great for reusability.

CREATE OR REPLACE FUNCTION get_avg_salary(p_dept TEXT)
RETURNS NUMERIC
LANGUAGE plpgsql
AS $$
BEGIN
    RETURN (
        SELECT AVG(salary)
        FROM employees
        WHERE department = p_dept
    );
END;
$$;

Here is an example of how the function can be used:

SELECT name, salary
FROM employees
WHERE salary > get_avg_salary(department);

Transactions

Transactions are used to group a set of code operations in which if any of it fails then the whole execution is aborted. This ensures that data is only changed when the full code execution structure is successful.
Transactions are best for data safety as they prevent partial updates.

BEGIN;

UPDATE accounts
SET balance = balance - 100
WHERE id = 1;

UPDATE accounts
SET balance = balance + 100
WHERE id = 2;

COMMIT;

If it fails:

ROLLBACK;

How They Work Together

It will not be complete to conclude the article without showing an example of how these concepts can be used together.
Here is a simple scenario showing the use of the three concepts interdependently.
Function:

CREATE OR REPLACE FUNCTION get_balance(acc_id INT)
RETURNS NUMERIC AS $$
BEGIN
    RETURN (SELECT balance FROM accounts WHERE id = acc_id);
END;
$$ LANGUAGE plpgsql;

Procedure:

CREATE OR REPLACE PROCEDURE transfer_money(from_id INT, to_id INT, amount NUMERIC)
LANGUAGE plpgsql
AS $$
BEGIN
    IF get_balance(from_id) < amount THEN
        RAISE EXCEPTION 'Insufficient funds';
    END IF;

    UPDATE accounts SET balance = balance - amount WHERE id = from_id;
    UPDATE accounts SET balance = balance + amount WHERE id = to_id;
END;
$$;

Transaction:

BEGIN;
CALL transfer_money(1, 2, 100);
COMMIT;

Conclusion

SQL has a wide range of capabilities more than just being a querying language. Having features such as procedures, functions, transactions and many others helps it to be an efficient tool for use on data in whichever way needed. Mastering it comes a long way in helping the analysis of data directly from the database.

Sub-queries vs Window Functions vs Common Table Expressions: Which Should You Use in SQL?

Braeson Nyahera — Sun, 19 Apr 2026 08:44:49 +0000

SQL is the most used tool for data manipulation but what happens when there are some concepts that seem to work towards the same ultimate outputs. As queries become more complex, developers often encounter different techniques that appear to produce similar results—namely sub-queries, common table expressions (CTEs), and window functions.
At first glance, these concepts can seem interchangeable. You might solve the same problem using any of them, which raises a natural question: which one is better?
The goal is not to choose one, but to understand when each is the right tool.
In this article I will discuss about sub-queries, window functions and common table expressions;

Sub-Queries
These are queries inside other queries and are executed either before or alongside the outer query. They can return scalar value, row or table.

SELECT *
FROM employees e
WHERE salary = (
    SELECT MAX(salary)
    FROM employees
    WHERE department = e.department
);

Window functions
These are used when you are trying to preserve your rows. Used together with components such as - OVER(),PARTITION BY,ORDER BY and common functions such as - SUM(),AVG(),RANK(),LAG(),LEAD()

SELECT employee_id,first_name, last_name,salary, job_title, department
FROM (
  SELECT *,
    RANK() OVER(PARTITION BY department ORDER BY salary DESC) AS rnk
  FROM employees
) t
WHERE rnk = 1;

Common Table Expressions (CTEs)
These are temporary results set that are initialized using -WITH. They are often used to improve readability and modularity of scripts.

WITH max_salary AS (
    SELECT department, MAX(salary) AS max_sal
    FROM employees
    GROUP BY department
)
SELECT e.*
FROM employees e
JOIN max_salary m
    ON e.department = m.department
WHERE e.salary = m.max_sal;

All these features of SQL are used to perform advanced data querying and transformations. They can all be tweaked to perform almost similar queries but it all depends on what the final output is expected to be.

When to use what
Each of the features has an instance when it is the most convenient over the others.

For instance when it is a simple filtering such as retaining all those above average marks then a sub-query is the best option.
When we are more focused on row level analytics then the window functions come in handy for that task.
In a situation where it is a multi-step logic and re-usability is a core factor the common table expressions are best suited for use.

Using in collaboration
The best thing about this features is that they can be used together for the best solution. This helps in the data transformation where there are steps within it in which either is considered to be superior over the other.
Here is an example where all of them are used together:

with ranked AS (
    SELECT 
        *,
        AVG(salary) OVER (PARTITION BY department) AS avg_salary
    FROM employees
)
SELECT *
FROM ranked
WHERE 
    salary > avg_salary
    AND salary > (
        SELECT AVG(salary) FROM employees
    );

Best Practices
These are some of the recommended practices when working with these SQL features:

It is best to avoid deep nesting of queries by using CTEs which help in the structuring of the script for easy readability.
When using CTEs it is necessary to name them well so that their purpose can be identified easily without having to read all the code snippets manually.
Window functions are often more efficient than sub queries in large datasets, so where possible it is best to use window functions over sub queries but in instances of small datasets any can be used depending on preference.

Conclusion
These are to be considered more as complementary tools than competing tools. In most advanced queries you will have to use them together or interchangeably to get to the results.
What makes a major difference is how you use them as they can heavily impact the performance. It is best to master all of them and understand the strength and weaknesses of each so that when it get to a point of picking one or more you do it from a point of information.