DEV Community: Rose1845

Apache Kafka

Rose1845 — Thu, 21 May 2026 18:50:29 +0000

What is Apache Kafka

Main Concepts

Event

A record that something actually happened, i.e., a message When you read/write to Kafka, you do this in the form of events

Producers
Use the producer api to produce and send events

Topics

Events are organized and durably organized into topics
Topics are categorized into which the events are published that trigger other actions

Consumers
Those who subscribe (read and process) the events sent by producers
Real time processing(Streams)
Partitions for Scalability and Performance
Consumer Groups for scalable Message Consumption

Kafka Brokers

A server that stores data(messages), topics, and manages message distribution to consumers
Fault tolerance since topic partitions are distributed across multiple brokers
Each partition has a leader broker and multiple replicas Apache Kafka - stores messages on disk for a certain configurable retention period Benefits of data retention
Real-time processing of data
consumers can read multiple times they want
replay of messages, debugging of historical data Apache Kafka focuses on event streaming and long-term data retention Optimized for high throughput

Zookeper
Cluster Management
maintains a registry of all brokers in the cluster

Leader Election
Each Kafka partition has a leader broker
Metadata Configuration
And traditionally, Kfaka used an external tool Zookeeper for managing metadata and coordination of tasks in a distributed system
External dependency for Kafka
The newer versions of Kafka from v3 now use KRaft
Kafka version 3.0 and above

removed the dependency for zookeeper and introduced KRaft
And metadata is now being managed natively within Kafka brokers
Raft consensus Algorithm for leader election

Data Warehouse Concepts: Top-Down or Bottom-Up?

Rose1845 — Thu, 07 May 2026 11:21:33 +0000

What is a data warehouse?
A data warehouse is a central repository of information that can be analyzed to make more informed decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence. Business analysts, data engineers, data scientists, and decision makers access the data through business intelligence (BI) tools, SQL clients, and other analytics applications.

Designing a Data Warehouse is an essential part of business development. For designing, there are two most common architectures named Kimball(Bottom-up) and Inmon(Top-down), but the question is which one is better, and which one serves users with low redundancy. Let us compare both on some factors.

Kimball(Bottom-up): Kimball's approach to designing a data warehouse was introduced by Ralph Kimball. This approach starts with recognizing the business process and the questions that Dataware House has to answer. These sets of information are being analyzed and then documented well. The Extract Transform Load (ETL) software brings all data from multiple data sources, called data marts, and then it is loaded into a common area called staging. Then this is transformed into an OLAP cube.

Key Characteristics
1.Dimensional Model:
Kimball promotes the use of dimensional models (star schemas and snowflake schemas), which are designed for ease of use and query performance.
Data is organized into fact tables and dimension tables.

Data Marts: Kimball starts by creating data marts that address specific business processes (e.g., sales, inventory). These data marts are later integrated into a comprehensive data warehouse.
Conformed Dimensions: Dimensions are shared across data marts, ensuring consistency and integration.
ETL Process: A streamlined ETL process loads data into the dimensional models, making the data warehouse more accessible for business users.
User-Focused: Kimball’s approach is designed to be user-friendly, enabling business users to easily navigate and query the data.

Pros:

Faster Implementation: Data marts can be implemented quickly to meet immediate business needs.
User-Friendly: Star schema design is intuitive and easy for end-users to understand and query.
Flexibility: Easier to adapt and extend the data warehouse as new requirements emerge.
Cons:
Data Redundancy: Denormalized data can lead to redundancy and increased storage requirements.
Inconsistencies: Potential for inconsistencies between data marts if not properly managed.
Integration Challenges: Integrating data marts to form a cohesive data warehouse can be challenging.

Inmon(Top-down): Inmon's approach to designing a Dataware house was introduced by Bill Inmon. This approach starts with a corporate data model. This model recognizes key areas and also takes care of customers, products, and vendors. This model serves for the creation of a detailed logical model which is used for major operations. Details and models are then used to develop a physical model. This model is normalized and makes data redundancy less. This is a complex model that is difficult to be used for business purposes for which data marts are created and each department is able to use it for their purposes.

Key Characteristics:
Enterprise-Wide Data Warehouse:
Inmon advocates for building a comprehensive, centralized data warehouse that integrates data from across the entire organization.
The focus is on creating a single version of the truth.

Normalized Data Model: The data warehouse is designed using a normalized model (typically 3NF), which reduces data redundancy. Data is highly structured and detailed.
Data Marts: Data marts are created as subsets of the data warehouse. They are often denormalized and tailored to specific business lines or departments. These data marts source data exclusively from the central data warehouse.
ETL Process: A robust ETL (Extract, Transform, Load) process is essential to clean, integrate, and load data into the centralized data warehouse before it is distributed to data marts.
Time-Variant and Non-Volatile: The data warehouse stores historical data and is designed to retain data for long-term analysis.

Pros:

Data Integrity: Normalized data reduces redundancy and ensures data consistency.
Comprehensive Data Model: Provides a holistic view of the organization’s data.
Scalability: Can handle large volumes of data and complex queries.
Cons:
Time-Consuming Implementation: Requires significant time and resources to design and implement.
Complexity: Managing and maintaining a normalized data warehouse can be complex.
Slower Query Performance: Normalized data can lead to slower query performance compared to denormalized structures.

Airflow DAGs, Tasks, and Operators: A Complete Beginner’s Walkthrough

Rose1845 — Tue, 28 Apr 2026 16:53:04 +0000

Apache Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows. Airflow’s extensible Python framework enables you to build workflows connecting with virtually any technology. A web-based UI helps you visualize, manage, and debug your workflows. You can run Airflow in a variety of configurations, i.e., from a single process on your laptop to a distributed system capable of handling massive workloads.

Workflows as code
Airflow workflows are defined entirely in Python. This “workflows as code” approach brings several advantages:

Dynamic: Pipelines are defined in code, enabling dynamic Dag generation and parameterization.
Extensible: The Airflow framework includes a wide range of built-in operators and can be extended to fit your needs.
Flexible: Airflow leverages the Jinja templating engine, allowing rich customizations.

Dag
A Dag is a model that encapsulates everything needed to execute a workflow. Some Dag attributes include the following:

Schedule: When the workflow should run.
Tasks: tasks are discrete units of work that are run on workers.
Task Dependencies: The order and conditions under which tasks execute.
Callbacks: Actions to take when the entire workflow completes.
Additional Parameters: And many other operational details.

Unpacking the three words( D .A G.)
Directed. The arrows between tasks go one way. Task A points to Task B. Not the other way around. You can't reverse a dependency.

Acyclic. No loops. Task A cannot eventually depend on itself, directly or indirectly. If it could, the pipeline would run forever. Airflow enforces this rule and will throw an error if you accidentally create a cycle.

Graph. Just a map of connected things. Nodes (your tasks) and edges (the dependencies between them). That's it. Nothing more complicated than what you'd draw on a whiteboard to explain a workflow to a colleague.

Trigger a Dag manually.... see picture below

In other words, we can say:
"A DAG is a one-directional, no-loop map of your workflow. You define the steps. Airflow figures out the order."

Task
A task is one unit of work. One step in your pipeline. "Fetch data from the API" is a task. "Clean the data" is a task. "Save to CSV" is a task. A task does one job and one job only. The moment a task is trying to do three things, it should probably be three tasks.

Operator
An operator is the type of task. Airflow comes with a bunch of built-in operators for common jobs. Examples of the popular operators in Airflow

PythonOperator: Runs a Python function. This is what we'll use today.
BashOperator: Runs a shell command. Useful for scripts, CLI tools, anything you'd run in a terminal.

ETL vs ELT: Which One Should You Use and Why?

Rose1845 — Thu, 09 Apr 2026 07:20:59 +0000

What’s the Difference Between ETL and ELT?

What is ETL?

Extract, transform, and load (ETL) is the process of combining data from multiple sources into a large, central repository called a data warehouse. ETL uses a set of business rules to clean and organize raw data and prepare it for storage, data analytics, and machine learning (ML).

What is ELT?

Extract, load, and transform (ELT) is an extension of extract, transform, and load (ETL) that reverses the order of operations. You can load data directly into the target system before processing it. The intermediate staging area is not required because the target data warehouse has data mapping capabilities within it. ELT has become more popular with the adoption of cloud infrastructure, which gives target databases the processing power they need for transformations.

ETL process
ETL has three steps:

You extract raw data from various sources
You use a secondary processing server to transform that data
You load that data into a target database The transformation stage ensures compliance with the target database’s structural requirements. You only move the data once it is transformed and ready.

ELT process
These are the three steps of ELT:

You extract raw data from various sources
You load it in its natural state into a data warehouse or data lake
You transform it as needed while in the target system With ELT, all data cleansing, transformation, and enrichment occur within the data warehouse. You can interact with and transform the raw data as many times as needed.

Differences

Extract, load, and transform (ELT) has improved extract, transform, and load (ETL) in several ways.

Transform and load location
Transformation and load occur in different locations(it can be a dabase,API) and use distinct processes. The ETL process transforms data on a secondary processing server.

In contrast, the ELT process loads raw data directly into the target data warehouse. Once there, you can transform the data whenever you need it.

Data compatibility
ETL is best suited for structured data that you can represent in tables with rows and columns. It transforms one set of structured data into another structured format and then loads it.

In contrast, ELT handles all types of data, including unstructured data like images or documents that you can’t store in tabular format. With ELT, the process loads the various data formats into the target data warehouse. From there, you can transform it further into the format you require.

Speed
ELT is faster than ETL. ETL has an additional step before it loads data into the target that is difficult to scale and slows the system down as the data size increases.

In contrast, ELT loads data directly into the destination system and transforms it in parallel. It uses the processing power and parallelization that cloud data warehouses offer to deliver real-time or near-real-time data transformation for analytics.

Costs
The ETL process requires analytics involvement from the start. It needs analysts to plan on the reports they want to generate and define data structures and formatting. The time required for setup increases, which adds to costs. Additional server infrastructure for transformations may also cost more.

ELT has fewer systems than ETL, as all transformations occur within the target data warehouse. With fewer systems, there is less to maintain, leading to a simpler data stack and lower setup costs.

Security
When you work with personal data, you must comply with data privacy regulations. Companies must protect personally identifiable information (PII) from unauthorized access.

In ETL, developers have to build custom solutions, like masking PII to monitor and protect data.

On the other hand, ELT solutions provide many security features—like granular access control and multifactor authentication—directly within the data warehouse. You can invest more time in analytics and less time in meeting data regulation requirements.

When to use ETL vs. ELT

Extract, load, and transform (ELT) is the standard choice for modern analytics. However, you might consider extract, transform, and load (ETL) in the following scenarios.

Legacy databases
It is sometimes more beneficial to use ETL to integrate with legacy databases or third-party data sources with predetermined data formats. You only have to transform and load it once into your system. Once transformed, you can use it more efficiently for all future analytics.

Experimentation
In large organizations, data engineers conduct experiments—things like discovering hidden data sources for analytics and trying out new ideas to answer business queries. ETL is useful in data experiments to understand the database and its usefulness in a particular scenario.

Complex analytics
ETL and ELT may both be used together for complex analytics that use multiple data formats from varied sources. Data scientists may set up ETL pipelines from some of the sources and use ELT with the rest. This improves analytics efficiency and increases application performance in some cases.

For examples, here are some common use cases for ETL at the edge:

You want to receive data from different protocols and convert it into standard data formats for use in cloud workloads
You want to filter high-frequency data, perform averaging functions on large datasets, and then load averaged or filtered values at a reduced rate
You want to calculate values from disparate data sources on the local device and send filtered values to the cloud backend
You want to cleanse, deduplicate, or fill missing time series data elements

Tools used in both approaches

AWS Glue is a serverless data integration service for event-driven ETL and no-code ETL jobs.
Fivetran: An automated, cloud-based platform recognized for ELT, which also supports ETL and integrates with dbt.
Airbyte: An open-source, flexible platform providing pre-built connectors for both approaches.
Azure Data Factory - Cloud-based, serverless services designed for managing, moving, and transforming data.

Connect Power BI to a SQL Database

Rose1845 — Sun, 15 Mar 2026 16:45:41 +0000

Power Bi - Data Visualization Tool - process of turning raw data into visuals and charts to make it easy for humans to understand the data.
Basically a toll created by Microsoft to turn raw data into interactive insights

Install Docker in Ubuntu v22.04

Rose1845 — Mon, 09 Mar 2026 16:04:11 +0000

What is Docker?
Docker is a software platform that allows you to build, test, and deploy applications quickly. Docker packages software into standardized units called containers that have everything the software needs to run including libraries, system tools, code, and runtime. Using Docker, you can quickly deploy and scale applications into any environment and know your code will run.

How Docker works
Docker works by providing a standard way to run your code. Docker is an operating system for containers. Similar to how a virtual machine virtualizes (removes the need to directly manage) server hardware, containers virtualize the operating system of a server. Docker is installed on each server and provides simple commands you can use to build, start, or stop containers.
Type this command below in your terminal

Why use Docker
Using Docker lets you ship code faster, standardize application operations, seamlessly move code, and save money by improving resource utilization. With Docker, you get a single object that can reliably run anywhere. Docker's simple and straightforward syntax gives you full control. Wide adoption means there's a robust ecosystem of tools and off-the-shelf applications that are ready to use with Docker

When to use Docker
You can use Docker containers as a core building block creating modern applications and platforms. Docker makes it easy to build and run distributed microservices architecures, deploy your code with standardized continuous integration and delivery pipelines, build highly-scalable data processing systems, and create fully-managed platforms for your developers.

Install Docker in Ubuntu v22.04
sudo apt install docker.io docker-compose-v2

Then after finishing the installation

Add Docker as a usergroup
sudo usermod -aG docker $USER

newgrp docker

SQL Joins and Window Functions

Rose1845 — Mon, 02 Mar 2026 14:13:02 +0000

Why SQL join?

Recombine Data - combine the tables in one big result > Big picture
Data enrichment - when you want to get extra data
Check existence of other data in order to check that you need the help of another table > filtering

When you want to combine two tables, let's say table A and table B, that is through

Columns - in this, we combine the columns from the tables using the SQL JOINs We have 4 common types of joins

LEFT JOIN
RIGHT JOIN
INNER JOIN
FULL OUTER JOIN

Rows - when you combine 2 or more tables through rows, we use the SET Operators like UNION, UNION ALL, INTERSECT, EXCEPT. And here, all the number of columns and data types must match

Today, we are going to talk about 4 common types of JOIN(s)

LEFT JOIN - returns data from the left table that hs match on the right table, even if there is no match
Right join - return data from the right table that has match on the left table
INNER JOIN - returns data matching from the 2 tables
FULL OUTER JOIN - return data from the 2 tables filled with nulls from both tables

Schemas and Data Modelling in Power BI

Rose1845 — Sun, 15 Feb 2026 10:10:46 +0000

What is a Schema?

Schema refers to the logical structure of a database or data model that defines how tables are organized and related. In Power BI, schemas are used to optimize data storage, retrieval, and reporting.

Why Schema Matters in Power BI?

A schema in Power BI is crucial because it defines how data is structured, stored, and connected in a data model. A well-designed schema:

Enhances Performance — Optimized schemas improve query speed and report loading time.
Ensures Data Accuracy — Proper relationships prevent incorrect aggregations or duplications.
Simplifies Data Analysis — A clear schema makes it easier to create reports and dashboards.
Improves Scalability — A structured schema allows for easy expansion as data grows.
Optimizes DAX Calculations — Efficient schemas lead to better DAX performance and calculations.

Types of Schema

Star schema Star schema is a mature modeling approach widely adopted by relational data warehouses. It requires modelers to classify their model tables as either dimension or fact. It is a widely used data modeling approach in Power BI for optimizing performance and simplifying relationships. It consists of:

Fact Table (Central Table) — Stores transactional data (e.g., sales, revenue, quantity).
It store observations or events, and can be sales orders, stock balances, exchange rates, temperatures, and more. A fact table contains dimension key columns that relate to dimension tables, and numeric measure columns. The dimension key columns determine the dimensionality of a fact table, while the dimension key values determine the granularity of a fact table.
For example, consider a fact table designed to store sale targets that has two dimension key columns Date and ProductKey. It's easy to understand that the table has two dimensions. The granularity, however, can't be determined without considering the dimension key values. In this example, consider that the values stored in the Date column are the first day of each month. In this case, the granularity is at month-product level.

Dimension Tables (Surrounding Tables) —
Describe business entities—the things you model. Entities can include products, people, places, and concepts including time itself. The most consistent table you'll find in a star schema is a date dimension table. A dimension table contains a key column (or columns) that acts as a unique identifier, and other columns. Other columns support filtering and grouping your data.
_
Contain descriptive attributes (e.g., Date, Product, Customer).
Fewer Joins — Uses one-to-many relationships, reducing complexity and improving query speed._

Normalization vs. denormalization

To understand some star schema concepts described in this article, it's important to know two terms: normalization and denormalization.

Normalization _ is the term used to describe data that's stored in a way that reduces repetitious data.

If, however, the sales table stores product details beyond the key, it's considered denormalized. In the following image, notice that the ProductKey and other product-related columns record the product.

*Measures
*
In star schema design, a measure is a fact table column that stores values to be summarized. In a Power BI semantic model, a measure has a different—but similar—definition. A model supports both explicit and implicit measures.
_Explicit measures are expressly created and they're based on a formula written in Data Analysis Expressions (DAX) that achieves summarization. Measure expressions often use DAX aggregation functions like SUM, MIN, MAX, AVERAGE, and others to produce a scalar value result at query time (values are never stored in the model). Measure expression can range from simple column aggregations to more sophisticated formulas that override filter context and/or relationship propagation. For more information, read about DAX Basics in Power BI Desktop.
Implicit measures are columns that can be summarized by a report visual or Q&A. They offer a convenience for you as a model developer, as in many instances you don't need to create (explicit) measures. For example, the Adventure Works reseller sales Sales Amount column can be summarized in numerous ways (sum, count, average, median, min, max, and others), without the need to create a measure for each possible aggregation type.

Surrogate keys is a unique identifier that you add to a table to support star schema modeling. By definition, it's not defined or stored in the source data. Commonly, surrogate keys are added to relational data warehouse dimension tables to provide a unique identifier for each dimension table row.

Power BI semantic model relationships are based on a single unique column in one table, which propagates filters to a single column in a different table. When a dimension table in your semantic model doesn't include a single unique column, you must add a unique identifier to become the "one" side of a relationship. In Power BI Desktop, you can achieve this requirement by adding a Power Query index column.

Advantages of Star Schema

Optimized for Performance — Fewer joins mean faster queries and better report speed.
Simplifies DAX Calculations — Flat structure makes it easier to create measures and aggregations.
Enhances Data Visualization — Works seamlessly with Power BI’s data model and relationships.
Reduces Complexity — Easier to design, manage, and scale compared to Snowflake or Galaxy schemas.

Snowflake schema is a set of normalized tables for a single business entity. For example, Adventure Works classifies products by category and subcategory. Products are assigned to subcategories, and subcategories are in turn assigned to categories. In the Adventure Works relational data warehouse, the product dimension is normalized and stored in three related tables: DimProductCategory, DimProductSubcategory, and DimProduct.

In Power BI Desktop, you can choose to mimic a snowflake dimension design (perhaps because your source data does) or combine the source tables to form a single, denormalized model table. Generally, the benefits of a single model table outweigh the benefits of multiple model tables. The most optimal decision can depend on the volumes of data and the usability requirements for the model.

Power BI loads more tables, which is less efficient from storage and performance perspectives. These tables must include columns to support model relationships, and it can result in a larger model size.
Longer relationship filter propagation chains need to be traversed, which might be less efficient than filters applied to a single table.
The Data pane presents more model tables to report authors, which can result in a less intuitive experience, especially when snowflake dimension tables contain only one or two columns.
It's not possible to create a hierarchy that comprises columns from more than one table.
When you choose to integrate into a single model table, you can also define a hierarchy that encompasses the highest and lowest grain of the dimension. Possibly, the storage of redundant denormalized data can result in increased model storage size, particularly for large dimension tables.

Advantages of Snowflake Schema

Less Data Redundancy — Normalized tables reduce duplication.
Better Data Integrity — Structured data ensures consistency.
Efficient for Large Datasets — Optimized for big data storage.
Easier Maintenance — Updates are more manageable.

Model relationships

A model relationship propagates filters applied on the column of one model table to a different model table. Filters will propagate so long as there's a relationship path to follow, which can involve propagation to multiple tables.

Relationship paths are deterministic, meaning that filters are always propagated in the same way and without random variation. Relationships can, however, be disabled, or have filter context modified by model calculations that use particular Data Analysis Expressions (DAX) functions

Data types of columns

The data type for both the "from" and "to" column of the relationship should be the same. Working with relationships defined on DateTime columns might not behave as expected. The engine that stores Power BI data, only uses DateTime data types; Date, Time, and Date/Time/Timezone data types are Power BI formatting constructs implemented on top. Any model-dependent objects will still appear as DateTime in the engine (such as relationships, groups, and so on). As such, if a user selects Date from the Modeling tab for such columns, they still don't register as being the same date, because the time portion of the data is still being considered by the engine.

Cardinality

Each model relationship is defined by a cardinality type. There are four cardinality type options, representing the data characteristics of the "from" and "to" related columns. The "one" side means the column contains unique values; the "many" side means the column can contain duplicate values.
If a data refresh operation attempts to load duplicate values into a "one" side column, the entire data refresh will fail.
The four options, together with their shorthand notations, are described in the following list:

One-to-many (1:*)
Many-to-one (*:1)
One-to-one (1:1)
Many-to-many (:)

One-to-many (and many-to-one) cardinality
The one-to-many and many-to-one cardinality options are essentially the same, and they're also the most common cardinality types.

When you configure a one-to-many or many-to-one relationship, choose the one that matches the order in which you related the columns. Consider how you would configure the relationship from the Product table to the Sales table by using the ProductID column found in each table. The cardinality type would be one-to-many, as the ProductID column in the Product table contains unique values. If you related the tables in the reverse direction, Sales to Product, then the cardinality would be many-to-one.

One-to-one cardinality
A one-to-one relationship means both columns contain unique values. This cardinality type isn't common, and it likely represents a suboptimal model design because of the storage of redundant data.

Many-to-many cardinality
A many-to-many relationship means both columns can contain duplicate values. This cardinality type is infrequently used. It's typically useful when designing complex model requirements. You can use it to relate many-to-many facts or to relate higher grain facts. For example, when sales target facts are stored at product category level and the product dimension table is stored at product level.

Cross filter direction
Single cross filter direction means "single direction", and Both means "both directions". A relationship that filters in both directions is commonly described as bi-directional.

For one-to-many relationships, the cross filter direction is always from the "one" side, and optionally from the "many" side (bi-directional). For one-to-one relationships, the cross filter direction is always from both tables. Lastly, for many-to-many relationships, cross filter direction can be from either one of the tables, or from both tables. Notice that when the cardinality type includes a "one" side, that filters will always propagate from that side.

Linux for Data Engineers: A Beginner-Friendly Guide

Rose1845 — Sun, 25 Jan 2026 17:15:06 +0000

If you’re getting into data engineering, Linux is not optional.it’s a core skill.
Most data systems in the real world run on Linux, and knowing your way around the terminal makes your work faster, cleaner, and more powerful.

This article explains why Linux matters for data engineers, introduces essential Linux commands, and shows how to create and edit files using Vi and Nano, all in plain language.

Why Linux Is Important for Data Engineers

As a data engineer, you will work with:

Data pipelines (ETL / ELT)
Servers and cloud machines (AWS, GCP, Azure)
Databases (Postgres, MySQl)
Big data tools (Spark, Kafka, Airflow)

Almost all of these run on Linux servers.

Linux helps you:

Work directly on production servers
Automate tasks using scripts
Debug issues quickly
Handle large files efficiently
Understand how data flows at system level If you can use Linux confidently, you immediately stand out as “production-ready”.

Understanding the Linux Terminal

The terminal is just a way to talk to your computer using commands instead of clicking buttons.
eg:

ls - shows whta files are in
Essential Linux Commands for Data Engineers
pwd – Where am I?
pwd
Output:
/home/rose
This shows your current directory.

ls – List files
Output:
data scripts README.md
Common options:
ls -l # detailed view
ls -a # include hidden files

cd – Move between folders- I mean change to folder you want
cd dev
Go back:
cd ..

Go home:
cd ~

mkdir – Create folders
mkdir dataengineering
This is very common when organizing ETL jobs.
touch – Create files
touch extract_data.py
Creates an empty file — perfect for scripts.

cat – View file content
cat README.md

Use:
q → quit from where you are
/error → search for “error”
This is extremely useful for debugging pipelines.
Editing Files with Nano
Nano is simple and safe for beginners.
Open a file with Nano
nano extract_data.py
Write:
print("Extracting data...")
Nano shortcuts:

CTRL + O → Save
Enter → Confirm
CTRL + X → Exit

Nano tells you the shortcuts at the bottom
Editing Files with Vi
Vi (or Vim) is everywhere in Linux servers.
Open a file
vi transform.sql
Vi modes
Normal mode i.e navigation
Insert mode i.e typing
Command mode i.e saving & quitting
Start typing
Press:
i
Now type:

SELECT * FROM users;

Save and exit

Press:

ESC

Then type:

:wq

And press Enter.
Exit without saving
:q!

Practical Example: Creating a Data Script

mkdir etl
cd etl
touch extract.sh
nano extract.sh

Inside the file:

!/bin/bash

echo "Starting data extraction..."

Make it executable:

chmod +x extract.sh
Run it:

./extract.sh
Output:

Starting data extraction...

Permissions

Linux controls who can read, write, or execute files.
Check permissions:

ls -l
Example:

-rwxr-xr-- extract.sh
Meaning:
Owner can read/write/execute
Group can read/execute
Others can read
This matters a lot on shared servers.

Where You’ll Use These Skills as a Data Engineer

SSH into cloud servers
Edit Airflow DAGs
Inspect Spark logs
Manage cron jobs
Automate daily pipelines
Debug production failures

Linux is the operating system of data infrastructure.

How to Set Up GPG Keys for an Existing GitHub Account (Step-by-Step)

Rose1845 — Sun, 18 Jan 2026 11:40:35 +0000

When working with Git and GitHub, you may notice a “Verified” badge on some commits. This badge means the commit was cryptographically signed, proving it truly came from the author and wasn’t tampered with.
In this article, you’ll learn how to set up GPG keys for an existing GitHub account and start signing your commits.

What Is a GPG Key and Why It Matters?

GPG (GNU Privacy Guard) is a tool used to:

Digitally sign commits and tags
Prove authorship and integrity
Improve security and trust in collaborative projects

Benefits of signing commits:

Your commits show as Verified on GitHub
Protects against commit spoofing
Builds credibility as a developer

Prerequisites

Before you begin, make sure you have:

A GitHub account
Git installed
GPG installed on your system

Terminal access

Step 1: Check If GPG Is Installed

Run this command:

gpg --version

If GPG is not installed:
Ubuntu / Debian

sudo apt update && sudo apt install gnupg

macOS (Homebrew)

brew install gnupg

Windows

Install Gpg4win from the official site.

Step 2: Generate a New GPG Key

Run:

gpg --full-generate-key

When prompted:
Key type: RSA and RSA

Key size: 4096

Expiration: Choose what works for you (e.g., 1y or 0 for no expiry)

Name & Email:

Use the same email address as your GitHub account
Passphrase: Use a strong one (don’t forget it)
After completion, your GPG key is created

Step 3: List Your GPG Keys and Copy the Key ID

Run:

gpg --list-secret-keys --keyid-format=long

Example output:

/home/nyaugenya/.gnupg/pubring.kbx
----------------------------------
sec   rsa3072/CBC3C9CAC3450592 2025-12-17 [SC] [expires: 2027-12-17]
      DD88627124BA164FD7D531C8CBC3C9CAC3450592
uid                 [ultimate] nyaugenya (go!!!) <test@gmail.com>
ssb   rsa3072/4DB25F105F5D7F76 2025-12-17 [E] [expires: 2027-12-17]

Copy the key ID after rsa4096/
Example: DD88627124BA164FD7D531C8CBC3C9CAC3450592

Step 4: Export the GPG Public Key

Run:

gpg --armor --export DD88627124BA164FD7D531C8CBC3C9CAC3450592

Copy everything, including:
-----BEGIN PGP PUBLIC KEY BLOCK-----
...
-----END PGP PUBLIC KEY BLOCK-----

Step 5: Add the GPG Key to GitHub

Go to GitHub → Settings
Click SSH and GPG keys
Under GPG keys, click New GPG key
Paste the copied key
Click Add GPG key

_GitHub now knows your signing key
_

Step 6: Tell Git to Use Your GPG Key

Configure Git with your key ID:

git config --global user.signingkey DD88627124BA164FD7D531C8CBC3C9CAC3450592

Enable commit signing by default:

git config --global commit.gpgsign true

Make sure your Git email matches GitHub:

git config --global user.email "test@gmail.com"

Git to automatically GPG-sign all tags you create

git config --global tag.gpgSign true

Step 7: (Linux) Fix “GPG Failed to Sign the Data” Error

If you see this error, run:

export GPG_TTY=$(tty)

To make it permanent:

echo 'export GPG_TTY=$(tty)' >> ~/.bashrc

Then reload:

source ~/.bashrc

Step 8: Make a Signed Commit

Create a commit:

git commit -m "My first signed commit"

Or explicitly sign:

git commit -S -m "Signed commit"

Push your changes:

git push

Git for Beginners: What It Is, Why It Matters, and How to Use It with GitHub

Rose1845 — Sat, 17 Jan 2026 20:32:59 +0000

All of this revolves around Git and version control.
This article will walk you through:

*What Git is and why version control is important
*

How to push code to GitHub
How to pull code from GitHub
How to track changes using Git
What Is Git?

Git is a version control system.In simple terms, Git helps you:

Keep track of changes in your code
Go back to previous versions if something breaks
Work with other developers without overwriting each other’s work

Why Is Version Control Important? Version control solves these problems.

Benefits of Git & Version Control

History tracking – See who changed what and when
Backup – Your code is safely stored remotely
Collaboration – Multiple people can work on the same project
Undo mistakes – Easily revert to a previous version
Branching – Work on new features without breaking the main code

What Is GitHub?
GitHub is a platform that hosts Git repositories online.
Git the tool (installed on your computer)
GitHub the service that stores your Git projects on the internet
Installing Git
Before using Git, install it:

Windows / macOS / Linux:
(https://git-scm.com/install/)
Verify installation:

git --version

Basic Git Setup (One-Time)

Tell Git who you are:

git config --global user.name "Rose1845" // replace with your github username git config --global user.email "odhiamborose466@gmail.com" // replace with your own email
`

How to Push Code to GitHub Step 1: Create a Repository on GitHub

Go to https://github.com

Click New Repository

Give it a name

Click Create repository

Step 2: Initialize Git Locally

Inside your project folder:

git init

This creates a hidden .git folder that Git uses to track changes.

Step 3: Track Files

Check file status:

git status

Add files to Git:

git add .

Step 4: Commit Changes

A commit is a snapshot of your code at a specific point in time.

git commit -m "Initial commit"

Step 5: Connect to GitHub
Copy the repository URL from GitHub, then run:

git remote add origin https://github.com/username/repository-name.git

Step 6: Push to GitHub

git branch -M main

git push -u origin main

Your code is now on GitHub!

How to Pull Code from GitHub Pulling means downloading the latest changes from GitHub. Clone a Repository (First Time)

git clone https://github.com/username/repository-name.git

This creates a local copy on your machine.

Pull Latest Changes
If you already cloned the repo:

git pull origin main(name of your branch in this case it's main)

This Fetches new changes
Merges them into your local code

How to Track Changes Using Git Git gives you powerful tools to see what’s happening in your project. Check File Status

git status

This Shows:

Modified files
Staged files
Untracked files

**See Changes in a File

git diff

Shows what changed before committing. **View Commit History

- git log

THis Displays:

Commit IDs
Authors
Dates
Messages *Short Commit History *

git log --oneline

*Great for a quick overview
*

A Typical Git Workflow Most follow this cycle:
Make changes to code
Check status

git status

Add changes

git add .

Commit changes

git commit -m "Describe what you changed"

Push to GitHub

git push