Understanding Git and GitHub for beginners

Caleb Kilemba — Thu, 29 Jan 2026 14:01:49 +0000

Before diving into modern software development, it’s important to understand the tools that make collaboration, version control, and code management possible. Whether you are just starting your programming journey or looking to understand how developers work together on real-world projects, Git and GitHub are foundational skills you cannot ignore. This article breaks down these concepts in a simple, practical, and beginner-friendly way, helping you build a strong base before moving into hands-on usage.

What is Git

Git is a version control system/software on your computer that tracks every change that is used to track changes. This tool is used mostly by software developers, its helps them trace easily any changes made or errors on projects. It also makes it possible for multiple people to work on the same project simultaneously using branches thus avoiding code overlapping. It can be used as a backup where historical projects can be locally saved.

What is GitHub?

Github is a web -based platform designed to help developers, collaborate and manage projects with ease. It also helps to store code. Github serves as a portfolio for coding projects. It also allows developers from all around the world to contribute to your project.

Hope we are still together, lets continue with the learning

GitHub	Git
Cloud-based hosting platform for Git repositories	Version control system
Requires an internet connection to access repositories	Operates locally on your machine
Provides collaboration and project management tools	Tracks changes in code

Prerequisites for Using Git

Before you can start using Git on your machine, you need to ensure it is properly installed based on your operating system:

Windows: Download and install Git Bash. This provides a Unix-style command-line experience which is the standard for Git operations.
macOS / Linux: Open your Terminal. Git is often pre-installed, but if it isn't, you can install it using your system's package manager (e.g., brew install git for Mac or sudo apt install git for Linux).

git --version

[Wanna know about kenya? Here is the description of Kenya](https://www.google.com/search?gs_ssp=eJzj4tTP1TcwtCxKNzVg9GLNTs2rTAQALWAFHw&q=kenya&rlz=1C1GCEA_enRO1099KE1196&oq=kenya&gs_lcrp=EgZjaHJvbWUqCggBEC4YsQMYgAQyBggAEEUYOTIKCAEQLhixAxiABDIMCAIQIxgnGIAEGIoFMhAIAxAuGIMBGLEDGIAEGIoFMg0IBBAAGIMBGLEDGIAEMg0IBRAAGIMBGLEDGIAEMhAIBhAuGMcBGLEDGNEDGIAEMg0IBxAAGIMBGLEDGIAEMhMICBAuGIMBGMcBGLEDGNEDGIAEMhAICRAuGMcBGLEDGNEDGIAE0gEMMTU2MjE5NmowajE1qAIIsAIB8QUzIP-8ONwPJQ&sourceid=chrome&ie=UTF-8

trnteurewywm

Caleb Kilemba — Thu, 15 Jan 2026 19:24:00 +0000

eg4yjrum

My First Article in the Data Engineering Series

Caleb Kilemba — Thu, 15 Jan 2026 19:22:55 +0000

title: Part 2 – Designing Data Pipelines
published: true
series: Data Engineering from Zero to Production

tags: dataengineering, etl, pipelines

Heading 1

heading 2

Today this is the first mark down class that we have done
k = 5

jskdjdld
hdid

kenya
Uganda
Nigeria

Dev.to

kenya
kenya

15 foundational concepts on Data Engineering

Caleb Kilemba — Tue, 12 Aug 2025 23:14:39 +0000

Introduction

Data engineering is the backbone in mordern analytics, AI, and business intelligence. It involves designing, building, and mantaining the systems that store, process, and make data accessible for analysis. In this article, I will explain the 15 core foundational concepts every aspiring or practicing data engineer should master.

Data Modeling

Data modeling is the process of designing how data is structured and related. Data modeling provides a blueprint for databases, ensuring that data is stored logically and efficiently. A well designed model reduces redundancy, improves query performance, and ensures data integrity.
The core aspects of data modeling include conceptual models which include entities and relationships, logical model that includes tables, columns, and data types, and Physical model that includes implementation details such as indexes and partitions.

An ER (Entity-Relationship) diagram showing customers, orders, and products.

Data Warehousing

A data warehouse is a repository that stores intergrated data from multiple sources for analysis and reporting. It plays a vital role in business intelligence and decision making processes.
characteristics of a data warehouse
It is subject oriented --> it is organized around key business subjects
Intergrated --> it combines data from different sources with consistent naming and formating.
It is non-volatile --> Data is read-only once centered and not changed.
Time-Variant --> It mantains historical data for trend analysis.

--> Data sources for a data warehouse include operational systems, external data, flat files, and external data.
--> The ETL process is an architectural component of a data warehouse in data preparation.
There are three types of data warehouses;

Enterprise Data warehouse --> this is comprehensive and organization wide.
Data Mart --> This is smaller and department specific subset
Operational Data Store --> This is Near real time data used for data reporting.

ETL (Extract, Transform, Load)

This is the process of extracting data from sources, transforming the data into a usable format and loading the data into storage.
ETL process is a foundational in data engineering as it ensures clean, and reliable data for analytics. In cloud warehouses, ELT (Extract, Load, Transform) is common. There are also modern variations of streaming ETL for real time pipelines.

Data Pipelines

A data pipeline is a system that automates the movement, transformation, and processing of data from various sources to a destination such as a data warehouse. Data pipelines ensures data flows efficiently and reliably through different stages, enabling analytics, and machine learning.
Types of data pipelines
Batch pipelines --> this processes data in scheduled chunks i.e daily updates, a good example is loading sales data into a warehouse hourly.
Streaming pipelines --> these process real time data i.e transaction data
ETL/ELT --> Transforms/loads data into destination

Directed Acyclic Graph (DAG)

Data Formats and Serialization

Data doesn’t just exist in thin air — it’s stored and transmitted in specific formats, and the choice of format has big consequences.
Common formats:
CSV (Comma-Separated Values) – A flat text file where each line represents a row and commas separate values. It’s easy for humans to read and for most systems to process, but lacks advanced features like data types or compression. Best for simple datasets and compatibility across tools.
JSON (JavaScript Object Notation) – Stores data in key-value pairs with a hierarchical structure. Flexible and ideal for web applications or APIs, but can be verbose, leading to larger file sizes.
Parquet / ORC – Columnar storage formats optimized for analytics. Instead of storing data row-by-row, they store it column-by-column, enabling efficient compression and faster queries for analytical workloads.
Avro / Protobuf – Schema-based formats that are compact and designed for efficient serialization (turning data into bytes for transmission). They enforce structure and are ideal for streaming pipelines or cross-language communication.
Why it matters:
Choosing the right format affects:
Performance – Columnar formats can make analytical queries much faster.
Storage cost – Compression in Parquet/ORC can significantly reduce storage usage.
Interoperability – Some formats work better for system integration (JSON) while others are better for internal analytics (Parquet).

Data Quality Management

Data quality is about ensuring that the data you’re using is fit for purpose. Bad data = bad decisions.

Key dimensions:

Completeness – No missing required values.
Consistency – The same data is represented in the same way across datasets.
Accuracy – Data reflects the real-world truth it represents.
Timeliness – Data is up-to-date when needed.

Why it matters:

If your analytics are based on incomplete, inconsistent, or outdated data, the resulting insights could mislead business decisions, waste resources, or even cause compliance issues.

Data Governance

Think of this as the rulebook for data. It defines who can access what, how data is documented, and how it complies with laws.

Key elements:
Metadata management – Keeping a record of what each dataset is, where it came from, and what it contains.
Access control – Using role-based or attribute-based permissions to control who sees what.
Regulatory compliance – Ensuring data handling follows laws like GDPR (privacy) or HIPAA (healthcare).
_Why it matters:
_Good governance builds trust in data, avoids legal trouble, and makes it easier for teams to collaborate without stepping on each other’s toes.

Scalability and Performance Optimization

When your dataset grows from gigabytes to terabytes, your systems need to keep up without slowing down.
Techniques:
Sharding and partitioning – Splitting data across multiple databases or files to reduce load on any single resource.
Caching – Storing frequent query results in fast-access memory instead of recalculating them.
Parallel processing – Breaking tasks into smaller chunks to be processed simultaneously (e.g., Spark, Dask).
Why it matters:
Without optimization, systems become bottlenecks, leading to delays, timeouts, and higher costs.

Cloud Data Platforms

Cloud providers now offer fully managed data warehouses that handle scaling, backups, and performance tuning for you.

Examples:
AWS Redshift – Great for heavy analytics workloads on AWS.
Google BigQuery – Serverless, pay-per-query, and fast.
Snowflake – Popular for its separation of storage and compute, allowing elastic scaling.
Azure Synapse – Integrates tightly with Microsoft’s ecosystem.
Why it matters:
They remove much of the operational burden, allowing teams to focus on data and analytics rather than infrastructure.

Data Security

Protecting data is non-negotiable — both for legal reasons and to maintain trust.

Practices:
Encryption at rest – Protects stored data.
Encryption in transit – Protects data while it’s moving across networks.
Access control – Restricts data access based on user roles.
Audit logging – Keeps a record of who accessed or modified data.

Why it matters:
A breach can cost millions in fines, damage a company’s reputation, and violate customer trust.

Workflow Orchestration

Data pipelines have many moving parts — they must run in the right order, handle failures, and restart if needed.

Tools:
Apache Airflow – The most widely used, with rich scheduling and monitoring features.
Prefect – More Python-friendly and developer-centric.
Luigi– Lightweight but effective for smaller pipelines.
Why it matters:
Without orchestration, pipelines may break silently, run in the wrong order, or fail without alerting anyone.

Monitoring and Observability

You can’t improve what you can’t measure. Monitoring ensures data systems are healthy and issues are detected early.

Metrics to track:
Data freshness – How recently the data was updated.
Throughput – Amount of data processed over time.
Failure rates – Percentage of failed jobs or queries.

Tools:
Prometheus – Open-source metrics collection.
Grafana – Visualization and alerting.
Datadog – Commercial, all-in-one monitoring.

Data Lineage

This is the “data family tree” — where it came from, how it changed, and where it ended up.

Why it matters:
Debugging – If a report looks wrong, you can trace back to the source.
Compliance – Regulations may require knowing exactly where data originated.
Trust – Users can see the full journey from source to dashboard.

Conclusion

Mastering these 15 foundational concepts gives a solid grounding in data engineering. Tools may change, but these principles guide the design of efficient, scalable, and trustworthy data systems.

DEV Community: Caleb Kilemba