DEV Community: Bradley Kipkoech

DATA ANALYSIS

Bradley Kipkoech — Wed, 20 May 2026 11:58:54 +0000

What is data analysis?
This is basically data cleaning, analysis, reporting and dashboard development.
Data analysis revolves around data quality, data warehouses, dashboarding, documentation and using data to improve decision making.

What is a data warehouse?
A data warehouse is a centralized system where data from different sources is collected, cleaned, organized and stored for reporting, analysis, dashboards and decision making.

What does data quality mean?
It mean ensuring data is fit for use. Good quality data should be complete, accurate, consistent, timely, valid and unique. This are the data quality dimensions.
Complete - is the required data available?
Accuracy - is the data correct?
Consistency - does the data match across systems?
Timeliness - was the data submitted on time?
Validity - is it in the right format?
Uniqueness - are there any duplicates?

How to clean a dataset
First you need to understand the structure of the dataset and the expected fields. Then check for missing values, duplicates, incorrect formats, inconsistent names, outliers, and invalid outputs(will be covered in excel and python article)

What makes a good dashboard?
A good dashboard should be simple, accurate, interactive, and action-oriented. It should show the most important indicators clearly, allow users to filter by relevant categories, and help them identify area that need attention.

How do you develop a dashboard?
Start by understanding the users' information needs and key indicators they want to monitor. Then prepare and clean the data, model relationships between tables, create measures where needed, and design visuals that clearly communicate performance. After building the dashboard, validate the numbers against the source data and collect feedback from the users.

What is a data dictionary?
It is a documentation that explains the fields in a dataset. It usually includes the column name, description, data type, allowed values, source, and business rules. It helps users understand and interpret data consistently.

SQL FOR DATA ANALYSIS

SQL is a language used to communicate with a database.
GOAl
Knowing how to explain and use SQL to extract, summarize, join, and validate data from a database.

SELECT
Clause used to select either every column from a table or specific columns from a table.
i.e SELECT *
FROM table_name; this selects everything.
SELECT
column1, column2
FROM table_name; this selects specific columns

WHERE
This filters rows into your specifications
i.e SELECT *
FROM table_name
WHERE column1<80;

ORDER BY
It sorts data either in ascending or descending order.
i.e SELECT
column1, column2
FROM table_name
ORDER BY column1 ASC/DESC

GROUP BY
It groups selected data.
i.e SELECT *
FROM table_name
GROUP BY column1;

HAVING
It filters rows after grouping
i.e SELECT *
FROM table_name
GROUP BY column1
HAVING column1<80;

AGGREGATE FUNCTIONS
SUM - adds up a selected column.
COUNT(*) - counts the number of rows in a column.
AVG - finds the average of a selected column.
GROUP BY - Groups rows.
HAVING - filters rows after grouping.
MIN - finds the minimum value in a row.
MAX - finds the maximum value in a row.

JOINS
They help us connect tables.
For instance you have three tables.
types
Inner join
Returns only rows that have matching values in both tables.
i.e SELECT column1
FROM table_name1
INNER JOIN table_name2
ON table_name1.column1 = table_name2.column1;

Left join
It takes all the records on the left table and attaches matching records from the right table
i.e SELECT column1, column2, column3
FROM table_name1
LEFT JOIN table_name2
ON table_name1_id = table_name2_id;

Right join
Returns all the rows from the right table and only matching rows from the left table.
i.e SELECT column1
FROM table_name1
RIGHT JOIN table_name2
ON table_name1.column = table_name2.column

Full join
Returns all rows when there is match either in the left or the right table.

DOCKER

Bradley Kipkoech — Tue, 26 Aug 2025 05:56:26 +0000

DOCKER
we first need to understand what docker is
Docker is an open platform that enables developers to build, ship and run applications within an isolated, lightweight environments called containers.it simplifies the process of deploying and managing applications by packaging application and all its dependencies(libraries, system tools, code, runtime) into a single portable unit.

key concepts
containers
this are standalone, executable packages that contain everything an application needs to run.they share the host operating system's kernel, thus making them more lightweight and efficient.

images
this are read-onlt templates used to create containers.They act as blueprint, defining the application's environment and dependencies.

Dockerfile
it is a text file containing instructions for building a docker image. it specifies the base image, dependencies and commands needed to set up the application's environment.

Docker engine
this is the core component of docker which is responsible for building and running containers. it includes a server(daemon) , APIs and a command-line interface.

Benefits of docker
portability
efficiency
isolation
scalability
simplified deployment

DE CONCEPTS

Bradley Kipkoech — Tue, 26 Aug 2025 05:44:21 +0000

Columnar vs Row-based storage
row-based storage stores the entire records together, making it efficient for transactional workloads that frequently access complete rows.Optimal for oltp where you typically need all columns of specific records.
Columnar storage groups data by columns, enabling efficient compression and fast analytical queries that only access specific columns.ideal for olap workloads, data warehousing, and scenarios with selective column access patterns.

Partitioning
Divides large datasets into smaller, managealbe segments based on specific criteria like data ranges, geogrphic regions or hash values. this improves query performance and enables parallel processing and simplifies data management tasks like archiving and backup.

CAP theorem
states that distributed systems can guarantee at most two of the three properties, consistency-all nodes see the same data, availability-system remains operation and partition torelance-system continues despite network failure.Modern systems often provide tunable consisitency levels, allowing different guarantees for different use cases within the same system

windowing in streaming
divides continuous data streams into finite chunks for processing, tumbling windows are fixed-size, non-overlapping time intervals.Sliding windows overlap and move continuously. session windows group events based on activity periods with gaps indicating session boundaries
inludes handling late-arriving data, watermark for determining window completeness and trigger conditions for window evaluation.system like apache flink and kafka streams provide sophisticated windowing capabilities with configurable lateness and results updating strategies.proper windowing enbales meaningful aggregations and analysis over unbounded data streams while managing memory usage and computational complexity.

Retry logic & Dead letter queues
retry logic automatically reattempts failed operations with strategies like exponential backoff, fixed delays, or linear backoff. it handles transient failures and must be implemented carefully to avoid overwhelming systems or creating infinite loops
DLQs capture messages that cannot be processed after exhausting retry attempts. they prevent message loss, enable failure analysis and allow for manual intervention or atlernatives.eg categorizing errors (transient vs permanent), implementing circuit breakers, adding jitter to prevent thundering herd problems, and monitoring retry patterns to identify systemic issues requiring architectural changes.

Backfilling and reprocessing
processes historical data to populate new datasets or fill gaps in existing ones. common when introducing new features, fixing data quality issues or migrating systems, backfill jobs often process data in reverse chronological order to provide recent data first.

reprocessing reruns data pipelines on existing data, typically after fixing bugs, updating business logic, or recovering from failures. it requires carefull consideration of downstream impacts and often involves versioning strategies to manage different data generations
Challenges include managing computational resources, ensuring data consistency during the process, handling schema evolution, and coordinating with downstream consumers to prevent conflicts or inconsistencies.

Time travel & data versioning
Time travel allows quering historical versions of data, enabling analysis of changes over time and recovery from accidental modifications. systems like snowflake, bigquery and delta lake provide built-in time travel capabilities with configurable retention periods
data versioning tracks changes of datasets and schemas, similar to version control for code. it enables reproducible analytics, A/B testing of data transformations and rollback capabilities when issues are discovered.

Implementation approaches include snapshot-based versioning, log-based change tracking, and copy-on-write mechanisms. These features are crucial for data debugging, compliance auditing, and maintaining data science experiment reproducibility.

Distributed processing concepts
Distributed processing enables handling large-scale data by spreading computation across multiple machines. Key concepts include data locality (processing data where it's stored), fault tolerance through replication and checkpointing, and coordination mechanisms for task distribution.
Frameworks like Apache Spark use concepts such as resilient distributed datasets (RDDs), lazy evaluation for optimization, and automatic task scheduling. Map-reduce paradigms break complex operations into parallelizable steps, while more modern frameworks support iterative algorithms and real-time processing.
Challenges include managing data shuffling costs, handling stragglers (slow tasks), ensuring fault tolerance, and optimizing resource utilization across the cluster while maintaining data consistency and system reliability.

DATA WAREHOUSES AND LAKE HOUSES

Bradley Kipkoech — Mon, 11 Aug 2025 23:01:11 +0000

Data warehouses

Data warehouse is a centralized repository which is designed to store, manage and analyze large volumes of current and historical data from various departments on an organization. They are optimized for analytical processing and business intelligence

characteristics of a warehouse

they can handle massive amounts of data
they not only deal with current data but also historical data
the system architecture prioritizes query performance and data retrieval speed over transactional processing, making complex analytical queries execute efficiently
they consolidate information from multiple data sources
business intelligence support because of its ability to analyze and visualize data
the structure allows the data in it to be easily accessible.

benefits

improved decision making
enhanced performance
data quality and consistency

OLTP VS OLAP

Online transactional processing systems are designed to handle real-time transactional operations that occur in day-to-day business activities.eg banking systems.

online analytical processing are optimized for complex analysis, reporting and business intelligence activities.

Key Differences and Implications

Transaction vs. Analysis: OLTP systems excel at processing individual transactions quickly and accurately, while OLAP systems specialize in analyzing patterns across large datasets.

Data Freshness: OLTP systems work with real-time data, whereas OLAP systems typically work with data that may be hours or days old, depending on the ETL schedule.

Concurrency Requirements: OLTP systems must handle many simultaneous users performing transactions, while OLAP systems typically serve fewer concurrent users running complex queries.

Failure Impact: OLTP system downtime directly affects business operations, while OLAP system unavailability impacts reporting and analysis capabilities.

data modelling

data modelling is the systematic process of creating abstract representation of data structures, relationships and constraints to support a specific business requirement and analytical needs.
data modelling is like creating an architectural blueprint before constructing a blueprint.

types of models

conceptual model - provides high-level, business oriented view of data requirements without technical implementation details.
Example Elements:

Customer entity related to Order entity
Product entity connected to Category entity
Employee entity associated with Department entity

logical model - this adds technical detail while remaining independent to specific database management systems.
Additional Elements:

Customer_ID (Integer, Primary Key)
Customer_Name (VARCHAR(100), NOT NULL)
Order_Date (DATE, NOT NULL)

Physical Model - this specifies how data will be stored in a particular database system

dimensional modelling for data warehousing

it is a preferred approach of designing a data warehouse structure because it optimizes for analytical query performance while maintaining business user comprehension.

fact tables - serves as the central repo for measurable business metrics and form the core dimensional tables.

dimensional tables - it provides the descriptive context that makes fact table measurements meaningful and analyzable

CHANGE DATA CAPTURE

Bradley Kipkoech — Mon, 11 Aug 2025 22:15:32 +0000

CDC indentifies and captures changes made to data in source. it enables incremental data synchronization.it tracks insert, update and deletion operations.involves reading database transaction logs,using database triggers and tracking modification timestamps.Crutial for maintaining data warehouses enabling real-time replication.

Idempotency
this ensures that performning the same operation multiple times produces identical results. this prevents duplicate records and maintains data consistency when duplicate messages are processed
implementation include using unique keys, upsert operations, deduplication windows in streaming systems and maintaining processing state.eg using combination of timestamp and transactional id as a composite key ensuring reprocessing the same event doesn't create duplicates.

BATCH VS STREAM INGENSTION

Bradley Kipkoech — Mon, 11 Aug 2025 22:02:30 +0000

Batch ingestion processes data in discrete chunks at scheduled intervals.Mostly historical data in large volumes.Suitable where latency isn't critical.Involves etl jobs processing transaction log.
Streaming ingestion processes data as it arrives,enabling rea time analytics for real time analytics and immediate response.essential for fraud dectection, monitoring systems and live dashboards.

Data Governance Framework and Data Security.

Bradley Kipkoech — Mon, 11 Aug 2025 21:54:47 +0000

Data Governance
it is a framework that defines how an organization manages, protects and derive value from its data assets.

Objectives
Data quality management
Data security and privacy
Data stewardship
Regulatory compliance

Business value of data goverenance
Risk mitigation
Improved decision making
operational efficiency
Competitive advantage
Cost reduction

ETL pipeline and airflow

Bradley Kipkoech — Wed, 23 Jul 2025 12:13:03 +0000

Extraction, transform, load is a process of extracting data from various data sources, transforming them to clean formats which then can be loaded in a database.

Extraction & transformation
in extraction we basically get data from various data sources, the data sources could include databases, api keys, images, videos and various files i.e json, xml.
i use python to extract the files form the data sources. i recently worked with api key where i get the url and assign a variable to it and used the requests library to get the data from the api. Then i change the data into a json format. i the used pandas library to create a data frame for the data i have. which covers for the transformation part.

load
loading the data into the database using the engine library where i had to input my database connection string, which includes the database name, the host plus port and the password.

Automation
say one is trying to get weather data for a particular place, now the weather changes every moment, one needs to run the etl process every now and then in order to gets the latest data. It might be hectic to keep running the file manually, that is where airflow comes in, airflow is an apache which automates an etl process A DAG (directed acyclic graph) is a mathematical structure consisting of nodes and edges. In Airflow, a DAG represents a data pipeline or workflow with a start and an end. The mathematical properties of DAGs make them useful for building data pipelines: Directed means there is a clear direction of flow between tasks.
all you need to know about airflow
first you need to create a virtual environment. In python we use the following command: python -m venv venv(name of you environment) the you need to activate it by the following command: source venv/bin/activate.
You the need to install airflow using the following command: pip install apache airflow==2.8.0(this is the version), if you're using a linux os or wsl you might need to use sudo. After installation is done, you need to adjust some config files to suit you to use. So you need to change path to your airflow folder i.e cd airflow, then you open the configuration file which is normally saved as airflow.cfg by using the following command: nano airflow.cfg. What we need to edit in this file first is the executor we change it to localexecutor because we want our processes to be executed locally, this works when you are not using sqlite as your database. We also edit to show examples and we set it to false so our airflow user interface won't show examples of dags.
the next is the db config where we need to add our connection string, for me i used a postgres database and this is the syntax:db_url = "postgresql://postgres:your password@localhost:5432/db name".
we need to save the file the exit back to our folder, where we can run airflow db migrate to allow the airflow to move from the default sqlite to your preferred database. we then run airflow db init to initialize our database and airflow webserver and follow the port to local host to enable us view our ui.