DEV Community: Elvis Mwangi

Change Data Capture(CDC) in Data Engineering concepts, tools and Real-World implementation strategies

Elvis Mwangi — Mon, 15 Sep 2025 12:15:37 +0000

Building and maintaining data pipelines can be an uphill task for individuals relying on limited strategies and tools. Individuals have to choose between reliable sources of data and available data storage areas, making it tedious to make changes in between. Change Data Capture (CDC) is an approach that involves making data changes-insert, delete and update real-time data being delivered to a data warehouse, data lake or an analytical store. Not only does it maximise efficiency, but it also improves and facilitates real-time data replication. The approach is reliant on using database triggers, tracking changes via transaction logs (Khalid Abdelaty, 2025) and making changes for downstream applications. Data engineers processing changed data feeds (slowly changing dimensions - SCD processes ) are split between 2 decisions:
Type 1- Retain the latest data by overwriting existing data. Anytime a change is made, old data is replaced by new data, translating to no records of the old data are retained. The approach is used in fields where the old data is no longer needed, such as updating customer information in retail stores.
Type 2 - Keep a history of changes to the data. New versions of the changed data are created alongside old data (What Is Change Data Capture (CDC)? | Databricks on AWS, 2025). The data is also timestamped to help a user track the log history on changes made in a record. In real-life applications, the approach is used in areas such as studying customer segmentation data on fields such as their addresses when carrying out analysis.
To better understand CDC, the following concepts have been expounded:
Change Detection
A CDC system pinpoints changes in the source system as they occur. This simply means logs on updates, inserts and removal of records are stored and may be used for analytical purposes at a future date.
Real-time
CDC allows the data pipelines to host real-time information, allowing the system to rely on current and recent data, should an individual embark on making changes.
Delta-driven data
Describes tracking changes made to data using rows. Mainly employed by Delta Databricks tools, delta tables rely on Change Data Feed (CDF) to track and process changes. It acts as a stream of modifications, enabling data modifications while supporting event-driven data pipelines that accommodate real-time data analytics by avoiding the need to reprocess entire datasets (What Is Change Data Capture (CDC)? | Databricks on AWS, 2025). Databricks recommends Delta Live Tables (DLT) tools, which simplify the implementation of CDC pipelines handling out-of-order records of faulty changes, providing data quality and activity monitoring. CDF activities are guided by:
Authorising the Change Data Feed (CDF)
When a CDF is enabled on a delta table, it automatically triggers the Delta Lake runtime, which records change events- update, insert for every data wire.
CDF generation
The CDF is a forward-looking log capturing all row-level changes made in the Delta table.
Processing Changes.
As a byproduct of Delta Live Tables, the feed is consumed downstream by events relying on tools such as Databricks Lakeflow Declarative Pipelines to process changes in real-time or near real-time.
Applying changes to target tables
To allow consistent data updates and changes, the changes made are applied to all data tables.

Metadata
Each captured change(insert, update and delete) includes metadata and a sequential number to capture the order in which the change was made.
_
Idepotent processing
Ensures a CDC system doesn’t alter the final state of a system. Consumers basically track a change, usually assigned a unique identifier, transaction-message IDs from the CDC stream, and, if an event has been processed, by querying a database. If the ID exists, the order has been processed and the message is disregarded; if not, the order is processed and a log is generated to avoid reprocessing. Idepotent CDC processing empowers a consumer system to handle potential duplicates in the change stream. It ensures that applying the same change event multiple times doesn’t cause irreversible changes or errors to the system. Idempotent processing is important to CDC as it ensures data reliability- system issues cause CDC changes to be re-delivered, consistency- verifies that only one change or application can be successfully carried out, eliminating duplicates and resilience - should the system experience a challenge, it reboots and re-processes the events without the negative effects.
Implementing idempotency CDC processing

Assigning unique IDs Messages or change events are assigned unique, non-sequential ID values by the producing service. It is indicated as a Kafka header and included in the message payload.
Checking for Duplicates Before processing the data or changes, the producer service checks for the existence of the message ID in the dataset or dedicated database table.
Process or Discard Should the message ID be found, the producer system marks it as a duplicate and proceeds to discard it, and an update is made to ensure the process is not repeated. If the message is not found, the consumer system starts a database transaction, processes the message, executes the required business logic, and commits the transaction while storing the message ID to prevent any future duplication.
Record Processing Upon the completion of the above task, the message ID is inserted into the database, marking it as processed.

The four commonly recognised CDC methods are:
Log-based CDC
Reads database translation logs- Write Ahead logs (WAL) to identify changes instantly as they occur. Its competitive advantage lies in its ability to operate at a low level, capturing changes with minimum disruption. PostgreSQL is an example of a tool which relies on log-based CDC.An example of a log-based

-- Enable logical replication
ALTER SYSTEM SET wal_level = logical;

-- Create a logical replication slot to capture changes
SELECT pg_create_logical_replication_slot('cdc_slot', 'pgoutput');

-- Fetch recent changes from the WAL
SELECT * FROM pg_logical_slot_get_changes('cdc_slot', NULL, NULL);

replication.
`

Trigger-based CDC
Uses triggers attached to source table events (updates, inserts and deletes) to record changes. An upside with the approach is that, if not carefully managed, it may add extra loads to the database schemas should one fail in managing it. It is more suitable for moderate transaction volumes.

Polling-based CDC
The system checks for changes using a timestamp or version column. However, the technique exposes the system to latency challenges as changes are only detected at fixed intervals. The approach is highly recommended in places where real-time access to logs is unavailable, though a slight delay in detecting changes may be experienced.
Time-based CDC
Reliant on a column which records the last time a change was made. A comparison made by the system checks the logs for changes made across the system. It is similar to polling-based CDC with slightly distinct properties, as it requires a more robust mechanism to track changes. It is dependent on consistent modification of the timestamps.

CDC tools
Popular tools that facilitate the implementation of CDC are dependent on the use case.
Debezium
It is an open-source system that captures and streams database changes into systems such as Apache Kafka.It is helpful when streaming data from multiple sources of data, dependent on real-time updates. Use Case: Event-driven architecture and Real-time data streaming.

AWS Data Migration Service (DMS)
Relies on the CDC to continuously replicate data on the AWS system with minimal downtime. It is an excellent choice if one wants to move data to the cloud with ease. Use Case: Cloud migrations and AWS-based architectures.

Apache -Kafka
When paired with tools such as Debezium, it serves as a spine to processing CDC events. It enables a system to synchronise data across multiple consumers, host reliable data pipelines and real-time data analytics. Us Case: Streaming CDC data to a real-time data-driven architecture.
Talent and Informatica
CDC platforms are built to automate ETL pipelines, eliminating manual configurations. Advantageous in complex transaction scenarios where integrated solutions can simplify operations. Use Case: Enterprise-grade ETL solutions with built-in CDC.
Database CDC Native Solutions
Some relational database tools offer native CDC features, reducing the need for external tools.
I. PostgreSQL logical replication
Ii. SQL Server CDC
Iii. My SQL binary log (binlog) replication.

They are used to minimise dependencies on external CDC tools

Real-World Applications.
Cloud Migration
With the world rapidly evolving, building pipelines to an ever-growing database presents experts with storage challenges. Cloud services offered by service providers such as AWS and Microsoft Azure ensure companies pay for cloud storage in response to their usage. CDC helps cloud migration by providing a system architecture that synchronises with real-time data streaming and cloud databases with minimal downtime.
Data Integration
Due to its ability to be used among a wide range of external data services, CDC is an important tool for companies seeking to move data.
Data replication and synchronisation
In data activities that involve multiple consumers and multiple data sources, CDC ensures data is synchronised, facilitating across-the-board changes in real-time. In fields such as inventory management, where customer information is constantly changing, CDC ensures changes do not affect the integrity of the system. Data replication is upheld across the target database, enabling users to monitor changes in real-time.

                   **References**

Khalid Abdelaty. (2025, February 25). What is Change Data Capture (CDC)? A Beginner’s Guide. Datacamp.com; DataCamp. https://www.datacamp.com/blog/change-data-capture
What is change data capture (CDC)? | Databricks on AWS. (2025, August 4). Databricks.com. https://docs.databricks.com/aws/en/dlt/what-is-change-data-capture

Apache Kafka Deep Dive

Elvis Mwangi — Mon, 08 Sep 2025 13:38:00 +0000

While revolving around key features such as extracting, transforming and loading of data, data pipelines play an important role in determining the integrity of information collected. Dependent on the volume, data sources (Foidl et al., 2024) and data absorption rate, most data pipeline processes have been automated to ensure activities such as data capture and extraction can take place simultaneously. The system also ensures that multiple professionals, such as data analysts, can access historical information previously stored. Data Pipelines tools are split under cloud-based systems, such as Google Data Flow and AWS Data Pipeline, and open source frameworks, such as Apache Airflow and Apache Kafka. Understanding which of the above tools to employ, one needs to classify their workload under the following three types of data pipelines:
Batch-Processing
Under Batch processing, data pipelines are built to handle large volumes of data processed at scheduled time intervals. Usually, the ideal technique is where real-time data processing is not essential. Tasks such as fetching historical data to be used for reporting rely on already stored data.

Stream Processing Pipelines
In fields that rely on multiple data sources, the above technique handles continuous large volumes of data at a scheduled time interval. Value-adding transformations under steam processing include filtering, aggregation, applying business logic and data enrichment. Real-life applications include running machine learning models that rely on real-time data in the financial world and online fraud detection programmes.

*Hybrid Processing Pipelines *
For projects requiring real-time data processing, creating a unified data processing pipeline ensures scalability and data availability.

_ APACHE KAFKA-CORE CONCEPTS_
Open source libraries and frameworks such as Apache-Kafka enable individuals to create and access publicly available data and tools while learning the concepts. Apache-Kafka comes off as a hybrid processing pipeline technique as it borrows various concepts from batch and stream processing. It is basically a distributed commit log (Das, 2021). A commit log under Kafka relies on an append-only, ordered sequence of records known as messages. Key concepts () discussed below are: producers, consumers, consumer group, brokers, topics, partitions, replications, Zookeeper, Kafka Streams, Kafka, Connect and Kafka cluster.

Producers
An application that sends messages using Kafka Producer APIs. Based on the partition strategy, a producer application secures information and requests about the events or messages ferried. Parameters that determine a producer’s ability to publish messages to a Kafka topic fall under

No key specified - The producer application tries to balance out the event or message by randomly allocating a partition accommodating the total number of events being sent.
1. Key Specified - Same key hash under consistent hashing, ensures minimal redistribution of keys in a re-hashing scenario. Messages containing similar keys are basically sent to similar partitions.
2. Partition Specified - Involves specifying a particular storage destination that will receive the message.
3. Custom partitioning logic- Rules are formulated based on the partition to be used.

Consumer
An application that subscribes to and receives messages. If the message being transmitted reads 10,13,16,17 and is inserted into the topic, a consumer application will read it in the same order. A log is stored in the Zookeeper every time a message is denoted in Kafka. It hedges the data loss risk by resetting an offset position after allowing access to older messages.

Consumer Group
Unlike a consumer application (Apache Kafka Concepts, Fundamentals, and FAQs, 2024), consumer groups involve a single logical consumer executed with multiple physical consumers. Should a single consumer need to scale up the messages or events Kafka is providing, one can create additional instances of it. In a practical setting, new consumer groups are assigned partitions previously held by the old consumer group, a terminology defined under rebalancing consumer groups. Kafka keeps track of the members of a consumer group and allocates data to them. Solving a utility crisis under Kafka, there must be as many partitions as consumers. If consumers are more than the partitions created, the situation will lead to idle consumers; vice versa, it will lead to consumers reading more than one partition. A deeper dive into fan-out exchange and order guarantee offers more insight into Kafka’s consumers.
Kafka Topics
Acting as a name log, it organises, stores streams of events where producers write to and consumers read from. Messages are append-only, allowing users to access historical data. A single topic can also be written by multiple producers and read by multiple consumers. Improving on data availability, topics can be divided into multiple partitions.
Kafka Broker
Kafka servers are held by Zookeeper. A Kafka broker is a single Kafka server. Functions of a broker include: receiving messages from producers, assigning offsets and committing them to the partition log. Brokers only view event data as opaque arrays of bytes, ensuring data is available and a message's integrity has been upheld. A producer connects to a set of initial servers in the cluster, requests information about the server and based on the partition strategy, it determines which server to use. Under consumer applications, consumer groups of the servers in a cluster coordinate sharing information received from a producer.

Kafka offsets
Individual messages are assigned a unique sequential ID, which consumers use to track their progress.
** Cluster**
It is a group of broker nodes running together to provide scalability, availability and fault tolerance. One of the servers takes up the role of the controller, assigning partitions to other servers, monitoring for broker failure and taking up administrative duties. When a partition is replicated onto 6 brokers, one of the servers takes up the controller role and the rest 5 fall back and become followers. Data and messages are then written on the controller server and replicated by the other 5. The benefit of the said technique ensures we do not incur data loss should a leader go down, as one of the followers takes up the role.
The following command highlights running a 6-node Kafka cluster
The steps have been split into 3:
Create a topic

bin/kafka-topics.sh --create \
  --topic my-topic \
  --bootstrap-server localhost:9092 \
  --partitions 6 \
  --replication-factor 3

confirm topic details

bin/kafka-topics.sh --describe \
  --topic my-topic \
  --bootstrap-server localhost:9092

Create 6 partitions of that topic

{
  "version": 1,
  "partitions": [
    {"topic": "my-topic", "partition": 0, "replicas": [1,2,3]},
    {"topic": "my-topic", "partition": 1, "replicas": [2,3,4]},
    {"topic": "my-topic", "partition": 2, "replicas": [3,4,5]},
    {"topic": "my-topic", "partition": 3, "replicas": [4,5,6]},
    {"topic": "my-topic", "partition": 4, "replicas": [5,6,1]},
    {"topic": "my-topic", "partition": 5, "replicas": [6,1,2]}
  ]
}

Replicate the data of all 6 partitions into a total of 3 nodes

bin/kafka-topics.sh --create \
  --bootstrap-server localhost:9092 \
  --topic my-topic \
  --replica-assignment replica-assignment.json

Zookeeper
Kafka servers run on Zookeeper platforms. The system manages, tracks Kafka brokers, topics, partitions assigned, and leader elections.

Data Engineering Applications
Change Data Capture(CDC)
Data activities are synchronised to track and record changes. CDC feeds and logs contain record changes- inserts, updates and deletes, which enable them to preserve the integrity of a system even after datasets are not reloaded. Using Apache Kafka, the system leverages its streamlining capabilities to propagate and process changes occurring in source databases.

Kafka Connect is a framework used to connect Kafka and other systems that use it to stream data in and out. Source connectors such as Debezium are equipped with capabilities designed to perform CDC from databases such as MySQL, PostgreSQL, MongoDB, Oracle and SQL Server.
Streaming Analytics _
Process of analysing data in a continuous format rather than in batches. Synchronising data systems using tools such as Apache Flink and Kafka Streams, data engineers not only perform real-time analysis used for immediate insights but also implement tasks such as fraud detection. Streaming analytics also enables users to conduct data transformation using connected devices.
_Real-time ETL Pipelines
Involves designing and implementing scalable architectures that secure data quality through validation and cleansing. It also enables data experts to retain historical data for future data activities. Kafka uses its real-time data streaming capabilities to ensure data engineers can continuously extract, transform and load data into warehouses using tools such as Kafka Streams or Apache Flink. Kafka Connect is also used for simplified data integration.
**
Real-World Production Practices**()
Activity Tracking
Multinational companies such as Netflix, LinkedIn and Uber use open source Kafka tools to track user activities. Companies such as Uber that have spread out globally have data engineers building systems that collect large volumes of data on customer behaviour, enabling the company to repopulate their websites with better offers. Activity tracking also enables the company’s machine learning engineers to collect timely data that can be used to train models that predict numerous variables under investigation.
Capacity Planning
Due to Kafka’s ability to transfer storage capabilities to local devices, capacity planning involving projects may take place. Replication factors may also be reconfigured after a team schedules the deployment of a project. Fields such as inventory handling and logistics planning may require secondary storage devices once the business begins dealing with numerous customers and expands the number of orders taken.

References
Apache Kafka Concepts, Fundamentals, and FAQs. (2024). Confluent. https://developer.confluent.io/faq/apache-kafka/concepts/#:~:text=Kafka%20runs%20as%20a%20cluster,to%20process%20data%20in%20parallel.
Das, A. (2021, January 17). Kafka Basics and Core Concepts. Medium: inspiring brilliance. https://medium.com/inspiredbrilliance/kafka-basics-and-core-concepts-5fd7a68c3193
Harald Foidl, Golendukhina, V., Ramler, R., & Felderer, M. (2023). Data pipeline quality: Influencing factors, root causes of data-related issues, and processing problem areas for developers. Journal of Systems and Software, 207, 111855–111855. https://doi.org/10.1016/j.jss.2023.111855

A snipet on Docker vs Docker Compose

Elvis Mwangi — Mon, 25 Aug 2025 12:24:07 +0000

Getting Started with Docker and Docker Compose
With an ever-growing demand for data pipelines, storage and transformation, software that has been developed towards improving a programmer’s efficiency has been on the rise. Docker is a one-stop software housing containers that have libraries, packages, system, runtime tools and code. It enables data experts to run and deploy applications into any environment. As an open-source option, it makes the software attractive among beginners and start-ups seeking to deploy and distribute applications.

Features of Docker
As an open-source software, Docker relies heavily on key features such as:
Isolation
Containers housing information run independently, providing a secure and consistent environment for applications, eliminating conflicts between them. It also provides service-oriented tasks, simplifying distribution, calling and debugging.
Portability
Docker can run on any system that has the software installed, regardless of the underlying operating system.
Open-Source Platform
Prompts users to choose which technology, i.e. Amazon, to complete a task. As such, its features are attractive to beginner and lone developers dependent on Docker toolchains.
Efficient Life Cycle
By eliminating time lost between writing code, having it tested and deployment, Docker not only saves users resources but also motivates individuals to embrace more tasks.
Scalability
Provides developers flexibility as Docker containers are lightweight, making them attractive in handling dynamic workload fluctuating between different deployed applications
Image Management
Serve as a blueprint to developers and data experts when building applications or making changes to existing projects. New images stored in the Dockerfile contain a set of instructions sent to Docker on how to create an image. A registry housed in Docker hosts the Dockerfiles for future use.
Volume Management
Due to the numerous projects or large quantities of data saved and processed in Docker, the system ensures persistent data storage, availability of data logs and containers are stopped or removed while processing takes place.

Docker Compose
It’s a tool used to run multi-container applications. By scaling down production, the tool not only supports the streamlining of tasks but also manages Docker application stores in numerous containers. Docker runs applications in single container units, while Docker Compose specialises in running applications that use multiple containers.

Defining 15 Common Data Engineering Concepts

Elvis Mwangi — Sun, 10 Aug 2025 20:46:32 +0000

In an ever-evolving technological world, 90% of the global data was generated in the last two years. An estimated 2.5 quintillion bytes of data are generated daily, necessitating reliable storage and data processing systems. An economic shift saw an increase in internet service providers, prompting cheaper options and hence driving the number of individuals accessing the internet, leading to a surge in data collected. Data engineering as a discipline focuses on building data infrastructure whose purpose is to store, extract and transform data. The article will focus on distinct data engineering core concepts and, in some instances, make comparisons on similar data concepts applicable in the field.

Batch vs Streaming ingestion
Data pipelines built by data engineers store historical data and can be used to perform real-time data analysis. The techniques listed above fall under Extract, Transform and Load (ETL) processes. Batch processing is an automated ETL technique that involves processing large volumes of data into batches or chunks. Employing the use of tools such as Apache Airflow, the technique is efficient where data is processed without immediate action in cases such as data warehousing and periodic reporting. Streaming processing systems handle data in its real-time form. Streaming processing data sources that include social media, feeds, and other live data sources handle information that changes with time. Employing frameworks such as AWS Kinesis, the system's scalability enables it to handle high data velocity.

Windowing in Streaming
Classified under streaming ingestion, it involves the partitioning of continuous data streams into smaller feasible subsets for systematic data processing. Types of windowing under Real-time analytics include:

Sliding windows – overlap and share information with other windows
Tumbling windows – fixed-size, contiguous time intervals used for making definite data segments
Session windows – ideally potent, their length is dependent on a user’s engagement. Change Data Capture (CDC) It’s a system used to track and document changes made to data. It’s basically a log motivated to maintain consistency while annexing modifications such as inserts, deletes and updates. Principles governing CDC include: Incremental updates- CDC centers around changed data minimizing network bandwidth, log-based tracking- keeps logs on data transactions, capturing extracted data changes, capture- focuses on data changes that involve inserts, updates and deletes, idempotent processing – ensures duplicates do not affect data integrity. Fields relying on the CDC include: financial services, healthcare, logistics and supply chain, telecommunications and commerce. • Idempotency Factored as a CDC principle, it ensures API’s generating data requests produce similar results despite the number of iterations. Examples of HTTP techniques are GET, OPTIONS, PUT, HEAD, TRACE and DELETE. Idempotency principles solve error handling capabilities of a system, consistent outcomes, debugging, concurrency management and fault tolerance. Implementing idempotency keys, which are unique identifiers, requires the following approaches: generate unique keys, store and check keys and implement expiry for keys. ** Online Transaction Processing (OLTP) vs Online Analytical Processing (OLAP)** OLTP focuses on transactional processing and real-time data analysis, while OLAP is designed for complex data analysis and reporting. Separating both processes yields the following benefits: Performance optimisation- efficient data processing, improved data quality- reduced risk of errors, enhanced decision-making- independent scaling. Columnar vs Row-Based Storage Data storage under a columnar system is stored and organized in a column, while in a row-based storage system, data is stored under rows. Benefits of using a columnar system include: compressible data, multipurpose- provides a wide variety of big data applications, speed and efficiency- data is easier to find and self-indexing. Benefits of a row-based storage system include: simpler data manipulation and efficiency for transactional workloads. • Partitioning. For scalability purposes, breaking down data not only helps process the databases but also improves the efficiency of the tools used during data manipulation. Types of data partitioning include: horizontal- data is split into rows housing same set of columns, vertical- split by columns using a partition key column present in all tables maintaining a logical relationship, range- data portioned is dependent on a range of values assigned to a specific table, hash – depends on a harsh function applied to a partition key composite a blend of two portioning techniques and list- a set of values determines partitioning. Partitioning is applicable in machine learning pipelines, log management, OLAP operations and distributed databases. Extract Transform Load (ETL) vs Extract Load Transform (ELT) An ETL entails extracting data from distinct sources, transforming it to suitable readable formats and loading it into data storage systems, while in an ELT process, data is loaded and then transformed. Notable differences between the 2 include: under ELT, data storage revolves around data warehouses, but more often data lakes holding unstructured data, and ETL makes data privacy compliance simpler due to the transformation activities carried out by data analysts. Challenges faced when migrating from one data architecture to another include: a difference in the logic and code will be experienced, a change in data security parameters prompted by interchanging the loading and transformation process and reconfiguring the data infrastructure.

CAP Theorem
States that in a well-allocated data system, it is impossible to simultaneously attain consistency- all data notes have the same current up-to-date view, partition tolerance- despite a system failure due to failed node communication, the system is operational and availability- requests made in the system do not yield errors, data properties. An entity must choose only two. Trade-offs can be made between
1.1 AP (Availability and Partition Tolerance
2.1 CP (Consistency and Partition Tolerance
3.1 CA (Consistency and Availability
DAGS and Workflow Orchestration
Often created using Apache Airflow and Dagster DAGs, tasks execute in the correct order, preventing cycles which warrant efficient workflows. Uses of DAGs in workflows include task scheduling, dependency management, monitoring and error handling. Advantages include: better visibility- DAGs paints a clear visual representation of the workflow, enhanced observability and increased efficiency- automation of pipelines and workflows enables a data engineer to allocate time and resources to other objectives.
Retry Logic and Dead Letter Queues (DLQ)
Retry logic refers to strategies implemented on a failed system to ensure the reliability of software structures by automatically re-attempting failed operations. Retry logic encompasses maximum retries, backoff strategy, jittered backoff, constant backoff and exponential backoff.
DLQ serves as a storage unit housing problematic data while ensuring no message loss, ushering future re-processing. The two potential causes as to why messages are sent to the DLQ pipeline are erroneous message content and changes in the receiver’s system.
Backfilling and Reprocessing
Backfilling describes the process of replacing old records with new ones when processing historical data. Quality incidence and the presence of anomalies in data force data engineers to employ backfilling techniques. Backfilling impact is felt when it is applied to an ever-growing dataset. Examples of data backfilling include fixing a mistake in data, missing values or data, working with unstructured data and data from calculations.
Data Reprocessing involves recalculating data based on existing information. It's triggered by manual initiation and driver change. Reprocessing is dependent on the following factors
• Number of rules in the database
• Number of vehicles in the database
• Data range to be reprocessed.

Time Travel and Data Versioning
Time travel enables organisations to conduct data audits or make changes over time. Using machine learning models, time travel techniques query tables as they existed across multiple warehouses or the same workspaces. However, data versioning focuses on tracking and managing changes to datasets over time. Unlike backfilling, data versioning restores datasets to their previous versions, saving time. It's also a complimentary CDC log tool. Implementation techniques include: versioning approach- Valid_from/to Metadata, Full Duplication and First-class.

Data Governance
This is a system of rules, policies and processes employed by an organisation in managing individual data assets. It focuses on data security, availability and quality. A data governance framework involves numerous distinct teams addressing issues such as
• Data governance tools
• Organisation goals, roles and duties
• Data policies, processes and standards
• Auditing procedures
Time Travel and Data Versioning
Time travel enables organisations to conduct data audits or make changes over time. Using machine learning models, time travel techniques query tables as they existed across multiple warehouses or the same workspaces. However, data versioning focuses on tracking and managing changes to datasets over time.
Distributed Processing Concepts
Involves splitting computational tasks into smaller parts and analysing data over multiple interconnected devices or nodes. Benefits include: scalability, fault tolerance, efficient handling of large volumes of data and performance. Disadvantages of distributed data processing include: data consistency, network latency, ensuring data security and system complexity.

Quick Dive-in on Data Warehouse Architecture.Targeting

Elvis Mwangi — Sun, 27 Jul 2025 11:00:13 +0000

Just like building a suspension bridge or a subway tunnel, a well-detailed blueprint goes a long way in easing the implementation of the project. Data architecture depends on components such as data sources and integration, Extract, Transform Load (ETL) processes, Data modelling, Data Storage, Data Access and Security and Data Governance. The components serve as pillars towards building and maintaining data warehouses across business intelligence environments.

Data sources pinpoint a digital location where numerous valid databases can be outsourced. Depending on the data formats, extracting quality and ensuring the system maintains consistent standards can be an issue. Reinforced data integration ensures that data warehouses accommodate different data types from diverse data sources. Standardisation and improved data accessibility generated from ETL processes promote consistency, ensuring a business’s objectives are met.

Optimising storage efficiency ensures a data warehouse is functional. Analytical queries support the extraction and analysis of large volumes of historical data in a warehouse. Under Modern Data Architecture, supported by tools such as Snowflake, dimension and fact tables offer a common structure for data stored. Data engineers use dimensional queries to filter and slice dimensional tables housing information on classes such as location and product name. Fact tables use relational measurement metrics, storing facts and foreign keys used to join tables during querying. Both dimensional query and fact tables are created in a snowflake schema, providing a gateway to efficient analysis and reporting.
Data warehouse is profoundly relied upon in fields such as retail-inventory management and customer segmentation, Manufacturing- quality control, healthcare- reducing operational risk, telecommunications- customer behavioural analysis, commerce – forecasting and customer segmentation.