DEV Community

Kihoro Codes
Kihoro Codes

Posted on • Edited on

What concepts should I master to be a Data Engineer?

As a new data engineering student, there are a number of concepts that you need to grasp. The concepts will guide you in knowing exactly what to learn in respect to data engineering. So create a notion page and gather all resources available to be able to track your progress while learning.
i) Batch Verses Streaming Ingestion.
A Data Engineer implements the ETL (Extract Transform and Load) process for their organizations. In extracting the data, they should have identified the sources of these data and have a procedure for collecting the data. Data ingestion is the procedure that the data engineer takes to collect the data from all the different sources and organize it in a way that it can be processed for their specific organization.
Batch Ingestion is when the data is collected over a period of time for example, a minute, a week or a month, once it has all being gathered, it is then processed all together at the same time. It is suitable for when dealing with very large datasets. For example collecting all the sales data of an Ecommerce store after a day.
Stream ingestion is when data is processed as soon as it is collected. In stream ingestion, the data is processed instantly. It is highly recommended for critical data that requires immediate decision making. For example, you can use stream ingestion when dealing with fraud detection systems to identify the fraud as soon as it happens.
ii) (CDC) Change Data Capture
This is a technique used to ensure that all the records in a database are synchronized across the entire database in real-time. If and when a change is made to a record in a database, then these changes are integrated across the entire database resulting in data with low latency.

iii) Idempotency
This is the ability to ensure that if a process is repeated several times, the results do not change. A data engineer performs the ETL process where the data extracted may be from csv flles. Implementing idempotency ensures that no matter how many times the same data is loaded, it does not change the output by producing duplicates, instead it is able to identify that the data has already being loaded before to avoid data inconsistencies.

iv) OLAP VS OLTP
OLAP (Online Analytical Processing) is a data processing system that does multidimensional analysis on large amounts of raw data at high speeds. It is best uses for analytical reporting such as financial analysis or forecasting future sales.
OLTP (Online Transactional Processing) is the process that enables most online transactions that are recorded into a database. These transactions are recorded in real time and can be performed by multiple users concurrently. Such transactions include online bank transactions and flight booking
Both OLAP and OLTP are used together by most organizations as they both contribute very necessary data required for the growth of the organization.
v) Columnar vs Row-based Storage
Columnar databases organize data by the fields making it easier for calculations and also aggregation of the data. It allows for efficient data retrieval and analysis, as it only pulls the required data.
Row based storage database organize data by rows making it easier for dealing with complex queries. It is used mostly for transactional systems that require frequent querying using the CRUD (Create, Read, Update and Delete) operations.
vi) Data Partitioning
This is the process of taking large volumes of data and subdividing it into smaller, more manageable datasets by using a suitable criteria. The criteria may be a column. The smaller datasets created are known as partitions. This process allows for efficient data filtering. Identifying the right column to be used for the partitioning is very key, it is recommended to use one with distinct values.
vii) ETL VS ELT
E-T-L is the major job of a data engineer. It means Extract, Transform and Load. Extracting is the process of process of getting the data from different sources. Transforming involves enriching the data by manipulating the field names. Loading is placing the transformed data into the tools used by the data analysts.
In ETL, the data that is in file storage is transformed using SQL and loaded into the DBMS used by the analysts while in ELT the data is loaded into a data warehouse and then transformed using SQL into the DBMS used by the analyst.

Top comments (0)