In their line of duty, data engineers come across pipelines built with different technologies, and they need to understand them. Data engineers must have basic knowledge of data storage, analytics, and pipeline to carry out their duty effectively.
Databases and data warehouses
A database is made up of one or more tables of related data. Dynamic growth in the business sector has necessitated the design of tools to be used in bringing different databases together for the purpose of data analysis. to carry out data analytics reports from various databases, data from these databases are ingested into a central point. A data warehouse is a tool that allows the ingestion of structured data from different databases. Before entering a data warehouse, the data undergoes processes such as validation, preprocessing, and data transformation. Warehouses, however, face the challenge of holding current-era business data because businesses need to handle unstructured and semistructured data
Handling of the big and unstructured dataset
Unstructured and semistructured data sets come from digital platforms such as IoT sensors, social media, web and mobile applications, videos and audio platforms etc. These platforms generate data in high velocity and huge volumes compared to structured data sources. Due to the challenges of handling such datasets, there was a need for a big data technology platform. One such technology is the Hadoop open source framework that was designed in the early 2010s. Hadoop was designed to process large datasets on a cluster of computers. Hadoop is managed under distributed file system called the Hadoop distributed file systems. Providers of HDFS include IBM, MAPR, and Cloudera, among others. These packages include distributed data processing frameworks like HIVE, spark,map-reduce etc.
Benefits of public cloud infrastructure
Public cloud infrastructure has an on-demand capacity
Public cloud infrastructures have elastic and limitless scaling
Public cloud infrastructures have global footprints
Public cloud infrastructure has a cost model based on usage
Public cloud infrastructure has freedom of hardware management
In 2013 AWS made amazon redshift available, and they started providing a data warehouse as a cloud-native service.
Data marts
A repository containing well structured curated as well as the trusted dataset is termed as an enterprise data warehouse (EDW). To measure business performance, business users analysis data in the warehouse. Data in the warehouse contains business subjects such as products, sales, customers etc. data house has four main components. These are;
Enterprise data warehouse: Hosts the data assets, such as the current and historical datasets.
Source systems- Data sources such as ERP, CRM
ETL pipelines –Loads data to the warehouses
Data consumers – Applications used to consume data from the warehouse
Parallel processing
Amazon redshift contains several computer resources.
Each redshift cluster has got two nodes :
• One leader node that interfaces client application to the computer node.
• Multiple computer nodes that store data from the warehouse and run queries in parallel. Each computer node has its own memory and processor separated from each
other.
Dimensional models in data warehouses.
In warehouses, data is stored in relational tables. The two common dimensional models in data warehouses are:
• Star
• Snowflakes
Dimensional models make it easy to filter and retrieve relevant data.
Data marts are built focusing on a single subject of the business repository, such as marketing, finance, or sales. Datamarts are created either through top-down or bottom-up formats.
How data is fed into the warehouse
Organizations bring data from different sources into the warehouse through the pipeline. Data pipelines are designed to serve the following purposes:
• Extracting data from the source
• Transformation of the data through validation cleaning and standardizing
• Loading the transformed data to the warehouse of the enterprise.
There are two types of pipelines.
Extract load transform(ELT) pipelines.
Extract transform load(ETL) pipelines.
Data lake
As earlier stated warehouses are suitable for handling structured dataset. However, business needs to get insights into semistructured and unstructured data set from HTML, JSON data, social media, images etc, for their analysis. Specialized machine learning tools handle such datasets. Data lakes have the ability to handle all kinds of data, may it be structured, unstructured, or even semistructured. Data lakes also handle huge datasets compared to warehouses.
Data lake archtecture
A data lake has 5 layers. These layers include
• Storage layer: This layer is located at the center of the data lake architecture. It provides virtually low cost of unlimited storage. This layer has got 3 main zones, each with a specific purpose. These zones are:
Landing zone: This zone is also known as a raw zone. This is the zone where the ingestion layer writes the data from data zones. The landing zone stores data permanently from the source.
Clean zone: Also called transform zone. Data from the clean zone is stored in the optimized formats
Curated zone: Also called the enriched zone. Data in the curated zone is optimized and cataloged for the consumption layer.
• Catalog and search layer: Data lakes contain huge structured, semistructured, or unstructured data sets from internal or external of an organization. Different departments use data set in the data lakes in an organization for different purpose, and there is a need for the user to search for the available schemas in the dataset. The catalog and the search layer provide metadata about the hosted data.
• Ingestion layer: This layer connects to different data sources. Data from the ingestion layer is forwarded to the storage layer.
• Processing layer: Data from the storage layer is processed in the processing layer to make it ready for consumption by the consumer. Components of both ingestion and processing layers create ELT pipelines.
• Consumption layer: The consumer utilizes the processed data through techniques such as interactive query processing and machine learning, among others.
In the end, I tried to create a simple pipeline in aws. I named it erick254 and selected the location as Africa Capetown.
Top comments (2)
Congratulations for the text. very good.
thanks luiz