DEV Community

Cover image for [Short-Note] Data Engineering 101
Kanin James Kearpimy
Kanin James Kearpimy

Posted on

[Short-Note] Data Engineering 101

I have been participating in Data Engineering Accelerated Program for 3 days and this is overview I perceived from program. I skipped technical aspect like coding and pipeline, another article would be better :)

Program is undergoing and the rest 5 days are waiting for me!

History of data engineering

Organization would like to utilize data such as Dashboard. To compute dashbaord, operator need to pull data from sources (on the left one), clean it, and put to dashboard for analytic user (right one)

When data source grow bigger. Manual operator is not capable. Data Engineering is come to automate the task.

Data Engineering replace operator

Extract, Transform, Load (ETL)

ETL

"We really know how consumer use data."

Pull data from source, clean and trasnform, save in storage.

Extract, Load, Transform (ELT)

"let consumer choose"

Pull data, store in database, then let consumer choose which data they want to use and extract from our source.

It speed up time to market at some use-cases.

Whole process of data engineering in organization.

whole process of data engineering

Data Engineer with ETL/ELT.

Data engineer write script (python, scala) to automate
1) pulling data from source,
2) cleaning them
3) loading them to database.

Then dashboard will use data to visualize to dashboard.

Data Warehouse and Data Lake.

Data werehouse is type of database that is optimized to store, search, query huge amount of structure, semi-structure, and (sometime) unstructure data.

Typical dashboard can only show fixed data to user. To acquire insightful one to help making decision, We need data scientist. Somehow current pipeline data can't give proper data to data scientist to explore.

We need place that store raw data, without preprocessing, and able to pull by data scientist. Its name is [[Data Lake]].

OLTP & OLAP

Online Transactional Processing

is focusing on transaction in database. It's read, write, update frequently. So, it heavily rely on fast processing.

Usecase:
  • Banking
  • Shopping
  • Retail scanning

Online Analyltical Processing

on the other hand, focus on large volumn, high dimensional data from data warehouse.

Usecase:
  • Data analytic purpose.
  • Machine learning

_I think those are differentiated by database architecture.

Culture and Organization shifting.

Data democratization

Data democratization
Core concept: enable everyone in organization access to data. It impact intention of people, culture of organization, and tool for everyone to access.

Reference

Top comments (0)