Data Engineering is not new concept or new career path. Its a old concept. The idea and core of data engineering, and what its' all about has been around for a long time. The only difference is now, its being redesigned, redeveloped and re modified.
Think of full stack developers. A full stack developer, is just a back-end and front-end developer merged together. Likewise, If you ever think about databases or know people that are DBAs, SQL, and ETL Developers, then you seen data engineering before. You just seen it in a different light or perspective.
Data Engineering in simple terms consists of taking data from one place to another. To dive a little deeper, the reasoning behind why we need data engineering is the most important part.
Lets start with ETL. ETL in the beginning sounds confusing but we are going to start with the basics.
E stands for Extract. Extract meaning remove or withdraw, data from different sources. So the source can be data from ESPN, Yahoo News, CNN, CBS etc..
T stands for Transform. Initially when you extract data from these different sources, the data may be unstructured, may be under performing in terms of speed, or may need to join some data together into one table. This is where you will do this part.
L stands for Load. So, this is the final part. It means to store the data, that was transform somewhere in a data lake or data warehouse.
You can learn more about data lakes and data warehouses:here
You store the data in Google Cloud's Big Query, Azure SQL Database or AWS Redshift Data Warehouse. This is ETL. Congrats! You now know what ETL is, if you always wanted to know or if you want to teach someone else.
So there are some additional features around ETL, that you hear about such as Apache Airflow, Luigi, AWS Glue, Apache Spark. These are all tools that are either apart of the ETL process or makes the ETL pipelines more efficent at doing their job.
For example, Apache Airflow is used for scheduling or creating cron jobs to say something like "update this table of users in this particular database everyday at 6am pacific time". There is a lot more in data engineering but this is the bare bones on data engineering to get you started.
I hope this tutorial was informational and it gave a general idea of what data engineering is. Stay tuned to my next data engineering article where I am going to teach you how to write your own ETL pipeline in python.
Top comments (1)
1) SQL programming and SQL model design are necessary skills
2) Traditional SQL programming cannot be engineered
3) github.com/braisdom/ObjectiveSql is the best choice
Some comments may only be visible to logged-in visitors. Sign in to view all comments.