Introduction to Data Engineering

Kimeu — Sun, 04 Sep 2022 16:36:26 +0000

Data Engineering is a discipline that entails collecting, translating and validating data for analysis. A good data engineer makes quality data available for analysis and data-driven decision making.There are four disciplines a data engineer should be well aversed with:
Data. There are different types of data file formats for example csv,tsv,json.
Data stores and repository. This include relational and non-relational databases, data lakes and data warehouses
Data pipelines. Entails collecting and gathering data from different sources
Analytics and data driven decision making.
Python language is the preferred programming language for data engineering as it has a wide variety of packages which are easy to import and enhance performance in data wrangling, ETL(Extract, Transform, Load), Feature engineering.

Introduction to Python for Data Engineering

Kimeu — Wed, 31 Aug 2022 10:57:09 +0000

Python is one of the best programming languages for data analysis due to a variety of packages e.g Pandas and Numpy, that enable its efficiency.
For one to be an expert in data engineering,he or she needs knowledge in software development and data analysis.
Python works well with data analysis as Python code can be interpreted by Jupyter notebook.
For example, when trying to change a datatype of a column to integer data type
df['colName'].astype(int)
Data analysis is made easier through Jupyter notebook,an app that you can easily perform operations on data to get meaning from a collected dataset as it allows one to import packages.
One has to understand how Jupyter differs from Python data types.
Jupyter notebook stores strings as objects while python stores them as strings.
During data collection, it's advised to use API to get data and not web-scrapping. Reason being, with web scraping the underlying html structure can be changed and one cannot reproduce the same results on performing on the dataset.
To install python packages on any environment use "pip install package-name". To install any packages on a conda environment use "conda install package-name"

DEV Community: Kimeu

Introduction to Data Engineering

Introduction to Python for Data Engineering