DEV Community: Lorna Munanie

Data Engineering for beginners. A step by step guide

Lorna Munanie — Fri, 10 Nov 2023 05:39:48 +0000

The growing rate of big data has led to an increase in demand of real time data processing and analytics. Data engineers play a huge role in designing and implementing data pipelines where data travels through from input to storage.A data engineer is a professional responsible for building storage solutions for huge amounts of data.
Data engineering on the other hand is the process of designing and implementing systems that collect and analyze data so as to get insights and understands trends and patterns in the data.

Roles of a data engineer

Extracting data from data sources - Data comes from different data sources eg. databases, external APIs among others. A data engineer therfore integrates data from these sources into a centralzed data storage.
Prepare data for analysis - data engineers are responsible of processing the data by applying some transformation, cleaning and validating making it ready for analysis.
Designing data pipelines - a data pipeline is where the data travels through from input to the storage.Data engineers are responsible for designing and implementing data pipelines to extract, transform, and load (ETL) data from various sources into a centralized data repository.

Step-by-step guide
step1: Master the basics

Mastering the fundamentals of data engineering would be the first step. As a data engineer it is advisable to have strong foundations in programming languages such as python and also databases such as MySQL/PostgreSQL, still get to understand data modelling which help in structuring data in a logical manner.

step2: Data manipulation and transformation

Data originates from different sources, therefore data engineer is responsible for extracting ,transforming, loading (ETL) and also cleaning and transforming data to make it ready for analysis.

step3: Getting insights and pattern from data

Data engineers should be familiar with various tools for visualizing the data such as tableau and power BI, so as draw patters and get insights from the given data.

step4: Building data pipelines

Having gotten the insights from data, you design and implement data pipeline where the data will travel through from input to the storage. data pipeline act as a highway for the data. This can be done by help of Apache Airflow to ensure smooth flow of the data.

step5: Data warehousing and data modeling

Data warehousing is the storage system for huge amount of data while data modeling involves organizing data in a logical manner which helps in ensuring efficiency, and consistency throughout the data lifecycle., this can be achieved by the help of snowflake and star schemas.

Conclusion
Data engineering is a critical field that empowers organizations to harness the full potential of their data. As a data engineer you need to have familiarized yourself with basics such as programming, data manipulation that is (ETL), know how to use visualization tools such as tableau or power BI, build pipelines and also get to understand how to structure data in logical manner.

Complete Guide to Time Series Forecasting

Lorna Munanie — Fri, 03 Nov 2023 09:07:20 +0000

Time series involves analyzing data that evolves over some period of time and then utilizing statistical models to make predictions about future patterns and trends in the data.

Characteristics of time series data

Temporal Ordering - Time series data is ordered chronologically, with each observation occurring after the previous one. This ordering is essential for analyzing trends and patterns.
Time Dependency - In a time series, each observation is influenced by the preceding observations, creating a sequential relationship where the value at a given time depends on the values that occurred before it.
Irregular Sampling - Analyzing and forecasting time series data can be challenging when there are irregular or uneven time intervals between observations. Dealing with missing or irregularly spaced data points necessitates the use of suitable techniques.

Components of time series

Trend - This represents the long-term direction or tendency of the data. It captures the overall upward or downward movement over time. Trends can be linear (constant increase or decrease) or nonlinear (curved or oscillating).
Seasonality - Refers to patterns that repeat at fixed intervals within a time series. These patterns can be daily, weekly, monthly, or yearly. External factors such as weather conditions, holidays, or economic cycles often have an impact on seasonality.
Noise(random fluctuations/ irregularities) - Represents the unpredictable and random variations in the data and includes factors that cannot be explained by trend or seasonality. Measurement errors, random events, or unidentified factors can contribute to the presence of noise in the data.

Commonly used time series models

Moving Average (MA) Model - This model calculates the average of past observations with the aim of predicting future values. It is useful for capturing short-term fluctuations and random variations in the data.
Autoregressive (AR) Model - This model predicts future values based on a linear combination of past observations.
Autoregressive Moving Average (ARMA) Model - The ARMA model combines the AR and MA models to capture both short-term and long-term patterns in the data. It is effective for analyzing stationary time series data.
Autoregressive Integrated Moving Average (ARIMA) Model - This model extends the ARMA model by incorporating differencing to handle non-stationary data. It is suitable for data with trends or seasonality.
Seasonal ARIMA (SARIMA) Model - This model is an extension of the ARIMA model and includes seasonal components. It is useful for analyzing and forecasting data with recurring seasonal patterns.

Evaluating the performance of time series models.
Some commonly used metrics include:

Mean Absolute Error (MAE) - This metric measures the average absolute difference between the predicted and actual values. It provides a straightforward measure of the model’s accuracy.

Root Mean Squared Error (RMSE) - RMSE calculates the square root of the average squared difference between the predicted and actual values. It penalizes larger errors more heavily than MAE.

Mean Absolute Percentage Error (MAPE) - MAPE calculates the average percentage difference between the predicted and actual values. It provides a relative measure of the model’s accuracy.

Forecast Bias - Forecast bias measures the tendency of the model to consistently overestimate or underestimate the actual values. A bias close to zero indicates a well-calibrated model.

Exploratory Data Analysis (EDA)and visualization Techniques

Lorna Munanie — Sun, 08 Oct 2023 20:34:26 +0000

EDA is a data analysis technique that mainly focuses on on understanding the characteristics of a dataset. It involves using various statistical and visualization tools to explore data, identify patterns and uncover insights and relationships.

Exploratory data analysis is an important step in the data analysis step. This is because it ensures that the data is really what it is claimed to be and that there are no obvious errors e.g missing values, outliers etc. EDA enhances accuracy, efficiency and reliability of data.

Data visualization on the other hand represents the various techniques used to represent data visually through charts, tables, maps, graphs and other visual elements. These techniques usually help to represent complex data in a more simplified and understandable format.

Common graphs used while performing EDA
Scatter Plot
Pair plots
Histogram
Box plots
Violin Plot

Performing EDA
We are going to use a sample dataset which is the Haberman Dataset to perform EDA.

We start by importing several python libraries'

Table Headers

Age -Represents the age of the patients undergone the surgery. It ranges from 30 to 83.

Year- Year in which the patients had the operation. It ranges from 1958-1969.

Nodes - A lymph node, or lymph gland is a kidney-shaped organ of the lymphatic system, and the adaptive immune system.

Status — Denoted by 1 and 2. 1 means the patient survived 5 years or longer and 2 means the patient died within 5 years.

From the above code, 225 patients survived 5 years or longer and 81 patients died within 5 years.

Data Visualization plots
Helps us understand the dataset much better in a visual way.

Histograms
These are 2-D plots where the X axis can be divided into time intervals or numerical bin ranges. Histograms help in identifying patterns such as skewness, central tendencies, and outliers.

From our example above:

Bar Charts
Bar charts are suitable for visualizing categorical or discrete data. They help understand trends.

Scatter Plots
It is a type of plot which will be in a scatter format. It is mainly between 2 features. Here we will plot nodes Vs age and see if there is any linearity.

Here blue and orange dots represent the survival status of the patients. blue represents the patient survived 5 years or longer and orange dot represents the patient died within 5 year.

Pair Plots
They display scatter plots for all possible pairs of continuous variables in a dataset. They provide a comprehensive view of the relationships between variables and are especially useful when exploring multiple variables simultaneously.

From the above plot we can get some interesting facts. We can say that plot 6(Year vs Nodes)is readable compared to the other two but certainly we cannot make any concrete observations based on this graph. The plot 4, plot 7 and plot 8 are the inverted plots of plot 2, plot 3 and plot 6 respectively.

Box Plots
Box plots tell us the percentile plotting which other plots cant tell easily. It also helps in detection of outliers.

In conclusion, these are some basic plots used in EDA. It is always important to read and understand what the plot is saying. It is never good to skip EDA for a machine learning project.

Data Science for Beginners :2023 - 2024 Complete Roadmap

Lorna Munanie — Sun, 01 Oct 2023 10:52:54 +0000

Data science is the study of data in order to extract meaningful insights from it. It extracts insights by combining various subjects such as math and statistics, specialized programming, advanced analytics, artificial intelligence (AI) and machine learning. These insights are then used by organizations in decision making and strategic planning.

A data science roadmap is visual representation of a strategic plan designed to help one learn about and succeed in the field of data science.

As a wide field in technology, data science has several career paths one can follow:

Data Analyst - Collects, cleans and analyzes data.
Data Scientist - Builds predictive models and creates data driven solutions.
Data Engineer - Builds infrastructure for generation, storage and retrieval.
BI Analyst - Creates reports, dashboards and visualizations
Machine learning engineer- Implements ML algorithms and models.
NLP Engineer- Focus on understanding and interpreting natural languages.

Key Data Science skills for beginners

Mathematical and Statistical Skills
Programming Skills
Communication Skills
Curiosity

Mathematical and Statistical skills

Statistics - This is a branch of mathematics that teaches us how to collect and analyze data so that we can find answers to questions.

Descriptive Statistics- Conducts experiments on the entire dataset
Inferential statistics- conducts experiments from a small dataset then applies to the entire dataset.

Probability- Numerical representations of the likelihood of an event.

Calculus- Calculus is a branch of mathematics that deals with the study of rates of change and the accumulation of quantities. It has two main branches:

Differential Calculus - Differential calculus helps us understand how things change. It helps us understand how a function behaves at a single point
Integral Calculus - Integral calculus helps us find areas and accumulate quantities.

Linear Algebra
This is a branch of mathematics that deals with vectors and matrices.

PROGRAMMING SKILLS

SQL(Structured Query Language)- This is a an organized collection of data that handles large datasets.
Python programming - Python offers built in data structures and libraries that store and manipulate data efficiently.

lists
Tuples
Dictionaries
Sets
Strings

Data Analysis and Visualization

Being a data scientist would require you to work on data visualization to display the pictorial forms of charts and graphs that can be easy to understood. There are hefty of tools that are being used and some of the popular ones are:

Tableau
Power BI
Looker Studio
Python Libraries e.g. Matplotlib, plotly

Communication Skills

Ability to spread and influence ideas that are easy to understand and that can be used in decision making.