<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Muinde Esther Ndunge </title>
    <description>The latest articles on DEV Community by Muinde Esther Ndunge  (@muinde_esther).</description>
    <link>https://dev.to/muinde_esther</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1172920%2F09f99d58-e5f5-4d3a-ab35-b5dce2af98c0.jpg</url>
      <title>DEV Community: Muinde Esther Ndunge </title>
      <link>https://dev.to/muinde_esther</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/muinde_esther"/>
    <language>en</language>
    <item>
      <title>Data Engineering Roadmap 2023</title>
      <dc:creator>Muinde Esther Ndunge </dc:creator>
      <pubDate>Thu, 02 Nov 2023 16:04:47 +0000</pubDate>
      <link>https://dev.to/muinde_esther/data-engineering-roadmap-2023-1a0i</link>
      <guid>https://dev.to/muinde_esther/data-engineering-roadmap-2023-1a0i</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Data engineering is a crucial field within the broader realm of data science and analytics. It involves the collection, transformation, and storage of data to make it accessible and useful for analysis. As a beginner in data engineering, you may feel daunted and wonder how to get started and build a successful career in this dynamic and in-demand field. This roadmap will guide you through the essential steps and concepts you need to master as you embark on your data engineering journey.&lt;/p&gt;

&lt;p&gt;Data engineers use tools such as Java to build APIs, Python to write dashboard ETL pipelines, and SQL to access data in source systems &amp;amp; move it to target locations.&lt;br&gt;
This roadmap has been broken down into monthly deliverables.&lt;/p&gt;

&lt;h2&gt;
  
  
  Month 1: Basics of Programming
&lt;/h2&gt;

&lt;p&gt;The first thing to master as a data engineer is a programming language. The most common programming language is Python which will enable you to kickstart your data engineering journey.&lt;/p&gt;

&lt;p&gt;Python is a versatile programming language because it is easy to use, has multiple supporting libraries, and has been incorporated into every aspect of Data Engineering processes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understand Python basics that is Operators, Variables, and Data Types &lt;/li&gt;
&lt;li&gt;Learn working with data files this includes learning Python libraries like pandas which are widely used for reading, and manipulating data.&lt;/li&gt;
&lt;li&gt;Learn the Basics of Relational Database

&lt;ul&gt;
&lt;li&gt;SQL Server/MySQL/PostgreSQL&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Learn the fundamentals of computing&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Master Git and GitHub version control&lt;/li&gt;
&lt;li&gt;Focus on shell scripting in Linux, you'll be using shell scripting for cron jobs, setting up environments, etc&lt;/li&gt;
&lt;li&gt;Web Scraping is part and parcel of a Data Engineer's job. We need to extract data from websites that might not have a straightforward helpful API&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Month 2: Databases
&lt;/h2&gt;

&lt;p&gt;Relational databases are one of the most common core storage components used in data storage. One needs a good understanding of relational databases to work with large amounts of data.&lt;br&gt;
One needs to master the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keys in SQL&lt;/li&gt;
&lt;li&gt;Joins in SQL&lt;/li&gt;
&lt;li&gt;Rank Window Functions&lt;/li&gt;
&lt;li&gt;Normalization&lt;/li&gt;
&lt;li&gt;Aggregations&lt;/li&gt;
&lt;li&gt;Data wrangling and analysis&lt;/li&gt;
&lt;li&gt;Data modeling for warehouse&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Month 3: Cloud Computing
&lt;/h2&gt;

&lt;p&gt;Learn about cloud platforms that deliver computing services over the internet.&lt;br&gt;
The three main choices available are&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Amazon Web Services(AWS)&lt;/li&gt;
&lt;li&gt;Microsoft Azure&lt;/li&gt;
&lt;li&gt;Google Cloud Platform(GCP)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can pick any cloud platform as you learn, it will be easier to master the others. The fundamental concepts are similar, with just slight differences in the user interface, cost, and other factors.&lt;br&gt;
At this point, you understand the basics of programming, SQL, web scraping, and APIs as well. This is enough to work on your first project which could be bringing in data from a website, transforming it using Python, and storing it in a relational database. You can move the data to the cloud depending on which cloud computing you have chosen to work with.&lt;/p&gt;

&lt;h2&gt;
  
  
  Month 4: Data Processing
&lt;/h2&gt;

&lt;p&gt;Learn how to process big data. Big data has two aspects, batch data, and streaming data. We need specialized tools to handle such intensive data and one of the popular ones is Apache Spark. Focus on  the following learning Apache Spark&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Spark architecture&lt;/li&gt;
&lt;li&gt;RDDs in Spark&lt;/li&gt;
&lt;li&gt;Working with Spark Dataframes&lt;/li&gt;
&lt;li&gt;Understand Spark Execution&lt;/li&gt;
&lt;li&gt;Broadcast and Accumulators&lt;/li&gt;
&lt;li&gt;Spark SQL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Learn ETL pipelines using Python spark, data preprocessing libraries constructs like Numpy and Pandas.&lt;/p&gt;

&lt;h2&gt;
  
  
  Month 5: Big Data Engineering
&lt;/h2&gt;

&lt;p&gt;Here we will build up on what we did during the previous month. Learn Big data engineering with Spark, optimization in Spark, and workflow schedules.&lt;br&gt;
The ETL pipelines you build to get the data into databases and data warehouses must be managed separately. We need a work scheduling tool to manage pipelines and handle errors&lt;/p&gt;

&lt;p&gt;Learn the following concepts in Apache Airflow&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DAGs&lt;/li&gt;
&lt;li&gt;Task dependencies&lt;/li&gt;
&lt;li&gt;Operators&lt;/li&gt;
&lt;li&gt;Scheduling&lt;/li&gt;
&lt;li&gt;Branching&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Month 6: Data warehousing
&lt;/h2&gt;

&lt;p&gt;Getting data into databases is one thing, the challenge is aggregating and storing data in a central repository. You will first need to understand the differences between a Database, Data Warehouse, and Data lake. Understand the differences between OLTP and OLAP&lt;br&gt;
There are several data warehousing tools available;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redshift&lt;/li&gt;
&lt;li&gt;Databricks&lt;/li&gt;
&lt;li&gt;Snowflake&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Month 7: Handling data streaming
&lt;/h2&gt;

&lt;p&gt;Data streaming is the continuous flow of data as it is generated, enabling real-time processing and analysis for immediate insights.&lt;br&gt;
To ensure that data is being ingested reliably while it is being generated we use Apache Kafka&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learn Kafka architecture&lt;/li&gt;
&lt;li&gt;Learn about Producers and Consumers
-- Create topics in Kafka&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are other tools used for streaming data such as AWS Kinesis, again you're not limited to which tool to use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Month 8: Processing streaming data
&lt;/h2&gt;

&lt;p&gt;After learning how to ingest streaming data, learn how to process data in real-time. You can do it with Kafka but it is not flexible for ETL purposes as Spark Streaming&lt;br&gt;
Focus on&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DStreams&lt;/li&gt;
&lt;li&gt;Stateless vs. Stateful transformation&lt;/li&gt;
&lt;li&gt;Checkpointing&lt;/li&gt;
&lt;li&gt;Structured Streaming&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Month 9: Data transformation
&lt;/h2&gt;

&lt;p&gt;Every data engineer has to transform data into a form that the other members of the organization can use. Data transformation tools make it easy for data engineers to do so.&lt;br&gt;
Focus on DBT as many companies are using it&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learn how to use compiler and runner components&lt;/li&gt;
&lt;li&gt;Model data transformation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Month 10: Reporting and Dashboards
&lt;/h2&gt;

&lt;p&gt;This is mostly the end product of data, where the data has already been transformed, insights driven from it, and ready to be presented to stakeholders. One can use any tools to visualize and create dashboards. Such tools include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Power Bi&lt;/li&gt;
&lt;li&gt;Tableau&lt;/li&gt;
&lt;li&gt;Looker&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Month 11: No SQL
&lt;/h2&gt;

&lt;p&gt;When working with relational databases, the data always needs to be structured  and the querying is not that fast when working with large data hence we have NoSQL. These databases deal with structured and unstructured data&lt;br&gt;
You can focus on learning one NoSQL database like MongoDB since it is popularly used in the industry and is easy to learn&lt;br&gt;
Focus on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CAP theorem&lt;/li&gt;
&lt;li&gt;CRUD operations&lt;/li&gt;
&lt;li&gt;Documents and Collections&lt;/li&gt;
&lt;li&gt;Working with different types of operators&lt;/li&gt;
&lt;li&gt;Aggregation Pipeline&lt;/li&gt;
&lt;li&gt;Sharding and Replication in MongoDB&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Month 12:Building projects
&lt;/h2&gt;

&lt;p&gt;Even though you will build projects in each step, by now you have an understanding of the essential tools in data engineering. To showcase your skills, build a capstone project and keep learning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This breakdown allows you to progressively build your data engineering skills over the year. You can adjust the pace of your learning based on your personal preferences and the time you have available. Consistent practice and hands-on experience will be crucial in mastering the field of data engineering.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>dataengineering</category>
      <category>data</category>
      <category>beginners</category>
    </item>
    <item>
      <title>The Complete Guide to Time Series Models</title>
      <dc:creator>Muinde Esther Ndunge </dc:creator>
      <pubDate>Thu, 26 Oct 2023 12:55:56 +0000</pubDate>
      <link>https://dev.to/muinde_esther/the-complete-guide-to-time-series-models-2alc</link>
      <guid>https://dev.to/muinde_esther/the-complete-guide-to-time-series-models-2alc</guid>
      <description>&lt;h2&gt;
  
  
  Table of contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;Understanding Time Series Data&lt;/li&gt;
&lt;li&gt;Components of Time Series&lt;/li&gt;
&lt;li&gt;Methods to Check Stationarity&lt;/li&gt;
&lt;li&gt;Converting Non-Stationary Into Stationary&lt;/li&gt;
&lt;li&gt;
Time Series Models

&lt;ul&gt;
&lt;li&gt;Moving Average(MA) Model&lt;/li&gt;
&lt;li&gt;Auto-Regressive(AR) Model&lt;/li&gt;
&lt;li&gt;Autoregressive Integrated Moving  (ARMA AND ARIMA)  Models&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Python Libraries for Time Series Analysis&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  1. Introduction
&lt;/h2&gt;

&lt;p&gt;A Time series is a collection of observations made sequentially in time.It is an arrangement of statistical data in accordance with their occurrences in time. Time series models are statistical models used to analyze and forecast the data. The models are widely employed in various domains, including finance, economics, climate science, and more. This guide provides an overview of time series modelling and its various components.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Understanding Time Series Data
&lt;/h2&gt;

&lt;p&gt;Time series data is a sequence of observations collected at regular time intervals. It can be univariate(one variable) or multivariate(multiple variables). There is only one assumption in TSA, which is "stationary", which means that the origin of time does not affect the properties of the process under the statistical factor. Understanding the characteristics of time series data is crucial for model selection.&lt;br&gt;
Data can be &lt;strong&gt;Stationary&lt;/strong&gt; which should not have trend, seasonality, cyclical and  irregularity time series components.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The mean should be completely constant&lt;/li&gt;
&lt;li&gt;The variance should be constant 
Data can also be &lt;strong&gt;Non_Stationary&lt;/strong&gt; that is either the mean-variance or covariance is changing with respect to time.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  3. Components of Time Series
&lt;/h2&gt;

&lt;p&gt;Time series data consists of the following components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Trend:&lt;/strong&gt;&lt;br&gt;
This is the general tendency of data to grow or decline over a long period of time that is the long-term or downward movement in data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Seasonality:&lt;/strong&gt;&lt;br&gt;
Seasonality is characterized by repetitive patterns or cycles at fixed intervals. It occurs due to rhythmic forces which occur in a regular &amp;amp; periodic manner.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cyclical Variations:&lt;/strong&gt;&lt;br&gt;
These are movements in a time series that are not attributed to a regular movement. There is no fixed interval, uncertainty in movement and its pattern.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Irregular Variations:&lt;/strong&gt;&lt;br&gt;
These are unexpected situations/events/scenarios and spikes in a short time span.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  4. Methods to Check Stationarity
&lt;/h2&gt;

&lt;p&gt;When preparing data for TSA model, it is important to assess whether the dataset is stationary or not. This is done using statistical tests which include:&lt;br&gt;
&lt;strong&gt;Augmented Dickey-Fuller(ADF) Test:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It is done with the following assumptions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;H0: Series is non-stationary&lt;/li&gt;
&lt;li&gt;HA: Series is stationary

&lt;ul&gt;
&lt;li&gt;p-value &amp;gt; 0.05 Fail to reject(H0)&lt;/li&gt;
&lt;li&gt;p-value &amp;lt;= 0.05 Reject (H0)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Kwiatkowski-Philips-Schmidt-Shin(KPSS) Test:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It is used to test for a Null Hypothesis that will perceive       the  time series as stationary around a deterministic &lt;br&gt;
trend against the alternative of a unit root.&lt;/p&gt;
&lt;h2&gt;
  
  
  5. Converting Non-Stationary Into Stationary
&lt;/h2&gt;

&lt;p&gt;There are three methods available for this conversion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Detrending&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This involves removing the trend effects from the given data and showing only the differences in values from the trend.&lt;br&gt;
It only allows cyclical patterns to be identified.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Differencing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This transforms the series into a new series, which we use to remove the series dependence on time and stabilize the mean of the time series. Trend ans seasonality are reduced during this transformation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Yt = Yt - Yt-1&lt;/li&gt;
&lt;li&gt;Yt=Value with time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Transformation&lt;/strong&gt;&lt;br&gt;
This includes three different methods which are Power Transform, Square Root and Log Transfer. The most commonly used one is Log Transfer.&lt;/p&gt;
&lt;h2&gt;
  
  
  6. Time Series Models
&lt;/h2&gt;

&lt;p&gt;There are several time series models available, each designed to capture different aspects of the data. Here are some common types:&lt;/p&gt;
&lt;h3&gt;
  
  
  Moving Average(MA) Model
&lt;/h3&gt;

&lt;p&gt;This is the commonly used time series model. It is slick with random short-term variations. Relatively associated with the components of time series. It is represented as MA(q), where q is the order of the moving average.&lt;/p&gt;

&lt;p&gt;The MA is calculated by taking average data of the time-series within k periods&lt;br&gt;
There are three types of moving averages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple Moving Average (SMA)&lt;/li&gt;
&lt;li&gt;Cumulative Moving Average(CMA)&lt;/li&gt;
&lt;li&gt;Exponential Moving Average(EMA)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Simple Moving Average (SMA)&lt;/strong&gt;&lt;br&gt;
SMA calculated the under weighted mean of the previous M or N points. The sliding window data points selection is based on the amount of smoothing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import matplotlib.pyplot as plt

# Sample time series data
data = {'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'],
        'Value': [10, 15, 20, 18, 22]}

df = pd.DataFrame(data)

# Calculate SMA with a window size of 3
window_size = 3
df['SMA'] = df['Value'].rolling(window=window_size).mean()

# Plotting the time series data and SMA
plt.figure(figsize=(10, 6))
plt.plot(df['Date'], df['Value'], label='Original Data', marker='o')
plt.plot(df['Date'], df['SMA'], label=f'SMA ({window_size}-period)', linestyle='--')

plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Simple Moving Average (SMA)')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()

plt.show()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cumulative Moving Average(CMA)&lt;/strong&gt;&lt;br&gt;
CMA considers all data points up to a certain period, calculating the average cumulatively&lt;br&gt;
Here's an example&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import matplotlib.pyplot as plt

# Sample time series data
data = {'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'],
        'Value': [10, 15, 20, 18, 22]}

df = pd.DataFrame(data)

# Calculate CMA
df['CMA'] = df['Value'].expanding().mean()

# Plotting the time series data and CMA
plt.figure(figsize=(10, 6))
plt.plot(df['Date'], df['Value'], label='Original Data', marker='o')
plt.plot(df['Date'], df['CMA'], label='CMA', linestyle='--')

plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Cumulative Moving Average (CMA)')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()

plt.show()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Exponential Moving Average&lt;/strong&gt;&lt;br&gt;
EMA give more weight to recent data points. It is used to mainly identify trends and filter out noise. The weight of elements is decreased gradually over time.&lt;/p&gt;

&lt;p&gt;When dealing with TSA in Data Science and Machine learning, we use models like Autoregressive-Moving-Average(ARMA) models with [p,d, and q]&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;p == autoregressive lags&lt;/li&gt;
&lt;li&gt;q == moving average lags&lt;/li&gt;
&lt;li&gt;d == difference in the order&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before we dive deeper into these models let's understand the terms below:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auto-Correlation Function(ACF)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ACF measures the linear relationship between a time series and its lagged values.It indicates how similar a value is within a given time series.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf

# Sample time series data
data = np.random.rand(100)

# Create a pandas DataFrame
df = pd.DataFrame({'Value': data})

# Calculate and plot ACF
plot_acf(df['Value'], lags=20)
plt.title('AutoCorrelation Function (ACF)')
plt.xlabel('Lag')
plt.ylabel('ACF')
plt.show()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Partial AutoCorrelation Function(PACF)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;PACF measures the direct relationship between a time series and its lagged values while removing the influence of the intermediate lags.&lt;br&gt;
It basically shows the correlation of the sequence with itself with some number of time units per sequence order where only direct effect has been shown.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from statsmodels.graphics.tsaplots import plot_pacf

# Calculate and plot PACF
plot_pacf(df['Value'], lags=20)
plt.title('Partial AutoCorrelation Function (PACF)')
plt.xlabel('Lag')
plt.ylabel('PACF')
plt.show()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;&lt;p&gt;If the ACF plot declines gradually and the PACF drops instantly, Auto Regressive Model will be the perfect machine learning model in this case&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If the ACF plot drops instantly and the PACF decline gradually, a Moving Average model will be a perfect ML-model&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If both ACF and PACF plot decline gradually, then an ARMA model will be used.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If both drop significantly, no model is used.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Auto-Regressive Model
&lt;/h3&gt;

&lt;p&gt;This is a simple model that uses linear regression to predict the value of a variable based on its past values. It is mainly used for forecasting when there is some correlation between values in a given time series.&lt;/p&gt;

&lt;p&gt;Mathematical Representation:&lt;br&gt;
The AR(1) model can be expressed as:&lt;/p&gt;

&lt;p&gt;Xt=ϕ1⋅Xt−1+ϵtXt​&lt;/p&gt;

&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; Xt is the value at time t.&lt;/li&gt;
&lt;li&gt; ϕ1​ is the auto regressive coefficient.&lt;/li&gt;
&lt;li&gt; Xt−1​ is the value at time t−1.&lt;/li&gt;
&lt;li&gt; ϵt is white noise or the error term.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Autoregressive Integrated Moving  (ARMA AND ARIMA)  Models
&lt;/h3&gt;

&lt;p&gt;ARMA is a combination of Auto-Regressive and Moving Average Models. This model provides a weakly stationary stochastic process in terms of two polynomials. It captures both temporal patterns in a time series data.&lt;br&gt;
ARMA is specified by two orders p for auto regressive lags and q for moving average components.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The AR(p) component captures the linear relationship with past values.&lt;/li&gt;
&lt;li&gt;The MA(q) component accounts for the influence of past white noise or error terms.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima_model import ARMA

# Sample time series data
data = np.random.randn(100)  # Random data for illustration

# Create a pandas DataFrame
df = pd.DataFrame({'Value': data})

# Fit an AR(2) model
model_ar = ARMA(df['Value'], order=(2, 0))
results_ar = model_ar.fit()

# Fit an ARMA(2, 1) model
model_arma = ARMA(df['Value'], order=(2, 1))
results_arma = model_arma.fit()

# Print model summaries
print("AR Model Summary:")
print(results_ar.summary())
print("\nARMA Model Summary:")
print(results_arma.summary())

# Plot the original data and model predictions
plt.figure(figsize=(10, 6))
plt.plot(df['Value'], label='Original Data')
plt.plot(results_ar.fittedvalues, label='AR(2) Predictions', linestyle='--')
plt.plot(results_arma.fittedvalues, label='ARMA(2,1) Predictions', linestyle='--')

plt.xlabel('Time')
plt.ylabel('Value')
plt.title('AR and ARMA Model Predictions')
plt.legend()
plt.grid(True)
plt.show()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ARMA is best for stationary series thus ARIMA was developed to suport both stationary and non-stationary series.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AR ==&amp;gt; Uses past values to predict the future.&lt;/li&gt;
&lt;li&gt;MA ==&amp;gt; Uses past error terms in the given series to predict the future.&lt;/li&gt;
&lt;li&gt;I==&amp;gt; Uses the differencing of observation and makes the stationary data.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Python Libraries for Time Series Analysis
&lt;/h2&gt;

&lt;p&gt;To implement time series models in Python, you can use libraries like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.statsmodels.org/"&gt;Statsmodels&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://facebook.github.io/prophet/"&gt;Prophet&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.statsmodels.org/stable/index.html"&gt;ARIMA from Python's statsmodels&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.statsmodels.org/stable/index.html"&gt;Exponential Smoothing with Holt-Winters&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Time series models are powerful tools for analyzing and forecasting time-ordered data. Selecting the right model and understanding the components of the data, are critical for accurate predictions. With the appropriate model and evaluation techniques, you can make informed decisions based on historical data trends and patterns.&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>data</category>
    </item>
    <item>
      <title>Visualizing the Story within Data: A Guide to Exploratory Data Analysis with Data Visualization</title>
      <dc:creator>Muinde Esther Ndunge </dc:creator>
      <pubDate>Wed, 11 Oct 2023 10:23:48 +0000</pubDate>
      <link>https://dev.to/muinde_esther/visualizing-the-story-within-data-a-guide-to-exploratory-data-analysis-with-data-visualization-3nak</link>
      <guid>https://dev.to/muinde_esther/visualizing-the-story-within-data-a-guide-to-exploratory-data-analysis-with-data-visualization-3nak</guid>
      <description>&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;Data is often described as the new oil of the digital age,but like crude oil, it is only valuable when refined and preprocessed. Exploratory Data Analysis(EDA) is the key to unlocking the hidden gems within your data. In this article, we will delve into the world of EDA, exploring its key benefits, techniques and finally look at data visualization as one key technique and give a real world example.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Exploratory Data Analysis
&lt;/h2&gt;

&lt;p&gt;Exploratory Data Analysis, or EDA, is the process of investigating a dataset and summarizing its main features. It is the process of visually and statistically summarizing, interpreting, and understanding datasets. Its primary goal is to uncover patterns, trends, relationships, and anomalies within the data. EDA is a crucial step before diving into more advanced analytics or building predictive models&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Benefits
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Spotting missing and incorrect data&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Understanding the underlying structure of your data&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Testing your hypothesis and checking assumptions. It helps you form educated guesses about what might be happening within your data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Calculating the most efficient variable by determining how they relate to each other and which independent variables affect the dependent variable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Create the most efficient model by removing any extraneous information because additional data can either skew your results or simply obscure key insights with unnecessary.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Types of Exploratory Data Analysis
&lt;/h3&gt;

&lt;p&gt;Depending on the type of data we have and the columns we are analyzing, various strategies can be used &lt;br&gt;
&lt;strong&gt;1. Univariate Analysis&lt;/strong&gt;&lt;br&gt;
This sort of evaluation looks at the distribution of a single variable at a time to understand its distribution and relevant tendencies.&lt;br&gt;
&lt;strong&gt;2. Bivariate Analysis&lt;/strong&gt;&lt;br&gt;
It looks at the distribution of two or more variables and explores the relationships, associations, correlations, and dependencies between them&lt;br&gt;
&lt;strong&gt;3. Multivariate Analysis&lt;/strong&gt;&lt;br&gt;
This extends bivariate evaluation to encompass more variables. It aims to apprehend the complex interactions and dependencies among more than one variable.&lt;br&gt;
&lt;strong&gt;4. Time Series Analysis&lt;/strong&gt;&lt;br&gt;
It is mainly applied to statistics sets that have a temporal component. This entails inspecting and modeling styles, traits, and seasonality through the years.&lt;br&gt;
&lt;strong&gt;5. Data Visualization.&lt;/strong&gt; &lt;br&gt;
This is an important aspect of EDA that will focus on in this article. This entails creating visible representations of the statistics to facilitate understanding and exploration. Various visualization techniques, inclusive of bar charts, histograms, scatter plots, line plots, heat maps, and interactive dashboards are used to represent exclusive kinds of statistics&lt;/p&gt;

&lt;h2&gt;
  
  
  Exploratory Data Analysis using Data Visualization
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Data Visualization
&lt;/h3&gt;

&lt;p&gt;Data Visualization is the graphical representation of data that allows us to see patterns, trends, and outliers more clearly. In EDA, data visualization serves several critical purposes:&lt;br&gt;
&lt;strong&gt;1. Pattern Recognition:&lt;/strong&gt; Visualizations help in identifying recurring patterns in the data, which can lead to deeper insights&lt;br&gt;
&lt;strong&gt;2. Anomaly Detection:&lt;/strong&gt; Outliers and anomalies often stand out vividly in visualizations, making them easier to spot&lt;br&gt;
&lt;strong&gt;3. Communication:&lt;/strong&gt; Visualizations are a universal language that can effectively convey complex information to both technical and non-technical stakeholders.&lt;/p&gt;

&lt;p&gt;To choose and design a data visualization, it is important to consider two things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The question you want to answer ( and how many variables that question involves)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The data that is available. (is it quantitative or categorical?)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this article, we will explore different types of graphical representations using the customer churn rate dataset to explore different aspects of the dataset that will enable us to draw meaningful insights from the data.&lt;br&gt;
We will first start by importing the libraries we will use and the data&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7pncju3gsovjjb2aiscx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7pncju3gsovjjb2aiscx.png" alt="Image description" width="800" height="412"&gt;&lt;/a&gt;&lt;br&gt;
The libraries are inclusive of those we will use for machine learning. Don't let them scare you.&lt;br&gt;
Let's have a snippet of our dataset&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flel4oofia1gfqa0ijtbc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flel4oofia1gfqa0ijtbc.png" alt="Image description" width="800" height="278"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This dataset contains 32 columns.&lt;br&gt;
I have already dealt with the missing values. So we will start with EDA analysis. For this article, we will sorely focus on looking at the general churn rate, the geography of the customer, and the customer's lifetime in the service.&lt;/p&gt;

&lt;h3&gt;
  
  
  The General Churn Rate
&lt;/h3&gt;

&lt;p&gt;To get a glimpse of the general churn rate of the customer, we introduce a metric(churn rate-the percentage of customers who churned) and look at it in terms of the characteristics of the customers we have. We will use a pie chart for this.&lt;br&gt;
Pie charts make it possible to visualize the relationships between the parts and the whole of a variable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbl9wmgmmauma21mu2jt4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbl9wmgmmauma21mu2jt4.png" alt="Image description" width="800" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the chart, we can see among the customers, 26.5% of customers are in churn and have stopped using the company's services&lt;/p&gt;

&lt;h3&gt;
  
  
  The geography of the user
&lt;/h3&gt;

&lt;p&gt;We will look at the customer's location geographically and determine whether geography has an impact on the churn rate.&lt;br&gt;
We will use a scatter map box and then use hexagons to further understand this relationship&lt;br&gt;
A scatter plot on a Map box map created with Plotly Express is a visualization that combines the geographical context of a map with the ability to display individual data points as markers.&lt;br&gt;
Plotly Express is a high-level data visualization library that allows users to create interactive plots and charts with minimal code&lt;br&gt;
Key features include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Geographical context&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Interactive exploration&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Customizable markers&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Marker clustering&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Color mapping&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Size mapping&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Animations&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Customizable map layout&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Farb7cxh1j9z72p2cxtz0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Farb7cxh1j9z72p2cxtz0.png" alt="Image description" width="800" height="274"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From the scatter plot, The largest number of customers is in the Los Angeles and San Francisco areas which are large cities&lt;br&gt;
Let's use a bar chart to get a glimpse and count of customers per city&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F41qw54q2ce23zcvtnyse.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F41qw54q2ce23zcvtnyse.png" alt="Image description" width="800" height="104"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Furdfdu1xfjuo7twfwupx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Furdfdu1xfjuo7twfwupx.png" alt="Image description" width="800" height="333"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's add visualizations by hexagons&lt;/p&gt;

&lt;p&gt;We want to see the number of customers and the percentage of churn customers by dividing an area into hexagons which is convenient if we want to understand whether the value of the metric changes depending on the geographical location of the clients, and entities such as a city  or country are very large.&lt;br&gt;
Hexagonal cells are color-coded based on the number of data points they hold, which enables you to easily understand data patterns. They help you identify patterns or clusters in a larger point dataset.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkaktjhkydj5s6lc93bph.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkaktjhkydj5s6lc93bph.png" alt="Image description" width="800" height="304"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Folfktfyvzxyk7m41192s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Folfktfyvzxyk7m41192s.png" alt="Image description" width="800" height="345"&gt;&lt;/a&gt;&lt;br&gt;
In general, there are fewer hexagons in the Los Angeles area with a high percentage of churn rate (50+%). In some hexagons, we see 80-100 percent of customers in outflow, but these are hexagons where in total &amp;lt;= 10 customers.&lt;/p&gt;

&lt;p&gt;Let's build a scatter plot, where the x-axis is the number of customers in a hexagon, y-churn rate&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F22bfuh9rf6mzrgfkltk4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F22bfuh9rf6mzrgfkltk4.png" alt="Image description" width="800" height="124"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe59uzsg4v620c0h35aub.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe59uzsg4v620c0h35aub.png" alt="Image description" width="800" height="345"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We observed a churn rate of 25% only in hexagons, where we had a small number of customers. We do not see any geography of customers where our metric would behave differently as we can consider these hexagons with a small number of customers and churn rate &amp;gt;= 50% as zones with abnormally high churn rates.&lt;/p&gt;

&lt;h3&gt;
  
  
  Customer's lifetime in the service
&lt;/h3&gt;

&lt;p&gt;To determine how many months the clients who are in the churn used our service and whether is there a point when the largest number of customers stop using the service, we will create a histogram&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2q06macm1fv9isitwww7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2q06macm1fv9isitwww7.png" alt="Image description" width="800" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We will group the data by churn label and tenure months and check the quantiles&lt;/p&gt;

&lt;p&gt;Churn Label      &lt;/p&gt;

&lt;p&gt;No&lt;br&gt;&lt;br&gt;
0.50    38.0&lt;br&gt;
0.75    61.0&lt;br&gt;
0.90    71.0&lt;br&gt;
0.95    72.0&lt;br&gt;
Yes&lt;br&gt;&lt;br&gt;
0.50    10.0&lt;br&gt;
0.75    29.0&lt;br&gt;
0.90    51.0&lt;br&gt;
0.95    60.0&lt;br&gt;
Name: Tenure Months, dtype: float64&lt;/p&gt;

&lt;p&gt;50% of the customers who left the service did so in the first 10 months. The number of clients in the churn ceases to decline sharply after 5 months.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;EDA is only a key to understanding and represent data in a better way which helps you build a powerful and more generalized model. Data visualization is easy to perform EDA which makes it easy to make others understand what we are doing.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>data</category>
      <category>visualization</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Beginner Data Science Roadmap - 2023</title>
      <dc:creator>Muinde Esther Ndunge </dc:creator>
      <pubDate>Mon, 02 Oct 2023 10:21:12 +0000</pubDate>
      <link>https://dev.to/muinde_esther/beginner-data-science-roadmap-2023-59d8</link>
      <guid>https://dev.to/muinde_esther/beginner-data-science-roadmap-2023-59d8</guid>
      <description>&lt;h2&gt;
  
  
  Beginner's Journey in Data Science
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In the 21st century, data science has earned the title of the "sexiest job" according to a study by Harvard Business School. But what exactly is data science?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Science&lt;/strong&gt; is a multidisciplinary field that relies on a cross-disciplinary set of skills. It involves the science of analyzing raw data using various techniques from mathematics, statistics, and machine learning to draw meaningful conclusions and insights. In this article, we will explore the learning curve for beginners in data science.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Tools and Skills Needed
&lt;/h2&gt;

&lt;p&gt;As a beginner, it's essential to acquaint yourself with the key tools and skills required in data science:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Programming Languages:&lt;/strong&gt; Python, R, and SQL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Machine Learning Libraries:&lt;/strong&gt; TensorFlow, Keras, and Scikit-learn.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Visualization Tools:&lt;/strong&gt; Tools like Tableau, Power BI, and Matplotlib.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Storage and Management Systems:&lt;/strong&gt; Databases like MySQL, MongoDB, and PostgreSQL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Computing Platforms:&lt;/strong&gt; AWS, Azure, and Google Cloud Platform.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Need for Data Science
&lt;/h2&gt;

&lt;p&gt;The demand for data science is on the rise due to the vast amount of data generated by businesses, organizations, and individuals. Data science provides the tools and techniques to extract valuable insights from this data, enabling informed decision-making for businesses.&lt;/p&gt;

&lt;h2&gt;
  
  
  Learning the Fundamentals
&lt;/h2&gt;

&lt;p&gt;As a beginner in data science, you should build a solid foundation by learning the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;At least one programming language such as Python, SQL, Scala, Java, or R.&lt;/li&gt;
&lt;li&gt;Basics of data structures, algorithms, logic, control flow, writing functions, and object-oriented programming.&lt;/li&gt;
&lt;li&gt;Familiarity with Git and GitHub.&lt;/li&gt;
&lt;li&gt;Basic skills in data visualization and manipulation.&lt;/li&gt;
&lt;li&gt;Mathematics skills, including linear algebra, multivariate calculus, and optimization techniques.&lt;/li&gt;
&lt;li&gt;Understanding of statistics and probability, which are essential for mastering machine learning.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Learn Data Exploration and Preprocessing
&lt;/h2&gt;

&lt;p&gt;Key aspects of data preparation and preprocessing include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exploratory Data Analysis.&lt;/li&gt;
&lt;li&gt;Feature Engineering.&lt;/li&gt;
&lt;li&gt;Data Cleaning.&lt;/li&gt;
&lt;li&gt;Handling Missing Data.&lt;/li&gt;
&lt;li&gt;Data Scaling and Normalization.&lt;/li&gt;
&lt;li&gt;Data collection from various sources, including APIs, databases, publicly available data repositories, and web scraping.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Machine Learning
&lt;/h2&gt;

&lt;p&gt;The next step in your journey is to learn machine learning, which can be divided into two major categories: Supervised and Unsupervised Learning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Supervised Learning:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Regression:

&lt;ol&gt;
&lt;li&gt;Linear Regression.&lt;/li&gt;
&lt;li&gt;Polynomial Regression.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;Classification:

&lt;ol&gt;
&lt;li&gt;Logistic Regression.&lt;/li&gt;
&lt;li&gt;K-Nearest Neighbors.&lt;/li&gt;
&lt;li&gt;Support Vector Machines.&lt;/li&gt;
&lt;li&gt;Decision Trees.&lt;/li&gt;
&lt;li&gt;Random Forest.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Unsupervised Learning:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clustering:

&lt;ol&gt;
&lt;li&gt;K-means.&lt;/li&gt;
&lt;li&gt;DBSCAN.&lt;/li&gt;
&lt;li&gt;Hierarchical Clustering.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;li&gt;Dimensionality Reduction:

&lt;ol&gt;
&lt;li&gt;Principal Component Analysis (PCA).&lt;/li&gt;
&lt;li&gt;t-Distributed Stochastic Neighbor Embedding (t-SNE).&lt;/li&gt;
&lt;li&gt;Linear Discriminant Analysis (LDA).&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Additionally, you can explore Reinforcement Learning, where algorithms maximize rewards to reach specific goals. Don't forget to familiarize yourself with machine learning libraries and frameworks like Scikit-learn, TensorFlow, Keras, and PyTorch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deep Learning
&lt;/h2&gt;

&lt;p&gt;Deep learning is a subset of machine learning that models artificial neural networks after the human brain. Here are some aspects to consider in your deep learning journey:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Neural Networks, including Perceptrons and Multi-Layer Perceptrons.&lt;/li&gt;
&lt;li&gt;Convolutional Neural Networks (CNNs) for tasks like image classification, object detection, and image segmentation.&lt;/li&gt;
&lt;li&gt;Recurrent Neural Networks (RNNs) for sequence-to-sequence models, text classification, and sentiment analysis.&lt;/li&gt;
&lt;li&gt;Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) for tasks like time series forecasting and language modeling.&lt;/li&gt;
&lt;li&gt;Generative Adversarial Networks (GANs) for image synthesis, style transfer, and data augmentation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Big Data Technologies
&lt;/h2&gt;

&lt;p&gt;To manage and analyze large datasets effectively, consider learning the following big data technologies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hadoop (including HDFS and MapReduce).&lt;/li&gt;
&lt;li&gt;Apache Spark (including RDDs, DataFrames, and MLlib).&lt;/li&gt;
&lt;li&gt;NoSQL databases like MongoDB, Cassandra, HBase, and Couchbase.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Data Visualization and Reporting
&lt;/h2&gt;

&lt;p&gt;Data visualization is a crucial step in data science, as it transforms data into easily understandable insights. Learn tools like Power BI, Tableau, and Python Dash for data visualization. Enhance your storytelling and communication skills to convey your findings effectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Domain Knowledge and Soft Skills
&lt;/h2&gt;

&lt;p&gt;Understanding domain-specific knowledge is essential. It helps you grasp the intricacies of a field and focus on critical project aspects such as precision, accuracy, representativeness, and significance. Improve your problem-solving skills by working on projects involving small datasets. Develop effective time management and teamwork skills, as collaboration is common in data science projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Staying Updated and Continuous Learning
&lt;/h2&gt;

&lt;p&gt;Data science is a dynamic field with evolving trends. Stay updated by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enrolling in online courses.&lt;/li&gt;
&lt;li&gt;Reading books and research papers.&lt;/li&gt;
&lt;li&gt;Following data science blogs and podcasts.&lt;/li&gt;
&lt;li&gt;Attending conferences and workshops.&lt;/li&gt;
&lt;li&gt;Engaging with the data science community through networking.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Continuous learning is key to mastering data science and staying relevant in this ever-changing field.&lt;/p&gt;

&lt;p&gt;In conclusion, the journey into data science begins with building a strong foundation in programming, mathematics, and statistics. As you progress, explore machine learning, deep learning, big data technologies, and hone your data visualization and soft skills. Embrace continuous learning to keep pace with the dynamic world of data science.&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>datascience</category>
      <category>data</category>
    </item>
  </channel>
</rss>
