DEV Community: Nick Kimani

The Complete Guide To Time Series Models

Nick Kimani — Thu, 02 Nov 2023 19:01:33 +0000

You might be wondering, what are time series models? These are models built leveraging time series data. There are a variety of time series models to choose from, ranging from simple models like Autoregressive (AR) time series models to deep learning time series models like Recurrent Neural Networks (RNN).

However, time series modelling doesn't only start at that, there are steps leading to it. This whole process is called Time Series Analysis.

And what are the possible benefits of conducting time series analysis?

You can identify trends and seasonality in data which may not have been as obvious. From this you can gain valuable insights as a business that help in decision making such as marketing.
Forcasting future events. From analysis and creating models you can predict future values. For example predicting the GDP of a nation from previous GDP values of the nation.
It can aid in identifying relationships between variables over time. For example identifying the postive relationship of labor, gross enrollment ratio and development over time for a country like Kenya.
By viewing trends and distribution of variabes over time, it is more convinient to observe anomalies in organizations. This could be relevant in fraud detection or failure of machines or amenities.

In this article, I will take you through the steps involved in time series analysis.

1. Where to source data?

When it comes to time series analysis, the data used has to be indexed by a time dimension. This means that the data has to be recorded in a temporal way, whether every minute, hour or day etc. Because of this the data is commonly referred to as time series.

An important point to note is that no missing values should be present in the data being used for time series analysis. This is to ensure that the trend observed on the data is as accurate as possible.

2. What should be the nature of the data?

Data used in time series analysis should be stationary. And what does it mean for data to be stationary? When observations at given points in time in our data are not affected by the points preceeding it in any way, the data is said to be stationaty. This means that said data has no:

Trend
Drift
Seasonality
Cyclicality

NB: As much as the data used needs some 'predictability' removed, extreme randomness in the data is still not suitable. Meaning that data with random walks cannot be used for time series analysis. Afterall, the whole purpose of a time series model is to predict values right? :>)

Because stationarity is a major consideration when using data, it is important to diagnose the data early. Luckily there are tests to ease our decision making. The most popular test is the Augmented Dickey Fuller (ADF) test. This test has the null hypothesis that there is an existence of a unit root (non stationarity) in the time series against the alternative that the time series is stationary.

If the data is found to be stationary, then you can proceed with your analysis comfortably. However, if it is not stationary, there is no need to worry, there are ways to transform the time series to stationary.

1. Differencing the data points.

This is done by taking the difference between n-lag points. For example:

a 1st lag differnce would be x(t) - x(t-1) 
a 2nd lad differnce would be x(t) - x(t-2)
................................................................
a nth lag difference would be x(t) - x(t-n)

2. Log-transforming

This is done by log teansforming all datapoints in the time series.
From x to ln(x)

3. Seasonal decomposition

This invloves identifying trends and seasonality in data. After that the trend can be removed from the data to make it stationary.

After applying one of these methods your time series should be stationary and ready for modelling.

3. Identifying the autoregressive (p) and moving average (q) lags for the models.

As discussed earlier, there are a variety of time series models to choose from. Moving average (MA) and autoregressive (AR) models can be considered as foundational in time esries analysis. Whne building such models we look for the best moving average and autoregressive lags to use. We do this by the help of functions. Namely the Autocorrelation (ACF) and partial autocorrelation (PACF) functions. The 'p' in AR models is determined from the PACF function while the 'q' in MA models is obtained from the ACF function. More resources on this.

In the case where an ARIMA model is being used, an 'i' parameter will be inrtoduced, similar to 'p' and 'q'. This 'i' stands for intergration and is the value for differencing applied to the data to make it stationary. So 1 for a 1st lag differenced, 2 for a 2nd lag differenced and so on and so forth.

4. Building the model

Now that all necessary parameters are ready, all that is left is to chose what model is suitable for your problem. The basic models are AR and MA models. If you want to move your model performance futher you can apply the ARMA model which is a combination of both AR and MA models. ARIMA models are also a combination of AR and MA models however an 'intergrated' parameter is introduced and the beauty of it is that the data used for ARIMA models can be non-stationary.

**NB:** The non-stationarity in the data for ARIMA models should only be due to unit root as the model cannot deal with other causes of non-stationarity.

I hope from this you have gotten an idea of what time series modelling is and how to do it.

Exploratory Data Analysis using Data Visualization Techniques

Nick Kimani — Fri, 13 Oct 2023 11:24:16 +0000

Data is everywhere in this new age. However, data cannot make sense of itself. This aspect is what leads us to Exploratory data analysis (EDA).

EDA is the method by which we can make sense of data. It involves analysing data to gain insights, and identify relationships and patterns. By performing EDA an individual can detect outliers and anomalies, view distributions of data e.g. sales per region, detect trends and come up with insights useful to stakeholders.

EDA involves the use of certain visualization techniques implemented by various tools.

Visualization Tools for EDA

Python - Python is an object oriented programming language popularly used in data analysis. Within python, there are a number of libraries dedicated to data visualization.
They include:

Maplotlib
Seaborn
Plotly

This is how to import them to your script or notebook:

To note: Plotly creates interactive visualizations whereas the other 2 create static visualizations.

Visualization Softwares - These are softwares fully dedicated to data visualization and analysis tasks.
They include:

Tableau
PowerBI
Qlik
Plotly

Visualization Techniques for EDA

In this section we look at the popular visualization techniques (Plots & Graphs) for performing EDA. All of the techniques discussed can be implemented in any of the tools mentioned above.

Before that it is important to know the different types of data:

Categorical data
Continuous data

Bar Graphs

It is a way of visualizing categorical data usually with rectangular bars of different heights or lengths that depict the size of a category. It is used for scenarios such as comparing the sales per city or per product.

Pie Charts
Just as the bar graph, this is also used for comparing categorical data. It is a circle, to which proportions are assigned to categories based on their size in relation to the whole data.

Histograms
It is used for checking the distribution of continuous data such as sales, profits, e.t.c.. It can be viewed as the equivalent of bar graphs but for continuous data. This is because it groups the data into ranges, called bins, and displays the total count of records in each bin as bars.

Box and Whisker plots
This is used to check the distribution of continuous data, it is more popular for its usefulness in detecting outliers. The 'box' displays the 1st, 2nd(median) and 3rd quartiles. The endpoints of the whiskers are 1.5*Interquartile range (+3rd quartile OR -1st quartile). Any records beyond the whiskers are considered to be outliers.

Scatterplots
They are used in bivariate analysis in checking whether two variables could have a correlation to each other. They are typically used with continuous data.

For example: The figure on the left shows that the two variables are likely to have a high positive correlation. Whereas the figure on the right shows that there is a low possibility of correlation between the variables.

Wordcloud
It is a technique of visualizing text data. The size of the word is dictated by the frequency of the word, with large words signifying high frequency. It is popular in analysis for comments or sentiment analysis.
For example a teleco company seeking to find out what its customers think about their products.

Correlation Matrix
This is a type of heatmap visualization that uses correlation values between variables in data to determine the 'heatcolor' to assign to a relationship between variables. It is useful in finding patterns and relationships in large datasets as it saves on the time that would have been used to inspect the variable relationships individually.

Conclusion
Exploratory data analysis (EDA) is an important aspect in making sense of data and extracting valuable and actionable insights from it. The good news is that there are a myriad of ways to perform EDA, from the tools to use, to the techniques. It is important to understand the use cases of the various techniques and because applying them to unsuitable cases would lead to errors in interpretation or illogical visualizations.

Data Science for Beginners: 2023 - 2024 Road Map.

Nick Kimani — Wed, 04 Oct 2023 20:14:47 +0000

Guide to starting a career in datascience

Get familiar with the basics.

There are a lot of tools used in data science. However there are some fundamentals needed. You would need to get familiar with statistics, calculus, linear algebra, an object oriented programming language of choice (Python, R or Java) and Structured Query Language.

Apply the basics.

It is one thing to read/know about something and another to put it n practice. The next step would be to learn and actively apply the tools you've gotten familiar with, learn their use cases and methods of application. This would be in the form of data analysis, data visualization, and/or data cleaning.

Learn Intermediate datascience concepts.

You should familiarize yourself with concepts such as machine learning, processing leading to machine learning and the do's and don'ts. Learn how to preprocess and create machine learning algorithms/models, improve created models by hyperparameter tuning and/or selecting important or relevant features for the models.

Participate in competitions or learning forums.

Sites such as Kaggle, Zindi and others hold various competitions which are a good platform to polish your acquired skills and also identify areas of improvement.

Learn advanced datascience concepts

You would need to learn advanced machine learning techniques. These include, deep learning techniques, computer vision, natural language processing(NLP), keras and cloud computing.

Keep Learning!

The data field is always growing, you should join data communities and platforms not only for networking (which is important) but for keeping in touch with the current trends in order to keep your skills on toes. These communities can be found on Kaggle, Zindi, LinkedIn, and Twitter.

Godspeed on your data science journey!!