Leveling with cluster analysis in Python: general concepts

#beginners #tutorial #python #datascience

Financial markets have discontinuities: sometimes a price jumps up or down in a time so short that it can be considered a real discontinuity if the time measured by our clocks were really continuous, real-line continuous.

Those discontinuities create problems for many forms of mathematical modelling, since their models are based upon continuous functions. For instance, many price oscillations look like periodic functions, but when a discontinuity is found, any harmonic analysis becomes troublesome.

Actually, a trend can also be troublesome to the fitting of periodic functions to financial data. But in this case, fitting a polynomial of low grade to the data can filter the trend and then a periodic function series can be fitted to the residuals, the difference between the fitted polynomial and the original data.

The purpose of this suite of articles is to present a simple method to eliminate jumps from the observed data. Of course, when reconstructing the fitted data, the discontinuity will be added back.

Only very basic knowledge of Python and time series are needed as most concepts will be explained with care and references to longer tutorials.

Roadmap

This one, the 1st of 5 short articles, will introduce the general concepts for the solution, the 2nd one will present basic Python concepts and techniques to be used in the solution, the 3rd one will present a solution implemented in Python, the 4th article will add a sinusoidal decomposition of the data after the filtering of the solution, and the 5th and last one will use all the elements to address a real problem in cryptocurrencies.

Cluster analysis as a means to group similar data

Cluster analysis is a well-known technique for grouping data elements based on their similarities. In a metric space, similarity means smaller distances. There are several ways to devise the groups or clusters of data, and one of the simplest is called k-means clustering. In very few words, it creates clusters by assigning a mean average of the coordinates to a point, that's a centroid. Through these articles we shall use only k-means clustering.

The following image is a typical two-dimensional representation of two groups.

The points are in blue, and the centroids are in red.

Cluster analysis in a curve

Since the k-means clustering is based upon the distance of points, an interesting effect will happen when the points are connected in a curve; therefore, they are much closer to each other than the points dispersed in a cloud, like in the previous image.

Please consider the following image that shows a time series with a discontinuity.

When a k-means cluster analysis is applied to it, the centroids of the clusters are shown in red.

It's easy to see that the there are two groups are in different levels, as shown in the following image:

The Group 2 is around the green line, while the Group 1 is around the red line.

Then to eliminate the discontinuity, it's enough to lower the Group 1 to the level of the Group 2. That is: to subtract from the points y coordinate the difference between the level of the two groups.

That can be shown in the following image:

Now no differences can be seen in the two groups of points.

Next step

The next article in this suit will introduce the basic Python concepts needed to create the 1st of the images presented here, and also used in the other articles.

DEV Community

Leveling with cluster analysis in Python: general concepts

Roadmap

Cluster analysis as a means to group similar data

Cluster analysis in a curve

Next step

Top comments (0)