Leveling with cluster analysis in Python: basic Python concepts

#datascience #tutorial #beginners #python

This is the 2nd of a series of 5 little articles that intend to present a simple idea of time series, and their implementation in Python. The purpose here is to present both a time series problem, and how we can solve it in simple Python code.

Only very basic knowledge of Python and time series are needed as most concepts will be explained with care and references to longer tutorials.

Roadmap

The 1st article of the series presented the basic concepts of this series. This one, the 2nd one will present basic Python concepts and techniques to be used in the solution. The 3rd one will present a solution implemented in Python. The 4th article will add a sinusoidal decomposition of the data after the filtering of the solution. And the 5th and last one will use all the elements to address a real problem in cryptocurrencies.
'

Some simple Python ideas

Libraries, modules and submodules

Isaac Newton, that created a huge part of modern Physics and Mathematics once said the he could see further because because he standed in the shoulder of giants.

This concept is behind most software codes: they do not create everything, they use a large part of what was already created, mainly in the form of sofware libraries. Python is very good at this, and here is the part of the code used here that will use some libraries. In Python, a library is usually named a module.

import numpy as np 

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

The idea that the current code will bring information from another one is encompassed in the word import in the code above. Another important point is that programmers usually prefer to write less. So, in the 1st line, the library numpy is renamed as np. In another line, the module matplotlib.pyplot is renamed simply as plt.

And in the last line of the code, another shortening is presented: instead of renaming a code fragment in a shorter from, Python lets one pick only what a programmer needs. In this case, only KMeans will be picked from sklearn.cluster.

A less important point is that use of a dot lets selecting a part of a module; that is: a submodule. In the code shown, matplotlib.pyplot means the submodule pyplot of the module matplotlib. And, of course, sklearn.cluster is the submodule cluster of the module sklearn. Creating submodules increases the organization of larger modules, as this divides them in specialized parts.

Pieces of information in scalar variables

The next part of the code deals with storing information that shall be processed. Here a list of scalar variables, or variables that are individual:

coeff : float = 0.25

base : float = 0.0
ladder : float = 1.0
n_points : int = 101
n_half : int = n_points // 2

delta : float = 1.0 / n_points

There are two types of information in this code fragment: int, that hold only integral values. In this case the count of points (n_point) and the half of that count, n_half.

The double bar in the line for n_half is used to make sure the division of n_points by 2 will generate a number of type int, and not a float number. That is: it will hold 50, and not 50.5, as the division of 101 by 2 would create.

The type float can be used to hold number with a decimal part. For instance, coeff will be used to hold a coefficient. In this case, 0.25. Since the problem here is to represent a discontuinuity, a set of values will be close to a base level, while other ones will be above it, in ladder.

And finally, delta holds the step that will be used as a clock tick in our time series.

A data generator

A time series usually is a series of data collected along the time. For instance, mean wage in a certain year. But to avoid the need of getting real data, this code will generate its own data. By means of a pseudorandom number generator, aka PRNG. In a few words, a PRNG is a mathematical algorithm that can generate a sequence of numbers without any pattern; that is: they look random. It's called pseudorandom because if one such algorithm is fed with a constant seed it will always generate the same sequence of numbers.

In this case it is:

rng = np.random.default_rng(42)

In this code, rng is the name of the fabric of numbers, and it's created by calling the function (or piece of code) defaut_rng with the parameter 42. This function is in the submodule np.random.

Pieces of information in arrays

Since a time series contain several data, a scalar variable can't be used to contain it. The module numpy has the resource of arrays or ndarrays. In an array, the elements are identified by an index, that is analogous to the apartment number in a building.

The following code fragment creates the arrays needed to this problem:

x : np.ndarray = np.linspace(0.0, 1.0, n_points)
y : np.ndarray = rng.uniform(-0.5, 0.5, size=n_points)

y[n_half : ] += ladder

In the 1st line, the array x receives n_points (aka 101) numbers from 0.0 to 1.0, subdivided in increments of 0.01; that is: 0.0, 0.01, 0.02, ..., 0.99, 1.0

In the 2nd line, the array y receives n_points numbers chosen pseudorandomly between -05 and 0.5.

In the third line, the 2nd half of y receives an increment of ladder. That creates the discontinuity to be solved in the remaining of the texts.

Plotting the data

Please consider the following code fragment:

fig, ax = plt.subplots()
ax.plot(x, y)

ax.set_xlabel('x')
ax.set_ylabel('y')

plt.grid(visible=True)
plt.show()

The 1st line is a generic one, that can be used to create plots -- aka charts -- much more complicated than the ones used here. For instance, several plots in the same image.

The 2nd line plots the elements of two arrays, taking care of connecting by lines each one of their values.

The 3rd and 4th lines of the code label the x and y axis.

The 5th line of the code create a grid to ease the visualization of data. And the 6th and final line causes the chart to be shown in the screen.

The final result is like this: