DEV Community: Giotto_ai

Detecting stock market crashes with topological data analysis

Giotto_ai — Thu, 21 Nov 2019 13:33:44 +0000

As long as there will be financial markets, there will be financial crashes. Most mortals will suffer when the market takes a dip… but those who can foresee it coming, they can protect their assets or take risky short positions to turn a profit (a nevertheless stressful situation to be in, as depicted in the Big-short).
An asset on the market is associated to a dynamical system, whose price varies in function of the available information. The price of an asset on a financial market is determined by a wide range of information, and in the efficient market hypothesis, a simple change in the information will be immediately priced in.

The dynamics of financial systems are comparable to those of physical systems

In the same way that phase transitions occur between solids, liquids, and gases, we can discern a normal regime on the market from a chaotic one.
Observations show that financial crashes are preceded by a period of increased oscillation in asset prices. This period of high uncertainty is associated with an abnormal change in the geometric structure of the time series.
In this post we use topological data analysis (TDA) to capture these geometric changes in the time series in order to produce a reliable detector for stock market crashes. The code implements the ideas given by Gidea and Katz, thanks to Giotto’s open-source TDA library.

There is little consensus about the exact definition of a financial crash

Intuitively, stock market crashes are rapid drop in asset prices. The price drop is caused by massive selling of assets, which is an attempt to close positions before prices decrease even further.
The awareness of a large speculative bubble (like in the case of the sub-primes), or a catastrophic event, will cause markets to crash. In the last two decades we have seen two major crashes: the 2000 dot-com crash and the 2008 global financial crisis.

Our results in a nutshell

We analyse daily prices of the S&P 500 index from 1980 to the present day. The S&P is an index that is commonly used to benchmark the state of the financial market as it measures the stock performance of the 500 large-cap US companies .
Compared to a simple baseline, we find that topological signals tend to be robust to noise and hence less prone to produce false positives.
This highlights one of the key motivations behind TDA, namely that topology and geometry can provide a powerful method to abstract subtle structure in complex data.

Detection of stock market crashes from baseline (left) and topological (right) models, discussed in detail below.
Let’s describe in more detail the two approaches.

A simple baseline

Given that market crashes represent a sudden decline of stock prices, one simple approach to detect these changes involves tracking the first derivative of average price values over a rolling window. Indeed, in the figure below we can see that this naïve approach already captures the Black Monday crash (1987), the burst of the dot-com bubble (2000–2004), and the financial crisis (2007–2008).

Magnitude of the first derivative of mean close prices between successive windows.
By normalising this time series to take values in the [0,1] interval, we can apply a threshold to label points on our original time series where a crash occurred.

Crash probability for baseline model (left), with points above threshold shown on original time series (right).

Evidently this simple method is rather noisy.

With many points labelled as a crash, following this advice will result in over-panicking and selling your assets too soon. Let’s see if TDA can help us reduce the noise in the signal and obtain a more robust detector!

The TDA pipeline

The mathematics underlying TDA is deep and won’t be covered in this article — we suggest this overview. For our purposes, it is sufficient to think of TDA as a means to extract informative features which can be used for modeling downstream.
The pipeline we developed consists in: 2) embedding the time series into a point cloud and constructing sliding windows of point clouds, 3) building a filtration on each window to have an evolving structure encoding the geometrical shape of each window, 4) extracting the relevant features of those windows using persistence homology, 5) comparing each window by measuring the difference of those features from one window to the next, 6) constructing an indicator of crash based on this difference.

TDA pipeline

Time series as point clouds — Takens’ embedding

A typical starting point in a TDA pipeline is to generate a simplicial complex from a point cloud. Thus, the crucial question in time series applications is how to generate such point clouds? Discrete time series, like the ones we are considering, are typically visualised as scatter plots in two dimensions. This representation makes the local behaviour of the time series easy to track by scanning the plot from left to right. But it is often ineffective at conveying important effects which may be occurring over larger time scales.
One well-known set of techniques for capturing periodic behaviour comes from Fourier analysis. For instance, the discrete Fourier transform of a temporal window over the time series gives information on whether the signal in that window arises as the sum of a few simple periodic signals.
For our purposes we consider a different way of encoding a time-evolving process. It is based on the idea that some key properties of the dynamics can be unveiled effectively in higher dimensions. The key idea is to represent a univariate time series (or a single temporal window over that time series) as a point cloud, i.e. a set of vectors in a Euclidean space of arbitrary dimension.
The procedure works as follows: we pick two integers d and τ. For each time tᵢ ∈ (t₀, t₁, …), we collect the values of the variable y at d distinct times, evenly spaced by τ and starting at tᵢ, and present them as a vector with d entries, namely:

The result is a set of vectors in d-dimensional space! τ is called the time delay parameter, and d the embedding dimension.
This time-delay embedding technique is also called Takens’ embedding after Floris Takens, who demonstrated its significance with a celebrated theorem in the context of nonlinear dynamical systems.
The result of this procedure is a time series of point clouds with possibly interesting topologies. The GIF below shows how such a point cloud is generated in 2-dimensions.

Illustration of the Taken’s embedding with embedding dimension d=2 and time delay τ=1
From point clouds to persistence diagrams
Now that we know how to generate a time series of point clouds, what can we do with this information? Enter persistent homology, which looks for topological features in a simplicial complex that persist over some range of parameter values. Typically, a feature, such as a hole, will initially not be observed, then will appear, and after a range of values of the parameter will disappear again.

Illustration of the Taken’s embedding with embedding dimension d=2 and time delay τ=1

From point clouds to persistence diagrams

Now that we know how to generate a time series of point clouds, what can we do with this information? Enter persistent homology, which looks for topological features in a simplicial complex that persist over some range of parameter values. Typically, a feature, such as a hole, will initially not be observed, then will appear, and after a range of values of the parameter will disappear again.

Point clouds from two successive windows and their associated persistence diagram

Distances between persistent diagrams

Given two windows and their corresponding persistence diagrams, we can calculate a variety of distance metrics. Here we compare two distances, one based on the notion of a persistence landscape, the other on Betti curves.

Magnitude of the landscape (left) and Betti curve (right) distances between successive windows.
From these figures we can infer that the metric based on landscape distance is less noisy than the Betti curves.

A topological indicator

Using the landscape distance between windows as our topological feature, it is a simple matter to normalise it as we did for the baseline model. Below we show the resulting detection of stock market crashes for the dot-com bubble and global financial crisis. Compared to our simple baseline, we can see that using topological features appears to reduce the noise in the signal of interest.

Crash probabilities and detections using topological features. The time ranges correspond to the dot-com bubble in 2000 (upper) and the global financial crisis in 2008 (lower).

Conclusion

Our results suggest that the periods of high volatility preceding a crash produce geometric signatures that can be more robustly detected using topological data analysis. However, these results concern only a specific market and for a short period of time, so one should further investigate the robustness of the procedure on different markets and varying thresholds. Nevertheless, the results are encouraging and open up some interesting ideas for future development.

The shape of football games

Giotto_ai — Mon, 18 Nov 2019 14:53:23 +0000

If you are a football fan, you’d die to have Messi join your squad. Would your team win the championship? Would it at least avoid relegation? Since Messi cannot join everybody’s team, we choose to use data and simulations to infer the answer. The EA Sport’s Fifa dataset is our proxy for player characteristics and TDA (topological data analysis) the spice to model the probability of the outcome of each match. Using simulations on those probabilities, we generate the most likely final ranking of the Premier League.

Messi is only joining Aston Villa in our Python Jupyter Notebook

Who do you need to beat Klopp this year?

Our model allows you to compose your squad, and to measure the overall effect on the team’s end of year position in the leaderboard. Try the model for yourself, get the python code, and get data here for: matches, odds, tda features, player stats, player names (original source).

Modelling

ASSUMPTION 1: THE OUTPUT OF A FOOTBALL MATCH DEPENDS ONLY ON THE SPECIFIC AND COMBINED ATTRIBUTES OF PLAYERS ON THE FIELD.

Most coaches will disagree with this assumption because we are omitting the non-negligible influence of things like team-spirit, weather, tiredness due to intra-week matches, injuries, yellow/red cards, substitutions, tactics, personal commitment of individual players, time of the season, special commitment of fans and many other internal and external factors that can influence the outcome of a match.

Walter Mazzari, former Watford manager, became a reference while he was at F.C. Internazionale in 2014 for the iconic sentence: “we were playing well, and then it started raining”. (source:https://www.standard.co.uk)
Having said so, you might have figured out the first undeniable truth:

Our model is clearly wrong (as any model is…).

Whoever has been on the field as a player, coach, fan, steward, bench warmer or gardener knows that when it comes to predict the outcome of a match, the amount of information to be taken into account is wider than the information that can be recorded. A ferocious scream from the stands, a mistaken whistle from the ref’, or the shrimps on the lunch menu may jeopardize the whole outcome of the match. Some respected scientists claim that football is just random or as hard as proving Fermat’s last theorem (except we have 129 pages of mathematical proof for the latter).

Professor Andrew Wiles proved Fermat’s last theorem in 1994. After 358 years we finally have a proof that “there are no whole number solutions to the equation x^n + y^n = z^n when n is greater than 2, unless xyz=0”. Will somebody ever find a solution for predicting the outcome of football matches? (source:http://www.ox.ac.uk)

We are of course aware that no predictive model can perfectly predict the outcome of football matches. Our ambition is rather to see whether the agnostic methods of topological data analysis can identify relevant patterns in the minuscule set of seven aggregated characteristics for each team.
We use the 24 attributes given to each player by EA to engineer attack and defense features. The initial attribute correlation matrix gives directions to build the latter. To generate features for the whole team, we build the following 7 features based on the initial match composition:
· Rating of the goalkeeper
· Maximum attack value in the team
· Maximum defense value in the team
· Average attack in the team
· Average defense in the team
· Standard deviation of attack in the team, in percentage
· Standard deviation of defense in the team, in percentage
We train a model to estimate the probability of each match outcome based on 2591 matches from the past six Premier League seasons. We test on the 380 matches of season 14/15 for which we provide a simulation of the final leaderboard.

Motivation

It is the 31st of December 2011, Sir Alex Ferguson is turning 70 and Manchester United, at Old Trafford, is playing against Blackburn Rovers, the very last of the leaderboard. During their last match together, the Rovers suffered an abysmal 7–1 defeat. That night, those who were expecting another demonstration from Ferguson’s team would be very surprised. Leading 2–0, Blackburn seal the fate of the match 3–2 with a victory goal in the last 10 minutes.

That night the lucky betters were making 28:1 at the bookies.

Unfortunately, this miracle win at Old Trafford didn’t save Blackburn from relegation. On the other hand, these points would reveal fatal for the red devils in the long-run. The missed opportunity that night lead to an incredible tie at the top of the final leaderboard. City and United both closed the season with 89 points. Thanks to a better goal difference, the citizens won the cup, leaving Manchester United with a bitter after-taste.

That’s great. But why topology?

Although this event seems unpredictable for Manchester United, topology makes a clear separation between the case of this match against Blackburn and confrontations against teams of the same caliber. For example we consider Man-Utd vs West Brom and Man-Utd vs Bolton during the same season.
Let’s try to understand why our match is so special (so you can plan your next trip to the bookies). The first thing we can do is study the space of matches, a match is a point in a space of 14 dimensions (remember each team has 7 features, and a match has 7+7=14 features). We use the two first components of the PCA centered around each of the matches we consider to visualize similar matches.

2-dimensional PCA representation for Man-Utd vs West Brom. 3–0, Man-Utd. vs Bolton 2–0 and Man-Utd. vs Blackburn 2–3

What you can’t see with PCA, you can see with TDA

The three plots are projections of the 14 features used to describe the match down to two, and projections are known to loose information along the way. We use TDA to recover and visualize structure from the original space. The tool we use is a persistence diagram (available in Giotto!). A persistence diagram is a representation of the dataset in terms of the connectivity of points, it is obtained by progressively connecting neighbouring points and measuring the homology of the construction. It is a new way to understand, visualize and extract features from data. If you want to learn more about TDA we recommend this post.

Persistence diagram for Man-Utd vs West Brom. 3–0, Man-Utd. vs Bolton 2–0 and Man-Utd. vs Blackburn 2–3
The three persistence diagrams are computed on the same point clouds as in the case of the PCA. The points in the diagrams are no longer matches, rather they describe relationships between the points in the original space. In our case, it characterises the shape of the point cloud around each of the three selected matches.

The persistence diagrams inform you about local and global structures

From the first two first diagrams, we can see that all the connected components (represented by the orange points) are concentrated on the y-axis between [5,10]. Furthermore, the loops (represented by the green points) are concentrated in the box [6,8]x[6,8] and their maximum distance to y=x is one.
In the last diagram, the orange points are more spread out, the extremal point (0,17) represents a component that is late to connect with the rest of the dataset. On top of that, the green points are also much more spread out on the y=x line, and they are overall way closer to this line.

The structure of the third diagram is suggestive of an outlier

Indeed the orange point (0,17) represents the merge between Man. Utd.-Blackburn with the rest of the matches, this means that this match is further away than it is represented in the PCA diagram.

Persistence diagrams are great, however they cannot be directly inputed in a predictive model. You will need to know how to convert a persistence diagram into a feature for a model.

We use a trick called the amplitude function to synthesise the information in the diagram.

A more detailed explanation of how we extracted features from persistence diagrams is also contained in the Python Jupyter Notebook we are sharing.

Results

The model for individual matches is trained: we are now ready to run some simulations for the whole season. You can choose a squad and see how far they go. As a test for the model, we studied the impact of transferring Messi in every Premier League team.

No surprise, hiring Messi is always good.

With Messi on your squad, your chances to be relegated drops by 12% on average, the probability to bring the trophy home increases by 4% on average, and making it to the top4 increases by 14%. The team in the most need of Messi is Queen’s Park Rangers, who would climb 11 positions in the leaderboard. Originally finishing 14th, with an simulated probability of 72%, Leicester City would qualify for the Champions League.
Below is the original leaderboard for season 14–15, paired with the simulated probabilities for:
-winning the title,
-getting in the top4,
-and being relegated,
with and without Messi in each squad.

Leaderboard for Premier League season 14–15, including simulated probabilities with and without Messi

Model evaluation

The quality of the leaderboard simulation directly reflects the model’s accuracy in predicting match outcomes. We use a random forest classifier on the 14 features + features from the persistence diagram and we test the framework against some baseline prediction strategies:

Predicting the home team to win always (baseline)
Elo rating computed on team’s performance
Market predictions given by betting odds Accuracy of predictive strategies on 14–15 season In the above table, we present the accuracy of each stragegy on the test set. Below we compare the predictions in terms of their confusion matrices. Confusion matrix of the different predictive strategies

The results confirm: football is random. Even the bookie betting odds correctly nail it only for 53% of the matches.

Our results are comparable to those given by the betting odds, with which there is a surprisingly strong correlation. The result is interesting since our model relies on over-simplistic data. Our model has the uncommon capacity to predict draws (which represent 27% of the outcomes).
The model generalizes well when presented with data from other years and other championships. Without having ever “seen” an Italian match, the same model performed at 52% rate of accuracy in predicting Serie A matches of the season 2015/2016. This would not be possible for team-specific strategies like the Elo rating.
Maybe the best attribute of the model is its flexibility in building and testing squads. Not only can we mix teams and simulate championships, we can also make smart transfer decisions. For a fixed budget, you can optimize the best portfolio of players based on their costs and benefits.

Lionel Messi is playing in Barcelona since 2011, maybe it’s time for him to join other teams at least virtually…(source:https://metro.co.uk)

Conclusion

Our attempt is to give a simple solution to a complicated ternary classification problem. The topological model achieves nice accuracy on a very limited set of features, and it is comparable to common, although less flexible approaches.
We have tried it with Messi, now we’re curious, will Ronaldo perform better?