DEV Community: Darth Espressius

Graph Features for Graph Machine Learning

Darth Espressius — Tue, 17 May 2022 22:58:31 +0000

In my previous post, we went through what a Graph is in the context of math, as well as node-based features used for machine-learning on graphs.
TLDR: a graph is a set of nodes connected by edges, both of with can contain features. Graph ML is concerned with using graph-based representations to infer these features on new graphs, or in some cases to learn structures existing within a graph.

The previous post covered features which assumed that we wanted to perform inference on the nodes or edges of a graph. Some applications however, tend to warrant entire-graph operations; For example, predicting whether a new molecule is toxic or whether a new protein is compatible with particular enzymes.

To this end, there are three common approaches:

Bag of nodes
Weisfeiler-Lehman Kernel
Graphlets

Bag of Nodes

This is the simplest method. Summary statistics, such as histograms, node degree, centrality measures from node-level operations can be aggregated and used as a graph-level representation. However, this is entirely based on node-level data, which means that larger, structural features of the graph may be missed.

Weisfeiler-Lehman Kernel

The idea behind this method is to iteratively aggregate node-level information, which contain data extending past their local ego graph (their immediate neighbourhood). This method can get mathsy quite quickly;

A label is assigned to each node, such as the node-degree. A hash-function then iteratively assigns each node a new label using the multi-set of current labels within the current node's neighbourhood (multi-set, since some neighbours may have the same degree). This is run a fixed number of time (depending on graph-size and how much data we wish to capture). Each node now encodes the structure of its neighbourhood, which can then be summarized for further processing.

Graphlets

This is a typically combinatoric-ally difficult problem, since it analyses different possible subgraph structures existing in a graph (called graphlets). A graphlet kernel would encode the number of times graphlets of a certain type occur in a graph, typically in a column vector. A similar approach looks at paths that occur in a graph, and encodes the number of times particular degree-sequences occur. A slight variation to this approach uses only the shortest-path between nodes which is used to encode this data.

Node Features for Graph Machine Learning

Darth Espressius — Sat, 07 May 2022 18:05:19 +0000

A graph in the context of mathematics is was almost every other field refers to as a network. It consists a series of nodes connected by edges, both of which can contain meta- information, or what we refer to as features.

For example, a graph can describe power-stations across some geographic area; Each station could have features describing its maximum output, current output and current demand as its node features. The edges could describe the energy flowing to distribution stations, or other power-stations in times of high demand. An interesting application may be to forecast the required power output for any station or for the consumption of any distribution station at a given point in time. The benefit of using a graph-based approach in this case is an implicit way to capture the dependencies between stations that deliver power to similar geographic areas, where each node is "aware" of other nodes at given time-steps, and therefore can adjust itself in relation to other nodes for more efficient energy use.

See (here)[https://arxiv.org/pdf/2105.13399.pdf] for an interesting application similar to what was described above

Before we get ahead of ourselves, we should take a step back and think about how we go about making predictions on a graph. Supervised machine learning uses features constructed from our data to establish (hopefully) pronounced similarities and differences in the input data that may not have been immediately obvious. In order to do this, we take our data (graph with node and edge features) and construct more appropriate features that can more easily be digested by a traditional machine-learning model.

There are three (traditional) ways of going about this, we can construct features on the nodes, on the edges or on the entire graph at a time. In this post, I will go through node features, where it's used, and some high-level intuition about each.

Node Importance

Identifying important vertices in a graph is an interesting problem that usually requires some the development of some node-level embedding.

Node Degree

Caveat: this technically is a type of centrality, but I keep it separate as its sometimes referred to as a node "feature"

The most obvious way to create a smart feature describing nodes u in a graph V is to take a count of the node degree, or the number of edges leaving or entering a node (or just the number of edges if the graph is un-directed).

d_u = \sum_{u\in V}\bold{A}[u, v]

As intuitive as this is, there are some drawbacks; Counting the number of edges of a given node only accounts for the nodes directly local neighborhood, and it ignores other node features (such as how important a neighbor may be) that may contribute to informative statistics from the graph. Lastly, the importance of a given node should (I think?) depend on the importance of my neighbors. For example, in a given social network, I think if you had a direct edge to the President of the US, your "importance" should be more heavily weighted.

Node Centrality

Node centrality aims to improve upon vanilla node degree, by addressing its main shortcoming of not accounting for neighbors' importance. Additionally the idea of "centrality" encompasses a range of different methods, few of which I go through here

Betweenness Centrality

This measure accounts for the number of shortest-paths to a given node. Going back to the social-media example, if lots of my friends are directly connected to an important person, then that makes them more important, and therefore (assuming it's actually a lot of friends), I may then be weighted as a more "important" person.

This measure counts the number of shortest paths which go through a given node, therefore making it more "central" to the graph. Another way is to think of junctions in a city as nodes, and edges as roads; if you have to pass through the main junction to reach most of the other junctions, it's highly likely that it is indeed important.

For completeness, here's how we calculate betweenness centrality of a node v:

\sum_{\text{all node pairs in graph}}\frac{\text{Number of shortest paths between two nodes that pass through node v}}{\text{Number of shortest path between two nodes}}

Closeness Centrality

This measure counts how far away a node is from every other node. The idea is, if a node is further from every other node, it is less important (note the concept of "important" here can be flipped if we rank important-ness in rarity, or if we want to find outlier nodes).

For a given node v, this measure is taken as the sum of 1 over the number of edges in every shortest path to every other node.

\frac{1}{\sum_{\text{every other node}}\text{(Number of edges in shortest path)}}

Eigenvector Similarity

This is an interesting way of simply updating node importance based on neighbor importance, and requires a bit more background. We represent a graph as an adjacency matrix (other ways are as an edge list or adjacency set), which is square and has the number of rows/columns equal to the number of nodes in a graph. A non-zero entry indicates connection between two nodes. A node's eigenvector similarity is defined as following:

e_u = \frac{1}{\lambda}\sum_{u\in V}\bold{A}[u, v]e_v

If we rewrite the above in vector notation, with e as a vector of node centralities:

\lambda\bold{e} = \bold{Ae}

The above is of the form of the eigenvector-eigenvalue decomposition (see Section 4.4 of this free online book for a great breakdown).

A given view of this measure is that it ranks the likelihood that a node is visited on a random walk of infinite length on the graph.

The part to note here is that, assuming we require positive centrality values, there are theorem which allow us to solve this iteratively computationally.

Structure Based Features

Clustering Coefficient

This is taken as the fraction of existing connection among a node's neighbors divided by the total number of possible connections.

It corresponds to the probability that two nearest neighbors of a node are connected with each other. In another view: clustering coefficient measures the proportion of closed triangles in a nodes local neighbor hood, giving an idea of how tightly knit a node's neighborhood may be.

There are many variations of this metric, of which a popular version called the local is computed as follows:

c_v = \frac{\text{Number of edges between neighbors of node }v}{\text{Number of pairs of nodes in neighborhood}}

Graphlet Degree Vector

A graphlet is a collection of nodes, that can contribute to a subgraph of a given network. This counts the number of graphlets rooted at each given node (up to a given size or of a given type). The graphlet degree vector is a vertical column of how many graphlets of a particular count appears rooted at a given node.

From Pruzi et. al. 2004

For example, to get the graphlet degree vector of a node for graphlets up to five-nodes, every node would have a vertical vector of 73 values, representing the various types of graphlets, where each number represents the number of a given 5-node graphlet which appears rooted at a given node. An interesting note is that the graphlet-frequency vector is fairly robust to random node addition/deletion and rewiring. This may be beneficial for classification tasks, but may not be favorable for outlier-analysis tasks. Additionally, this measure has been used as a basis for graph comparison tasks, in fairly recent papers such as Graphlet-based Characterization of Directed Networks

See here for a more thorough application of the graphlet-degree vector.

Automating Data Validation using Kedro Hooks

Darth Espressius — Sat, 26 Mar 2022 10:41:03 +0000

30 Second Intro to Kedro

Kedro is a fairly un-opinionated Python framework for running data pipelines. On a high-level, Kedro is a DAG-solver, consisting a series of discrete steps abstracted as Nodes, connected by Datasets abstracted as Catalog entries. Nodes are grouped into higher-order constructs called Pipelines, and the order in which nodes are run is determined by the common data-dependencies in the input-output of each node.

Hooks

Hooks, according to the Kedro documentation, allow you to extend the behaviour of Kedro's main exeuction in an easy and consistent manner. A Hook is built from a specification and an implementation. Below shows the general structure of a project created by running kedro new.

Hooks are implemented in hooks.py and should consist a set of related functions grouped into a class (for each set of related) hooks. A hook is then specified in the src/<project_name>/settings.py by registering the hook Class. This is done by importing your newly created hook-class and adding it to the HOOKS key.

There are several types of hooks, depending on what type of event your hook should follow, and when it should execute. In this post, I would be focusing on one specific hook to validate data after it has been loaded

Using Hooks to Validate Data

One Kedro hook after_dataset_loaded allows you to consistently execute a user-defined function every time an entry in your data-catalog is loaded. This is helpful in, for-example: ensuring the distribution of your data-source is as-expected. This can be a common issue in building machine-learning pipelines, where monitoring data-drift is crucial in maintaining the performance and trust-ability of your model. In this post, we will be writing a hook to monitor data-drift using the Population-Stability-Index

Hook Definition

We will be using the after_dataset_loaded Hook to ensure our data for a (potential) machine-learning model is consistent. If we look at the definition of the after_dataset_loaded Hook:

    @hook_spec
    def after_dataset_loaded(self, dataset_name: str, data: Any) -> None:
        """Hook to be invoked after a dataset is loaded from the catalog.
        Args:
            dataset_name: name of the dataset that was loaded from the catalog.
            data: the actual data that was loaded from the catalog.
        """

we see that the hook definition requires a dataset name and the data that was loaded from the catalog. (Don't worry, we don't need to actually specify those as this is handled by Kedro, we simply need to use the above-defined interface in our hooks.py file, and add the hook_impl decorator to the correctly-named function).

For example, let us create a new class in hooks.py, called PSIHooks and create the required hook:

from kedro.framework.hooks import hook_impl

class PSIHooks:

@hook_impl 
def after_dataset_loaded(
  self, 
  dataset_name: str,
  data: Any) -> None:

Let's also assume that our data contains columns, and we would like to validate each column against a series of values stored as arrays. Using this implementation (not mine), for PSI, we can add the following to the body of our hook:

# convert dataframe to numpy matrix 
actual_values = data.values

psi_values = calculate_psi(expected_values, actual_values)

logging.info('f Dataset Name: {dataset_name}')
logging.info('PSI Values')
logging.info(psi_values)

Additionally, you could add conditional logic to determine the data needed to validate each individual dataset, and the option to not monitor data for datasets that are a result of some data-operation.

While the above establishes the general PSI-calculation, there is no way to keep track of what our PSI is, or how the PSI-changes over time. In this case, we can use an experiment-tracking framework such as MLFlow, Neptune.ai or wandb.ai in the body of our hook to log how our PSI changes over time.

Protocol Buffers, Neural Networks and Python Generators

Darth Espressius — Mon, 07 Feb 2022 12:15:23 +0000

NB: There is an interactive, Google-Collab-style version of this post available here

TLDR: So I was working on my thesis, and wanted to implement a particular paper that I would be able to iterate upon, long story short: this paper presented a Fully-Convolutional Siamese Neural network for Change Detection. And me, being me, was not satisfied with simply cloning their model from GitHub and using it as-is. I had to implement it, using TensorFlow (instead of PyTorch), so that I could really experience the intricacies of their model. (So I did, and you can find it here but that's besides the point of this post).

12 hours and two days later, I was ready to train my model. A recent 2022 paper released a dataset of 20000 image pairs, and painstainkingly labelled masks for the purposes of training the very type of network I had wrote. So there I was, ready with data, my training loop written from scratch and a freshly brewed cup of coffee, ready to type the all-so-crucial command

python src/train.py

But then, after about 15 seconds or so, the stacktrace in my terminal immediately gave me the sense that all was not right....
Garbled, nearly unintelligible collections of words, all hinting that I was running out of memory (somehow 64 Gigabytes of system RAM and an 8GB GPU wasn't enough?!), and then, the magic error message brought my model training to a screeching halt indicating something about my "protos" did not allow for such large graph-nodes (or something along those lines).

A quick side-quest: TensorFlow 2.x default mode of operation is eager mode, when I hit run, the function runs as-is, and does not care on a low-level the command that came before or after. However, if using special decorators, there is a possibility for performance enhancement in using Graph execution, where a really smart piece of code optimally choses how to execute my hand-written code in an execution graph. To get a better understanding of this, see the documentation.

"proto"?

Now that you have an idea of what Graph-execution is, and a general idea of the error I was facing, there remains one vital gap in information: what the hell is a "proto"?! According to this stackoverflow post, Protobuf has a hard limit of 2GB, since the arithmetic used is typically 32-bit signed. As this medium post explained, TF graphs are simply protobufs. Each operation in TensorFlow are symbolic handles for graph-based operations, which are stored as Protocol Buffers. A Protocol Buffer (proto for short), are Google's language-neutral, extensible mechanism for serializing structured data. The specially generated code is used to easily read and write structured data (in this case a TensorFlow graph) regardly of data stream and programming language.

To the best of my understanding, my gigantic dataset was causing individual operations in the execution graph to exceed the proto hard-limit of 2GB, since I was using the tf.Data API and the from_tensor_slices function to keep my entire dataset in memory and perform operations from there. Now, the dataset is about 8GB large, wayyyyy smaller than my 64GB of RAM, however performing multiple layers of convolutions (not to mention, in parallel) quickly caused the entire training pipeline to shut down.

So I needed to somehow use this large dataset, but without having to keep all the images in memory, and for this, we now move to Python generators

`yield`

A generator function allows you to declare a function that behaves like an iterator. For example, in order to read lines of a text file, I could do the following, which loads the entire file first, then returns it as a list. The downside of this is that the entire file must be kept in memory

def csv_reader(file_name):
    lines = []
    for row in open(file_name, "r"):
        lines.append(row)

    return lines

If instead, I do the following

def csv_reader(file_name):
    for row in open(file_name, "r"):
        yield row

I could then call the csv_reader function as if it were an iterator, where the next row is loaded only when the function is called and the previous output (possibly already processed) is discarded.

So something along the lines of

next(csv_reader(file_name))

Generators and `tf.Data`

TensorFlow's tf.Data API is extremely powerful, and the ability to define a Dataset from a generator, is all the more powerful. So this is how I solved my issued from above, first I defined a generator for both train and validation sets:

(the preprocessing functions simply loads the image from its file path, converts them to floats and normalizes them)

import tensorflow as tf

def train_gen(split='train', data_path='data/'):
    path = data_path + split
    for t1, t2, l in zip(sorted(os.listdir(path+'/time1')), sorted(os.listdir(path+'/time2')), sorted(os.listdir(path+'/label'))):
        # get full paths

        t1 = process_path_rgb(f'data/{split}/time1/' + t1)
        t2 = process_path_rgb(f'data/{split}/time2/' + t2)
        l = process_path_grey(f'data/{split}/label/' + l)

        yield (t1, t2), l

def val_gen(split='val', data_path='data/'):
    path = data_path + split
    for t1, t2, l in zip(sorted(os.listdir(path+'/time1')), sorted(os.listdir(path+'/time2')), sorted(os.listdir(path+'/label'))):
        # get full paths

        t1 = process_path_rgb(f'data/{split}/time1/' + t1)
        t2 = process_path_rgb(f'data/{split}/time2/' + t2)
        l = process_path_grey(f'data/{split}/label/' + l)

        yield (t1, t2), l

Not that since my model is a Siamese neural network, it has two heads and therefore requires two inputs (t1 and t2 above refer to time-1 and time-2, or before-and-after, where l is the label mask indicating the areas that actually underwent change). Finally, I passed these generators to the tf.Data API calls as follows:

train_ds = tf.data.Dataset.from_generator(
    train_gen, output_types=((tf.float32, tf.float32), tf.uint8))
val_ds = tf.data.Dataset.from_generator(
    val_gen, output_types=((tf.float32, tf.float32), tf.uint8))

The following section is more for performance and batching, which again removes how much data is actually held in memory at any given point in time. The from_generator call achieves exactly what I wanted, where data is loaded on a as-needed basis, and (thus far) avoided my headache with Protocol buffers

buffer_size = 1000
batch_size = 200

train_batches = (
    train_ds
    .cache()
    .shuffle(buffer_size)
    .batch(batch_size)
    .repeat()
    .prefetch(buffer_size=tf.data.AUTOTUNE))

val_batches = (
    val_ds
    .cache()
    .shuffle(buffer_size)
    .batch(batch_size)
    .repeat()
    .prefetch(buffer_size=tf.data.AUTOTUNE))

This is a very, very problem-specific post, however it does cover some key aspects of dealing with large sets of image data, TensorFlow and Python generators. I hope that you learnt something!

For any changes, suggestions or overall comments, feel free to reach out to me on LinkedIn or on Twitter @__aadiDev

3 Common Loss Functions for Image Segmentation

Darth Espressius — Sun, 30 Jan 2022 00:12:18 +0000

Image segmentation has a wide range of applications, from music spectrum separation and self-driving-cars to biomedical imaging and brain-tumor segmentation. The aim of image segmentation is to visually separate (segment) parts of an image (or image-sequence) into separate objects. For example in the image below from the OCR: Transformer Segmentation paper, the car at the center of the image was "detected" on a pixel-wise basis. Whilst object detection would simply return the coordinates of say, a bounding box around the car, segmentation aims to return an image mask (1 for "is car", 0 for "is not car") for a given image.

Deep learning has affected (in my opinion) the area of computer vision more so than any other field. There have been multiple innovations in various fields using a variety of techniques over the past five tears. Image segmentation can be thought of a classification task on the pixel level, and the choice of loss function for the task of segmentation is key in determining both the speed at which a Machine-Learning model converges, as well to some extent, the accuracy of the model.

A loss function gives feedback to the model during the process of supervised training (learning from already-labelled data), how well it is converging upon the optimal model parameters. It is used to guide a model in its search for the "ideal" approximation which maps the input data to the output data (images to masks in the case of image segmentation).

This review paper from Shruti Jadon (IEEE Member) bucketed loss functions into four main groupings: Distribution-based, region-based, boundary-based and compounded loss. In this blog post, I will focus on three of the more commonly-used loss functions for semantic image segmentation: Binary Cross-Entropy Loss, Dice Loss and the Shape-Aware Loss.

Binary Cross-Entropy

Cross-entropy is used to measure the difference between two probability distributions. It is used as a similarity metric to tell how close one distribution of random events are to another, and is used for both classification (in the more general sense) as well as segmentation.

The binary cross-entropy (BCE) loss therefore attempts to measure the differences of information content between the actual and predicted image masks. It is more generally based on the Bernoulli distribution, and works best with equal data-distribution amongst classes. In other terms, image masks with very heavy class imbalance may (such as in finding very small, rare tumors from X-ray images) may not be adequately evaluated by BCE.

This is due to the fact that the BCE treats both positive (1) and negative (0) samples in the image mask equally. Since there may be an unequal distribution of pixels that represent a given object (say, a car from the first image above) and the rest of the image, the BCE loss may not effectively represent the performance of the deep-learning model.

Binary Cross Entropy is defined as:

L(y,y^)=−ylog⁡(y^)−(1−y)log(1−y^) L(y,\hat{y}) = -y\log(\hat{y}) - (1-y)log(1-\hat{y})

Quick primer on mathematical notation: if $y$ is our target image-segmentation mask, and $y^\hat{y}$ is our predicted mask from our deep-learning model, the loss measures the difference between what we want ( $y$ ) and what the model gave us ( $y^\hat{y}$ )

This has been implemented in TensorFlow's keras.losses package and as such, can be readily used as-is in your image segmentation models.

An adaptation of vanilla BCE has been weighted BCE, which weights positive pixels by some coefficient. It is heavily used in medical imaging (and other areas with highly skewed datasets). It is defined as follows:

L(y,y^)=−βylog⁡(y^)−(1−y)log(1−y^) L(y,\hat{y}) = -\beta y\log(\hat{y}) - (1-y)log(1-\hat{y})

The $β\beta$ parameter can be tuned, for example: to reduce the number of false-negative pixels, $β>1\beta > 1$ , in order to reduce the number of false positives, set $β<1\beta < 1$

Dice Coefficient

This is a widely-used loss to calculate the similarity between images and is similar to the Intersection-over-Union heuristic. The Dice Coefficient has as such, been adapted to a loss function as the Dice Loss:

DL(y,y^)=1−2yy^+1y+y^+1 DL(y, \hat{y}) = 1 - \frac{2y\hat{y}+1}{y+\hat{y}+1}

A common criticism is the nature of its resulting search space, which is non-convex, several modifications have been made to make the Dice Loss more tractable for solving using methods such as L-BFGS and Stochastic Gradient Descent. The Dice Loss can be implemented in TensorFlow by subclassing tf.keras.losses as following:



class DiceLoss(tf.keras.losses.Loss):
    def __init__(self, smooth=1e-6, gama=2):
        super(DiceLoss, self).__init__()
        self.name = 'NDL'
        self.smooth = smooth
        self.gama = gama

    def call(self, y_true, y_pred):
        y_true, y_pred = tf.cast(
            y_true, dtype=tf.float32), tf.cast(y_pred, tf.float32)
        nominator = 2 * \
            tf.reduce_sum(tf.multiply(y_pred, y_true)) + self.smooth
        denominator = tf.reduce_sum(
            y_pred ** self.gama) + tf.reduce_sum(y_true ** self.gama) + self.smooth
        result = 1 - tf.divide(nominator, denominator)
        return result

Shape-Aware Loss

The U-Net paper forced their fully-connected convolutional network to learn small separation borders by using a pre-computed weight map for each ground truth pixel. This was aimed at compensating for the different frequency of pixels from certain classes in the training data set, and is computed using morphological operations. This weight map was computed as:

w(\bold{x}) = w_c(\bold{x}) + w_0 e^{-\frac{ (d_1(\bold{x}) + d_2(\bold{x}))^2}{2\sigma^2}}

The $d_1$ and $d_2$ functions give distances to the nearest and second nearest cells. $w_c$ is manually tuned to weight classes of instances of objects within an image depending on class distribution.

This weight term is then used in the typical cross-entropy loss, which results in the following loss function:

L(y,y^)=−w(x)×[ylog⁡(y^)+(1−y)log⁡(1−p^)] L(y, \hat{y}) = -w(\bold{x})\times \left[ y\log(\hat{y}) + (1-y)\log(1-\hat{p})\right]

3 Ways to Handle non UTF-8 Characters in Pandas

Darth Espressius — Thu, 20 Jan 2022 18:13:44 +0000

So we've all gotten that error, you download a CSV from the web or get emailed it from your manager, who wants analysis done ASAP, and you find a card in your Kanban labelled URGENT AFF,so you open up VSCode, import Pandas and then type the following: pd.read_csv('some_important_file.csv').

Now, instead of the actual import happening, you get the following, near un-interpretable stacktrace:

What does that even mean?! And what the heck is utf-8. As a brief primer/crash course, your computer (like all computers), stores everything as bits (or series of ones and zeros). Now, in order to represent human-readable things (think letters) from ones and zeros, the Internet Assigned Numbers Authority came together and came up with the ASCII mappings. These basically map bytes (binary bits) to codes (in base-10, so numbers) which represent various characters. For example, 00111111 is the binary for 063 which is the code for ?.

These letters then come together to form words which form sentences. The number of unique characters that ASCII can handle is limited by the number of unique bytes (combinations of 1 and 0) available. However, to summarize: using 8 bits allows for 256 unique characters which is NO where close in handling every single character from every single language. This is where Unicode comes in; unicode assigns a "code points" in hexadecimal to each character. For example U+1F602 maps to 😂. This way, there are potentially millions of combinations, and is far broader than the original ASCII.

UTF-8

UTF-8 translates Unicode characters to a unique binary string, and vice versa. However, UTF-8, as its name suggests, uses an 8-bit word (similar to ASCII), to save memory. This is similar to a technique known as Huffman Coding which represents the most-used characters or tokens as the shortest words. This is intuitive in the sense that, we can afford to assign tokens used the least to larger bytes, as they are less likely to be sent together. If every character would be sent in 4 bytes instead, every text file you have would take up four times the space.

Caveat

However, this also means that the number of characters encoded by specifically UTF-8, is limited (just like ASCII). There are other UTFs (such as 16), however, this raises a key limitation, especially in the field of data science: sometimes we either don't need the non-UTF characters or can't process them, or we need to save on space. Therefore, here are three ways I handle non-UTF-8 characters for reading into a Pandas dataframe:

Find the correct Encoding Using Python

Pandas, by default, assumes utf-8 encoding every time you do pandas.read_csv, and it can feel like staring into a crystal ball trying to figure out the correct encoding. Your first bet is to use vanilla Python:



with open('file_name.csv') as f:
    print(f)

Most of the time, the output resembles the following:



<_io.TextIOWrapper name='file_name.csv' mode='r' encoding='utf16'>


```.
If that fails, we can move onto the second option

### Find Using Python Chardet
[chardet](https://github.com/chardet/chardet) is a library for decoding characters, once installed you can use the following to determine encoding:
```python


import chardet
with open('file_name.csv') as f:
    chardet.detect(f)

The output should resemble the following:



{'encoding': 'EUC-JP', 'confidence': 0.99}

Finally

The last option is using the Linux CLI (fine, I lied when I said three methods using Pandas)



iconv -f utf-8 -t utf-8 -c filepath -o CLEAN_FILE

The first utf-8 after f defined what we think the original file format is
t is the target file format we wish to convert to (in this case utf-8)
c skips ivalid sequences
o outputs the fixed file to an actual filepath (instead of the terminal)

Now that you have your encoding, you can go on to read your CSV file successfully by specifying it in your read_csv command such as here:



pd.read_csv("some_csv.txt", encoding="not utf-8")

Why I despise IPython Notebooks

Darth Espressius — Wed, 12 Jan 2022 00:07:36 +0000

Reason 1: Esc twice ain’t it

I’m keyboard-driven, I use a tiling window manager, my mouse is seen as a luxury and I find it a pain to lift my hands from my keyboard and break my stream of thought to shift what feels like a 16 hour flight with 2 connections all the way to my mouse. (And it isn’t that I have a bad mouse, I like my mouse! But I like my keyboard more...).

For those who don’t know what a tiling window manager is, it’s essentially a way of interacting with your open windows, grouped into workspaces (which are switched by key-combinations). I like the idea of auto-aligning open windows (trust me, I’ve been through Alt-Tab hell and back), and having my applications open by keystrokes and automatically fill exactly a given part of the screen is a Godsend!

Everything you're seeing happens with TWO keystrokes

However, my biggest gripe with the entire notebook environment isn’t necessarily related directly to notebooks: it has to do with actual, WORKING vim-shortcut support.

I love shortcuts, and my entire being is wired to hit j and k as soon as I see any sort of code-related text to navigate up and down. My caps-lock key is remapped to Esc so that I have more control switching back to ‘Normal’ mode in vim. Usually this wouldn’t be an issue, since I use the vim-extension in VSCode and use evil-mode in emacs (I use doom-emacs which use vim-keybindings by default).

However, God forbid that you happen to hit the esc key twice whilst in a Jupyter/IPython cell with the Vim extension loaded (either via the web interface or via VScode), and you're taken OUT of the entire editor. This means that I now have to reach (what feels like) halfway across the room to my mouse, to re-click the cell I was working on, this is just terrible for ergonomics.

Now, this may not be an IPython-specific issue, it's just that I haven't found an extension which works, since technically speaking all the extensions work as they are supposed to! Hitting esc takes you to "normal" mode in Vim, however this now breaks the ability to shift between cells normally as you would in a Notebook because you technically aren't in the actual cell

Reason 2: `print(type(df))` is NOT debugging

An IPython/Jupyter notebook is meant to be run sequentially, which I may or may not have an issue with. My MAIN gripe around notebooks however come in the form of debugging. Typically, a break-point is set at a particular line in code, and a debugger temporarily halts code execution at that point, and we can set certain variables to be "watched"; i.e. the debugger can keep track of these variables during execution. For example, in the following code

output = tf.nn.Conv2D(some_image, filter, ...

I'm applying a convolution to what is presumably some sort of image. Don't worry about the actual operation, the important part is that I'm taking an image (or a matrix of floating-point values) and applying some operation to it, which may change the actual values, change the type of values (some may go to NaN) and possibly change the actual shape of the original image depending on convolutions. Now, in order to actually check what the operation is doing, I could potentially do the following set of abominations:

print(output.shape) # get shape
print(type(output)) # make sure I actually get a return Tensor
print(np.isna(np.sum(output.values))) # make sure no NaNs popped up

and probably the WORST of them all:

# I'm PRINTING an IMAGE as its 
#   RAW values, how in the 
#   actual heck is this helpful..?
print(output)

Also to the above, remember IPython notebooks length grow as cell-outputs are populated. Imagine doing the print(output) but MULTIPLE times in one notebook just to validate pre-processing steps (which is very common in computer vision). This is just highly unproductive!

Compare this to the below which requires none of the print statements (which usually have to be removed or re-added every-time you need to find the values or properties of an object, which is tiring, unnecessary and overall unproductive). EVERYTHING I could ever want to know about any variable is seen, and is much more useful and clean-coding practice:

This is from using watch in VSCode's debugger on the output variable, the shape, type, etc etc are all visible without additional code

Finally: Becoming familiar with how an Engineer might look at your code to deploy

This may be the least "ranty" reason. Deployable code does NOT contain random print statement throughout, and typically forms a directed acyclic graph between functions! This acyclicity is broken in a Jupyter notebook, since changing a cell above another cell does NOT change the output of the below cell. Let me show you what I mean, if I do the following:

The output is quite clearly correct,however if I NOW change the value of a (or b) and re-run only the upper cell, clearly the following does not remain true:

Hence the entire notebook need-be re-run (which is not a big deal, there's a convenient drop-down, but then I have to use my mouse again AND it breaks the natural flow of functions throughout my pipeline). But the biggest issue I have here is modularity, I can't easily swap in one notebook for another without doing ALL the data pre-processing steps in separate cells. You can't import one notebook into another the same way you'd import packages.

And deployment?! YIKES, that's a data engineer's job right....? WRONG!! How about we actually think about how our code is to be deployed, and follow some sort of paradigm where our resulting code can be modularly swapped in and out, and most importantly, actually follow some format to be easily tested. There are frameworks that assist in setting up your entire pipeline as a directed-acyclic-graph (such as Kedro, etc), but it's still a chore, and highly inefficient to not consider that building a data-science pipeline, outside of purely academic research is not for deployment in some setting

In conclusion, there are benefits to IPython notebooks, ease of demonstration, etc etc. However, this article covers my opinion of Jupyter/IPython notebooks, and why I try my best to steer clear of them for data-science/machine-learning related tasks.

If you like rants like these, feel free to follow me on twitter

Replacing terms using ^ in the Linux Terminal

Darth Espressius — Wed, 05 Jan 2022 23:05:11 +0000

I use the terminal quite a lot in my day-to-day activities. This includes, but is not limited to: copying files, installing packages, running updates, searching for folders, etc. Sometimes, the commands are simple ls -la or grep to show files and search for text respectively.

But sometimes, these commands are longer, MUCH longer, and this is past the point of realizing that I should have written this in a script file to run that way. AND, God forbid that I happen to make a mishap in typing some unnecessarily convoluted command and hit enter OR that I need to use a similar command again, then I have to go through the trouble of either typing it from scratch or hunting through terminal history to find, modify and re-execute the command.

Until I discovered the ^ operator....

Let's start with an example, let's say I wanted to use conda (a package manager commonly used for data-science development), to create a new environment, in this case I'm using the following code from NVIDIA's RAPIDS install instructions for ease of demonstration.

conda create -n rapids-21.12 -c rapidsai -c nvidia -c conda-forge \
    rapids=21.12 python=3.8 cudatoolkit=11.5 dask-sql

However, let's say that I realised that I wanted to actually create the environment using a different version of python (again, for demonstration purposes, although there are packages that work with 3.7 but not 3.8).

In order to replace the 3.8 from the above command, here's the code I can use for in-place substitution:

^3.8^3.7

The above code will replace the first occurence of 3.8 from my previous command with 3.7 and proceed to re-execute the command.

I realise the above is REALLY application-specific, so let's break this down:
If I type:

echo star wars

the output is

star wars

If I type the following after the above

^wars^trek

The output becomes

star trek

Let me show that in-terminal so you get an idea of exactly what we're doing

The ^ operator is a shorthand way of using the gs command for global substitution. In other terms, ^wars^trek is equivalent to

!!:gs/wars/trek

Now let's say I wanted to replace all instances of a term from the most-recently run command, so if I ran:

echo luke luke

followed by

^luke^leia^:&

the result is the equivalent of running:

echo leia leia

Again, to see the continuity of how these commands work, we need to see them in-shell:

And now back to the original example (ignore the CondaError as this was necessary to cancel the original command)

Anomaly Detection I - Distance-Based Methods

Darth Espressius — Sun, 21 Nov 2021 14:16:01 +0000

In my previous post, I went through the basics of what anomaly detection is, why it is important and current challenges in the field. To give a TLDR

An outlier is generally considered a data point which is significantly different from other data points or which does not conform to the expected normal pattern of the phenomenon it represents

Distance-Based Approaches

A distance-based approach to anomaly detection is one which relies on some measure of distance, or distance-derived metric between points and sets of points. This results in multiple concepts of distance-based anomaly detection:

Less Than p Samples

In this approach, points with less than p neighbouring points are classified as anomalous or outliers. For example, in the image below, the test point at the bottom left may be classified as anomalous based on the neighbour distance (represented as the dashed circle) and the number p, which is chosen depending on how sensitive the algorithm must be to anomalies. For example, very high levels of p will result in high numbers of anomalies, since few points may have the necessary numbers of neighbours within its radius of consideration.

kNN Methods

Distance to All Points

This the simplest possible method, where an algorithm evaluates a single point against every other point. The sum of the distances may be used as the metric. However, computing the scores of all the data points is generally computationally expensive intensive since all pairs of distances between points need be calculated. This is in all cases very expensive when the number of data points is large.

Distance to Nearest Neighbour

This is simple but can be misleading at times. A new point is considered anomalous if the distance to its nearest point is greater than some threshold.

Average Distance to kNN

In an unsupervised approach to anomaly detection, it is impossible to know the correct value for k for any particular algorithm, as this is highly dataset-dependent. A range of values of k may be tested. The average to a test point's k nearest-neighbours is less sensitive to different choices for K, since it effectively averages the exact k-nearest-neighbour scores over a range of values.

In the image below, depending on whether the distance-to-furthest OR average distance-to-furthest is used, the encircled point may or may not be considered as an anomaly.

Median Distance to kNN

This is very simple to interpret in low dimensions. It is additionally useful for building models that involve non-standard data-types, such as text. However, unlike the above, there is no standard way to choose k (except through cross-validation, etc), is computationally expensive and requires large storage requirements.

Pruning Methods

This can be thought of an extension to the aforementioned, with the main goal being a reduction in computational complexity. Pruning methods first partition the input space into discrete regions, with summary statistics such as the minimum-bounding rectangle, number of points, etc. During the nearest-neighbour search, a test example is compared to the bounding rectangle within which it lies, to determine first if it is possible at all for a nearby region to contain neighbours. If not, the region nearby is eliminated. This reduces the search complexity of actually finding nearby points to run distance calculations.

Metrics

There are multiple methods to determine the "distance" in distance-based methods, following are a few of the most common

Euclidean

This is possibly the most common, and is the easiest to work with. The Euclidean distance works well for well-clustered data, but is very sensitive to outliers. It may seem counter-intuitive to consider a metric's sensitivity to outliers when the entire point of anomaly detection is in detecting outliers, however an overly-sensitive metric may result in an unbearable false positive rate.

In all equations, assume that p and q are two points with i dimensions

\sqrt{\sum_{i=1}^n(q_i-p_i)^2}

Weighted Euclidean

If the relative importance of a dimension (which represents a feature in a dataset), then weighted Euclidean distance can be used. For example, if attempting to determine whether readings from a car's engine are anomalous, the oil-temperature may be far more important than the noise its engine makes. In other terms, more weight must be placed on the oil-temperature, since less variability is to be tolerated, whilst the sound in decibels can be down-weighted since it may have a wider operating range.

\sqrt{\sum_{i=1}^{n}w_i(q_i-p_i)^2}

Minkowski

This is a generalization of the Euclidean distance, and similarly performs well for isolated, well-clustered data. However, large-scale attributes will dominate smaller-scale attributes (owing to the index term), hence caution is advised in terms of feature-scaling.

\sqrt[n]{(q_i-p_i)^n}

Manhattan

This is incredibly sensitive to outliers. The Manhattan distance results in a radius surrounding points which are hyper-rectangular (rectangular in high-dimensions). The caveat in using the Manhattan distance is that anomalies may be a function of both orientation and distance, which is not typically desired

\sum_{i=1}^n|p_i-q_i|

Mahalanobis

This is most applicable when the dimensions in each axis are wildly non-comparable. This was originally derived to define regions that were hyper-ellipsoidal, and can alleviate distortion caused by linear correlation amongst features (this is known as a whitening transformation). However, this is incredibly computationally expensive, and should be used only when absolutely
necessary.

\sqrt{(p-q)S^{-1}(p-q)^T}

Advantages of Distance-Based Methods

These methods scale well for large datasets with medium-to-high dimensionality
These are also more computationally efficient than corresponding density-based statistical techniques

Disadvantages of Distance-Based Methods

Extremely high dimensionality drastically reduces performance due to elevated computational complexity
The search algorithms for nearest-neighbour methods can be inefficient unless a specialised indexing structure is used (such as a k-D Tree), at the cost of increased storage.
Distance based methods cannot usually deal with data streams ad may not detect local outliers (such as between clusters of data points), since only global data is present

An Introduction to Anomaly Detection

Darth Espressius — Tue, 26 Oct 2021 23:36:57 +0000

Outliers

An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism ~ Hawkins 1980

An anomaly, also known as an outlier, is a rare-event data point or pattern which does not conform the the notion of normal behaviour. An object in a data set is usually called an outlier if:

It deviates from the known/normal behaviour of the data
The point is far away from the expected/average value off data or
It is not connected/similar to any other object in terms off its characteristics

Anomaly Detection is the process of flagging unusual cases in data, and spans multiple industries across multiple types of data.

This may seem like a trivial task, however humans have evolved to perform pattern-recognition at levels which far surpass even the most complex machine-learning model which exists today. We can differentiate betweeen the expected variance in data and outliers after having only seen a small number of normal instances (an infant is able to differentiate its biological parents from relatives before the age of 1 after exposure to only two humans, that's one heck of a cold-start performance metric)

The property which defines an outlier may be attributed to various properties of the data, and each property may lead to a specific characterization of outliers.

Classifying Outliers

Size-Based

An outlier can quantitatively correspond to the size of a data neighbourhood. For example, according to network theory, the degree of distribution of social networks typically follows a power-law. In other terms, the number of nodes with degree $k$ (number of connections) is proportional to $k−αk^{-\alpha}$ . A community made up of a collection of $n$ nodes (persons) all with the same degree, for a sufficiently large number of persons, can be thought of an outlier, or anomalous community, as it does not follow the expected size-related pattern.

Diversity Based

Outliers may be classified based on how different they are, generally speaking, to other data points. For example, if a search-engine optimizes its results according to how fast a website loads (yes it's dumb but bare with me), then a page with a significantly faster-loading behaviour compared to others can get a better ranking. Here, incredibly fast-loading websites, such as static pages, can adversely affect the rankings of search engines.

Applications of Anomaly Detection

Network Intrusion Detection

This is mainly applicable to time-series and graph-based data. Network security is of paramount importance, and the rise of cyber-attacks (such as DDoS, etc) has continued to cement the need for robust detection and response to these types of attacks. This is particularly challenging, since an anomaly detection system must be able to partially differentiate between an actual anomalous event and some other non-anomalous, high-traffic event such as a new product release (such as when Google's online store went down for the launch of its Pixel 6).

Medical Diagnosis

ECGs, MRIs and simpler readouts such as glucose and oximetry readouts directly or indirectly indicate individuals' health status. This can be a potentially life-or-death application of anomaly detection, and is further complicated by the low-latency need of such a system.

Industrial Visual Defect Classification

This application uses anomaly detection for (in my opinion) its second most immediately-tangible application yet. Measurements from various sensors and cameras are used as input into an anomaly-detection system as a form of quality-assurance. This is another challenging area, as defects can vary from subtle changes such as thin scratches to larger structural defects like missing components.

The Sciences

A black-hole is an anomaly from our perspective. We only recently managed to get a decent image of what was previously strictly theorized. Even Einsten himself did not believe in its existence. Anomaly detection systems can be used to detect previously-unseen physical phenomena, such as in parsing the input from out-of-visible-light telescopes or to detect genetic mutations in DNA.

Feature Selection

This is possibly the most difficult part of building an anomaly detetion system, owing to its unsupervised nature. A common way of measuring the non-uniformity of a set of univariate points is the Kurtosis measure. The data is standardized to zero mean and unit variance. The resultant data points are raised to the fourth power, following by summation and normalization

K(\textbf{z}) = \frac{1}{N}\sum_{i=1}^{N}z_i^4

Types of Kurtosis

Feature distributions that are highly non-uniform exhibit a high-level of Kurtosis, for example when the data contains a few extreme values, the Kurtosis measure tends to increase owing to the use of the fourth power. Features may then be selected based on its level of Kurtosis as a learning algorithm may better differentiate non-anomalous points from actual hazards.

Approaches to Anomaly Detection

Extreme-Value Analysis

This is the most widely used knowledge-based method. A decision-tree -type structure is used to classify data into anomalous or not. This is different from the classification approach, as these rules are hand-defined and hardcoded into the algorithm.

An even simpler approach would assign specific thresholds to data values and simply report anomalies when the data crosses this threshold. This system is the least flexible and is not able to learn with new data, or adapt to different data distributions. It is however, very simple to set up and highly interpretable, as the overall structure is defined beforehand.

Statistical Techniques

This approach assumes that data follows a specific distribution. The most basic form computes parameters of a probability density function for each known class of network traffic, and tests unknown samples to determine to which class it belongs.

In most network-based time-series applications, the Gaussian distribution is used to model each class of data, however other approaches such as Association Rule mining (counting the co-occurrences of items in transactions) has been used for one-class anomaly detection by generating rules from the data in an unsupervised fashion.

As a Supervised Classification Task

An algorithm learns a function mapping input features to outputs based on example input-output pairs. The goal is to reframe anomaly detection as a binary classification task (either an anomaly or not). However, owing to the incredibly skewed data distribution (remember, by definition an anomaly is rare), each anomaly is potentially highly underrepresented. Additionally, there may be many types of anomalies (an aircraft engine can under-perform by either spinning too slowly or by catastrophically exploding, both of which can potentially lead to hazardous situations for vastly different reasons). This further leads to sparsity and intense skews in the data used to train these models.

Unsupervised Proximity-Based Learning

In the above, the group , or *cluster of points represents normal operation, whilst the three to the bottom-left can be considered anomalous*

Since labeled anomalous data is rare, unsupervised approaches tend to be more popular than supervised ones in anomaly detection. This is where no input-output pairs are readily available for training, but the algorithm learns what is 'normal' over time, and reports anything which deviates to some degree from the usual data distribution. The caveat here is that many anomalies may correspond to noise, or may not be of interest to the task at hand. The acutal approach used for this sort of unsupervised learning may vary wildly. For example, in detection anomalous events from video, an algorithm may predict the next video frame, and compare the actual video frame to the predicted. The deviation from predicted and actual may then be thresholded in order to classify frames as positive/negative anomalies.

Information Theoretic Models

The idea behind this approach is that outliers increase the minimum-code-length (MDL) required to describe the data set, because they represent deviations from natural attemps to summarize thd data. The following example, taken from Outlier Analysis, describes this idea.

Consider the following two strings:

ABABABABABABABABABABABABABABABABAB
ABABACABABABABABABABABABABABABABAB

The second string is the first length as the first, different only at a single positions containing the unique symbol 'C'. The first string can be concisely described as "AB 17 times", whilst the second scring can not be described in the same manner (additional description is need to account for 'C'). These models are closely related to conventionaly models, in that they learn a concise representation of the data as a baseline for comparison.

Semi-Supervised Learning

This is a hybrid approach using both labelled and unlabelled data. This hybrid approach is well suited to applications like network intrusion detection, where one may have multiple examples of the normal class and some examples of intrusion classes, but new kinds of intrusions may arise over time.
Another idea is based on initializing a neural network with pre-trained weights and then improving by adaptation to training data. This is a relatively new concept, and has not begun to see much advanced research. This approach might include training an auto-encoder on normal data, and then using the encoder on previously unseen data. The difference between encoded features from the usual data distribution can then be used to indicate the presence of an anomaly.

Evaluation Criteria

By definition, anomaly detection expects that the distribution between normal and abnormal data classes may be highly skewed, and is known as the class imbalance problem. Models which learn from this type of data may not be robust, as they tend to perform poorly when attempting to class anomalous examples. For example, imagine trying to classify 1000 time-series snippets representing web-traffic for your website, 950 of which are typical, everyday usage patterns. Any algorithm which classifies all of your data samples as normal (non-anomalous), immediately achieves $95%95\%$ accuracy! In other terms, a simple rule, 'return negative' appears to perform remarkably well. The issue here is that the anomalous class is under-represented, an the accuracy metric does not account for this

\frac{\text{Number of Correct Predictions Made}}{\text{Total Number of Predictions}}

Based on the prior argument, we must conclude that using Accuracy alone may not be suitable for adequately evaluating an anomaly-detection system. This is where we move forward with more intentional measures.

Precision and Recall

Precision gives an idea of the ratio between correct predictions (true positives) and the sum of all predictions that returned as true (true positives plus false positives). An algorithm which optimizes strictly for precision is less concerned about false negatives, and is instead optimized for extreme confidence in its positive predictions.

\frac{TP}{TP+FP}

Recall gives the ratio of correct predictions (true positives) so the sum of all true data points (true positives plus false negatives). This metric, also known as sensitivy, can give a more balanced idea of how well an algorithm is at detecting positive samples. For example, the recall in our "return all false" model will be zero, as the numbeer of true-positives (correct predictions) will also be zero. Optimizing for recall may be more appropriate when the cost of false-negative may be very high, for example in an airport security system, where it is better to flag many items for human inspection as opposed to accidentally allowing dangerous items onto a flight.

\frac{TP}{TP+FN}

Hopefully you should have a general understanding of what anomaly detection is, why it's useful, a few challenges in the field and a few ways of framing anomaly-detection problems. In my next few posts, I'd be delving into anomaly detection models in-detail, and walking through some Python in how to implement some of these anomaly detection models

Five thousand processing cores?

Darth Espressius — Mon, 11 Oct 2021 00:46:46 +0000

Even if you're not in the 'tech' industry, in today's computing age you may have heard the term 'CPU' tossed around. A CPU or Central Processing Unit is a general term for the 'brain' of today's computers. (I use the term computer here very loosely to refer to any sort of desktop, laptop, server, etc without attempting to fully encapsulate the infinite array of microprocessing units in our fridges, watches, and elevator controls).

How a CPU Works

A CPU may divide a series of tasks by time; where any given slot (or series of slots) may be dedicated to a given task or series of tasks. These tasks are assigned to a single computational unit (also known as a core, technically known as a floating point unit) at any given point in time, and the single core is freed to move on to its next task once the previously-running task has completed.

The issue with this model is readily apparent: what if I want two tasks to happen at the same time?

RTOS

Since the 1980's, the real-time operating system or RTOS was the only way by which a single CPU could achieve, or at least appear to achieve, some sort of concurrent operation. However, this "appear to achieve" is a bit of a gotcha, since the real-time in RTOS translates to finishing within a predetermined time-interval. This is achieved by some sort of scheduling algorithm, and a series of programming constructs for holding resources (mutexes), signalling (semaphores) and a host of other methods by which some sort of deterministic behaviour is effected.

This is not purely concurrent, there is no way to actually carry out simultaneous operations, say, on a large chunk of data

The Multi-Core Processor

IBM 100 Power 4

The first multi-core processor was the POWER4, however the first commercially available desktop processor accessible as a familiar socket-mounted package was the Intel Celeron for home consumer usage, and the AMD Opeteron for server usage. This took the single-core idea and solved the "I want to do two things at once" problem in the most brute-force way possible: if you want to do to (or more things) at once, then you need two (or more) cores.

This wasn't (and still isn't) an absurdly irrational concept, as the proliferation of multi-socket motherboards prior to the multi-core era demonstrated the desire for concurrency in the enterprise space.

Multi-Socket Motherboard

The main limitation before the introduction of the first multi-core CPUs was (and still is) power delivery. Having to power two cores in a single package introduces complications and adds additional heating requirements. Moreover, having the introduced overhead for core synchronization and memory sharing not outweigh the benefits of multi-core has seen many creative solutions over the years, the most recent of which is AMD's Infinity Fabric

When the first multi-core processors came out, the issue was fitting enough transistors on a chip to build more than one core. The issue regarding transistor size has for more or less disappeared, as we approach the opposite problem in transistor design: as transistors shrink below the 1-2nm mark, new quantum effects such as tunneling introduce an entirely new class of nondeterminism into chips' operation.

Okay, so now you understand where CPUs came from, and how being able to do more than to things at once was physically achieved, but where do GPUs come in?

The CUDA Core!!

or streaming processor?

Okay I love NVIDIA and AMD, but their definition of a 'core' is a bit, err...ambitious? On the CPU side of things, a 'core' should be able to fetch instructions, load the necessary data required to perform this instruction into memory, perform the said data operation as indicated by the instruction, and return the complete, processed data at the end of the operation.
Layout of the latest 3000-series GPUs

A CUDA core or (Stream processor depending on which colour flag you're currently waving, for me it's currently green, so we'd stick with "CUDA Core" for now), is simply a floating-point unit. It receives data, performs some operation, and returns it. It does not independently handle fetching instructions and loading data into memory.

Terminology out of the way, modern-day GPUs have thousands of CUDA cores, the GA104 in my NVIDIA RTX 3060ti has nearly five thousand CUDA cores. Heck, the measly mobile GTX1060M in my laptop has over a thousand, and that launched five years ago. GPUs are essentially a set of floating point processors bundled nicely into a well-powered, nicely ventilated chip which makes GPUs incredibly versatile for huge levels of parallelism.

GPUs have been used for real-time scheduling, graph algorithms, HPC, object detection using neural networks, and the list goes on. This is due in no small part to the nature of machine-learning applications, and the ability of neural networks to be split across multiple processing cores. Major Deep Learning frameworks such as TensorFlow and PyTorch now offer GPU support by default (once the CUDA toolkit and cuDNN is installed).

With respect to purely cost, GPUs have higher instruction throughput and memory bandwidth when compared to CPUs. Additionally, GPUs tend to have significantly higher raw arithmetic capabilities than CPUs, and is centered around a large number of fine-grained parallel processors.

I could go on and on about the wonders of GPUs and where there are used, and probably do some more hand-wavy stuff in an attempt to convince you that GPUs are really cool, but I'd rather go in a bit more detail into how exactly GPUs do what they do, and the thinking that goes into developing a GPU program.

From here forward, most of the technical details are NVIDIA-specific, however they can for the most part be ported to AMD/ATI GPUs

The Programming Model

GPUs work on the SIMD model, or the single-instruction-multiple-data idea, where a single operation is to be carried out on multiple data points in parallel. These operations must be independent, as there is no data-sharing between these operations. This is in direct contrast to the MIMD model of the CPU (or multi-instruction-multi-data, where CPUs possess inherent complexity to be able to handle multiple types of different tasks).
GPUs are more general purpose, as their floating-point units can be adapted to a wider range of applications by means of a programming interface (such as CUDA).
How Threads are grouped into blocks which are grouped into grids

A streaming multiprocessor or an SM, in NVIDIA-land could be thought of as a multithreaded CPU core, with its own shared memory, with a set of 32-bit registers (think of this as the GPU equivalent of L1 cache), and contains a set of floating point units. A collection of threads called a block runs on an SM and executes a custom GPU function called a kernel.

That's a lot of terminology, the important bit to note is that current GPUs have a limit of 1024 threads per block, and this number is further limited by the available memory requirements of your specific kernel. For a more in-depth explanation of this, see here

Thinking About Problems

The GPU architecture is centered around fine-grained parallelism (or thread-based parallelism). This is where a problem is partitioned into coarse sub-problems solved independently by blocks of threads, where each sub-problem is split into finer pieces that may be solved cooperatively in parallel by all threads in a block.

There can be a few issues here however, where bad branching in your custom GPU program or kernel results in massive overhead induced by the GPUs limitation to tell a block of threads to do only one thing. For example, if your kernel needs all the even-numbered threads to do one thing, and the odd-numbered threads to do another thing, there will always be one set of threads waiting on the other to complete its task, which effectively doubles the processing time for your given task (or set of tasks).

Sharing is Caring

Remember where I said that each thread can work only on independent data points? Well in theory this may seem feasible, but in practice this idea falls apart. Think of the simplest case of needing to first calculate the square of a series of numbers, followed by finding a sum of these numbers. Every "square" mathematical operation can happen on a separate thread, however when needing to sum the output, the threads need to talk to each other, or at least have some central repository by which to sync their outputs. This is where shared memory comes in. (This is one of the main types of memory available in the CUDA programming model, along with global, texture and host memory).

Shared memory is a memory that can accessed all threads within a block, and is orders of magnitude times faster than system memory, with significantly lower latency. (It can be thought of programmer-controlled L1 cache). The CUDA programming model introduces a special keyword __syncthreads() to ensure no race conditions occur.

A race condition is where two processes need to access a single memory location, and one of both threads attempts to read from/write to the memory location before the other is done with its own operation. This can lead to failed reads and corrupt writes.

This is a very basic introduction to why GPUs are useful, and how they function on a high-level basis. If you have any questions, feel free to contact my via the email listed in the profile, and happy reading!

Gradient Boosting on the GPU

Darth Espressius — Sun, 26 Sep 2021 23:10:05 +0000

Decisions, decisions, it seems like every data-centric problem simmers down to making some sort of choice. Whether it be choosing a class of object present in an image, or modeling churn prediction by choosing a probability, solving data-related problem is typically centred around making decisions.

Decision Trees

Source: Song et. al., 2015

Decision and regression trees are a form of supervised learning. These algorithms infer implicit rules from structured data to predict some given outcome.
The first regression tree algorithm was published in a 1963 paper, whilst the first decision tree algorithm was published in this 1972 paper The premise of a decision tree to recursively divide the input attributes in smaller, "purer" subsets, which can then more accurately define a given output.

In other words, say you are attempting to predict what type of personal computer someone may purchase. You have a list of previous purchases of persons given their age and occupation. Naturally, you may first group your input data by age (the younger folk may make a different sort of purchasing decision, aiming for portability, performance or flashier features, whilst the elderly may aim on a larger, more tactile keyboard and included software).

After you split your data by age, you end up with two groups; put another way, your decision making process has created its first 'branch'. These two groups can then be further split by occupation. Persons working in software or AI may look to more performance oriented options, while accountants, managers and writers may focus on ergonomics and portability. This multi-way split into various occupation then splits your original two groups further, resulting in "purer" input attributes.

In other terms, your resulting groups or leaves are of one or a few age groups and one or a few occupations. Yes, a few, it may not be necessary to split on every single occupation, owing to a concept known as overfitting, where your model does not generalize well owing to its incredibly high specificity. To the stats folks, this is when your model exhibits unreasonably high variance owing to unconstrained optimization.

This results in an incredibly interpretable model, since the decision made at each branch is easily found. However, a decision tree may leave some room for accuracy improvement, and can require complete re-training if data changes. The latter point is a significant consideration, as organizations may have large amounts of data, where data drift may result in models slowly becoming less effective over time

Strength In Numbers

A decision tree ensemble is a group of decision tree classifiers/regressors. A data tuple is mapped to an output leaf for a series of decision trees, and the average of the output is taken. This can assist in improving accuracy. A specific implementation of the ensemble method is the gradient-boosted tree.

The decision tree ensemble is trained to optimize some given loss function

L=∑il(yi^,yi)+∑kΩ(fk) \mathcal{L} = \sum_{i}l(\hat{y_{i}}, y_{i}) + \sum_{k}\Omega(f_{k})

The $Ω\Omega$ term above penalizes the impact of any single decision tree. This may sound strange at first, however the situation may arise where two decision trees in the ensemble decide to over- and under- weight any given input attribute, resulting in a net zero effect. For example, if one tree thinks that age positively correlates very strongly with a person buying a less portable machine (i.e. the older folk buy desktops instead of tablets), another decision tree added to the model may instigate an exactly opposite relationship to effectively cancel out the first tree's impact.

This is a phenomena known as parameter explosion and is a serious issue in statistical learning algorithms. (This regularization technique is not limited to Decision-Trees, Regression has its variants in the Lasso and Ridge variants, whilst Deep Learning employs dropout, etc)

Gradient Boosting

An additive function (a new tree) is added greedily to some decision making function to minimize your objective function. In other terms, a gradient-boosting scheme trains and adds the next-best-performing tree at each iteration.
In more mathematical terms, at each iteration, a new newly trained tree which most optimally minimizes a chosen objective function is added to the overall ensemble.

There are a few methods by which trees split input data, the following list does not claim to be exhaustive, but rather gives a general introduction to popular split-finding techniques

Exact Greedy

Every possible split of input data (for a given attribute) is enumerated, and the split resulting in the maximal increase in Information Gain is chosen. Information Gain represents the difference in entropy before and after your data is split, whilst entropy measures how "impure" (how many different values occur in the split) your attribute is. For example, following our example above, if we split based on occupation, and any given group of data after the split contains multiple, unrelated occupations, this data is said to be high-entropy, as no one consistent theme is present for the occupation attribute. If, however, the split data contains a single occupation per split, the data is low-entropy.

This implies then, that if the difference in entropy before and after a split is high, the data has become "purified", or has experienced positive information gain. As stated in the name for this method, the next split is chosen greedily (which means the next best option is chosen out of all possible options).

This technique can be computationally expensive to enumerate all splits for continuous features however, since it includes sorting all values for a given attribute and accumulating these gradient statistics for every possible split.

Approximate Greedy

Exact greedy above cannot work for data which is not held is main system memory. When dealing with terra- and peta-bytes of data, it is unreasonable to expect all data to be held in memory to perform any sort of statistical analysis. The approximate greedy method of split-finding proposes split points based on percentiles of a given feature by means of samples from the main database. Continuous features are additionally bucketed using these candidate points, and the best solution is found based on aggregated statistics.

Sparsity Aware

Data is never perfect. That's a fact of life. In the library we will be using, a default split is defined for missing data based on existing data. This ensures that all data within your input database is used to train the model, and ensures robustness against future missing data. This paper found an exponential decrease in classification time when using sparsity-aware methods. In other words, the algorithm did not have to estimate adjusted gradients at time of classification when subject to missing features, as a pre-baked option was already selected. This can be customised depending on application need.

The GPU

Or as I like to call it, a concurrency nerd's DREAM. I'm planning on writing an article completely dedicate to the wonders of GPU processing, and how to do some cool things using NVIDIA CUDA, but for now, we will be using NVIDIA's RAPIDS library.

RAPIDS allows execution of end-to-end pipeline entirely on GPUs, and is built on CUDA primitives (a blend of C++ and C code). This allows workloads that are highly parallelizable (single-instruction-multiple-data) to be scaled outwards across multiple GPU cores. Whilst your GPU's individual cores may be much simpler than a CPU's, a GPU typically has hundreds (to thousands) of these cores. For example, the lowest end NVIDIA GPU is currently upwards of EIGHT HUNDRED. The lowest end GPU has orders of magnitude more cores than many high-end desktop CPUs.

Additionally RAPIDS is incredibly compliant with typical Pandas' function calls. What this translates to is a very natural transition from typical Pandas' function calls (which run on your CPU) to RAPIDS API calls, which run by default on your GPU. See here for a complete reference the RAPIDS' direct analog to Pandas.

# GPU-powered API
import cudf as cd

# usual import
import pandas as pd

# does exactly the same thing from a programmer's perspective
cd.DataFrame(data)
pd.DataFrame(data)

Gradient Boosting on GPU

Now the fun part, since gradient-boosting involves iteratively adding decision trees to a main model, at first it may seem completely counter-intuitive to attempt to run this on GPU. However, we are not parallelizing tree creation, RAPIDS works to parallelized across data. Data points for each iteration will be scaled across your multiple GPUs cores in the background to essentially "unroll" the summation expressed in the additive equation above.

Enough talk, it's time to code.
Note, this will require you to have a RAPIDS-compatible GPU, the latest version of the CUDA Toolkit and RAPIDS installed. See the following links on how to check for compatibility with RAPIDS, and how to install both RAPIDS and the CUDA toolkit:

Getting started with RAPIDS AI
Install the CUDA Toolkit
Check CUDA Compatibility

(If you have a 10-series NVIDIA GPU or newer, you should be fine)

This does not claim to be a fully end-to-end tutorial, this highlights some of the main features of the RAPIDS and XGBoost APIs.

Let's ensure we have our libraries imported

import cupy as cp
import xgboost as xgb

Let's also say we have sorted our data set, and have numpy arrays of the training and validation data. From these arrays, we need to wrap this data in XGBoost's DMatrix format. This according the XGBoost Documentation is a more optimized data wrapper for Extreme Gradient boosting.

dtrain = xgb.DMatrix(X_train, label=y_train)
dvalidation = xgb.DMatrix(X_validation, label=y_validation)

Tilling the Soil

Now we need to specify some parameters for our Gradient-Boosted Tree. I'm ignoring some safety-checking for the purposes of clarity. If this were being deployed, the code would first check for the existence of a GPU, then proceed to create the tree

params = {
    'tree_method':'gpu_hist',
    'n_gpus': 1, 
    'eval_metric': # your choice of metric
    'objective': # some objective function
}

The most important part of the above is the assignment 'tree_method': 'gpu_hist'. This tells XGBoost to use a CUDA-accelerated GPU-based tree construction method.

Light, Water and Love

Now it's time to grow our tree. XGBoost and RAPIDS make this incredibly simple:

num_round = 10
bast = xgb.train(params, dtrain, num_round)

And that's it! You have successfully grown a Gradient-boosted tree on the GPU.