DEV Community: Fortune Adekogbe

Automating Data Transformation with Machine Learning

Fortune Adekogbe — Tue, 09 Apr 2024 14:13:06 +0000

Change is the only constant in our universe, and by default, it leads to increased disorder. To function as a society, we have learned to transform resources in the opposite direction using energy, and one of these resources is data.

Data is, by definition, raw, and to be most useful, it must be processed. The operations used to accomplish this are referred to as transformations because they change the state of the data to make it more useful for decision-making and analysis. Traditional data transformation techniques have included everything from statistical techniques to regex queries, which are sometimes built into full-fledged applications. These techniques, however, are limited in terms of the types of data they can handle and the effort required to use them. To overcome these constraints, machine learning has taken its place among data transformation tools.

Machine learning is the process of teaching programs, also known as models, to perform tasks through iterative exposure to data samples. Two popular methods involve exposing the model to input data in unsupervised learning or to both the input data and the corresponding target output in supervised learning. In either case, the goal is to efficiently train the model to learn the transformation operation and process data in a way that makes it more useful.

This article discusses how machine learning can be used to transform data. You will understand the benefits of doing so, as well as any challenges you may encounter along the way. You will also see case studies of machine learning applications in various industries and organisations.

Basics of Data Transformation

Data transformation is the process of modifying data to make it more useful for decision-making. It is fundamental to the concept of data processing because data must change to progress from its raw state to what is considered useful information. It must be transformed. For example, consider a journalist who has recorded an interview with a person of interest and plans to publish an article. This recording contains raw data that they have gathered and can most likely be shared directly; however, to publish an article worth the public's attention, they must transform that back-and-forth communication into a compelling and true story that reflects the core message of the interview. However tedious this may be, it must be done.

One method for transforming data is to use statistical techniques such as measures of central tendencies, which attempt to summarize a set of data about a specific quantity. Some examples of this are the mean, median, and mode. More advanced techniques could include using fast Fourier transforms to convert audio data from the time domain to the frequency domain in the form of spectrograms. Another interesting data transformation operation is matrix operations, which are commonly used on images during resizing, color conversion, splitting, and so on. These operations convert images into more useful forms for you. On text data, data transformation may involve identifying and extracting patterns using regular expressions (Regex). These patterns typically have to be configured carefully to ensure that all the desired information is extracted and edge cases are taken care of.

Limitations of Traditional Data Transformation Operations

While these data transformation operations are useful, their applicability is limited. Traditional data transformation techniques, for example, can be used to change the size and color of an image but not to identify the objects in it or their location. Similarly, while audio data can be transformed into spectrograms, traditional techniques cannot be used to identify the speaker or transcribe the speech.

Using these techniques also requires a significant amount of effort in many cases. For example, a regex query designed to extract information from text data may necessitate manually reviewing hundreds of samples and dealing with numerous edge cases to ensure that you have developed a pattern that works consistently. Even so, a new sample may present a previously unconsidered edge case.

The learning curve for these data transformation techniques may also be steep. As a result, you may find yourself in a situation where you must spend engineering hours learning how to use a new tool before encountering the limitations that come with it.

As an ingenious species, we found ways to get by despite our limited technology. However, we also invested resources in research to find other ways to ease our burdens. As a result, the popularity of machine learning tools has grown over the last two decades.

The Role of Machine Learning in Data Transformation

Machine learning models accept input and process it to generate useful outputs for you. In this sense, they are fundamentally data transformation tools trained to learn a transformation function that achieves a specific goal. When you send a prompt to a language model, such as GPT3.5, it performs a transformation operation to produce an output that satisfies your query.

As a data transformation tool, machine learning opens up a world of possibilities. Many previously unfeasible data transformation tasks are now very possible. With a new goal in mind, the only constraints on developing a system that learns a custom transformation are the availability of data and computing resources. However, in many cases, APIs and open-source models are available to assist you in achieving your data transformation objectives without requiring you to build your own from scratch. If necessary, you can fine-tune an existing model to make it more suitable for your use case. In either case, your task becomes significantly easier to automate.

Whisper, an OpenAI open-sourced model, is a good example of a model like this and is one of the best models for speech transcription. The journalist in the previous example, who may have recorded an interview, can now easily convert that content into text using this model. Another interesting model for data transformation is YOLOv8 (You Only Look Once). This model is part of a family of models that help you detect and locate objects in images. YOLOv8 is also useful for segmentation, pose estimation, and tracking. Another fascinating application of machine learning in data transformation is the creation of embeddings. These are compressed representations of your data (text, image, or audio) that can be used for tasks such as search, recommendation, classification, and so on. For text data, there are a variety of pre-trained models and tools available to assist with transformations such as translation, intent detection, entity detection, summarization, and question answering.

While these models' capabilities are limited, you can fine-tune them to work for your use case after curating a corresponding dataset. This dataset will also not need to be as large as the one required for training from scratch.

Benefits of Automating Data Transformation with Machine Learning

Some benefits of augmenting or replacing traditional data transformation operations with machine-learning-powered ones are discussed below.

Increased Scope of Operation: Data transformation has vast applications in almost every digital industry. However, machine learning broadens the scope of data transformation and automation to include industries and tasks that would normally require human intervention.
Improved Efficiency: Using machine learning in data transformations allows you to automate complex tasks with a reasonable level of accuracy. It increases the efficiency of your operations by allowing you to complete human-level tasks faster and, in some cases, with comparable accuracy.
Reduction in Manual Labor and Human Error: Humans perform a great deal of repetitive work that has long been outside the scope of automation and traditional data transformation processes. Some examples include audio/video transcription, language translation, and object detection. Machine learning uses quality data gathered from humans completing such tasks to teach programs to do the same. This frees you up to perform tasks that are more difficult and beyond the scope of your automation tools. It also reduces the likelihood of human error, especially in the long term. This is because machine learning only requires you to consider the possibility of human errors in the data used to train a model. Following that, there is a much lower chance of human error because the models are more likely to perform as expected and assist you in achieving your goal.
Real-time data transformation capabilities: Because many data transformation tasks are complex, humans may need to analyze data in batches and then return the results. This is not always feasible, and it is rarely the best option. However, you can automate and process data in real time by leveraging machine learning transformations.

Innovative Solutions in Data Transformation: The Case of GlassFlow

A key part of automating data transformation with machine learning is how you build the systems that make them work. GlassFlow is a platform that removes bottlenecks in the development of data streaming pipelines. It simplifies and accelerates the pipeline building and deployment process for you by abstracting complex technology setup and decision-making. Thus, it allows you to concentrate on ensuring that the functional components of your data pipelines perform as expected.

One significant benefit of GlassFlow's offering is that you no longer have to worry about scaling issues. Regardless of the number of instances required to handle your production workload, GlassFlow is built with powerful tools like Kubernetes and NATS to help you manage them. Particularly machine learning data transformations, which can be computationally intensive.

GlassFlow's Python-based pipelines are a natural fit for you, as Python is the most popular language for modern data workflows. Once you have installed their package, you can quickly begin building, testing, and deploying pipelines. You can define your applications as Python functions and deploy them with simple but effective commands.

GlassFlow's setup is also serverless. This means you will not have to worry about provisioning and managing machines in the cloud, which can be extremely stressful, especially with machine learning workflows. GlassFlow also manages updates and upgrades for you. This relieves the burden of maintenance and allows you to devote your engineering hours to more challenging tasks. You can also be confident that there will be no data loss and that your pipeline will be operational at least 99.95 percent of the time. You also have access to a human customer support team that is available to assist you whenever you require it.

All of this is provided through a transparent pricing system in which you only pay for the services you use if the extremely generous free tier does not work for you.

Case Studies

Organisations are already using machine learning in a variety of industries to automate and augment their existing workflows. One of these industries is manufacturing, which employs machine learning to detect defective products, monitor workers for safety compliance, manage inventory, and perform predictive maintenance. Machine learning is also used in the healthcare industry to speed up diagnostics by analyzing medical images, as well as to determine whether patients are at risk of certain conditions by processing their data and recommending preventive measures. The e-commerce industry incorporates machine learning into its recommendation systems. They collect user data and, using the models they create, select corresponding products that users might be interested in. Ad recommendation systems and social media platforms use a similar technology. Machine learning transformations are also used in the finance industry to detect fraudulent transactions and estimate credit scores.

Two interesting organisations that use machine learning are discussed below.

Airbus

Airbus manufactures aircraft, ranging from passenger planes to fighter jets, to pioneer sustainable aerospace. Airbus has leveraged machine learning to improve its operations in a variety of ways. For example, as a decades-old company, Airbus needed a way to search through its vast amount of data to find information about an aircraft, airline, and so on. They accomplished this by using Curiosity, a search application that extracts knowledge from both structured and unstructured documents using natural language processing.

On another front, Airbus employs machine learning to analyze telemetry data from the International Space Station. They, in particular, ensure the continued operation of the Columbus module, a laboratory on the International Space Station, as well as the health of the astronauts on board, by sending thousands of telemetry data points to Earth for analysis. In the past, operators manually reviewed the data to identify anomalies, which were then fixed by engineers. However, the volume of data being processed, as well as the cost of human errors, necessitated a shift to an automated approach. This involved training models using anomaly detection techniques which reduced the need for an additional human in the loop and increased process efficiency.

Bosch

Bosch is a manufacturing company that makes a variety of household appliances and power tools. When products are created, they are usually manually inspected for defects by humans. This step is critical because detecting flaws early in the supply chain is the most cost-effective approach. To make this process more efficient, Bosch developed the Visual Quality Inspector (VQI), which uses computer vision to automate the quality inspection process.

In detail, Bosch takes in data using visual sensors such as cameras and then preprocesses it using a variety of data transformation operations before passing it to machine learning models. These models transform the data from its image-like state to a target value that can be used to determine whether or not a defect exists. This triggers an alert to the appropriate personnel, who handles it from there. As a result, they used machine learning to automate a task that traditional data transformation tools would normally overlook.

Challenges and Considerations

While you will agree that machine learning techniques improve your data transformation pipelines in numerous ways, as with any other technology, there are some drawbacks to consider when using them. Some of these are discussed below.

Complexity of implementation: There are numerous highly accurate and pre-trained machine learning models available for use in a wide range of applications. Many of them are ready to use right away, either directly or via an API. However, you may need to fine-tune these models to handle your specific transformation task or, in some cases, create your own from scratch. Even with pre-trained models, you must carefully consider how to package them for deployment in a way that meets your latency requirements. The process of implementing these systems can be non-trivial.
Data privacy and security: The concept of privacy and security is fundamentally about handling data so that only authorized individuals have access to or use it. As a result, it comes as no surprise that it is critical in data transformation, especially for machine learning-powered systems that may require massive amounts of data to train models. Ensuring the privacy and security of those who own the data you are using is not only a technical challenge but also a legal one. You must consider the laws that govern your industry and location when deciding how to process the data you collect from your users and/or the internet as a whole.
The need for skilled professionals in ML technologies: Based on the challenges discussed thus far, you must agree that the people who implement these ML data transformations must have some domain knowledge in the field, the depth of which varies depending on whether they are simply using APIs or training models themselves. This can be difficult because it requires engineering hours to upskill or additional costs to hire skilled professionals. As a result, you must consider the long-term benefits of using ML technology and make the best decision for your company.
Ethical use of machine learning in data processing: Almost every machine learning technique has the potential to infringe on the rights of others or manipulate them. As a result, ethics has long been a source of contention in the machine-learning industry. So, you must critically consider the ethical implications of applying machine learning to your industry and use case.

Conclusion

In this article, you learned about data transformation and how to improve it with machine learning. You have learned about the benefits of doing this, as well as some of the challenges and considerations that come with it. You also read case studies about organisations that use machine learning models as data transformation tools.

The selling point of machine learning is that it automates tasks that would otherwise require a lot of work hours or be done inefficiently with other tools. And as time goes by, these models keep getting better. So you should be looking in this direction because, as in many industries, it has the potential to transform how you process data and make your systems more efficient. As they say, the future is data-driven.

GlassFlow assists you in getting on board the data-driven train by preventing you from wasting valuable engineering time on data engineering tooling and setups. Instead, it offers a serverless, Python-based data streaming pipeline that allows you to get started quickly, from development to deployment. Hop on to the GlassFlow pipeline by joining their waitlist today.

On Transformers and Vectors

Fortune Adekogbe — Mon, 08 Apr 2024 10:03:03 +0000

A friend asked me some questions about how tokens are converted into vectors, how matrix multiplication can lead to anything resembling understanding, and what to do about understanding highly dimensional spaces, conceptually. I gave him a sufficiently detailed response which was appreciated by others and they suggested that I make a post about it. Hence this. I hope you find it interesting.

Q: In simple terms, how are tokens encoded into vectors?

A:

First, I want to highlight the what and why of tokenization.

Raw data in text, audio, or video form have to be broken down into smaller bits because of our skill issues. We don't have the computational facilities and efficient algorithmic techniques that can process these things as a whole. These resulting bits are called tokens and they are created using something we call a tokenizer.

(I will assume text data during this explanation.)

The tokenization step is important because the way you break down data directly affects the amount of contextual understanding you can get from it. For that reason, you probably don't want to tokenize your sentences at the character level.

Word-level tokenization is a popular strategy because words mean more to us, but a constraint here is that you can only understand the words that are in your dataset vocabulary. If someone brings something else, your Tokeniser will not be able to handle it.

This led us to use subword tokenization which involves breaking some words into parts. For instance, "reformers" could become ("re", "form", "ers") and "translate" could become ("trans", "late"). In this way, if we get a word like "transformers" that was not in the original set of words, we can break it into ("trans", "form", "ers"). If we also need to handle a word like “relate”, we can break it into (“re”, “late”). This strategy of breaking down words means that we can use the information we have to handle these new words. Fascinating right?

Now, to the encoding, we take these tokens that we have and the goal is to represent them mathematically in a way that similar tokens (words, if it's easier) have similar representations, aka vectors.

The vectors for "near" and "close" should be very similar. Same thing for "fortune" and "money" as well as “schism” and “division”. This is because those words are more likely to be used in similar contexts.

Practically, there are a lottttt of different ways that we can go about this. I will explain one of the simpler ones to make it easier to understand.

Remember those English "fill-in-the-gap" questions where they give you a list of options and ask you to pick the one that best completes the sentence? You can answer them because you understand what words should come after the parts before the gap and before the parts after the gap.

For instance, if I say:

The world is _____ here. You know that "quiet" fits that better than "academic", "courtesy" and "cloth".

Similarly, we teach models to learn what tokens (read words, if you prefer) fit into a particular context in a sentence. At the end of this, words like "father," "man," and "male" for instance, will be very highly correlated.

In slightly more mathematical terms, we train these models to maximize the probability of getting a target word given the words that are used in the surrounding context.

There is something very fascinating about this too. We realized that if you take the numerical difference between the vectors for words like "mother" and "father" and add or subtract it from that of "queen," you get something close to the vector for "king."

Transformers do something very similar but they also include something known as "positional encoding". This tries to also factor in the position of a token in the sentence to get a better representation. The idea here is that words can mean different things based on their position in a sentence and that context is important.

Q: How does matrix multiplication encode meaning into vectors?

A:

Welcome to gradient descent.

For starters, thinking about this in terms of matrix multiplication is accurate but a bit too general so I understand your question.

The most common operations we carry out are additions and multiplication. While this sounds basic, there are various ways that these matrices are combined. What makes them work is, first, the actual sequence of operations. This is what is called model architecture. It is essentially what happens to an input when it goes through the model and leaves as a predicted output.

Fixing the content of this architecture is the goal of model training. On a general level, deep learning aims to find a way to encode the function that transforms an input into an output without explicitly knowing what that function is. All we know is that whatever that function is is somehow represented in the architecture.

o u tp u t = f (in p u t)

To achieve this “function encoding”, we iteratively expose the architecture which is initially a series of randomly initialized matrices (could even be a matrix of zeros, for instance) to input-output pairs.

When each input (or batch of input) goes from the entrance to the exit of the architecture, we compare the predicted output to the actual output and compute the difference. This difference is then used to update all the matrices that were initially randomly initialized in the architecture.

We continue doing this until we can no longer do it because of cost or because the model doesn't seem to be improving anymore. At this point, the matrices are very different from how they started because of all the updates.

To be clear, in every step of this process, we are just doing matrix multiplication, but we also figured out that by experimenting with different sequences of operations in the architecture, we can get better results.

Transformers are only the most recent result of this experimentation. They were derived from the Attention mechanism which tries to mimic attention in humans. This was preceded by a range of specialized architectures like "long short-term memory networks," "convolutional neural networks," and so on. All of these came from the fully connected network, which was derived from the perceptron, which is essentially a glorified y = mx + c aka linear regression.

Q: What's the deal with the myriad of dimensions? How am I supposed to wrap my mind around a 12,000-dimensional space?

A:

As regards the number of dimensions, you are NOT meant to wrap your head around it. 😂 But on a high level, you can take it to mean that to properly describe the data we have, we need to specify 12,000 characteristics per data point. If we go any lower, we may lose information.

For instance, if I try to describe you with just two words, I will either throw you into some category that you may not like or drop something that can barely be regarded as a description.

But the more words I am allowed to use, the better I can represent my understanding of you. That said, one can reasonably argue that 12000 characteristics would be overdoing it and by about 8000 words, I should have a good enough description of the data point. Anything added to that is unnecessary verbosity. But that is a different discussion.

Getting Started with Machine Learning: The Quantum Edition

Fortune Adekogbe — Fri, 18 Mar 2022 10:34:14 +0000

We all want to automate the repetitive stuff and to take that to the next level, we need machine learning. The legend says that if you have enough data, sufficient compute resources and a proficient engineer, you can build applications that perform repetitive tasks with a speed and accuracy that matches and sometimes exceeds the capabilities of the average human. Data here ranges from what you have in your excel sheets to text, images, audio and videos. So whether the goal is predicting a quantity like the price of housing, identifying objects in an image or recommending movies to you, machine learning is here to save the day.

In this article, you will learn about machine learning systems, quantum computing, how quantum computing helps machine learning and tools that are available to help you get started with quantum machine learning.

Quantum Computing

To understand the quantum aspect, we will take a brief detour into the realm of Physics (stay with me). Quantum means small and as such, quantum mechanics refers to the study of small particles like atoms and electrons. Scientists studying these particles in the 20th century realized that most of the rules that govern larger bodies (classical mechanics) break down on such a small scale. This means that these rules were indeed approximations that seemed to work objectively since we hadn’t encountered this edge case. Two important phenomena that would help understand this quantum business are quantum superposition and entanglement.

Quantum Superposition

You are currently looking at an electronic device (don’t look away) and are probably 100% sure about what it is (a phone, laptop or any other device). Would you believe me if I said that your observation is wrong? Well, quantum superposition tells us that before fixing our attention on any aspect of reality, said aspect exists as a combination of infinite possible states. With the introduction of human perception via our senses, a particular state is fixed and that is what we experience.

Similarly, we are initially made to believe that machines fundamentally interact with bits that are either 0s or 1s as most machines are built based on this assumption. However, it turns out that before observation, these bits exist in a state where all possible values from 0 through infinite decimals to 1 are superimposed. Now isn’t that interesting? ‌‌‌‌Working off this, the qubit (quantum bit) was introduced and it is essentially a bit that is both 0 and 1 simultaneously. This simplifies the scenario of an infinite number states as we can’t possibly tackle that straight up.

Quantum Entanglement

Now, imagine that you passed a random stranger and by chance brushed shoulders with them, apologize and move on. However, upon reaching your destination, you find out that you can precisely determine their location, name and internet passwords. Wouldn’t that be strange? Well, that is what quantum entanglement is all about.

Two quantum particles that somehow end up interacting with each other end up being aware of their properties regardless of how far apart they are. Einstein called this “spooky action at a distance” and to be honest, I agree. Since entangled qubits know each other and one qubit can be entangled with multiple other qubits, this speeds up information processing exponentially.

Quantum Computing in Machine Learning

The typical machine learning process up to building a model on a high level involves:

Gathering data from all relevant sources
Labelling this data and confirming these labels (validation)
Transforming the data into a form that helps the model understand it
Building possible candidate models based on various approaches and selecting the best based on a test.

Quantum machine learning works by making the data processing and/or model building stages quantum-based. If both are not quantum then we have a regular classical machine learning setup. Also, if the input data is loaded and preprocessed as a quantum state or the model is quantum-based, then we have a hybrid quantum-classical system. Finally, if both are quantum (data and modelling algorithm), we have a full-blown quantum-based machine learning system.

These quantum machine learning models are the result of either the conversion of classical models into a quantum version or the development of quantum models from scratch.

What quantum computing brings to the table

As you would now agree, quantum machine learning has the potential to reduce the time it takes a system to learn. Also, training on a quantum computer would mean that more data can be processed quickly by taking advantage of the range of available states. Finally, the accuracy of the resulting systems even for reasonably complicated problems increases even when the same quantity of data is used.

Quantum advantages have been seen when applied to algorithms like principal component analysis, support vector machines and kernel methods, quantum reinforcement learning, quantum convolutional neural networks and so on.

Getting into the thick of things

Though quantum computers are still under active development with IBM recently creating the first quantum processor with 127 quantum bits (qubits), enthusiasts and independent researchers are not left out. There are libraries and environments built so that we can utilize quantum computing hardware and simulated environments. Some of these will be discussed below.

IBM Quantum

IBM Quantum provides a cloud-based environment for carrying out quantum computing. It consists of the quantum composer a tool that enables you to build quantum circuits via a graphical interface, access quantum services like systems and simulators as well as a quantum lab. In the quantum lab, you can (without download or installation) program quantum circuits and take advantage of other functions of Qiskit (an open-source package for programming on quantum hardware or simulators).

Qiskit is described as the most feature-rich, popular and open quantum computing SDK. The SDK is in Python and has modules for solving problems in machine learning, nature (chemistry), finance and optimization. For machine learning specifically, it can be used to implement quantum support vector machines, quantum generative adversarial networks and variational quantum classifiers. It also has sub-modules for implementing other forms of quantum neural networks, and even a Pytorch runtime for hybrid quantum-classical machine learning. It's quite the package.‌‌

Azure Quantum

Azure Quantum is a Microsoft provided cloud service that lets you write quantum programs and use quantum computing in production. They introduce the Quantum Development Kit (QDK) in a programming language called Q# which works with Python and C#. The Quantum Development Kit has libraries for solving problems in chemistry, numerics and machine learning.

Focusing on machine learning, with Q#, model architectures are instantiated and then with Python, these models can be imported and data can be parsed in for training. A little weird, I know but it works. The most interesting bit of this arrangement is how scripts in Q# can be made to directly interact with Python scripts using namespaces.

With Q# and the quantum development kit, applications can be created and connected to host programs that are written either in Python or a .NET language. Microsoft offers $10,000 in free credits for use on quantum hardware after which use will require a charge.

TensorFlow Quantum (TFQ)

TensorFlow Quantum (TFQ) is a library for quantum machines created by Google to promote the development of hybrid quantum-classical ML models. Google also created Cirq an open-source quantum computing library written in Python. These libraries enable you to load data in a quantum state and build models (classical or quantum). ‌‌

Also, with TF Quantum, you can learn to build quantum convolutional neural networks, quantum reinforcement learning systems and so on. You can also create hybrid systems where quantum data is used with classical models of any kind.

Conclusion

In this article, you learnt about machine learning systems, quantum computing, how quantum computing helps machine learning and tools that are available to help you get started with quantum machine learning.

Generally, hybrid quantum-classical ML models seem to be able to achieve better results when compared to quantum models. With the development of better quantum processors and algorithms through active research, quantum-based machine learning systems are expected to outdo current systems and speed up computation as is already done in other fields. I hope you enjoy exploring the world of quantum machine learning.

All puns were intended.‌‌

Staying in Tune: A guide to optimizing hyperparameters

Fortune Adekogbe — Fri, 20 Aug 2021 20:34:46 +0000

I used to play the clarinet and I'm a big fan of music but this article is not about that. I also used to listen to static noise by adjusting radio frequencies but this is not about that either.

Ronald H. Coase, a renowned British Economist once said "If you torture the data long enough, it will confess to anything". In this piece, I will walk us through a step by step guide to torturing data efficiently. Let's go!

The Loss Landscape

Manual tuning is a huge pain for the ML Engineer. The image above (for those who do not know) is a depiction of the loss landscape. Depending on your initial parameters, you could end up starting at any point in that landscape. Your goal is to have made your way to the global minimum of your search space at the end of the tuning process. Though the human brain is undisputedly powerful, this is nigh impossible via manual tuning.

I mean, you'll first have to think up a strategy. Do I change all parameters per training run or use the one factor at a time approach? The sad truth is that this is a fool’s errand. The next question is, "how do I tackle this problem?"

Abi kin salo ni? (Should I just run away?)

Introducing...drum roll...

The KerasTuner

KerasTuner is an easy-to-use, scalable hyperparameter optimization framework that solves the pain points of hyperparameter search. Simply put, it helps with getting the best hyperparameters within a predefined search space for your model. The API provides access to tuners that use Bayesian Optimization, Hyperband, and Random Search algorithms. As you may have guessed, it is most useful for Tensorflow models. Let’s get technical.

Installing the KerasTuner

try:
  import kerastuner as kt
except:
  !pip install -q -U keras-tuner
  import kerastuner as kt

The code snippet above helps with the installation of the Keras Tuner. The key part—on the first run—is the except block.
There we use the good 'ol pip package with a few arguments. The -q argument just means quiet (to control the console log level) while the -U argument upgrades the package to the latest available version.

I have grown accustomed to setting up my installation with the try-except block so that I get to run the same Jupyter cell without edits regardless of whether my goal is just importing or also installing.

Importing other required libraries

In the code snippet below, we import other libraries that will be used. If you don't have any of them in your development environment, run pip install [package name] in a terminal.

import json
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow.keras as keras
import matplotlib.pyplot as plt
from numpy.random import seed
from tensorflow import random
from keras.callbacks import EarlyStopping
from sklearn import metrics

Loading the Fashion MNIST Dataset

We will be using the fashion MNIST dataset to explore the tuner. This dataset is pretty much like the digits MNIST but instead of numbers, we have 10 fashion items. It was created because there was too much focus on the already easy MNIST digits dataset. Below is an image of samples from the dataset with every 3 rows containing a separate class.

Fashion MNIST Dataset

Labels

Each training and test example is assigned to one of the following labels:

Label	Description
0	T-shirt/top
1	Trouser
2	Pullover
3	Dress
4	Coat
5	Sandal
6	Shirt
7	Sneaker
8	Bag
9	Ankle boot

(X_train, y_train), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
# summarize loaded dataset
print(f'Train: X = {X_train.shape}, y ={y_train.shape}')
print(f'Test:  X = {X_test.shape}, y ={y_test.shape}')

Next, we split the dataset into training and validation sets.

X_train, X_valid, y_train, y_valid = model_selection.train_test_split(
    X_train, y_train, test_size=0.2, stratify=y_train, random_state=42)

Sample Model

Below is a simple 3 layered Keras model depicting how one could normally start the parameter tuning process.

model = keras.Sequential([
    # input layer
    keras.layers.Flatten(),

    # 1st hidden layer
    keras.layers.Dense(units=512, activation="relu", kernel_regularizer=keras.regularizers.l2(0.001)),
    keras.layers.Dropout(0.5),

    # 2nd hidden layer
    keras.layers.Dense(units=256, activation="relu", kernel_regularizer=keras.regularizers.l2(0.001)),
    keras.layers.Dropout(0.5),

    # 3rd hidden layer
    keras.layers.Dense(units=128, activation="relu", kernel_regularizer=keras.regularizers.l2(0.001)),
    keras.layers.Dropout(0.5),

    # 1st output layer
    keras.layers.Dense(10, activation="softmax")
])

# compile network    

model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001),
            loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
            metrics=['accuracy'])

stop_early = keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

model.fit(X_train, y_train, 
            validation_data=(X_valid, y_valid),
            epochs=10,
            callbacks=[stop_early],
            )

After running this for 10 epochs, we are only able to achieve a 71.82% accuracy. This will be compared to the results from the optimization techniques we will try out.

The Optimizers

We shall investigate the performance of all 3 built-in optimizers on a simple 3 layered multi-layer perceptron in Keras.

The first step is building the hyper model.

The Hyper Model

The hyper model is a python function that takes in the hyperparameter object hp as an argument. This then helps with defining various aspects of the search space. After this, the function returns the compiled model.

def model_builder(hp):
    # build the network architecture
    units1=hp.Int('units_1', 512, 2048, step=64)
    units2=hp.Int('units_2', 256, 1024, step=64)
    units3=hp.Int('units_3', 128, 768, step=32)

    model = keras.Sequential([
        # input layer
        keras.layers.Flatten(),

        # 1st hidden layer
        keras.layers.Dense(units=units1, activation="relu", kernel_regularizer=keras.regularizers.l2(0.001)),
        keras.layers.Dropout(0.5),

        # 2nd hidden layer
        keras.layers.Dense(units=units2, activation="relu", kernel_regularizer=keras.regularizers.l2(0.001)),
        keras.layers.Dropout(0.5),

        # 3rd hidden layer
        keras.layers.Dense(units=units3, activation="relu", kernel_regularizer=keras.regularizers.l2(0.001)),
        keras.layers.Dropout(0.5),

        # 1st output layer
        keras.layers.Dense(10, activation="softmax")
    ])

    # compile network    
    hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])

    model.compile(optimizer=keras.optimizers.Adam(learning_rate=hp_learning_rate),
                loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['accuracy'])
    return model

After defining the function and its single argument hp, the first thing we do is define the search space for each unit. This part is arbitrary, as the limits and steps depend on personal choices. Since the units are integers, the hp.int() function is used. The start, stop and end values are set.

Next. we move on to define the model pipeline. A sequential model is used and dropouts of 0.5 are added after each layer. Of course, this can also be tuned if you wish. The same goes for the activation function and kernel regularizer used in each layer which is set to relu and l2 regularization of 0.001.

The last layer outputs the probabilities for each class and uses a softmax activation function.

Next, we define the options for the learning rate and use the hp.Choice() method. The hp.Float() method could work but it is less direct.

Finally, we compile the model with an Adam optimizer, Sparse Categorical Cross-Entropy loss and metric set as accuracy.

The Random Search

After building the model, the next step is defining the tuner. The first one we will be considering is the Random Search. As you would have guessed, this pretty much picks pseudo-random sets of hyperparameters from the search space and hopes that they give good results. It is a very useful approach if one is running a reasonable number of trials and if the model is feeling lucky.

tuner = kt.tuners.RandomSearch(model_builder,
                                objective='val_accuracy',
                                max_trials=25,
                                directory='.',
                                project_name='random_search'
                                )

stop_early = keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

Setting up the tuner is not complicated. Here, we just add in the hyper model function, include the optimization objective which is the validation accuracy and set a maximum number of trials. After these, the directory for caching results and the project name are set.

The EarlyStopping function is also defined and is set to monitor validation loss and wait for just 5 epochs.

tuner.search(X_train, y_train, 
            validation_data=(X_valid, y_valid),
            epochs=10,
            callbacks=[stop_early],
            )

# Get the optimal hyperparameters
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]

print("The hyperparameter search is complete.")
print(best_hps.get_config()['values'])

Next, the tuner.search method is called. This is pretty much like the model.fit function in Keras.

the tuner.get_best_hyperparameters is called to get and display the best hyperparameters at the end of the tuner's search. The results are displayed below.

Best val_accuracy: 0.8577499985694885
Total elapsed time: 00h 24m 01s

The hyperparameter search is complete.

{'learning_rate': 0.0001, 'units_1': 1088, 'units_2': 1024, 'units_3': 640}

The Bayesian Optimizer

Bayesian Optimization runs models many times with different sets of hyperparameter values and evaluates past model's information to select hyperparameter values to build newer ones. This allows it to—very quickly—determine the most accurate model architecture.

tuner = kt.tuners.bayesian.BayesianOptimization(model_builder, objective='val_accuracy',max_trials=25,directory='.',project_name='bayesian_mlp')

stop_early = keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

The Tuner and EarlyStopping function are defined like before with the difference being the kt.tuners.bayesian.BayesianOptimization function selected instead as well as the project_name.

tuner.search(X_train, y_train, 
            validation_data=(X_valid, y_valid),
            epochs=10,
            callbacks=[stop_early],
            )

# Get the optimal hyperparameters
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]

print("The hyperparameter search is complete.")
best_hps.get_config()['values']

This block of code where the search is implemented is pretty much the same as that of the random search. The results are shown below.

Best val_accuracy: 0.8615833520889282

The hyperparameter search is complete.

{'learning_rate': 0.0001, 'units_1': 1536, 'units_2': 1024, 'units_3': 608}

The Hyperband

The hyperband tuning algorithm focuses on speeding up random search through adaptive resource allocation and early-stopping. Here, hyperparameter optimization is formulated as a pure-exploration non-stochastic infinite-armed bandit problem where a predefined resource like iterations, data samples, or features are allocated to randomly sampled configurations.

What does all this mean? The hyperband algorithm simply tries out several options with low resources and increases the resource allocation for those trials that seem promising. It thus successively halves the number of trials till we end up with a pretty great single solution.

Successive Halving

tuner = kt.Hyperband(model_builder,  
                     objective='val_accuracy',
                     max_epochs=10, 
                     directory='.',
                     project_name='hyperband',
                     overwrite= True
                 )
stop_early = keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

We start by defining the kt.Hyperband function and a couple of its arguments. One is the previously defined hyper model model_builder. Next, the objective is set as val_accuracy which refers to the validation accuracy. This is followed by setting the maximum number of epochs max_epochs, directory and project_name. The final argument overwrite is set to True.

The EarlyStopping function is also defined as usual.

tuner.search(X_train, y_train, 
            validation_data=(X_valid, y_valid),
            epochs=100,
            callbacks=[stop_early],
            )

# Get the optimal hyperparameters
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]

print("The hyperparameter search is complete.")
best_hps.get_config()['values']

Next, the tuner.search method is called like before and the tuner.get_best_hyperparameters method outputs the best hyperparameters at the end of the tuner's search. The results are displayed below.

Best val_accuracy: `0.8496666550636292`

The hyperparameter search is complete.

{'learning_rate': 0.0001,
 'tuner/bracket': 0,
 'tuner/epochs': 10,
 'tuner/initial_epoch': 0,
 'tuner/round': 0,
 'units_1': 1088,
 'units_2': 832,
 'units_3': 320}

Summary:

Model	Optimizer	Accuracy(%)
MLP	None	71.82
MLP	Random Search	85.77
MLP	Bayesian	86.16
MLP	Hyperband	84.97

As you have seen, all 3 optimization techniques get us to about 84-86% accuracy on just 10 epochs (with just the Bayesian Optimizer getting past this by a bit) performing way better than the initialized parameters that only got us to 71.82%. To get better results, you could either re-run the optimizers with a higher number of training epochs or take the resulting parameters from any of these and train a model that goes through more epochs.

I would advise that you consider the time and resource it takes to go through one epoch before deciding to increase the number of epochs.

What if?

What if we decided to tune all the parameters we defined in our model_builder as well as the number of layers? Well below is what your hyper model will look like.

def model_builder(hp):

    i= hp.Int("Layers", 3,8, step=1)
    dropout = hp.Float("Drop_out", 0.1, 0.7, step=0.2)
    activation = hp.Choice('activation', values=['relu', 'tanh'])
    l2_reg = hp.Float('l2 Regularization', 1e-4, 1e-1, sampling='log')
    hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])


    model = keras.Sequential([keras.layers.Flatten()])

    for j in range(i):
        if j < i//3:
            model.add(keras.layers.Dense(
                units=hp.Int(f'units_{j}', 512, 2048, step=64),
                activation=activation, kernel_regularizer=keras.regularizers.l2(l2_reg)))
        elif i//3 <= j < 2*(i//3):
            model.add(keras.layers.Dense(
                units=hp.Int(f'units_{j}', 128, 1024, step=64),
                activation=activation, kernel_regularizer=keras.regularizers.l2(l2_reg)))
        else:
            model.add(keras.layers.Dense(
                units=hp.Int(f'units_{j}', 32, 768, step=64),
                activation=activation, kernel_regularizer=keras.regularizers.l2(l2_reg)))
    model.add(keras.layers.Dense(10, activation='softmax'))

    model.compile(optimizer=keras.optimizers.Adam(learning_rate=hp_learning_rate),
                    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                    metrics=['accuracy'])

    return model

Finally, what if we used a CNN instead? Will it perform better as the literature says? To sate our curiosity, let us find out.

def cnn_model_builder(hp):

    conv2d_1 = hp.Int('Convd_1', 32, 128, step=32)
    conv2d_2_3 = hp.Int('Convd_', 64, 256, step=32)

    dense = hp.Int('Dense 1', 64, 512, step=64)
    hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])


    model = keras.models.Sequential()
    model.add(keras.layers.Conv2D( 32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
    model.add(keras.layers.MaxPooling2D((2, 2)))
    model.add(keras.layers.Conv2D(64, (3, 3), activation='relu'))
    model.add(keras.layers.MaxPooling2D((2, 2)))
    model.add(keras.layers.Conv2D(64, (3, 3), activation='relu'))
    model.add(keras.layers.Flatten())
    model.add(keras.layers.Dense(64, activation='relu'))
    model.add(keras.layers.Dense(10, activation='softmax'))
    model.compile(optimizer=keras.optimizers.Adam(learning_rate=hp_learning_rate),
                        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                        metrics=['accuracy'])
    return model

tuner = kt.tuners.bayesian.BayesianOptimization(cnn_model_builder,
                                                objective='val_accuracy',
                                                max_trials=15,
                                                directory='.',
                                                project_name='bayesian_cnn',
                                                overwrite=True
                                              )

stop_early = keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

tuner.search(X_train.reshape(48000,28,28,1), y_train, 
            validation_data=(X_valid.reshape(12000,28,28,1), y_valid),
            epochs=10,
            callbacks=[stop_early],
            )

# Get the optimal hyperparameters
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]

print("The hyperparameter search is complete.")
best_hps.get_config()['values']

The hyperparameter search is complete.
Hyperparameters
Convd_1: 128
Convd_: 256
Dense 1: 64
learning_rate: 0.0001
Score: 0.9011666774749756

The CNN model as expected performs better than all the Multi-Layer Perceptrons with an accuracy of 90.12% similar to what was defined by the literature.

Optimizing deep learning models can be very time consuming and draining but I believe that by applying the approach discussed here, the process can be made simpler. Leave a comment if you found this useful or interesting.

References

What is the difference between the way Essentia and Librosa generate MFCCs?

Fortune Adekogbe — Sat, 03 Jul 2021 13:22:15 +0000

I have been working on a music genre classification project for some time now and from the literature, I figured that MFCCs are the best features to start with. Though there are various libraries that implement the feature extraction, my focus has been on librosa and essentia.

Disclaimer:
This is not a piece that aims to answer the question but merely shed more light on why it is being asked and get responses.

MFCC

MFCC stands for Mel Frequency Cepstral Coefficient which is a fundamental audio feature. The MFCC uses the MEL scale to divide the frequency band to sub-bands and then extracts the Cepstral Coefficients using Discrete Cosine Transform (DCT). The MEL scale is based on the way humans distinguish between frequencies which makes it very convenient to process sounds.

It is a scale of pitches judged by listeners to be equal in distance one from another. Because of how humans perceive sound, the MEL scale is a non-linear scale and the distances between the pitches increases with frequency.

LIBROSA

librosa is an API for feature extraction and processing data in Python. librosa.feature.mfcc is a method that simplifies the process of obtaining MFCCs by providing arguments to set the number of frames, hop length, number of MFCCs and so on. Based on the arguments that are set, a 2D array is returned.

ESSENTIA

essentia is a full function workflow environment for high and low level features, facilitating audio input, preprocessing and statistical analysis of output. It was written in C++ with Python binding and exports data in YAML or JSON format.

The essentia.standard.MFCC function has a parameter to fix the number of coefficients in the MFCC but processes the entire file in one go returning a 1D array. The library however also has a FrameGenerator method that takes in other parameters which could make it yield similar results with librosa.

Making Essentia's MFCCs like Librosa

I used the FrameGenerator method to set other parameters like the hop length, number of frames and number of MFCCs to be the same as those used with librosa. Also, the sample rate and windowing type were modified to be the same for both libraries.
I then used both functions to generate MFCCs of the same shape for 20 tracks. Two of these are visualized below.

My observation was that even with this modification, essentia was still about 2 times faster than librosa (this was the primary metric I wanted to compare). However, I also noticed something else. The MFCCs did not look the same.

How different are the MFCCs from Librosa and Essentia?

Upon seeing the visual difference between them, I found the cosine similarity between the two MFCCs with the aim of quantifying it. For the two tracks displayed, the similarities were:

Africa Yako: 0.9019551277160645
So To Where: 0.9127510786056519

Generally, the similarities ranged between 0.90 and 0.94.

If you know the reason for this difference between the MFCCs or perhaps can identify a parameter that I am not considering, please do not hesitate to drop a comment. Thanks.

References:

MFCC implementation and tutorial
Practical Cryptography

Implementing the K Nearest Neighbors algorithm from scratch in Python

Fortune Adekogbe — Tue, 02 Mar 2021 23:02:02 +0000

Machine learning to most is a black box technique. This means that the inputs and outputs are known but the process is not fully understood. However, I think that for most classical algorithms this should not be the case. This piece aims to help you learn to implement the K Nearest Neighbor algorithm in Python.

The principle behind nearest neighbor methods, in general, is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples, in this case, is a user-defined constant.

Despite its simplicity, nearest neighbor methods have been successful in a large number of classification and regression problems, including handwritten digits and satellite image scenes.

The optimal choice of the value "k" is dependent on the data but in general, a larger value suppresses the effects of noise but makes the classification boundaries less distinct. Enough of the talking, lets begin...

import numpy as np
from scipy.spatial.distance import euclidean
from sklearn.base import BaseEstimator

class KNNBase(BaseEstimator):
    ...

Firstly, we import the relevant modules which are numpy, euclidean and BaseEstimator. Euclidean determines how distance is calculated while BaseEstimator is the base class for all estimators in sklearn.

The KNNBase class thus takes this class as its parent and inherits its methods. This then becomes the base class for the K Nearest Neighbors classifier and regressor.

class KNNBase(BaseEstimator):
    def __init__(self, n_neighbors = 5, distance_func = euclidean):
        self.n_neighbours = n_neighbors
        self.distance_func = euclidean

    def _fit(self, X, Y):
      self.X = X
      self.Y = Y
      return self

The KNNBase class is initialized in the __init__ method with 2 parameters namely the number of neighbors and the distance function. The former denotes the number of neighbors and determines the prediction for a particular data point while the latter determines how the distance is calculated.

The default values are taken as 5 for n_neighbors and euclidean for distance_func (any function from scipy.spatial.distance will do for distance_func).

The _fit method takes in the training features X and targets Y and stores them as instance variables. It then returns the now updated instance via self.

    def _vote(self, neighbors_targets):
        raise NotImplementedError()

    def _predict(self, X_test):
        Y_pred = np.empty(X_test.shape[0])
        for index, entry in enumerate(X_test):
          distances = [self.distance_func(i, entry) for i in self.X]
          neighbors = np.argsort(distances)[: self.n_neighbors]

          neighbors_prediction = [self.Y[j] for j in neighbors]

          Y_pred[index] = self._vote(neighbors_prediction)
        return Y_pred

The _vote method is a helper function that is not implemented in KNNBase class since it will be different for the classifier and regressor. However, its aim is to return the single prediction for a data point given its neighbor's targets.

The _predict method is the powerhouse of the algorithm and it takes in the data points X_test who's targets are to be predicted. The first thing we do is instantiate an empty array with the same number of elements as there are data point in X_test.

Next up, we loop through X_test and for each data point, the distances from every training example are calculated with the euclidean function.

These distances are then sorted with the np.argsort function and this returns the sorted array but instead of the values themselves as elements, it returns their indexes. This is useful to us because we need the indexes to get the corresponding predicted values for the nearest neighbors.

The first k neighbors are sliced out of the sorted array and the targets at those points stored in neighbors_prediction. A vote is made using these values by calling on the self._vote method and the result is stored by filling up the once empty array Y_pred.

The complete KNNBase class is thus:

class KNNBase(BaseEstimator):
  def __init__(self, n_neighbors = 5, distance_func = euclidean):
    self.n_neighbors = n_neighbors
    self.distance_func = euclidean

  def fit(self, X, Y):
    self.X = X
    self.Y = Y
    return self

  def _vote(self, neighbors_targets):
    raise NotImplementedError()

  def predict(self, X_test):
    """
    Predict the targets of the test data.
    """

    Y_pred = np.empty(X_test.shape[0])
    for index, entry in enumerate(X_test):
      distances = [self.distance_func(i,entry) for i in self.X]
      neighbors = np.argsort(distances)[:self.n_neighbors]

      neighbors_prediction = [self.Y[j] for j in neighbors]

      Y_pred[index] = self._vote(neighbors_prediction)
    return Y_pred

The Classifier

class KNNClassifier(KNNBase):

  def _vote(self, neighbors_target):
    count_ = np.bincount(neighbors_target)
    return np.argmax(count_)

This class inherits from the KNNBase class and implements the _vote method. In the _vote method, we take in the neighbors_target list and use the np.bincount method to generate the frequency of each class in the neighbors_target list. Then we return the argument with the highest value with np.argmax. If there is a tie for the most common label among the neighbors, then the predicted label is arbitrary.

The Regressor

class KNNRegressor(KNNBase):
  def _vote(self, neighbors_target):
    return np.mean(neighbors_target)

This class also inherits from the KNNBase class and implements the _vote method. However, in the _vote method, we take in the neighbors_target list and use the np.mean function to find and return the mean of all its values.
To be sure that our algorithm works, we will compare its result on the Boston housing dataset with that of the sklearn implementation.

from sklearn.datasets import load_boston
from sklearn.neighbors import KNeighborsRegressor
X, y = load_boston(return_X_y = True)
model = KNNRegressor().fit(X,y)
pred= model.predict(X[:10])
print("For the first 10 data points,")
print(f"Our implementation gives: {pred}")

model1= KNeighborsRegressor().fit(X,y)
pred1= model1.predict(X[:10])
print(f"The Sklearn Implementation gives: {pred1}")

For the first 10 data points,
Our implementation gives: [21.78 22.9  25.36 26.06 27.1  27.1  20.88 19.1  18.4  19.48]
The Sklearn Implementation gives: [21.78 22.9  25.36 26.06 27.1  27.1  20.88 19.1  18.4  19.48]```

We have now successfully implemented the K-Nearest Neighbour algorithm. Feel free to test it on any other dataset.

I hope you enjoyed reading this. If you have any questions, you can contact me on Twitter @phortz.

References:

Pipelines in ML: A guide to developing good workflows

Fortune Adekogbe — Sun, 14 Feb 2021 13:02:58 +0000

When I got started with machine learning, I did not consider pipelines to be very important. They just seemed like one of those things some people used but isn't really essential. I was creating reasonably accurate models in my notebooks and it felt alright.

Eventually, I began to focus more on the later part of a models' life — deployment and using models in Production — and thus the importance of a structured workflow became glaring.

In this article, I will walk us through a machine learning problem with our preprocessing handled in a Pipeline. You may be familiar with using kubeflow and docker on your choice cloud platform to do this but this article will make use of the good ol' Sklearn.

Note:

We will focus on just the implementation of various preprocessing and feature extraction steps with transforms. Little to no attention will be given to detailed data exploration, visualization and model accuracy. Check out this link to see a version of the notebook that focused on that.

If any of the packages used is not installed in your environment make sure to install with: pip install [package_name].

The data that is used in this article was gotten from a Data Science Nigeria 2019 Challenge and the aim is to build a predictive model to determine if a building will have an insurance claim during a certain period based on building characteristics. The target variable, Claim, is a:

1 if the building has at least a claim over the insured period. 0 if the building doesn’t have a claim over the insured period.

Making Relevant Imports

import pandas as pd
import numpy as np

train_df = pd.read_csv('train_data.csv')

Claim = train_df.pop('Claim')
train_df.pop('Customer Id');
train_df

First, we import pandas and numpy to aid our data wrangling and then we read in the training data using the read_csv method in pandas. We then use the pop function of DataFrames to remove the customer ID (since the IDs are unique and contain no useful information). The target column is also removed and stored in a variable named Claim for later use.

from sklearn.model_selection import train_test_split
X_train, X_valid, y_train,y_valid = train_test_split(train_df,Claim,test_size=0.25,
                                                   stratify=Claim,random_state=50
                                              )

The training data is split into training and validation sets using the train_test_split function in sklearn. A test size of 25% is set alongside a random_state for reproducibility. The target is parsed into the stratify argument to ensure that the proportion of 0s and 1s in the original data gets transferred into the splits.

Transformers

scikit-learn provides a library of transformers, which may clean, reduce, expand or generate feature representations. However, in this tutorial, we will be building our custom transformers.

Like other estimators, these are represented by classes with a fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. The fit_transform method may be more convenient and efficient for modelling and transforming the training data simultaneously and so we will be using it.

The Empty Transformer

I found that it is useful to have a base class to build up from whenever a transformer needs creating.

from sklearn.base import BaseEstimator, TransformerMixin

class Transformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        ...

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X

The class above inherits from two other classes namely BaseEstimator and TransformerMixin. BaseEstimator is the base class for all estimators in scikit-learn. The TransformerMixin is a class that defines and implements the fit_transform method (this fits to data, then transforms it).

Its __init__ method is an empty function, its fit method returns the instance while the transform function returns its input.

To create a custom transformer, at least one of these functions must be modified. We will explore a couple of transformers for the stated problem.

class NanFillerTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, column, normalize_ = False):
        self.column = column
        self.normalize_ = normalize_

    def fit(self, X, y=None):
        return self

    def replace_nan(self, X,feature_name,new_values,weights):
        assert len(new_values)==len(weights),'New values do not correspond with weights'
        from random import choices
        mask= X[feature_name].isna()
        length = sum(mask)
        replacement = choices(new_values,weights =weights,k=length)
        X.loc[mask,feature_name]=replacement
        return X[feature_name]

    def transform(self, X):
        x = X[self.column].value_counts(normalize=True)
        X[self.column] = self.replace_nan(X,self.column,x.keys(),x.values)
        if self.normalize_: X[self.column] = X[self.column]/X[self.column].max()
        return X

This transformer carries out a series of operations that results in missing values in a column being replaced by some predefined values with weights assigned to each value. The "brain" of the transformer is the replace_nan function and it is used in the transform method such that, the predefined values are the unique values in the column and the weights are the normalized value counts.

The transformer takes in 2 arguments which are:

column: the label of the column to be filled as a string
normalize_: a boolean that determines whether the values in the column are normalized after the missing values are replaced.

Given that it is not an estimator, the fit method stays unchanged.

class EncoderTransformer(BaseEstimator, TransformerMixin):
    def __init__(self,show_map = False):
        self.show_map = show_map

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X['NoWindows'] = X['NumberOfWindows'].map(lambda x: 1 if x == '   .' else 0)
        X['3-7Windows'] = X['NumberOfWindows'].map(lambda x: 1 if x in '34567' else 0)
        X['Other_Windows'] = X['NumberOfWindows'].map(lambda x: 1 if x in '1 2 8 9 >=10'.split(' ') else 0)
        X['Sectioned_Insured_Period'] = X['Insured_Period'].map(lambda x: 1 if x==1 else 0)
        X['Grouped_Date_of_Occupancy'] = X['Date_of_Occupancy'].map(lambda x: 1 if x>1900 else 0)
        return X

class YearEncoderTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, show_map = False):
        self.show_map = show_map

    def fit(self, X, y=None):
        return self

    def map_counts(self, X,feat):
        mapp = X[feat].value_counts(normalize=True)
        X[feat].map(lambda x:mapp[x])
        return mapp

    def transform(self, X):
        X['2012-13YearOfObs'] = X['YearOfObservation'].map(lambda x: 1 if x in [2012,2013] else 0)
        X['2014YearOfObs'] = X['YearOfObservation'].map(lambda x: 1 if x in [2014] else 0)
        X['2015-16YearOfObs'] = X['YearOfObservation'].map(lambda x: 1 if x in [2015,2016] else 0)
        X['YearOfObservation']= X['YearOfObservation'].map(lambda x: 2021 - x)
        if self.show_map:
            self.map_counts(datum,'YearOfObservation')
        return X

The EncoderTransformer and YearEncoderTransformer carry out operations that are specific to this problem. The decision to carry out the operations in their transform method was made by observing the data distribution in modified columns.

The show_map attribute of the YearEncoderTransformer is used to determine whether the map_count function is run. This function modifies the data in a column with label feat by replacing each with its normalized value count.

class FeatureCombiningTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, columns, drop_any=[]):
        self.columns = columns
        self.drop_any = drop_any

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        suffix = ''.join([i[0] for i in self.columns])
        X[f'Combined_{suffix}'] = X[self.columns].sum(axis=0)
        for j in self.drop_any:
          X.pop(self.columns[j])
          print(f">>> Removed {self.columns[j]} from dataframe")
        return X

The FeatureCombiningTransformer transformer sums up the row values in a subset of the data frame. The class is instantiated with the columns whose values are to be combined. The drop_any attribute contains a list of indexes of original column names that are to be dropped. For uniqueness, the suffix variable was added.

class DummyFeaturesTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, column = None):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if self.column:   
            X = pd.get_dummies(X,columns=[self.column])
        else:
            X = pd.get_dummies(X)
        return X


class NormalizedFrequencyTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, column, replace_original= True):
        self.column = column
        self.replace_original = replace_original

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        mapper = X[self.column].value_counts(normalize=True)
        if self.replace_original:
          X[self.column] = X[self.column].map(lambda x:mapper[x])    
        else:
          X[f"Coded_{self.column}"] = X[self.column].map(lambda x:mapper[x])    
        return X

The DummyFeaturesTransformer creates a pandas DataFrame that contains dummies made from the categorical data present in the specified column or the entire DataFrame.

The NormalizedFrequencyTransformer modifies a feature's data by replacing each with its normalized value count.

The Pipeline

Things get a little more interesting here. We have created a number of transformers but how do they get used together? How is the pipeline formed? Well, it starts with an import statement.

from sklearn.pipeline import Pipeline

data_pipeline = Pipeline([
        ('et',EncoderTransformer()),
        ('yet', YearEncoderTransformer()),
        ('fct',FeatureCombiningTransformer(['Garden', 'Building_Fenced', 'Settlement'], [0])),
        ('nanft1', NanFillerTransformer('Building Dimension', normalize_ = True)),
        ('nanft2', NanFillerTransformer('Date_of_Occupancy')),
        ('normft1', NormalizedFrequencyTransformer("Date_of_Occupancy")),
        ('nanft3', NanFillerTransformer('Geo_Code')),
        ('normft2', NormalizedFrequencyTransformer("Geo_Code")),
        ('normft3', NormalizedFrequencyTransformer("Insured_Period", replace_original = False)),
        ('normft4', NormalizedFrequencyTransformer("YearOfObservation", replace_original = False)),
        ('dft', DummyFeaturesTransformer('')),
])

Here we create pipelines object with the Pipeline function from sklearn. The same pipeline is used for the train and validation data. The order of the transformer combinations was predetermined but by paying attention to the transforms and their parameters, we will definitely end up on the same page.

Note that the different transforms are included as tuples whose first element is a string naming them. This step can be avoided with the sklearn.pipeline.make_pipeline function but we will not be implementing it.

from sklearn.preprocessing import LabelEncoder

class ColumnSelectTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)
        return X[self.column]

class Encoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        ...

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        le = LabelEncoder()
        return le.fit_transform(X.values).reshape(-1,1)

Before we wrap things up, we create 2 last transforms. The first extracts a column from the DataFrame while the second encodes its content with the LabelEncoder function from sklearn.

Feature Unions

The transformation above could have been done like the others and added to the main transforms but a new concept must be introduced. It so happens that we can have transforms combined in series and in parallel. Pipeline implements the series combination while FeatureUnion implements the parallel combination of transforms.

from sklearn.pipeline import FeatureUnion

Settlement_onehot = Pipeline([
    ('cst', ColumnSelectTransformer(['Settlement'])),
    ("le",Encoder())
])

Building_Fenced_onehot = Pipeline([
    ('cst', ColumnSelectTransformer(['Building_Fenced'])),
    ("le",Encoder())

])

Building_Painted_onehot = Pipeline([
    ('cst', ColumnSelectTransformer(['Building_Painted'])),
    ("le",Encoder())

])

categorical_features = FeatureUnion([
    ("Settlement_onehot", Settlement_onehot),
    ("Building_Fenced_onehot", Building_Fenced_onehot),
    ("Building_Painted_onehot", Building_Painted_onehot),
])

In the above code snippet, we have implementations of both the series and parallel combinations and it culminates in a pipeline named categorical_features.

data_transformer = FeatureUnion([
    ('features', data_pipeline),
    ('categorical', categorical_features),
])

train = data_transformer.fit_transform(X_train)
validate = data_transformer.fit_transform(X_valid)

Here we use the FeatureUnion function we just learned of to combine both categorical_features pipeline with the data_pipeline we created earlier.

We then go ahead and transform our training and validation data by using the fit_transform method.

Training

In this tutorial, the Categorical boosting classifier is used with a learning rate of 0.01 and 400 estimators. log_loss is used as the metric for comparing models.

from catboost import CatBoostClassifier
from sklearn.metrics import log_loss

cbc1 = CatBoostClassifier(verbose=0,learning_rate = 0.01, n_estimators=400)
cbc1.fit(train, y_train)
print(f'Train score: {log_loss(y_train, cbc1.predict_proba(train))}')
print(f'Validation score: {log_loss(y_valid, cbc1.predict_proba(validate))}')

Saving our Model

The model is then saved using the joblib library in Python as is shown below.

import joblib
filename = 'Insurance_model.sav'
joblib.dump(cbc1,open(filename,'wb'))

In practice, I would advise that the specific preprocessing steps are finalized before creating custom transforms and a pipeline.

Thanks for reading through, I hope you enjoyed and learnt from this. Check out this repository for the data and script(s) related to this article.

Reference

SKlearn Data Transforms

A Lightweight Path to Virtualisation

Fortune Adekogbe — Wed, 13 Jan 2021 22:36:00 +0000

We do not always have all that we need. Regardless of what device you use, you may find out that you need to set up an environment that is unlike what your system was built on either to develop and test new applications, run old applications and so on.

You could dual boot a system or set up a full-scale virtual machine with a hypervisor but not everyone can afford the time or compute it would take to set these up.

As I have been in such a situation myself, I will be describing a relatively lightweight path to setting up a Linux virtual machine on either a Windows or macOS host machine.

The Hypervisor

A hypervisor or virtual machine monitor is computer software, firmware or hardware that creates and runs virtual machines. The hypervisor presents the guest operating systems (virtual machine) with a virtual operating platform and manages the execution of the guest operating systems. There are lots of hypervisors but we will be using virtual box in this article. Click this link to download it for your respective operating systems. The installation process is quite direct, all recommendations should be followed.

Once the screen below is seen, click 'YES' and continue until it completes the installation and then click 'Finish'.

Multipass

I stumbled upon Multipass whilst exploring the Ubuntu website. It was described as a mini cloud option that promised a faster set-up process for virtual machines and that caught my attention.

Click Windows or macOS to download multipass for your operating system. The installation process is pretty straight forward, just make sure to follow the default settings and recommendations.

While setting up for installation on Windows, you will notice that it selects the Hyper-V as default if you use a windows OS pro or workstation. However, if you use the home edition, Virtualbox should be selected as the hypervisor as Hyper-V will be unavailable.

With Multipass and VirtualBox installed, we make sure that VirtualBox support is enabled for Multipass as follows.

Windows

If VirtualBox was selected over Hyper-V, in an admin terminal run:

multipass set local.driver=virtualbox

macOS

In a terminal, run:

sudo multipass set local.driver=virtualbox

With this done, we can then create a Virtual Machine with the multipass launch command. If we want a specific version of the OS, we need to include an optional image to launch else, the latest stable version of Ubuntu will be installed.
To install an instance of the bionic version of Ubuntu (18.04) named Delta and allocated memory of 1GB and disk space of 5GB, run:

multipass launch -d 5G -m 1G -n Delta bionic

-d specifies the disc space to allocate, -m specifies the amount of memory to allocate and -n signifies the name of the virtual machine. If the latter is absent, Multipass selects a random 2-word name (joined by a hyphen). The names are amusing so you could try out:

multipass launch bionic

With that done, you now have two virtual machines running on your system. Isn’t this so stress-free?

Running multipass list should show you all VMs and their status (running, stopped, deleted).

You can delete the second VM if you have no use for it, re-execute multipass list to see its changed status and recover allocated memory by executing the commands below.

multipass delete [generated-name]
multipass list
multipass purge

At this point, we have a VM named Delta running but have not used it.

Multipass lets us run commands in the VM without being in it yet with

multipass exec [name of vm] [command] but that's just a bother so we shall break into the Shell with the command below.

multipass shell Delta

This loads for a bit then displays a Linux terminal.

Now one would naturally expect that since we have linked VirtualBox with Multipass we should be able to view our VM instance in Vbox's interface. That in fact is what this part of the docs directs us to do. It, however, did not work out for me but you can go ahead and try it out.

Setting up a GUI for your virtual machine

Now here we are with just a shell to our virtual machine. I found two ways of installing a GUI. If your device runs on an unlicensed Windows device(home edition), the “lightweight path” should be what you select else, read through both options and use what fits in with your end goal.

A “not so lightweight” option

In this path, we install Xrdp, an open-source implementation of the Microsoft Remote Desktop Protocol (RDP) that allows us to graphically control a remote system. To install this, run the following sequence of commands in your opened shell.

sudo apt update
sudo apt install ubuntu-desktop xrdp

This might take a while depending on your internet connection speed so feel free to grab a cup of tea or whatever drink you like.

Once this is done, you can then set a password to the default ubuntu user by running the following command.

sudo passwd ubuntu

We will then be asked to enter and re-enter a password. This concludes operation on the server-side.

For the client, we can use on Windows the Remote Desktop Connection application and on macOS the Microsoft Remote Desktop application from the Mac App Store. There, we enter the virtual machine’s IP address (which can be found by issuing the command ip addr in your shell), set the session to XOrg and enter the username and password we created on the previous step.

A lightweight option

If you only want Multipass to launch one or a couple of windows and not a complete desktop, this is the right path. Here, we use the X window system to connect the applications in Multipass to your host machine letting them tap off its capabilities. Our aim is to connect the windows singly and not as a desktop.

To use X11 on Windows, install the X server which can be found here. The installation is pretty direct, just make sure to follow defaults and recommendations if any.

Launching the now installed X server through the X launch application on your desktop or start menu, some options will be displayed. Below is a walkthrough of the advised selections.

Screen 0:

Stick with Multiple windows and leave the display number as -1.

Screen 1: After clicking next, we see the client startup window where we should click "Start no client".

Screen 2: Next up we have the extra settings, make sure to activate disable access control.

Screen 3: Finally we save the settings on the next page and start the X server.

An application icon will be displayed on our taskbar. On Windows, you may be asked to allow the server to communicate on private networks, click “Allow access”.

To configure the Multipass instance Delta, we will need the host computer's IP address.

To do this, open another terminal and run ipconfig. Copy this address then get back into the Delta shell.

Here, we set the DISPLAY environment variable to the server display on the host IP filling in the underscores with the IP address.

export DISPLAY=__.__.__.__:0.0

To test the setting, we will run in the host some simple program:

sudo apt install x11-apps
xcalc &

A small window containing a scientific calculator will show up.

Click this to see a list of all the other X11 apps and try out some of them with the command [app-name] & as shown above.

This server comes with Python3 installed but if you want python2 installed, run:

sudo apt-get python-minimal

To install and launch VSCode, run:

Sudo snap --classic code
code &

Make sure to always set the DISPLAY environmental variable before attempting to launch an application.

I hope you enjoyed this piece. Questions, comments and contributions are welcomed.

Reference:
Stand-alone windows in Multipass - Documentation - Ubuntu Community Hub

The Art of doing nothing: Python Edition

Fortune Adekogbe — Sat, 26 Dec 2020 20:57:34 +0000

Writing programs can be stressful. As a developer, I have never been one to always pen down pseudo-codes before writing my first lines of code so it happens that at some point I do not know what to write. Also, when writing modular scripts, I define all classes/functions and their doc-strings at the start before completing them one after the other.

This means I had to find ways to make empty classes or functions not raise errors when the script is run. I have discovered a couple of ways to do this and I will be sharing them in this article. Whether you are like me or not, I do not doubt that you will find this interesting and useful.

The package called nothing

The mere fact that a python developer created this is quite fascinating. Here we have a package that contains nothing but a version number and a one-sentence description in its init file. In fact, this is the only description for this package on Github and PyPI, nothing else. Like every other package, it can be installed with pip via the command pip install nothing.

Once it is installed, we can import nothing and then use it as a placeholder as shown below

The pass statement

This is the more popular option. pass is an inbuilt null statement in python that is used to stand in for code that is yet to be written. Its inbuilt nature makes it most python developers goto statement for code blocks that are to do nothing. It is used like the nothing package as we'll see in the image below.

Ellipsis (...)

In the English language, an ellipsis is used when omitting a word, phrase, line, paragraph, or more from a quoted passage. In Python, however, an ellipsis is a single object with no methods that serve a range of functions from multi-dimensional array slicing to type hinting but we will be focusing on its use as a placeholder. This is a personal favourite because I find the similarity between its use case in python and regular English texts interesting. We see it used in the image below as a part of a recursive function with only the base case specified.

The final straw

This last case is quite intriguing. Its something you would have thought of at some point and simply put, it is leaving your editor empty. Why type when you are not ready yet? Finish up that pseudo-code then open your editor and code away. Yes, this is a message to me too😂

I hope I successfully communicated the art of doing nothing in Python to you all and that this was enjoyable.