DEV Community: Johannes Pfeiffer

Machine Learning: A classification problem in FinTech with Node.js and TensorFlow

Johannes Pfeiffer — Mon, 24 Aug 2020 13:35:46 +0000

Machine Learning

Solving a classification problem in FinTech with Node.js and TensorFlow.

Motivation

At Kontist we provide a banking app for freelancers. The user can select a category for each of their transactions. For example, ”Business expense,“ “Private,” ”Business income,“ ”Tax pay,“ and more. Based on this selection, we then do tax calculations for the freelancer to support his or her savings.

In the current user interface flow, the user selects a category from a list every time a new transaction comes in. To improve the user experience, we would like to automate the category selection. The naïve approach is to create manual rules like, “If the sender of a transaction was used in a transaction before, then just use the same category.” Obviously, this has some shortcomings:

Imagine the sender “Amazon.“ One transaction could be “Private“, but the next one could be a ”Business expense“ and our approach would fail.
How should we categorize transactions from new and unknown senders?
You could refine the rules by including more data. For example, we could not only look at the sender but also at the transaction amounts. But adding more manual rules to improve the accuracy would make the code complex and unwieldy.

So instead the approach we took was to create a machine learning solution. First, we build a model and trained it with some existing transactions for which the category was known. Then we uses that model to make predictions about upcoming transactions.

Introduction to classification

Classification is a task that assigns a label to some data based on what was learned from previous data. In our case, the labels are categories (“Business expense,” ”Private,” ”Business income,“ ”Tax pay,“ et cetera) and the data are the transactions.

In general, the process looks like this:

Define our model.
Train the model with known data.
Use the model to make predictions.

Choose the right inputs

Not all properties of a transaction help us with the classification. For example, it is obvious that some random UUID properties cannot help the model in making predictions. In fact, we found that only a couple of properties have any influence on the prediction at all. These properties, that we do use as an input for the training and prediction, are called “input features.” On the other hand, the categories are called the “output labels.“

Layers and Neurons

/neurons.png (Simplified overview)
Looking at this image we can see that each input feature corresponds to one neuron at the left, and each output label corresponds to one neuron at the right.

In between we have several neurons organized in multiple hidden layers. Neurons are connected from one layer to the next, each connection having a specific and custom weight. You could say the values (also called probabilities) of the output labels are just a sum of the neuron values multiplied by their weights. Put simply, training the model is a process of finding the correct weights for all connections between the neurons.

/neurons-weights.png (Sample weights; 62% of input data is predicted to be in the business expense category.)

Our setup

The backend is a Node.js and TypeScript environment. The transaction data comes from various sources, but we can access all of it via a PostgreSQL database.

Luckily, there is already a JavaScript binding for TensorFlow (called TensorFlow.js).

So, we can define a sequential model as described above. It consists of four layers. The first is the input layer, where we enter our features. This is implicitly added to the model. In addition, we have two hidden layers and a layer for the output labels.

import * as tf from "@tensorflow/tfjs-node";

const inputFeaturesCount = ...
const categoriesCount = ...
const model = tf.sequential();
const units = Math.ceil((inputFeaturesCount + categoriesCount) * 0.75);

model.add(
  tf.layers.dense({
    units,
    inputShape: [inputFeaturesCount],
    activation: "relu",
  })
);
model.add(tf.layers.dense({ units }));
model.add(
  tf.layers.dense({
    units: categoriesCount,
    activation: "softmax",
    })
);
model.compile({
  optimizer: "adam",
  loss: "categoricalCrossentropy",
  metrics: ["accuracy"],
});

Normalize everything

Before we can start to train our model, it is time to normalize the data; the input features must be numerical values.

For example, take the date of the booking, "2019-05-28 22:12." With the help of the moment.js library, this can be extracted into three input features:

const dayOfMonth = +moment(bookingDate).format("D");
const dayOfWeek = +moment(bookingDate).format("d");
const hour = +moment(bookingDate).format("H");

To avoid complications, we want all the values to be normalized between 0 and 1. Therefore, we divide all the values by their maximum value, adding an extra step.

Another part of the preparation for training is to evenly distribute the data. In our case, we have a lot more training data that is marked as "Business Purpose" than "Private." TensorFlow offers a nice way to handle that; it allows the user to set a class weight for each label corresponding to the distribution in the training data set. Note that these class weights are not to be confused with the actual weights of the connections between the neurons.

What does the crowd say?

Turns out that we have good input features which do not directly come from the transaction itself. We can have a look how the user in question or other users categorized transactions with the same IBAN in the past. This might give a strong indication of how to predict future transactions.

Training

Time to train our model. We take our training data, shuffle it, and split it into two parts.

The actual training data (80%)
Some validation data (20%)

First, TensorFlow uses the training data to try to find good weight values for the connections between the neurons. Training is a process of determining weight values so that the sum of the neuron values multiplied by their weights of connections will create good output label values.

The validation data will then be used to check if the training worked. We cannot use the training data to verify this; it would of course return perfect results since we just used it to create this model.

await model.fit(inputFeatureTensor, labelTensor, {
    // ...
  validationSplit: 0.2,
  callbacks: [
    tf.callbacks.earlyStopping({
      monitor: "val_loss",
      mode: "min",
    }),
  ],
});

How does TensorFlow find these values? It iteratively applies a function to adjust the weights so that the discrepancy between the label results and the expected results is minimized. If the discrepancy is below a given threshold, training is complete.

Making predictions

We now have a model and can start making predictions. Our income data must be in the same format as our training data, meaning we must apply the same normalization.

All that is left to do is call model.predict which will return a list of the probabilities that the input matches each category. The one with the highest probability will be selected as the category of the transaction.

Learnings

Native Binary

Internally, TensorFlow is a binary that runs completely separately from Node.js providing bindings for it. The following sections explain two resulting considerations.

Dispose variables

TensorFlow doesn't automatically cleanup memory after model operations like model.fit, model.predict etc. Therefore, we have to wrap these operations in tf.engine() scope calls:

tf.engine().startScope();
// ...
tf.engine().endScope();
tf.engine().disposeVariables();

Running in parallel

We do have multiple workers or processes. If they are interacting with the same TensorFlow.js instance it creates a complication. Possible solutions are to run the processes in sequence, block concurrent access, or separate them into their own instances.

Limitation of tools

A lot of tools to optimize and research the models are written in Python instead of JavaScript. For example, one cannot just use "Tensorboard" to watch how your models behave. For further optimization of our machine learning code, we plan to investigate the further integration of external tools.

Implementing GraphQL in an existing code base

Johannes Pfeiffer — Mon, 24 Aug 2020 13:35:27 +0000

Implementing GraphQL in an existing code base

Motivation

At Kontist, we provide a mobile banking app for freelancers. As our app grew in popularity, more and more of our external partners wanted to integrate with our services. To add to our offering, we developed a browser-based web application. Soon, our customers wanted an API as well. However, our clients had different requests for an API. Each client had its own idea about what data should be returned and how. Previously, we only had a custom-tailored REST-API for our own mobile application, hardly able to fulfill all those needs.

One problem was that the API endpoints did not return the data that was required by the client. Either it was not available at all or the client was required to do a lot of follow-up requests to find a simple value (also called underfetching). On the other extreme, it returned too much data. Much of this data was never used by the client at all, having a negative impact on the performance (so called overfetching). Both underfetching and overfetching symptoms are common problems for traditional REST-APIs.

What does GraphQL

Several options exist to overcome these overfetching or underfetching. One popular solution is called GraphQL.

The main idea of GraphQL is that clients define the data structure.

Instead of enforcing a server defined data structure with multiple endpoints, each client can define what data it needs only via one endpoint.

Example REST vs. GraphQL

Let's say we want to pull up an overview of the last three account transactions. In our UI, we only need the description and the amount of the transactions.

In the REST-API we would need (at least) two requests:

Get information about the accounts to which the user has access, then filter the response for the primary account.
2. Using the ID of the main account, pull all transaction data for this account. Then, drop all transactions except the three transactions we want to show. Drop all other data associated with those three transactions besides description and amount (e.g. we never show the booking date, the sender, IBAN, et cetera).

(Traditional approach with REST-API)

In contrast, GraphQL could fetch this transaction data like this:

(Approach with GraphQL-API)

Send one request with response structure defined in the body:

{
  viewer {
    mainAccount {
      transactions(first: 3) {
        edges {
          node {
            description
            amount
          }
        }
      }
    }
  }
}

In the body we use the GraphQL query language; it defines how the response should look. The available fields and their types are defined by the server up front. This allows mistakes in the client code to be detected before runtime of the application.

This approach clearly has some benefits:

Avoid over- and underfetching
Performance improvements because of ** less usage of network bandwidth ** no need to fetch unused data on the server ** no need to clear client data or pollute client memory
Integrated validation and type-checking
Documentation can be auto generated
Easier evolution of the API (without versioning)

Implementation

API First

Introducing the GraphQL-API was more complex than just adding a small component to our code base; we started using a different development model, called "API First.” Previously, we only added new features to our mobile application then adjusted the backend for it. Now, the feature will be used by different clients at launch (web application, mobile application, our external partners and users). API First means we carefully design the model in our API, which then can be used by our own and all other clients.

Reusing code

When we started the GraphQL-API we already had a well-functioning system. It did not make sense to start from scratch; this would have cost resources and introduced new bugs. Our idea was to reuse the existing models where it was possible. We did this by introducing decorators (also called annotations) to the code of the models to expose certain properties via GraphQL. Alongside GraphQL, we introduced a new ORM version which allowed us to use decorators. Our models are now mostly clean POJOs with some decorators added.

class Account {
    @Column(DataType.INTEGER)
    id: number;

    @Field()
    @Column
    iban: string;

    // ...
}

Here we see a small excerpt of our account model class with the properties id and iban. The @Column decorator comes from our ORM and the @Field decorator is used to expose this property in GraphQL. This example shows, that we do not expose the internal database ID property via GraphQL.

Beside these decorators, we have resolver classes which find the correct entities for a given field in the GraphQL-tree.

All of this allowed us to introduce GraphQL very quickly without much overhead or boilerplate code. Instead, we could spend the time on more advanced topics like API-design, paging, sorting, and filtering. There was already a common standard for paging called connection pattern. We implemented our own solution for sorting and filtering which is now published as an open source library called type-graphql-filter.

Common problems

When dealing with GraphQL common problems are performance and security. These become a concern simply because you can do so much in just a single call. We implemented the following measures to mitigate these risks:

Flat structure: We do not allow recursive structures in our queries. The hierarchy for a transaction is user > account > transaction, but then this transaction entity has no property leading back to its account or user.
Implementation of the data loaders standard: For performance intense properties, we cache the database result during the same request.
Hard limit: To avoid misusage of our API, we implemented limits on request and response sizes as well as on the number of items that we return in one response.

Conclusion

The feedback from both our internal and external users of GraphQL-API was very positive. We have the ability to quickly develop new models. Currently, we are migrating all remaining REST-endpoints to GraphQL. Then we will shutdown the old legacy API.