DEV Community: Benjamin Trent

Log clustering in Rust

Benjamin Trent — Fri, 20 Nov 2020 18:01:37 +0000

Log clustering in Rust

Log clustering is a powerful tool for finding insights in large amounts of logs. Spikes in log categories can indicate a change point in the system warranting investigation.

Here is a simple library I wrote drain-rs. And lg-rs is an example command line utility using the drain library.

Drain is an online, unsupervised semi-structured text clustering algorithm. It is based on the original work by logpai. Drain boasts some impressive numbers when compared against other clustering algorithms:

Paring this clustering efficacy with the throughput and low overhead capabilities of Rust is a no-brainer.

Let me know what you think. Its OSS, so issues/PRs are welcome.

Measure twice, write twice

Benjamin Trent — Thu, 06 Aug 2020 20:48:00 +0000

Continually measure

When writing code, it's tempting to get complex. Especially when you are concerned about performance. But, you should write it simply first. Measure its performance. With that data, then write your improvements. Using your performance measurements as guides.

My failure to measure

I was writing a statistic gathering service Java. The statistics were very simple counts of items seen overtime. The constraints of the system were:

high throughput
Thread safe

Java has a great class for thread-safe incremental statistics LongAdder. It's fantastic for fast writes but LongAdder#reset is not thread-safe. I needed to be able to grab the latest full count and then reset. In comes ReadWriteLock! ReadWriteLock#readLock can be used for all the increment actions and then ReadWriteLock#writeLock for grabbing the latest total. The resulting service ended up looking like this:

public static class Accumulator {
    private final LongAdder statsAccumulator = new LongAdder();
    private final ReadWriteLock readWriteLock = new ReentrantReadWriteLock(true);

    public Accumulator inc() {
        readWriteLock.readLock().lock();
        try {
            this.statsAccumulator.increment();
            return this;
        } finally {
            readWriteLock.readLock().unlock();
        }
    }

    public InferenceStats currentStatsAndReset() {
        readWriteLock.writeLock().lock();
        try {
            Stats stats = currentStats(Instant.now());
            this.statsAccumulator.reset();
            return stats;
        } finally {
            readWriteLock.writeLock().unlock();
        }
    }

    public InferenceStats currentStats(Instant timeStamp) {
        return new Stats(statsAccumulator.longValue(), timeStamp);
    }
}

I thought I had the perfect high throughput, thread-safe, statistics gathering class. I mean, it doesn't ever block the writes unless we grab the currentStatsAndReset. The common hot-path of inc() is normally not blocking.

Perfect.

But I didn't do one thing. I never measured performance of this implementation against a dead-simple synchronized version. 🤦

Start simple, then measure

Here is the simple version:

public static class Accumulator {
    private long statsAccumulator = 0L; 
    public synchronized Accumulator inc() {
        this.statsAccumulator++;
        return this;
    }

    public synchronized InferenceStats currentStatsAndReset() {
        Stats stats = currentStats(Instant.now());
        this.statsAccumulator = 0L;
        return stats;
    }

    public InferenceStats currentStats(Instant timeStamp) {
        return new Stats(statsAccumulator, timeStamp);
    }
}

It doesn't use any of those classes designed for low contentioning locking. Just a plain 'ol synchronized methods. Surely, all that use of synchronized would increase contention on write.

Another developer on my team called me out on this. He was curious to see if my version was truly faster. I knew I was right, so I wrote a JMH benchmark to prove him wrong.

The results were not on my side:

    Benchmark                                                   Mode  Cnt        Score        Error  Units
    MultiThreadedStatsAccumulatorBenchmark.rwAccumulator_1      avgt   20     5957.399 ±    112.892  us/op
    MultiThreadedStatsAccumulatorBenchmark.rwAccumulator_128    avgt   20  7480921.908 ± 255364.820  us/op
    MultiThreadedStatsAccumulatorBenchmark.syncAccumulator_1    avgt   20      421.662 ±      2.616  us/op
    MultiThreadedStatsAccumulatorBenchmark.syncAccumulator_128  avgt   20   792910.927 ±  52219.577  us/op

My complex version (rwAccumulator) was almost 10x SLOWER. The simple, fully synchronized version (syncAccumulator) kicked my butt. Both with 1, and 128 separate threads!

Measure, Measure, Measure

My lessons were humbling reminders.

Always write the simple way first.
If you think it should be faster, change it and measure it.
Measure, measure, measure

Software Engineers are Draftsmen

Benjamin Trent — Mon, 27 Jul 2020 20:22:53 +0000

We design, write, and deliver structures to be built. We build nothing. The computer does the building - millions of times. We use IntelliJ instead of CAD. Our designs evolve and are flexible; filled with manifold uses and complexities. But, in the end, it is a specification to build a structure. Like architects and draftsmen we are artists, engineers, designers, creators. Software Engineers are draftsmen.

Software Engineers are:

Writers?
Artists?
Engineers?
What do you think?

ML Model Inference in Painless

Benjamin Trent — Thu, 09 Jul 2020 19:02:08 +0000

Inference in Painless

I am an employee of Elastic at the time of writing

Machine learning inference is just math. You have some parameters, pump them through some functions, and boom, you get a result. While this is simple on the surface, all the tooling can get complex. Could I script simple model inference in Elasticsearch?

What is [Elasticsearch | Painless]

Elasticsearch is a distributed, restful, and open data store. The underlying store is Lucene with a bunch of goodies built on top.

Painless is a secure, simple, and flexible scripting language purpose built for Elasticsearch. You can use custom scripts at search time, in many different aggregations, and even at ingest time. It's crazy powerful and flexible. But, with great power, comes great responsibility.

Machine Learning inference in Painless

Painless 100.5 (not even 101)

Painless can be used a couple of ways:

inline: where the whole script is included in the API call
stored: The script is stored in Elasticsearch's cluster state.

Painless scripts can reference fields in the given context (doc fields, _source fields). They also have access to a params object. This can be provided for script reuse on different input parameters.

Simple models

Linear regression, being intuitive and simple is a very nice place to start the experiments.

It is trivial to implement one dimensional linear regression Painless.

# Storing a simple linear regression function script
PUT _scripts/linear_regression_inference
{
  "script": {
    "lang": "painless",
    "source": """
    // This assumes the parameter definitions will be given when used
    // This also assumes a single target.
    double total = params.intercept;
    for (int i = 0; i < params.coefs.length; ++i) {
      total += params.coefs.get(i) * doc[params['x'+i]].value;
    }
    return total;
    """
  }
}

I trained a simple model in scikit-learn, on the diabetes data set. Here is using the model's resulting parameters in the script to return a script field.

GET diabetes_test/_search
{
  "script_fields": {
    "regression_score": {
      "script": {
        "id": "linear_regression_inference",
        # Here are the model parameters. The linear regression coefficients and intercept. 
        "params": {
          # coef_ attribute from sklearn 
          "coefs": [-35.55683674, -243.1692265, 562.75404632, 305.47203008, -662.78772128, 324.27527477, 24.78193291, 170.33056502, 731.67810787, 43.02846824],
          # intercept_ attribute from sklearn
          "intercept": 152.53813351954059,
          "x0": "age",
          "x1": "sex",
          "x2": "bmi",
          "x3": "bp",
          "x4": "s1",
          "x5": "s2",
          "x6": "s3",
          "x7": "s4",
          "x8": "s5",
          "x9": "s6"
        }
      }
    }
  }
}

Writing custom inference code for every model type could get tiring. More complex models will demand an ever growing library of functions. There are plenty of inference library's out there to experiment with. Why reinvent the wheel? Can one be made to work with Painless?

m2cgen to Painless

m2cgen is a python library that translates trained models into static code. While only specific models are supported, the code generated works great. Painless supports a subset of Java and m2cgen has Java as a potential output. Generating painless scripts from trained models is possible!

Well, its not out of the box. The entry point for painless is an object called params. So, m2cgen's Java functions have to be adjusted for how painless accepts outside parameters. Here is an example translating the Java output to a Painless acceptable one:

import xgboost as xgb
from sklearn import datasets
from sklearn.metrics import mean_squared_error
import m2cgen as m2c


diabetes = datasets.load_diabetes() # load data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target, test_size=0.2, random_state=0)
print(diabetes.feature_names)
model = xgb.XGBRegressor(max_depth=6, learning_rate=0.3, n_estimators=50)
model.fit(X_train,y_train)
java_model = m2c.export_to_java(model)
java_model = java_model.replace("input", "params")
for idx, val in enumerate(diabetes.feature_names):
    java_model = java_model.replace("[" + str(idx) + "]", "[\"" + val + "\"]")
print(java_model)

Here is the output (truncated)

double var0;
        if ((params["s5"]) >= (0.0216574483)) {
            if ((params["bmi"]) >= (0.0131946635)) {
                var0 = 72.2889786;
            } else {
            ...
            return ((((((((((((((((((((((((((((((((((((((((((((((((((0.5) + (var0)) + (var1)) + (var2)) + (var3)) + (var4)) + (var5)) + (var6)) + (var7)) + (var8)) + (var9)) + (var10)) + (var11)) + (var12)) + (var13)) + (var14)) + (var15)) + (var16)) + (var17)) + (var18)) + (var19)) + (var20)) + (var21)) + (var22)) + (var23)) + (var24)) + (var25)) + (var26)) + (var27)) + (var28)) + (var29)) + (var30)) + (var31)) + (var32)) + (var33)) + (var34)) + (var35)) + (var36)) + (var37)) + (var38)) + (var39)) + (var40)) + (var41)) + (var42)) + (var43)) + (var44)) + (var45)) + (var46)) + (var47)) + (var48)) + (var49);

The generated script is humongous. Almost 7,000 lines. Anybody will tell you, that is too much.

But, does it work?

Ugh:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "exceeded max allowed stored script size in bytes [65535] with size [307597] for script [diabetes_xgboost_model]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "exceeded max allowed stored script size in bytes [65535] with size [307597] for script [diabetes_xgboost_model]"
  },
  "status" : 400
}

Script size limits are wise. Stored scripts are put in the cluster state object. The more stored scripts (and the larger the scripts), the more overall cluster performance will start to drag and might lead to other problems.

But what if I didn't care about my cluster health? I want my model and I want it now!

PUT _cluster/settings
{
  "transient": {
    "script.max_size_in_bytes": 10000000
  }
}

Sane limitations can't stop me!

Time to put the script:

PUT _scripts/diabetes_xgboost_model
{
  "script": {
    "lang": "painless",
    "source": """
    ...very large source...
    """
    }
}

Now I can use my stored script!

"regression": {
    "bucket_script": {
        "buckets_path": {
            "age": "age",
            "sex": "sex",
            "bmi": "bmi",
            "bp": "bp",
            "s1": "s1",
            "s2": "s2",
            "s3": "s3",
            "s4": "s4",
            "s5": "s5",
            "s6": "s6"
        },
        "script": {
          "id": "diabetes_xgboost_model"
        }
   }
}

This example is using it as a bucket_script aggregation.

Should you do this in production? No.

Was it a fun exploration? For me, yes! m2cgen is such a wonderful discovery and pairing it with the flexibility of painless was worth exploring.

Here is a gist containing the python code + full generated script

Outlier detection from scratch (sort of) in python

Benjamin Trent — Tue, 30 Jun 2020 21:30:10 +0000

note: this is a cross-post originally written on my blog

Outlier Detection

Outlier detection can be achieved through some very simple, but powerful algorithms. All the examples here are either density or distance measurements. The code here is non-optimized as more often than not, optimized code is hard to read code. Additionally, these measurements make heavy use of K-Nearest-Neighbors. Consequently, they not be as useful at higher dimensions.

First, lets generate some data with some random outliers.

%matplotlib inline
# Generate some fake data clusters
from sklearn.datasets import make_blobs
from matplotlib import pyplot
from pandas import DataFrame
import random
import numpy as np

r_seed = 42

# Generate three 2D clusters totalling 1000 points 
X, y = make_blobs(n_samples=1000, centers=3, n_features=2, random_state=r_seed)
random.seed(r_seed)
random_pts = []

# Generate random noise points that could be or could not be close to the clustered neighborhoods
for i in range(50):
    random_pts.append([random.randint(-10, 10), random.randint(-10, 10)])

X = np.append(X, random_pts, axis=0)

df = DataFrame(dict(x=X[:,0], y=X[:,1]))
df.plot(kind='scatter', x='x', y='y')
pyplot.show()

The Algorithms

Four separate algorithms are shown below:

Local Outlier factor (LoF): This is a density metric that determines how dense a points local neighborhood is. The neighborhood is determined via the K nearest neighbors
Local Distance-based outlier factor (LDoF): This is a density + distance algorithm that is similar to LoF, but instead of worrying about neighborhood density, it looks at how far a point is from the perceived center of the neighborhood.
K^th Nearest Neighbors Distance (K^thNN): A distance metric that looks at how far away a point is from its Kth nearest neighbor
K Nearest Neighbors Total Distance (TNN): A distance metric that is the averaged distance to the K nearest neighbors

Kth Nearest Neighbor Distance (K^thNN)

This is a very intuitive measure. How far away are you from your K^th neighbor? The farther away, the more likely you are to be an outlier from the set.

from sklearn.neighbors import NearestNeighbors

k = 10

knn = NearestNeighbors(n_neighbors=k)

knn.fit(X)
# Gather the kth nearest neighbor distance
neighbors_and_distances = knn.kneighbors(X)
knn_distances = neighbors_and_distances[0]
neighbors = neighbors_and_distances[1]
kth_distance = [x[-1] for x in sk_knn_distances]

Average distance to K Nearest Neighbors (TNN)

Very similar to K^thNN, but we average out all the distances to the K nearest neighbors. Since K^thNN only takes a single neighbor into consideration, it may miss certain outliers that TNN finds.

# Gather the average distance to each points nearest neighbor 
tnn_distance = np.mean(knn_distances, axis=1)

Notice the point in the upper-right corner, TNN determines that it is more likely an outlier due to how far it is from all its neighbors.

Local Distance-based Outlier Factor (LDoF)

This algorithm is slightly more complicated, though not by much.

The paper explaining it in depth is here.

Here is the simplified version.

We have already calculated one part of this algorithm through TNN. Lets call keep this value as TNN(x), for some point x.

The other part is what the paper calls the "KNN inner distance". This is the average of all the distances between all the points in the set of K nearest neighbors, referred to here as KNN(x).

So, the Ldof(x) = TNN(x)/KNN_Inner_distance(KNN(x))

This combination makes this method a density and a distance measurement. The idea is that a point with a LDoF score >> 1.0 is well outside the cloud of K nearest neighbors. Any point with an LDoF score less than or "near" 1.0 could be considered "surrounded" via the cloud of neighbors.

# Gather the inner distance for pts
def knn_inner_distance(pts):
    summation = 0
    for i in range(len(pts)):
        pt = pts[i]
        for other_pt in pts[i:]:
            summation = summation + np.linalg.norm(pt - other_pt)
    return summation / (k * (k - 1))

inner_distances = [knn_inner_distance(X[ns]) for ns in neighbors]

ldofs = [x/y for x,y in zip(tnn_distance, inner_distances)]

You can notice the effect of the "cloud" idea. All the points between the clusters are marked with a much lower probability of being an outlier, while those outside the cloud have a much higher likelihood.

Local Outlier Factor (LoF)

LoF is a density focused measurement. The core concept of this algorithm is reachability_distance. This is defined as reachability_distance(A, B) = max{distance(A,B), KthNN(B)}. In other words, it is the true distance between A and B, but it has to be AT LEAST the distance between B and its K^th nearest neighbor.

This makes reachability_distance asymmetrical. Since A and B have a different set of K nearest neighbors, their own distances to their K^th neighbor will differ.

Using reachability_distance we can calculate the local_reach_density to point's neighborhood density.

For some point x, its local_reach_density is 1 divided by the average of all the reachability_distance(x, y) for all y in KNN(x), i.e. the set of x's K nearest neighbors.

Armed with this, we can then compare point x's local_reach_density to that of its neighbors to get the LoF(x).

The wikipedia article on lof gives an excellent, succinct mathematical and visual explanation.

local_reach_density = []
for i in range(X.shape[0]):
    pt = X[i]
    sum_reachability = 0
    neighbor_distances = knn_distances[i]
    pt_neighbors = neighbors[i]
    for neighbor_distance, neighbor_index in zip(neighbor_distances, pt_neighbors):
        neighbors_kth_distance = kth_distance[neighbor_index]
        sum_reachability = sum_reachability + max([neighbor_distance, neighbors_kth_distance])

    avg_reachability = sum_reachability / k
    local_reach_density.append(1/avg_reachability)

local_reach_density = np.array(local_reach_density)
lofs = []
for i in range(X.shape[0]):
    pt = X[i]
    avg_lrd = np.mean(local_reach_density[neighbors[i]])
    lofs.append(avg_lrd/local_reach_density[i])

# Or just use
# from sklearn.neighbors import LocalOutlierFactor

What did you learn this week? July 12, 2019

Benjamin Trent — Fri, 12 Jul 2019 17:53:05 +0000

What did y'all learn this week?

Distributed work

Benjamin Trent — Sat, 15 Jun 2019 21:17:19 +0000

Remote work is now

I (at the time of this writing) work for Elastic. A medium sized, fully distributed company. My team is all over the world. My manager is in London. My closest team member is ~1500 miles away. But, we are a cohesive team that all love working where we are. This is my first fully distributed job. It is my first remote job of any kind. There were times in the past where my previous employers would allow a "work from home" day. It is nothing compared to being fully distributed. Here are some lessons learned so far.

Have intentional interactions

When distributed, you do not randomly run into people in the hallways. There is no watercooler. No coffee-breaks. Teams succeed when there is a high level of trust between the members. This trust only occurs when the team know each other. Don't skimp out on meetings. At Elastic, we things called Always On (AoN) video rooms. These are public video chat rooms, usually divided up by team or product-focus-area, where anybody can join and chat. Don't shirk randomly meeting with people from your team.

Have ceremony

You don't have a commute. There is nothing incidentally separating your time between work and everything else. This can easily cause work to drift earlier and later into your days. Also, your brain has no signal to switch from "work mode". Walk your dog, meditate, exercise, read, do anything that signals your brain that you have started and then ended your work day. A fake commute will make your life better.

Have open comms and calendars

When you are distributed, almost all communication is asynchronous. Don't hide communications and rely on personal messages. Have open comms as much as possible. This simulates this group discussions you had around your desk when you used to work in an office. Maybe somebody else has important feedback. Possibly somebody else can answer your question. Don't be afraid of Reply-All. Just respect its power.

As for calendars and scheduling meetings, don't be afraid to make your meetings public on the company calendar. This can build trust with others. It is not happenstance that openness and trust go hand-in-hand.

Don't forget to set your "Do not disturb" hours in SLACK!

Have multiple irons in the fire

It took awhile for me to get used to more than half of my team ending work at before Noon. What if what I am working on is blocked by a colleague in Germany? Well, put it off until the next day. Pick up something else and get working. When working, especially when at the start or the end of larger projects, there will be times when you are blocked due to time zone differences. This is just a consequence of the universe. Deal with it. Do more than one thing.

Have a love for where you are

Never take your local community for granted. Get plugged in, share experiences and grow. Even with SLACK, Zoom, IRC, E-Mail; humans need individual interactions. Get plugged into your local tech (or non-tech) community. It will help get you out of the house.

DEV Community: Benjamin Trent

Log clustering in Rust

Log clustering in Rust

Measure twice, write twice

Continually measure

My failure to measure

Start simple, then measure

Measure, Measure, Measure

Software Engineers are Draftsmen

ML Model Inference in Painless

Inference in Painless

What is [Elasticsearch | Painless]

Machine Learning inference in Painless

Painless 100.5 (not even 101)

Simple models

m2cgen to Painless

Outlier detection from scratch (sort of) in python

Outlier Detection

The Algorithms

Kth Nearest Neighbor Distance (KthNN)

Average distance to K Nearest Neighbors (TNN)

Local Distance-based Outlier Factor (LDoF)

Local Outlier Factor (LoF)

What did you learn this week? July 12, 2019

What did y'all learn this week?

Distributed work

Remote work is now

Have intentional interactions

Have ceremony

Have open comms and calendars

Have multiple irons in the fire

Have a love for where you are

Kth Nearest Neighbor Distance (K^thNN)