DEV Community: Alex Burlacu

Interviewing for a Senior ML Engineer position

Alex Burlacu — Sun, 24 Jul 2022 22:22:00 +0000

Originally posted @ alexandruburlacu.github.io/posts/2022-07-23-senior-ml-interview

Interviewing is always a tiring and sometimes awkward process. Thankfully there are lots of resources online to help you prepare. But what if you need more specific advice for a more niche position?

This post is based on my personal experience going through the interviewing process at 5 not-FAANG companies. I also had some experience interviewing for not-senior ML Engineering roles at another 3 companies last year. So, I will also do a comparative analysis.

Before we begin...

Let me start with a short prologue to explain why I'm writing this piece. In January 2022, I decided, again, it was time to search for another job outside of my home country. But this time, I decided to be sneaky/smart about it, so I changed my LinkedIn address to show that I'm in London. I also groomed a bit more my LinkedIn page to show some highlights of my recent experience. And then magic happened. For weeks I had recruiters invite me to interviews. I didn't even have to apply myself to anything, only to accept or reject opportunities arriving from recruiters. What surprised me was that the majority of options were senior or even lead roles. So, I felt like an imposter, but I still accepted a few of these and started the process. And then I searched for tips on how to nail senior ML engineering interviews... and found almost nothing. Sh*t. And that's how I ~~met your mother~~ decided to write this blog post.

I brushed up my interviewing skills through mock interviews. I was also searching for technical questions for Senior ML roles. Surprisingly, I couldn't find anything. All the info was only for MLE roles. It seemed a bit strange. In retrospect, it all makes sense now.

I know you are eager to find out why, so I'll just give the TL;DR right away - ML and Senior ML have more or less the same complexity/hardness for technical questions. Surprise!

I bet you didn't expect that. I know I didn't. But then, what is different? And how does the interviewing process works for Senior ML Engineers?

Senior vs non-senior ML interviews

Based on my experience, I haven't noticed much difference between senior ML and ML engineering interviews at the technical level.

What I did notice is the focus on soft skills for senior positions, and I don't necessarily mean communication skills. Instead, how a candidate handled failures, team-level conflicts, cross-team communication, how they solved their most challenging problems, or how they handled a poor decision.

I recall the first technical interview for a Senior ML role I had. I was anxious about what kind of questions will I receive. It wasn't so bad, I had tougher questions than that, but the focus was undoubtedly higher on how I handled some scenarios or how I would do it now.

Aspect	ML engineer interview	Senior ML engineer interview
Coding	Your usual leetcode-medium questions	Same, haven't seen dynamic programming at this stage
Take-home assignment	Either do EDA or deploy an ML model, focus on code quality, ease of use and tests	Same, take-home assignments are not harder for senior positions
ML Trivia	How algs. work? What would be the best solution for a type of problem	On average, the same as for ML engineer
System Design	How to implement a system for a given scenario? Data collection issues?	On average, same as for ML engineers, just be more conscious of budget constraints
Behavioral	Focus on collaboration, individual growth, and adaptability	Focus on failures, conflict management, and cross-team collaboration

One position for which I did notice some big differences when it comes to the technical questions is Research Engineer. I'm talking questions like how does JPEG compresses images, how to compute nth Fibonacci in O(log n) time, or how to compute PCA from scratch. Now, for a research engineering position, these kinds of questions do make sense because of the innovative and research-oriented nature of the projects they have to work on. These frequently can involve a lot of convert-math-to-code or let's-break-it-down-and-then-improve type of tasks.

Anyway, to give you a more detailed view, let's see what is the general interviewing process when it comes to these kinds of roles.

The general interviewing flow

First, let's go over the main steps in the process. Generally, there are at least 4 steps:

You have the first call with a recruiter or hiring manager. You get to know each other, go over your CV in general, discuss what makes you search for jobs, or accept invitations to interview, what you know about the company, what you are searching for, and so on. A pretty simple step if you ask me. Then, suppose the hiring manager thinks your goals and interests align with what the company seeks. In that case, you will be invited to the second, technical step. The dreaded one.
I call this step just technical for a reason. Some companies split it into 2, a take-home assignment and then a discussion based on it. Others have the typical coding interview. And others yet just have a technical discussion. The technical discussion usually covers ML theory and some specifics, like what is transfer learning, or what transformer architectures are. It might also be a pen-and-paper exercise where you can be asked to infer how PCA works. The latter is more common for more research-oriented roles.
Most of the time, there are two technical interviews, the second being more focused on system design interview. Or maybe some more technical challenges and discussions, YMMV, because this is very company- and team- specific.
Finally, the last round of interviews is usually reserved for everything else that wasn't covered in the previous steps, usually the behavior interview. Some companies have three rounds, combining the 3rd step with the 4th.

Now, let's dive into details.

1st interview

Pretty simple. Make sure to learn about the company, even if you were invited to interview with them. At this point, the company searching for candidates has a few objectives:

to understand how interested you are in the company/position
are there any legal constraints that need to be acknowledged, like visa status
or personal constraints, like the necessity to work remotely Also, at this stage, the recruiter is looking whether you'd be a good fit based on your career aspirations, personal opinions, and past experiences.

But don't be fooled, there's a probability of failure even at this stage. For example, if the recruiter feels you're not interested in the position or if your career plans don't align with the responsibilities of this position.

2nd/3rd interview

As mentioned, different companies do this stage differently. I found three types. Given that we have two steps here, most companies do a mix of these three methods.

The "take-home-assignment tribe"

Take home - either an ML serving solution or EDA + modeling. No one will expect you to deliver a robust, production-ready solution for the ML serving project, nor will anyone complain that your Jupyter notebook doesn't contain a SotA ML model for a given dataset. The focus is on code quality, the presence of tests and features, ease of running the code for the former, and reproducibility and soundness of the solution for the latter.

Focus on quality over quantity. A good way to show professionalism is to follow up with clarifying questions once you receive the task. And please, read it carefully. Too often have I seen people doing it all wrong and not even bothering to check the exact constraints for the homework.

The "coding challengers"

Too much was said about it. One point I consider worth reiterating is how important it is to actually talk through your problem-solving process and ask clarifying questions. I would argue that this could be even more important than solving the problem. Also, don't forget about:

Asking about possible edge cases and then covering them.
Explaining the time and space complexity of your solution.
If you have the time, extra points for going through your code "debugger-style". That is, step-by-step while telling what the current values of all your variables are.

The "technical discussionists"

Discussion with a team of engineers. It usually goes like this: Technical/ML Trivia + NotSoOptional[ML System Design] + Optional[Behavioral]. ML questions are mostly one of:

"How would you handle X scenario"
"What is Y? How does this work?"
Occasionally, for research-heavy roles - "Could you compute Z from scratch, here's a Google Doc", as a follow-up to the previous questions.

Where $$ Y \in {BatchNorm, DropOut, SkipConnections, DataAugmentation, SGD, Transformers, Attention, et al.} $$
$$ Z \in {PCA, Linear Regression, kNN, kMeans} $$

Sometimes technical discussions take a more ML-System-Design flavor.

It's COVID, so system design is usually only verbal unless you can also text-draw a solution while sharing your screen. Pseudo-code also helps.
ML System Design seems not to be any different. It's still one of "Design a Search Engine for X", or "How are you going to design an X-which-is-actually-a-recommender-system".


---------   r/w  ----------    ----------   HTTP/2
|  DB   | <------| API    |<-- | NGINX  |  <-------  Client
|       |        |        |    ---------- 
---------        ----------

Example of "text-drawing" #1

                                 /-------> Users Service --> MySQL
                                /
Client w/ Browser Cache ---> Gateway -----> Posts Service  --> Cassandra x 6
                                                |                 write_to: 2
                                              Redis               read_from: 1

Example of "text-drawing" #2

Extra points for talking through efficiency/budget/business considerations at this step. For example, proposing to split the application in two, with ML logic on a GPU-enabled machine and business logic on a more conventional server. Or thinking out loud about a buy vs. build decision about some sub-component.

Some personal opinions

I prefer take-home projects + technical discussions. This combination makes for a more meaningful technical discussion. It allows the candidate to express their ideas about how a proper production system should be designed based on the take-home assignment. Plus, a good take-home project can highlight candidates' abilities to write code and how they handle logging, testing, documentation, and deployment. I would argue it's much better than just solving leetcode problems.

I even used take-home assignments to filter candidates when we were hiring for my team. I know the main cons of it, but I believe that a well-defined problem can be solved in one or two evenings, a couple hours each. Not great, but I feel much more relaxed than doing a 45m coding interview. Speaking of the devil...

I don't like coding challenges. IMO, it's usually just lazy bs. These kinds of practices can be understandable for FAANG (well, more like MANGA nowadays) companies because of their scale*. But, when coding challenges are done by small companies, I mostly find this as just bad taste.

Disclaimer *: I don't mean that at Google-scale, they need their devs to know very well how to sort an array or find 2 numbers that add up to something. I mean that they have to go through so many candidates that they need a standardized, time-efficient, and repeatable way to check their capabilities. It doesn't seem realistic for companies this big to give take-home assignments and thoroughly check these without incurring significant time and productivity losses. That's the sad reality.

To add to the mess of coding interviews, companies are actually misusing them. Coding interviews are supposed to check for a candidate's problem-solving and communication skills. You need to show the interviewer what is your thought process and how are you tackling a new problem. Usually, it shouldn't matter much if the solution you implemented is optimal or not. You need to be aware of this, though. Regretfully, interviewers usually just look for the "correct" answers, like it's an exam and not a discussion, making the whole experience miserable.

In theory, coding tests are even worse. Because there's no way to see the candidate's thought process and the way they are tackling problems. Thus, it becomes just a timed exam that has no actual value in assessing how good a candidate is. In practice, because most interviewers are no better, I would take a coding test over a coding interview almost any day of the week.

So, if I were to rank coding interviews, I would arrange them like this:

"Discussion" coding interview
Coding test with no interviewer at all
Exam-like coding interview, without much support from the interviewer

Of course, there are exceptions. One time, at band camp (jk), I had a fantastic experience with a no-interviewer coding challenge. It was a 3.5h HackerRank challenge, in 3 stages, for a research engineering position. The questions ranged from probability to ML model serving, numerical stability, and basic ML theory. Then, for the second stage, it was a code review exercise! I was given a piece of code and had to identify a bug and suggest an improvement. How cool is that?! The final part was an actual coding challenge to implement a graph algorithm. It was exhausting, but at least it wasn't generic, and because it was so diverse, I felt like it enabled people to show where their true strength lies.

Alright, I'll stop complaining and move on to the next section of this post.

4th interview

This one is primarily behavioral. Although I would say the candidate is always asked behavioral questions, it's just at this stage, it is the primary focus.

I really like the questions about past experiences and how they can be improved, or if something didn't work, why?
I feel these questions correlate more with actual skill rather than generic theory questions.

A few questions that I really liked were:

If I ask your manager what's your greatest weakness, what would they tell me?
What was a situation in which you made a mistake? How would you prevent it now by having more experience?
Give me an example where you made a poor technical decision and then had to fix it. How did you do it?

Generally, any question which asks to reflect on past mistakes is especially cool. Why? They help uncover how you grew since then, how humble you are, and how your critical thinking works.

I have no recollection of such questions in a non-senior ML interview, but plenty of those for senior/lead positions. So maybe think about such scenarios before your next interview.

Some final tips to prepare

To really nail that interview process, I like doing mock interviews. The best way to do it (that I found) is Pramp.com. It's not an advertisement, you can check the link - it has no referral code or anything. I just really find them helpful, especially for coding interviews and somewhat for system design interviews.

For ML system design, the best thing I have found so far is Chip Huyen's booklet - Machine Learning Systems Design. And of course, for generic system design - The System Design Primer.

And remember, to really prepare for the behavioral interviews. Be ready to answer questions about how you failed and what you learned from it. Focus more on behavioral questions, specifically ones highlighting your leadership potential and learning-from-mistakes type of situations. For a good list of behavioral questions, see this PDF from LinkedIn.

Throughout the process, ask questions and show your interviewers that you are engaged in conversations with them and are interested in the role. Ask them about their technical and business priorities, how specific processes are implemented in the organization, and their current pain points. Here's a good list of questions you can ask.

Interested in becoming a senior engineer? You'll need both strong ML and superior soft skills to get that senior position. Also, maybe check my post Becoming a Senior Engineer, which should help you define your own roadmap.

A little disclaimer (last one in this post)

These posts were almost done since February, but due to the tragic events unfolding in Ukraine, I thought it wouldn't be nice, to say the least, to post it back then. In Moldova, there's a saying "Satu' arde da baba sî chiaptănă" which translates to something like "The (unreasonable) old lady is grooming while the whole village burns". I didn't want to be that lady, so I thought it would be better to wait until things become at least somewhat less chaotic.

#Слава Україні! #Героям слава!

Choosing programming languages for real-world projects

Alex Burlacu — Sat, 18 Jun 2022 22:46:21 +0000

Originally published at alexandruburlacu.github.io

A few years ago, when I was in my senior year at the university, during the distributed systems lecture our professor asked us a very nice question:

If we were to choose between a fancy new programming language, or Java/C#, for a greenfield commercial project, what would we choose and why?

If you're wondering what it has to do with distributed systems, I have to say - half of it was about software architecture.

The classroom was split into 2 camps, obviously. The fun and somewhat sad fact was that the Java camp won. I was part of that camp, even though I don't like Java, to say the least. We had much better arguments. So, what were those winning arguments? Rich library and tooling ecosystem, and the relative availability of professionals in our local market, for a fair price too. Our professor deemed us project managers, not real programmers, then said we were right, and for a few seconds the atmosphere in the classroom turned sad and hopeless. Then we moved on with the lecture.

TL;DR: We all want to play with the shiniest new toys, but when money is at stake, better stick to something tried and true.

So here are some questions to keep in mind when choosing a programming language, or any software tool for that matter, for a project. The focus will be on commercial projects, but some of the tips work for research projects and simple pet projects too.

Basic level

Initially, the decision-making process is usually guided by a very narrow understanding of the consequences of choosing a specific tool. In increasing order of maturity, here are some basic reasons to make a choice:

I would like to learn this new tool/language/framework, people say it's hot right now
People say this is the best tool/language for this kind of problem
I know this language/tool very well and can be very productive with it
I and my team know this language/tool quite well and we can all be productive with it

1 and 2 are only acceptable reasons for a pet project, with a small caveat, which I'll explain later*. Although I would recommend sometimes taking a look at more niche, possibly peculiar tools to learn. Because, you know, if a language doesn't change the way you think, it's not worth learning.

4 is a decent reason, see Paul Graham's post about using LISP to build a startup, but in the long run, it's not that simple.

Higher-level decision making

The difference between programming and getting stuff done, and software engineering is that the latter has significantly harder constraints (See Software Engineering at Google). Not just any code can be developed productively by a changing team of people and maintained over time. And most commercial software isn't one-time scripts, but code that lives on for years, if not decades. That's why, when choosing a tool, language, or an entire stack, try to guide your decision-making with these questions, in no particular order:

How well documented this tool/language is?
How actively used/developed is it?
How many dependencies of any sort does it have?
How stable this tool/language is?
What is the size and quality of the ecosystem for this tool/language?
How productive can someone be using this tool/language?

More constraints, but doable.

Business-level decision-making

Now we reached the final frontier. Until now, it wasn't particularly hard to make a choice, you just had to do your research. But now, we're gonna have to enter the realm of never-ending trade-offs. Keep in mind that software is written by people, who you have to employ, pay salaries, and ideally have a positive return on investment.

How easy is it to teach someone, or how much time does it take to make someone productive with the given tool/language?
How much reachable supply of professionals is out there for this tool/language? Is it sufficient for you?
How much do professionals who are knowledgeable with this tool/language ask for (money, perks, whatever)?
What is the quality of the supply? Are the engineers mostly newbies or seasoned professionals?
How many people would like to work with the chosen tool/language? How excited are they?

Rarely the raw performance of a tool or language is a big issue. Some domains are indeed interested in that characteristic too, like scientific computing, low-latency systems, and maybe embedded systems. More recently, how energy-efficient, or "green" a language or tool is, is of greater importance. Yes, I'm not kidding. For example Amazon cares about such things, although like all things at this level, it's not so simple.

An example of picking a language

Let's do a "demo". We will assume that we're a remote-first startup and we want to build ~~a snowman~~ a serverless platform. How do we pick the programming stack? Well, at least the programming language. We will assume that the technical founders are capable of writing any language. No, they are not spherical.

An important technical constraint for our project is that serverless technology is especially effective when the startup time of a serverless function is quick. If it's not, why bother? Optionally, we might want to dive into serverless edge computing, meaning we need a programming language that can work even on resource-constrained devices. Maybe not microcontrollers, but something like a newer Raspberry Pi shouldn't be considered unrealistic.

We are also budget-constrained because we're a startup. We need to execute fast, or else we might not reach escape velocity, and no one will bother.

With that said, let's prune some candidates. Because of our startup latency constraint, we can't afford to run anything which needs a VM-like runtime. So no Java, C#, and even Erlang or Elixir. Although Erlang and Elixir have less substantial problems with VM cold start, they have another downside of having a smaller talent pool. On yet another hand, this talent pool is usually very enthusiastic and professional. What a shame we're not building a messaging system.

Language	Verdict	Talent Pool Size	Tooling	Excitement Factor	Startup Latency
Java	No	Very Large	Very Good	Can we go lower?	Half of Java jokes are about this
C#	No	Large	Very Good	A bit better than Java	A bit better than Java
Elixir/Erlang	No	Small	Good	Almost through the roof	Good, for a VM-based language

If we are planning for maximum efficiency, maybe we should use C++? Definitely no. C++ is quite dangerous. Besides, we need to keep in mind that we want to develop fast and preferably without much risk of segmentation faults, resource leaks, and other C++ surprises. Btw, a good C++ dev is quite expensive and hard to find nowadays.

Language	Verdict	Talent Pool Size	Tooling	Excitement Factor	Startup Latency
...	...	...	...	...	...
C++	No	Moderate	Moderate, hard to use IMO	Depends what kind of person are you	Sonic the hedgehog approves

We know that development speed is important. But we also want a performant language without VM cold start problems. How about Python, or JS? These are popular, fast to work with, with a considerable talent pool, and JS can be speedy too. To be fair, this wouldn't be the worst idea. Python, specifically CPython, can be slow but with the right tooling, or by substituting it with PyPy, we can solve these problems. As for JS, one issue is that the language is not the most pleasant to debug, with its unholy trinity of no-values and subpar traceback messages. Regretfully, there are lots of not-so-good-devs out there professing these tools, so that's and issue. Finally, these are not the best systems programming languages.

Language	Verdict	Talent Pool Size	Tooling	Excitement Factor	Startup Latency
...	...	...	...	...	...
JS	Maybe/No	Very Large	Good	Depends what flavor are you using	Good
Python (CPython)	Maybe/No	Very Large	Good	It will be a bummer that it's not used for DS/ML/AI	Good
Python (PyPy)	Maybe/Yes	Very Large (but there's a catch)	Good	If you know, you know	Good, and it's very fast overall

Ok, so I said it, systems programming languages. And we dropped C++. What do we have left? Go, Rust, Crystal. We drop Crystal right away due to the lack of a sizeable community, talent pool, and libraries. So, it's Go vs Rust? Hold on, there's another contestant - OCaml. So, why did it come to these 3 languages? All of these are very suitable for systems programming, that is, interacting with lower-level OS constructs, are quite efficient at working closer to hardware, and in general, are fast and resource-efficient. Of all 3, Go is the most mainstream, so it's a plus. Also, it's easy to onboard people to use it. On the other hand, Rust and OCaml provide nicer guarantees for the programs you write, and although less popular, the quality of developers using them is usually pretty high. OCaml and Rust are pretty close idiomatically, but Rust syntax will be much more familiar to non-hardcore FP people, aka common folk, so it's probably 10 points to Rust. All in all, let's see the final table.

Language	Verdict	Talent Pool Size	Tooling	Excitement Factor	Startup Latency
Java	No	Very Large	Very Good	Can we go lower?	Half of Java jokes are about this
C#	No	Large	Very Good	A bit better than Java	A bit better than Java
Elixir/Erlang	No	Small	Good	Almost through the roof	Good, for a VM-based language
C++	No	Moderate	Moderate, hard to use IMO	Depends what kind of person are you	Sonic the hedgehog approves
JS	Maybe/No	Very Large	Good	Depends what flavor are you using	Good
Python (CPython)	Maybe/No	Very Large	Good	It will be a bummer that it's not used for DS/ML/AI	Good
Python (PyPy)	Maybe/Yes	Very Large (but there's a catch)	Good	If you know, you know	Good, and it's very fast overall
Crystal	No	Very Small	So-so	If you know, you know v2	Very Good, and it's blazing fast overall
Rust	Maybe/Strong Yes	Small-Moderate	Moderate	Almost through the roof	Very good, and it's very fast overall
Go	Yes	Large	Good	Pretty good	Good, and it's very fast overall
OCaml	Maybe/Yes	Small	Moderate	Almost through the roof, but only for FP geeks	Very good, and it's very fast overall

All things considered, probably the safest choice would be to use Go. And the next best thing would be Rust. A very good option would be PyPy, IMO. It's almost 1 to 1 equivalent to CPython, but considerably faster. If you like it more hardcore FP, you could try OCaml. You could in fact go polyglot, and pick 2 languages, but don't escalate to more than that. There's a reason most full-stack engineers are writing JS-only.

*Time to discuss that caveat.

Yes, picking a tool only because it's hot or seems interesting but is risky will rarely be a good idea, except when it is. You see, a tool is usually "hot" for a reason. Maybe it's solving a common pain in the industry, and does so elegantly. Or maybe, it boosts productivity, efficiency, or the long-term maintainability of a system. Still, this isn't enough to make such a risky move.

On the other hand, there's an interesting aspect here. If a tool is hot people will want to work with it. This phenomenon boosts the desire to work for your team/business because you're using this New Hot Thing ©. Combined with the intrinsic qualities of the new tool, it might make sense to actually give it a try. It is just as risky to never take a risk. Failing to grow and innovate will leave your business hard to hire for, your talent pool shrinking, and your operational efficiency slowly dying.

Follow sage's advice 😏 Made with: imgflip.com

A substitute for a conclusion

I hope I haven't fried your brains with this many things to consider. Even I sometimes don't do the whole process, or am being sloppy when assessing some of the aspects. Still, having a checklist of things to consider is always a good thing, so I hope you'll benefit from this.

Maybe a bit anti-climactic, but consider this - if you picked the wrong tool, it will rarely doom your project for failure. What will is not realizing you made a bad choice, and trying to fix it. Technical stacks are problems which can be fixed with money, and that's a good thing.

Not the ending you expected? 😏

P.S.

I should add a clarification about Java. Don't get me wrong - I don't "hate" Java, I just like pointing to its flaws, sometimes vehemently 😀. Java's unnecessary verbosity is the main issue that I have with it. It wasn't the only issue, but with the sped-up release cycle and a lot of ideas borrowed from other languages and communities, it's becoming a better language. Brilliant engineers use Java for many important, actively developed projects with no plans to retire or rewrite these. Ergo, it can't be an objectively "bad" language.

K-Means tricks for fun and profit

Alex Burlacu — Wed, 23 Jun 2021 19:49:14 +0000

This will be a pretty small post, but an interesting one nevertheless.

Originally published at https://alexandruburlacu.github.io/posts/2021-06-18-kmeans-trick

K-Means is an elegant algorithm. It's easy to understand (make random points, move them iteratively to become centers of some existing clusters) and works well in practice. When I first learned about it, I recall being fascinated. It was elegant. But then, in time, the interest faded away, I was noticing numerous limitations, among which is the spherical cluster prior, its linear nature, and what I found especially annoying in EDA scenarios, the fact that it doesn’t find the optimal number of clusters by itself, so you need to tinker with this parameter too. And then, a couple of years ago, I found out about a few neat tricks on how to use K-Means. So here it goes.

The first trick

First, we need to establish a baseline. I'll use mostly the breast cancer dataset, but you can play around with any other dataset.

from sklearn.cluster import KMeans
from sklearn.svm import LinearSVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

import numpy as np

X, y = load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=17)

svm = LinearSVC(random_state=17)
svm.fit(X_train, y_train)
svm.score(X_test, y_test) # should be ~0.93

So, what's this neat trick that reignited my interest for K-Means?

K-Means can be used as a source of new features.

How, you might ask? Well, K-Means is a clustering algorithm, right? You can add the inferred cluster as a new categorical feature.

Now, let's try this.

# imports from the example above

svm = LinearSVC(random_state=17)
kmeans = KMeans(n_clusters=3, random_state=17)
X_clusters = kmeans.fit_predict(X_train).reshape(-1, 1)

svm.fit(np.hstack([X_train, X_clusters]), y_train)
svm.score(np.hstack([X_test, kmeans.predict(X_test).reshape(-1, 1)]), y_test) # should be ~0.937

Source: knowyourmeme.com

These features are categorical, but we can ask the model to output distances to all the centroids, thus obtaining (hopefully) more informative features.

# imports from the example above

svm = LinearSVC(random_state=17)
kmeans = KMeans(n_clusters=3, random_state=17)
X_clusters = kmeans.fit_transform(X_train)
#                       ^^^^^^^^^
#                       Notice the `transform` instead of `predict`
# Scikit-learn supports this method as early as version 0.15

svm.fit(np.hstack([X_train, X_clusters]), y_train)
svm.score(np.hstack([X_test, kmeans.transform(X_test)]), y_test) # should be ~0.727

Wait, what's wrong? Could it be that there's a correlation between existing features and the distances to the centroids?

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

columns = ['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension',
       'distance to cluster 1', 'distance to cluster 2', 'distance to cluster 3']
data = pd.DataFrame.from_records(np.hstack([X_train, X_clusters]), columns=columns)
sns.heatmap(data.corr())
plt.xticks(rotation=-45)
plt.show()

Notice the last 3 columns, especially the last one, and their color on every row.

You probably heard that we want the features in the dataset to be as independent as possible. The reason is that a lot of machine learning models assume this independence to have a simpler algorithm. Some more info on this topic can be found here and here, but the gist of it is that having redundant information in linear models destabilizes the model, and in turn, it is more likely to mess up. On numerous occasions, I noticed this problem, sometimes even with non-linear models, and purging the dataset from correlated features usually gives a slight increase in the model's performance characteristic.

Back to our main topic. Given that our new features are indeed correlated with some of the existing ones, what if we use only the distances to the cluster means as features, will it work then?

# imports from the example above

svm = LinearSVC(random_state=17)
kmeans = KMeans(n_clusters=3, random_state=17)
X_clusters = kmeans.fit_transform(X_train)

svm.fit(X_clusters, y_train)
svm.score(kmeans.transform(X_test), y_test) # should be ~0.951

Much better. With this example, you can see that we can use KMeans as a way to do dimensionality reduction. Neat.

So far so good. But the piece de resistance is yet to be shown.

The second trick

K-Means can be used as a substitute for the kernel trick

You heard me right. You can, for example, define more centroids for the K-Means algorithm to fit than there are features, much more.

# imports from the example above

svm = LinearSVC(random_state=17)
kmeans = KMeans(n_clusters=250, random_state=17)
X_clusters = kmeans.fit_transform(X_train)

svm.fit(X_clusters, y_train)
svm.score(kmeans.transform(X_test), y_test) # should be ~0.944

Well, not as good, but pretty decent. In practice, the greatest benefit of this approach is when you have a lot of data. Also, predictive performance-wise your mileage may vary, I, for one, had run this method with n_clusters=1000 and it worked better than only with a few clusters.

SVMs are known to be slow to train on big datasets. Impossibly slow. Been there, done that. That's why, for example, there are numerous techniques to approximate the kernel trick with much less computational resources.

By the way, let's compare how this K-Means trick will do against classic SVM and some alternative kernel approximation methods.

The code below is inspired by these two scikit-learn examples.

import matplotlib.pyplot as plt
import numpy as np
from time import time

from sklearn.datasets import load_breast_cancer
from sklearn.svm import LinearSVC, SVC
from sklearn import pipeline
from sklearn.kernel_approximation import RBFSampler, Nystroem, PolynomialCountSketch
from sklearn.preprocessing import MinMaxScaler, Normalizer
from sklearn.model_selection import train_test_split
from sklearn.cluster import MiniBatchKMeans


mm = pipeline.make_pipeline(MinMaxScaler(), Normalizer())

X, y = load_breast_cancer(return_X_y=True)
X = mm.fit_transform(X)

data_train, data_test, targets_train, targets_test = train_test_split(X, y, random_state=17)

We will test 3 methods for kernel approximation available in the scikit-learn package, against the K-Means trick, and as baselines, we will have a linear SVM and an SVM that uses the kernel trick.

# Create a classifier: a support vector classifier
kernel_svm = SVC(gamma=.2, random_state=17)
linear_svm = LinearSVC(random_state=17)

# create pipeline from kernel approximation and linear svm
feature_map_fourier = RBFSampler(gamma=.2, random_state=17)
feature_map_nystroem = Nystroem(gamma=.2, random_state=17)
feature_map_poly_cm = PolynomialCountSketch(degree=4, random_state=17)
feature_map_kmeans = MiniBatchKMeans(random_state=17)
fourier_approx_svm = pipeline.Pipeline([("feature_map", feature_map_fourier),
                                        ("svm", LinearSVC(random_state=17))])

nystroem_approx_svm = pipeline.Pipeline([("feature_map", feature_map_nystroem),
                                        ("svm", LinearSVC(random_state=17))])

poly_cm_approx_svm = pipeline.Pipeline([("feature_map", feature_map_poly_cm),
                                        ("svm", LinearSVC(random_state=17))])

kmeans_approx_svm = pipeline.Pipeline([("feature_map", feature_map_kmeans),
                                        ("svm", LinearSVC(random_state=17))])

Let's collect the timing and score results for each of our configurations.

# fit and predict using linear and kernel svm:
kernel_svm_time = time()
kernel_svm.fit(data_train, targets_train)
kernel_svm_score = kernel_svm.score(data_test, targets_test)
kernel_svm_time = time() - kernel_svm_time

linear_svm_time = time()
linear_svm.fit(data_train, targets_train)
linear_svm_score = linear_svm.score(data_test, targets_test)
linear_svm_time = time() - linear_svm_time

sample_sizes = 30 * np.arange(1, 10)
fourier_scores = []
nystroem_scores = []
poly_cm_scores = []
kmeans_scores = []

fourier_times = []
nystroem_times = []
poly_cm_times = []
kmeans_times = []

for D in sample_sizes:
    fourier_approx_svm.set_params(feature_map__n_components=D)
    nystroem_approx_svm.set_params(feature_map__n_components=D)
    poly_cm_approx_svm.set_params(feature_map__n_components=D)
    kmeans_approx_svm.set_params(feature_map__n_clusters=D)
    start = time()
    nystroem_approx_svm.fit(data_train, targets_train)
    nystroem_times.append(time() - start)

    start = time()
    fourier_approx_svm.fit(data_train, targets_train)
    fourier_times.append(time() - start)

    start = time()
    poly_cm_approx_svm.fit(data_train, targets_train)
    poly_cm_times.append(time() - start)

    start = time()
    kmeans_approx_svm.fit(data_train, targets_train)
    kmeans_times.append(time() - start)

    fourier_score = fourier_approx_svm.score(data_test, targets_test)
    fourier_scores.append(fourier_score)
    nystroem_score = nystroem_approx_svm.score(data_test, targets_test)
    nystroem_scores.append(nystroem_score)
    poly_cm_score = poly_cm_approx_svm.score(data_test, targets_test)
    poly_cm_scores.append(poly_cm_score)
    kmeans_score = kmeans_approx_svm.score(data_test, targets_test)
    kmeans_scores.append(kmeans_score)

Now let's plot all the collected results.

plt.figure(figsize=(16, 4))
accuracy = plt.subplot(211)
timescale = plt.subplot(212)

accuracy.plot(sample_sizes, nystroem_scores, label="Nystroem approx. kernel")
timescale.plot(sample_sizes, nystroem_times, '--',
               label='Nystroem approx. kernel')

accuracy.plot(sample_sizes, fourier_scores, label="Fourier approx. kernel")
timescale.plot(sample_sizes, fourier_times, '--',
               label='Fourier approx. kernel')

accuracy.plot(sample_sizes, poly_cm_scores, label="Polynomial Count-Min approx. kernel")
timescale.plot(sample_sizes, poly_cm_times, '--',
               label='Polynomial Count-Min approx. kernel')

accuracy.plot(sample_sizes, kmeans_scores, label="K-Means approx. kernel")
timescale.plot(sample_sizes, kmeans_times, '--',
               label='K-Means approx. kernel')

# horizontal lines for exact rbf and linear kernels:
accuracy.plot([sample_sizes[0], sample_sizes[-1]],
              [linear_svm_score, linear_svm_score], label="linear svm")
timescale.plot([sample_sizes[0], sample_sizes[-1]],
               [linear_svm_time, linear_svm_time], '--', label='linear svm')

accuracy.plot([sample_sizes[0], sample_sizes[-1]],
              [kernel_svm_score, kernel_svm_score], label="rbf svm")
timescale.plot([sample_sizes[0], sample_sizes[-1]],
               [kernel_svm_time, kernel_svm_time], '--', label='rbf svm')

And some more plot adjustments, to make it pretty.

# legends and labels
accuracy.set_title("Classification accuracy")
timescale.set_title("Training times")
accuracy.set_xlim(sample_sizes[0], sample_sizes[-1])
accuracy.set_xticks(())
accuracy.set_ylim(np.min(fourier_scores), 1)
timescale.set_xlabel("Sampling steps = transformed feature dimension")
accuracy.set_ylabel("Classification accuracy")
timescale.set_ylabel("Training time in seconds")
accuracy.legend(loc='best')
timescale.legend(loc='best')
plt.tight_layout()
plt.show()

Meh. So was it all for nothing?

You know what? Not in the slightest. Even if it's the slowest, K-Means as an approximation of the RBF Kernel is still a good option. I'm not kidding. You can use this special kind of K-Means in scikit-learn called MiniBatchKMeans which is one of the few algorithms that support the .partial_fit method. Combining this with a machine learning model that has .partial_fit too, like a PassiveAggressiveClassifier one can create a pretty interesting solution.

Note that the beauty of .partial_fit is twofold. First, it makes it possible to train algorithms in an out-of-core fashion, which is to say, with more data than fits in the RAM. Second, depending on your type of problem, if you could in principle (very-very in principle) never need to switch the model, it could be additionally trained right where it is deployed. That's called online learning, and it's super interesting. Something like this is what some Chinese companies are doing and in general can be pretty useful for AdTech, because you can receive the info whenever your ad recommendation was right or wrong within seconds.

You know what, here's a little example of this approach for out-of-core learning.

from sklearn.cluster import MiniBatchKMeans
from sklearn.linear_model import PassiveAggressiveClassifier

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

import numpy as np

def batch(iterable, n=1):
    # source: https://stackoverflow.com/a/8290508/5428334
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx:min(ndx + n, l)]

X, y = load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=17)

kmeans = MiniBatchKMeans(n_clusters=100, random_state=17) # K-Means has a constraint, n_clusters <= n_samples to fit
pac = PassiveAggressiveClassifier(random_state=17)

for x, y in zip(batch(X_train, n=100), batch(y_train, n=100)):
    kmeans.partial_fit(x, y)       # fit K-Means a bit
    x_dist = kmeans.transform(x)   # obtain distances
    pac.partial_fit(x_dist, y, classes=[0, 1])     # learn a bit the classifier, we need to indicate the classes
    print(pac.score(kmeans.transform(X_test), y_test))

# 0.909 after 100 samples
# 0.951 after 200 samples
# 0.951 after 300 samples
# 0.944 after 400 samples
# 0.902 after 426 samples


# VS
kmeans = MiniBatchKMeans(n_clusters=100, random_state=17)
pac = PassiveAggressiveClassifier(random_state=17)

pac.fit(kmeans.fit_transform(X_train), y_train)
pac.score(kmeans.transform(X_test), y_test)
# should be ~0.951

Epilogue

So you've made it till the end. Hope now your ML toolset is richer. Maybe you've heard about the so-called "no free lunch" theorem; basically, there's no silver bullet, in this case for ML problems. Maybe for the next project, the methods outlined in this post won't work, but for the one that will come after that, they will. So just experiment, and see for yourself. And if you need an online learning algorithm/method, well, there's a bigger chance that K-Means as a kernel approximation is the right tool for you.

By the way, there's another blog post, also on ML, in the works now. What's even nicer, among many other nice things in it, it describes a rather interesting way to use K-Means. But no spoilers for now. Stay tuned.

Finally, if you’re reading this, thank you! If you want to leave some feedback or just have a question, you've got quite a menu of options (see the footer of this page for contacts + you have the Disqus comment section).

Some links you might find interesting

Acknowledgements

Special thanks to @dgaponcic for style checks and content review, and thank you @anisoara_ionela for grammar checking this article more thoroughly than any AI ever could. You're the best <3

P.S. I believe you noticed all these random_states in the code. If you're wondering why I added these, it's to make the code samples reproducible. Because frequently tutorials don't do this and it leaves space for cherry-picking, where the author presents only the best results, and when trying to replicate these, the reader either can't or it takes a lot of time. But know this, you can play around with the values of random_state and get widely different results. For example, when running the snippet with original features and distances to the 3 centroids, the one with a 0.727 score, with a random seed of 41 instead of 17, you can get the accuracy score of 0.944. So yeah, random_state or however else the random seed is called in your framework of choice is an important aspect to keep in mind, especially when doing research.

Logging, Tracing, Monitoring, et al.

Alex Burlacu — Fri, 21 May 2021 19:55:03 +0000

So, you want to launch your code/app/system in production?

Wait, before you do, ask yourself this question: If something goes south, how will I know what exactly happened?

A good question, indeed.

A more seasoned engineer might say: I will use logs!!! But what if I tell you, logs are only the begging?

[Disclaimer Time] This article is not about some concrete technology, framework, or library, although it references some of these. It's more of an overview/tips about what logging/tracing/et al are and how to approach these when designing and operating software systems. The information here is based mostly from my own experience, but also from information available in papers and industry blog posts. You might need to google some stuff while/after reading it, especially if you've never operated a system running in production.

[Disclaimer Time #2 ] This post was originally published at alexandruburlacu.github.io

Act 1: I'll set up logs, alright...

So, what exactly is a log?

Technically, this is a log, but I want to talk about other kinds of logs.

Logs are a record about some event in a system

Pretty abstract, huh? A log is like an entry in a journal about something that happened, maybe with some context. Somewhat like the Twitter feed of an Apple-reporter during the WWDC event. You have time, you have a record of something that just happened, and maybe you have context too. Now, jokes aside, logs are necessary for a system running in production. They help you uncover what was happening moments before applications crash. Or malicious activity. Or other stuff. But how do we make good logs?

Tenets of a good log message

So, how should we design our logs? Here are some tenets:

Thy logs must be hierarchical: we need to respect the distinction between DEBUG/INFO/WARNING/ERROR and possibly other levels. We should not crowd the system with WARNING logs when INFO or DEBUG logs are more appropriate. Crowding also refers to how much information a log contains. That said, a good idea for an ERROR log is to register as much information as possible to aid in debugging. Use DEBUG-level logs to register information about what setting the program is using, even how much time or resources some subroutine is using, but don't abuse this. As for INFO logs, anything in between. Like information about a call to a top-level route handler in an HTTP server. Also, INFO logs are the right way to use prints in a system.
Thy logs must be informative: A good rule of thumb is to log everything that might help you debug your system. If an error happens, you will want to log the traceback. Also, logging the context in which the error happened will prove to be useful. By context, I mean some surrounding variables, which might have something to do with the failure. If your system is running with multiple processes or is multithreaded, or multi-whatever, do yourself a favor and log the PIDs/Thread IDs. Finally, be very careful with how you represent time, explaining why would require an entire blog, but time in computer systems is a pain, see for yourself.

ERROR: Error name, message, traceback, variables in scope is possible
WARNING: Warning name, message
INFO: Calls to top-level functions/handlers, like: [2021-05-17 00:06:23] INFO: GET /posts 200 OK
DEBUG: Program setup/initialization info, possibly memory or performance information*

*: more on that later

Thy logs must be filterable: logs are meant to be analyzed. Make them as searchable as possible. Consider formatting them as JSON documents, and don’t abuse nesting.

Why not? If the JSON is too nested, it becomes hard to search/analyze, defying its purpose.

For example, Elasticsearch can't properly index JSONs with two or more levels of nesting. That is, something like the example below can be indexed:

{"timestamp": "2021-05-18T21:09:54Z", "level": "error", "msg": "bad thing happened"}

Even something like this:

{"timestamp": {"date": "17th May, 2021", "time": "11:30:30am"}, "level": "error", "msg": "bad thing happened"}

But do something like this:

{"timestamp": {
    "date": "17th May, 2021",
    "time": [11, 30, 30, 124]
    },
 "level": "error",
 "msg": "bad thing happened",
 "context": {
    "some_key_for_multiple_values": []
    }
}

And Elastic will treat your deeply nested elements like strings, and then good luck filtering and aggregating these logs. So keep it flat, whenever possible.

Another good format is NCSA Common log format, but if possible, choose JSON. Why? Most log analysis tools use JSON. Something like NCSA Common log format is better for smaller systems, where you can search your logs with grep and friends. Finally: Whatever format you choose, be consistent across your whole system

Bad log (1): [2021-05-17 12:30:30] ERROR: KeyError // JSON version would be just as bad
Bad log (2): {"datetime": {"date": "17th May, 2021", "time": "11:30:30am"}, "type": "ERROR", "msg": "A KeyError error occured in function some_function"}
Better log: {"timestamp": "2021-05-18T21:09:54Z", "level": "error", "pid": 1201, "traceback": <your traceback as a string>, "msg": "KeyError: 'key_name'"}

Some wisdom on logging ops

So you have well-written logs. That's great!!

But now you have to decide how to access and analyze them. Funny thing, these decisions should also be guided by the stage and the scale of your system. In other words, I would advise against a complex infrastructure if you have one app serving a few hundred people.

Now we should dive into details.

You will roughly have three stages.

Log collection/shipment
Log storage
Log processing/analytics

First, log collection. We want to save our logs somewhere and not just let them print to stderr/stdout. So, now we have to think about where do we write them. It could be a file, or to Syslog, for example, or we could even write them into a TCP or UDP socket, sending them away to some logging server. To be honest, all choices are somewhat good. As long as you don't block the thread where the action happens, you should be fine, otherwise, prepare for a performance hit.

Regarding storage, for a simple app leaving them in file format should work for a while, but eventually, a storage solution with indexing support or really anything that can help you quickly search your logs will be advised.

Once you have multiple services, you can think of a centralized logging server, something like an ELK (Elasticsearch, Logstash, Kibana) cluster, with one or a few Elastic instances in a cluster setup.

So here comes my personal opinion: you should start by logging into a file, and mandatory ensure log file rotation because you don't want a single 10GB text file. Believe me... you don't. At some point, you will also have to think of log compression and possibly log shipping. Log shipping means transferring the logs from where these were created to where these will be analyzed and stored for a long time.

When it comes to log shipping, I would strongly suggest using TCP or HTTP over UDP and other protocols. Why, you may ask? Because first of all, with UDP you might lose logs while transferring them due to (1) no way of retransmitting lost packets, (2) no flow control, which might be the cause of lost packets, but also because with UDP message size is limited to 65KB of data, or even less, depending on network settings, which quite frankly could be not nearly enough. Also, your company firewalls might block this kind of traffic. So, a lot of trouble.

Having a centralized logging solution, you will now absolutely need to ship the logs, and having them first written to a file will prove a very nice idea because now your logs won't be lost in case of network outages, server failure, logging system failure or any of the above mentioned being too slow.

Nice.

Act 1.1: Hey, I think I can make a chatbot to notify me when something blows up

Yup, you can. And if you want to reduce MTTR you most likely should. Just take into account a few things.

First and foremost, if you have the possibility, set up alerting thresholds. You don't want to be notified when something is even slightly off every. single. time. Maybe it's some unique (non-critical) event, no need to bother, while if the issue happens frequently, you better be notified.
Another consideration, when it comes to alerting, is the possibility to have escalation alerting. First, send an alert via email. If no action was taken, now send it to a chat group of the responsible team. Still no activity? Send it in DM to an engineer, or even to a technical manager.
Finally, just aggregate the stuff, no need for 12, or a hundred, emails/Slack messages of the same issue. Something like one log message and then some text like X occurred 25 times in the last Y seconds should be good.

When it comes to what tools to use for alerting, well, you have Sentry, also to my knowledge, it is possible to set up alerting in Kibana, although I don't know whenever this is a paid option or free, and there are of course other tools.

This is by no means a definitive guide on how to do it, only some things to keep in mind. This whole blog post isn't a definitive guide if you haven't noticed yet.

Act 2: My system is slow, I guess I'll log execution time, and # of requests, and ...

... just. Stop. Please. The fact that you can do it, doesn't mean you should. Welcome to the world of telemetry and performance monitoring, where you will initially wonder, why not just use logs? I mean, in principle you could do this, but better to have a different infrastructure, to not mess everything up.

Mess up how? Well, if you're like me, you might want to set up performance monitoring not just at the route controller level, to see how much requests take to be handled and responded to (assuming a hypothetical server). You will also want to track how much time queries to the database take to execute, even functions! And now you have a ton of very fine-grained info, which will for sure overload the logging infrastructure. You don't want that. Besides, even if all runs smoothly, your read and write patterns will be different. Log analysis queries can be much more complex than analysis required for performance monitoring. Also, performance monitoring usually has smaller messages that need to be recorded with lower latency.
All in all, better set up a dedicated infrastructure for this.

The easiest thing is of course to use TRACE level logging, and as said earlier, dedicated infrastructure for performance monitoring. But this works only on small scale, where frankly, you don't even need performance monitoring.

As the system scales, you might start looking towards a more restricted type of logs, maybe some binary protocols, given that you will be sending small packets of information right away, very frequently.

Performance monitoring has a bit of a different write and query patterns than log analytics (ik, said it earlier), so different storage is recommended. Queries are simpler mainly showing trends, time series, current values, or some simple aggregate values, like counts, means, medians, and percentiles, and writes are very frequent but with little data, only a few metrics, compared with logging tracebacks and contexts and stuff like that.

That's why for example ELK stack is more common in logging infrastructure, where Elasticsearch can index and analyze even very unstructured data, and stuff like Grafana + Prometheus are more commonly used for performance monitoring. Prometheus, among other things, contains a time-series database, just the right thing to store and quickly query performance metrics.

Also, when it comes to performance analysis, you will want to monitor your system utilization, not just the stuff intrinsic to your code. If you're using Prometheus, that's easy to do.

Act 3: My microservice system is slow, but I can't figure out why

First, a likbez (crash-course) on networking and dynamic systems: Against our intuition, a computer network is a shared resource with a limited capacity. This basically means if one service is very chatty, it will influence the throughput and latency for all the rest. Also given that networks are a priori not 100% reliable and we mostly use TCP-based traffic, in the network, there will be plenty of packets (chunks of data, retransmissions, packets from administrative protocols). That's only half the problem though. There's more 😉

Our services are dependent upon each other and upon 3rd parties. So if one service is slow, it might influence other services, even ones that are not directly interacting with it. One metaphor to help you think of it is a spider web. When you touch it on one side, it will ripple on the other side. Kinda like a butterfly effect. And that's not just a simple comparison, you could indeed see failure due to some other service being somewhat slower.

So, how do we monitor this?

Maybe logs? Or something like performance monitoring from the previous act?

Well, I mean, it's a start, but only logs won't cut it. Because we don't see the full picture, specifically, we don't see the interaction between services, only each individual's performance. We need something more. Enter tracing.

First, a good mental model about tracing is that it's like logging, but with a correlation identifier, which makes it possible to combine said logs into a "trace".
A trace like this now can show us how, for example, a single request spans multiple services, how much time does each step takes and even how much time was spent on communication. All this can help uncover bugs and performance bottlenecks which a simple performance monitoring tool, or just logs, won't be able to do. Tracing will help you find bottleneck services, and sometimes even aid you in debugging distributed systems.

Traces should be thought of as an extension to performance monitoring tools, rather than logs. Traces' primary purpose is to uncover performance issues, also sometimes pinpoint the reason a specific operation failed. You could use them as logs, but don't overload them with information, otherwise, your collection, storage, and analysis infrastructure will cry.

How to structure your traces? The easiest thing to do is to use tools that automagically will patch your dependencies like database clients, web servers, and HTTP/RPC clients and be done with it. Sensible defaults, you know. If you want to have more control, be prepared to write some boilerplate, especially if you want to manually control what things will be propagated between services. When it comes to adding info to your spans (the pieces which combined form a trace) don't add your whole application context, only the most important things, for example, current configurations of your system.

Side note, sometimes it is important to correlate traces with logs, for this you can use yet another correlation identifier, for a more in-depth analysis of your system, combining traces with individual logs.

There are some existing Open Source tools with great support, like Jaeger and Zipkin, there are also industry initiatives like OpenTracing, OpenCensus and "their combination" OpenTelemetry, not to mention a few trace formats, like W3C Trace Context and Zipkin B3 formats.

A common architecture for tracing subsystems is a combination of a sidecar, collector, storage, and "presenter" components, not to mention the client library. When it comes to using tracing in a serverless setup it gets tricky, one solution would be to bypass the sidecar and send data directly to the collector, but you will lose some nice features.

Tracing, in general, is huuuuge topic, and covering it would require at least one more long-read article. That's why, for more information, I'd like to point you towards these two articles and this post from Uber. In these you'll find more "war stories" on how such systems where implemented (first article and the post from Uber) and also such important topics as trace sampling strategies and trace visualizations (second article).

Final act: Welcome to observability!!!

Observability, what?

Observability is the property of a system to be understood. It's a property of how well can one infer the internal state of something from its external outputs.
It’s a spectrum and depending on where your system stands, you can use monitoring and alerting more or less efficiently.
In other words, if a system is observable you can understand what is happening within it from its outputs.

We need to design our systems with observability in mind. And with all the stuff outlined above, that should become a doable task.

I prefer to think of observability, with a proper incident response procedure, of course, as a way to make said system anti-fragile (see the works of Nasim Taleb),
because with every failure and issue that happens, it "learns", on the organizational level, to be better. Or one could argue that on the contrary, the system now becomes more fragile because with every fix we believe more and more that the system is now unkillable, which it never will be.

Pick for yourself, but don't forget to use logging. At least you'll know when and why things go south, and that's something.

Epilogue

You've made it! Congrats! Now you have some very important knowledge of how to be prepared when manure hits the proverbial fan in production.
This knowledge should help you debug even super-obscure bugs. Of course, this isn't going to be easy, plus you now have an entire infrastructure to take care of,
but hey, if this helps reducing time to solve an issue from 1 week (or more) to 1, maybe 2 days, it might be worth it.

I know for a fact that it was worth it for me, time and time again when it helped me quickly identify edge cases, stupid misconfigurations, and performance bottlenecks.

So yeah, that's it for now. Incredibly, it didn't take much time since my last blog post.

Finally, if you’re reading this, I’d like to thank you. Let me know what are your thoughts about it via Twitter, for now, until I plug in some form of a comment section. Your feedback is valuable for me.

Elixir pattern matching magic

Alex Burlacu — Sun, 09 May 2021 13:10:43 +0000

Prologue

So, a while ago, while preparing for an off-topic lecture about polymorphism and type systems, I recalled an interesting concept called multiple dispatch. I won't go into details of what it is, so if you're interested, check these links: 1, 2, 3.
Anyway, while brushing up my knowledge about multiple dispatch, I found an even more powerful technique, called predicate dispatch.

To me, it seemed a lot like what is possible through pattern matching in functional languages. After some research, I asked on SO whenever my assumption was right, here. TL;DR: no answer as of today(2021/05/07).

Why am I telling you all this? Because that's how I decided to write an article about how cool pattern matching is, specifically in Elixir, and even if it's not the same as predicate dispatch, it's pretty damn powerful nevertheless.

So let's get started!

Btw, this post was originally published at https://alexandruburlacu.github.io/posts/2021-05-07-elixir-pattern-matching-magic

Basics

I will quickly go through the basics of Elixir pattern matching, before diving into real neat stuff.

In Elixir = doesn't just assign some value to some variable, it also matches the left-hand side of the expression with the right-hand side. So as a result, doing something like the code below is entirely possible.

iex(0)> x = 2
2
iex(1)> y = 4
4
iex(2)> 4 = y
4
iex(3)> 2 = y
** (MatchError) no match of right hand side value: 4

No big deal, right? Wrong! Because of this interesting property, we can do matching on composite data types, for example on lists.

Lists

In Elixir [] = [] is a valid expression. But now, we can also write something like:

iex(0)> xs = [1, 2]
[1, 2]
iex(1)> [x, y] = xs
[1, 2]
iex(2)> x
1
iex(3)> y
2

Starts to get interesting, eh? But wait, there's more!

iex(0)> [head | tail] = [1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]
iex(1)> head
1
iex(2)> tail
[2, 3, 4, 5]

Aaaaand moreeee!!!!

iex(0)> [head, next_to_it | tail] = [1, 2, 3, 4, 5]
[1, 2, 3, 4, 5]
iex(1)> head
1
iex(2)> next_to_it
2
iex(3)> tail
[3, 4, 5]

Noice.

Tuples

Alright, so pattern matching can do interesting stuff. In Elixir it's so deeply ingrained that it's used for example to signal whenever or not we have an error. For this, pattern matching on tuples is used.

iex(0)> {:ok, value} = SomeModule.some_function()
{:ok, "the value"}
iex(1)> # If the function returns something else than {:ok, value}
iex(2)> {:ok, value} = SomeModule.some_function()
** (MatchError) no match of right hand side value: {:ok, value}

Elixir has a special control structure to enable more flexible usage of pattern matching, it's called case.

iex(0)> x = {:ok, "is fine"}
iex(1)> case x do
...(1)>   {:ok, v} -> v
...(1)>   _ -> "nothing at all"
...(1)> end
"is fine"
iex(2)> x = {:err, "not ok"}
iex(3)> case x do
...(3)>   {:ok, v} -> v
...(3)>   _ -> "nothing at all"
...(3)> end
"nothing at all"

This concludes the basics part, so now we're gonna dive into more interesting stuff.

Functions

Elixir can use pattern matching even in function definitions, like in Haskell. And by the way, that's one of the most performant options, usually.

defmodule FactorialM do
    def factorial(0), do: 1
    def factorial(1), do: 1
    def factorial(x) do
        x * factorial(x-1)
    end
end

Can you spot a problem with this function? Well, what if we pass a floating-point value instead of an integer? Think what would happen, and compare with the answer¹ at the end of the article.

How can you fix it? Enter guards.

defmodule FactorialM do
    def factorial(0), do: 1
    def factorial(1), do: 1
    def factorial(x) when is_integer(x) do
        x * factorial(x-1)
    end
    def factorial(_), do: raise RuntimeError, "Input should be integer"
end

So now we can also define different paths for code execution depending on whenever or not our guard(s) are satisfied. Guards in Elixir are fairly limited, and hard-ish to extend, for safety reasons. Guards should be pure functions, and even if you try to define them using macros, the compiler still can check whenever they can be distilled down to existing guards and logical operators or not. For more information, see this documentation page and this little tutorial/blog post on how to write guards.

Finally, we can combine pattern matching capabilities of functions with those of tuples and lists and implement fairly interesting things. For example a map function.

defmodule FairlyInteresting do
    def map([], _func), do: []
    def map([head | tail], func) when is_function(func) do
        [func.(head) | map(tail, func)]
    end
end

Also, using pattern matching on tuples in function prototypes is the go-to way of using Elixir's GenServer, GenStage, and other Gen-things. This pattern is inherited from Erlang's OTP and is pretty beautiful if you ask me.

defmodule Stack do
    @moduledoc """Taken from: https://hexdocs.pm/elixir/master/GenServer.html"""
    use GenServer

    # Callbacks
    @impl true
    def init(stack) do
        {:ok, stack}
    end

    @impl true
    def handle_call(:pop, _from, [head | tail]) do
        {:reply, head, tail}
    end

    @impl true
    def handle_cast({:push, element}, state) do
        {:noreply, [element | state]}
    end
end

It's getting more interesting

Remember I told you pattern matching can be applied to composite data? Well, it's not just lists, it's also maps, and by extension structs, here:

iex(0)> kv = %{key: :value}
{key: :value}
iex(1)> %{key: data} = kv
{key: :value}
iex(2)> data
:value

And with structs:

iex(0)> defmodule AStruct do
...(0)>     defstruct [:state]
...(0)> end
# I'll omit this for brevety
iex(1)> s = %AStruct{state: 12}
%AStruct{state: 12}
iex(3)> s.state
12
iex(4)> %{state: st} = s # recall, a struct is just syntactic sugar for a map
%AStruct{state: 12}
iex(5)> st
12
iex(6)> %AStruct{state: st} = s
%AStruct{state: 12}
iex(7)> st
12

The as-pattern

What if you need to match a function parameter with some specific structure, but you also need a reference to the entire value. Have you ever heard about as-patterns?

defmodule FairlyInteresting do
    def merge([], xs), do: xs
    def merge(xs, []), do: xs
    def merge(first=[x|xs], second=[y|ys]) do
        if x < y do
            [x | merge(xs, second)]
        else
            [y | merge(ys, first)]
        end
    end
end

# ...

iex(0)> FairlyInteresting.merge [1, 3, 4, 7], [2, 2, 4, 8, 9]
[1, 2, 2, 3, 4, 4, 7, 8, 9]

You still with us? Yes? Good, because the fun part hasn't even started yet.

Partial functions

Moving on, in Elixir it is possible to define partial functions. Mathematically speaking, a partial function is a function defined only for some values, not for the whole set of values. For example, the division is technically a partial function, because we can't define it when the divisor is 0. We can make a partial function explicit via pattern matching. And it also works for anonymous functions!

iex(0)> partial = fn 
...(0)>     {:ok, value} when is_number(value) -> value * 2
...(0)>     {:notok, _} -> :i_mean_its_not_ok
...(0)> end
iex(1)> partial.(12)
# raises a FunctionClauseError
iex(2)> partial.({:ok, 12})
24

The pin (not my card's)

Finally, no discussion about pattern matching in Elixir would be complete without the ^ operator. So what is it?
It is commonly known as the pin operator, and it allows pattern matching without any assignment.

Recall that normally, using = we perform both pattern matching and assignment. That is, we check whenever the left-hand side of the expression matches the right-hand side, and if so, all the variables in the expression get assigned to corresponding values.

Ex. [1, x, y] = [1, 2, 3] # x = 2, y = 3.

So, using ^ we can pattern match, but not assign. Like this:

iex(0)> x = 2
2
iex(1)> ^x = 3
** (MatchError) no match of right hand side value: 3
iex(2)> ^x = 2
2

You might ask, where would I use this? Well, what about deciding in runtime what matching criteria are you interested in. For example:

iex(0)> status_of_interest = :wip # imagine that it is decided while the program is running
iex(1)> # maybe can even change throughout the program lifetime
iex(2)> partial = fn 
...(2)>     {^status_of_interest, value} when is_number(value) -> value * 2
...(2)>     {:notok, _} -> :i_mean_its_not_ok
...(2)> end
iex(3)> partial.({:ok, 12})
# raises a FunctionClauseError
iex(4)> partial.({:wip, 11})
22

Leveling up

Now we've seen enough, let's combine everything!

defmodule Measurement do
    defstruct [:prob, status: :ok]
end

defmodule Measurement.CDF do
    defstruct [:value]
end

defmodule FairlyInteresting do
    @doc "CDF stands for cummulative density function"
    def kinda_cdf([], acc, _func), do: [%Measurement.CDF{value: acc}]

    # So, now we have pattern matching on structs, inside lists,
    #  with as-patterns and guards, isn't it cool?
    def kinda_cdf([%Measurement{prob: t, status: :ok}=head | tail], acc, func)
        when is_function(func, 2) do
        [%Measurement.CDF{value: acc} | kinda_cdf(tail, func.(t, acc), func)]
    end

    def kinda_cdf([_head | tail], acc, func) when is_function(func, 2) do
        kinda_cdf(tail, acc, func)
    end
end

# ...

iex(0)> ms = [%Measurement{prob: 0.11}, %Measurement{prob: 0.07},
...(0)>       %Measurement{prob: 0.31, status: :notok}, %Measurement{prob: 0.21},
...(0)>       %Measurement{prob: 0.17, status: :ok}, %Measurement{prob: 0.08, status: :notok}]
[
  %Measurement{prob: 0.11, status: :ok},
  %Measurement{prob: 0.07, status: :ok},
  %Measurement{prob: 0.31, status: :notok},
  %Measurement{prob: 0.21, status: :ok},
  %Measurement{prob: 0.17, status: :ok},
  %Measurement{prob: 0.08, status: :notok}
]
iex(1)> FairlyInteresting.kinda_cdf ms, 0.0, &(&1+&2)
[
  %Measurement.CDF{value: 0.0},
  %Measurement.CDF{value: 0.11},
  %Measurement.CDF{value: 0.18},
  %Measurement.CDF{value: 0.39},
  %Measurement.CDF{value: 0.56}
]

If you're like Patrick right now, I don't blame you, even I was a bit shocked while writing this.

And if that's not enough, we move onto the mindbending stuff. Bear with me.

Working with bits

Something cool that Erlang and Elixir can do is pattern matching on binary data. Isn't this amazing?

Binary pattern matching in Erlang and Elixir exists because Erlang was initially developed to be used for network and telecom programming, that is implementing software for switches, routers, and servers; developing protocols, and doing this efficiently. Binary matching allows for very concise parsing of binary protocols.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |          Source Port          |       Destination Port        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                        Sequence Number                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Acknowledgment Number                      |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | Data |                                                        |
   |Offset|                      data                              |
   |      |                                                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                        Some Binary Header Format

iex(0)> source_port = <<12070 :: 16>>
"/&"
iex(1)> destination_port = <<80 :: 16>>
<<0, 80>>
iex(2)> seq_num = <<12_345_678 :: 32>>
<<0, 188, 97, 78>>
iex(3)> offset = <<0 :: 3>> # you can even specify bit-strings
<<0::size(3)>>

Now let's assemble the packet.

iex(4)> header = << source_port <> destination_port <> seq_num <> seq_num, offset :: bitstring>>
<<47, 38, 0, 80, 0, 188, 97, 78, 0, 188, 97, 78, 0::size(3)>>
iex(5)> packet = <<header :: bitstring, <<"and here comes the data">> >>
<<47, 38, 0, 80, 0, 188, 97, 78, 0, 188, 97, 78, 12, 45, 204, 132, 13, 12, 174,
  76, 164, 12, 109, 237, 172, 174, 100, 14, 141, 12, 164, 12, 140, 46, 140,
  1::size(3)>>

Notice the strange way we assemble the packet. Because we are working on the bit level, not even on the byte level, sometimes concatenation (<>) isn't possible.
That's why we use the list-like behaviour of <<>>, that is to say this form: << 12::3, <<1::2, <<3::3>> >> >> will be equivalent to this <<12::3, 1::2, 3::3>>.

Also, we need to specify that we're dealing with a bitstring not a sequence of bytes.

And now we match.

iex(6)> <<_sp :: 16, _dp :: 16, _seq_num :: 32, ack_num :: 32, _offset :: 3, data :: bitstring>> = packet
<<47, 38, 0, 80, 0, 188, 97, 78, 0, 188, 97, 78, 12, 45, 204, 132, 13, 12, 174,
  76, 164, 12, 109, 237, 172, 174, 100, 14, 141, 12, 164, 12, 140, 46, 140,
  1::size(3)>>
iex(7)> data
"and here comes the data"

Nice, but can we use ^ for more powerful matching?

iex(5)> <<^i, 32, ^have, 32, ^a, "n ", apple>> = "I have an apple"
** (MatchError) no match of right hand side value: "I have an apple"

Now, this is one thing you can't do. You can't combine <<>> and ^ operators. Shame. But it's useful never the less, you'll see in a moment.

A bit about strings

In Elixir you can even match the text. Nice, isn't it? But text, or strings, are arrays of bytes, so, it's pretty much obvious why can we do it.

iex(0)> partial = fn 
...(0)>     {:ok, "he" <> v} -> v
...(0)>     {:still_ok, v <> "ou"} -> v
...(0)>     _ -> :nope
...(0)> end
iex(1)> 
iex(2)> partial = fn # won't compile
...(2)>     {:ok, "he" <> v} -> v
...(2)>     {:still_ok, v <> "ou"} -> v # because of this
...(2)> _ -> :nope
...(2)> end
** (ArgumentError) the left argument of <> operator inside a match should always be a literal binary because its size can't be verified.

Well, the capability is very limited, because potentially you could have a very long string, and checking its end potentially could be a very expensive operation.
There's a way tho.

Back to bit sequences. So, because of the Erlang legacy, strings can be treated as sequences of characters, which in turn are just sequences of short unsigned integers. So, if you know the size of the matchable subsequence, you could potentially match even in the middle of the string.

Just in case someone needed it. If you need to match on the part of the string that is in the known middle and you aware of its length then you can use binary matching:

iex(1)> <<"I ", v::binary-size(9), "ing">> = "I got a string"
iex(2)> v
"got a str"

Strings and bits and bytes and pattern matching in Elixir is a huge topic, so don't worry if you're confused at this moment. You could check this post about exactly that if my ramblings didn't make sense ;)

Epilogue

I hope you like it. I don't know about you, but I like to discover weird powerful things like all the stuff above. I mean, lists and tuples are fine, but to be able to pattern match on bits, that's some Voodoo magic in here.

So yeah, that's it for now. I might write some more about advanced Elixir stuff, most likely related to the actor model. Let's hope it won't take as long as usual.

If you’re reading this, I’d like to thank you and hope all of the above written will be of great help for you, as it was for me. Let me know what are your thoughts about it in the comments. Also, be sure to check my new blog at alexandruburlacu.github.io. Eventually all my blogs will be published exclusively there.

Oh, yeah, the answer It will run until killed by the OS, because 1 is not 1.0 in Elixir, nor 0.0 is 0.

P.S.

For your efforts, I'd like to reward you with the possibility to run all these examples in a sandbox (elixir <filename.exs>). Knock yourself out ;)

Understanding a Black-Box

Alex Burlacu — Sun, 12 Aug 2018 13:52:47 +0000

An overview of model interpretability methods… and why it’s important. Originally published here.

Before we dive into some popular and quite powerful methods to crack open black box machine learning models, like deep learning ones, let’s first make clear why it is so important.

You see, there are a lot of domains that would benefit from understandable models, like self-driving cars, or ad targeting, and there are even more that demand this interpretability, like creditworthiness assignment, banking, healthcare, human resources. Being able to audit the model for these critical domains is very important.

Understanding the most important features of a model gives us insights into its inner workings and gives directions for improving its performance and removing bias.

Besides that, sometimes it helps to debug models (happens all the time). The most important reason, however, for providing explanations along with the predictions is that explainable ML models are necessary to gain end-user trust (think of medical applications as an example).

I hope now you also believe that understandable machine learning is of high importance, so let’s dive into concrete examples to solve this problem.

Simple(st) methods

The simplest method one can think of is slight alterations of input data to observe how the underlying black box is reacting. For visual data usage of partially occluded images is the easiest method. For text — the substitution of words, and for numerical/categorical data — alteration of variables. Easy as that!

The greatest benefit of this method — it is model-agnostic, you can even check someone else’s models without direct access to it.

Even if it sounds easy, the benefits are immense. I used this method numerous times to debug both Machine-Learning-as-a-Service solutions and neural networks trained on my own machine to find that the trained models choose irrelevant features to decide the class of images, thus saving hours of work. Truly 80/20 rule in action.

GradCAM

Gradient-weighted Class Activation Maps — a more advanced and specialized method. The constraints of this method are that you need to have access to the model’s internals, and it should work with images. To give you a simple intuition of the method, given a sample of data (image), it will output a heat map of the regions of the image where the neural network had the most and greatest activations, therefore the features in the image that model correlates the most with the class.

Essentially, you get a more fine-grained understanding of what are the important features for the model than in the previous model.

Here’s a nice demo of the GradCAM interpretability method.
To learn how the GradCAM works, check the “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization” paper.

LIME

Maybe you’ve heard about this one. If not, first take a look at this short intro.

In 2016 was published “Why Should I Trust You?: Explaining the Predictions of Any Classifier” paper that introduced LIME — Local Interpretable Model-agnostic Explanations. Let’s derive its capabilities from the name!

Local Interpretable — You should know that higher the complexity of the machine learning model, the less interpretable the model is. That is, logistic regression and decision trees are much more interpretable than, say, random forests and neural networks. The assumption of the LIME method is that non-linear, complex models like random forests or neural networks can be linear and simple locally, that is on small patches of the whole decision boundary. And recall, we said that simple models are interpretable.

Model-agnostic — This part is easier. LIME doesn’t have any assumptions about the model that is interpreted.

The best thing about LIME is that it is also available as a PyPI package. pip install lime and you’re ready to go! For more information, here’s their GitHub repo with benchmarks and some tutorial notebooks, and here is the link to their paper. FYI, LIME was also used by the Fast Forward Labs (now part of Cloudera) in their demo on the importance of model interpretability.

SHAP

SHapley Additive exPlanations — a more recent solution for understanding black-box models. In a way, it is very similar to LIME. Both are powerful, unified solutions, that are model-agnostic and relatively easy to get started.

But what is special about SHAP is that it uses LIME internally. SHAP actually has a plethora of interpretable models behind it, and it selects the most appropriate for the problem at hand, giving you the needed explanations, using the right tool for it.

Moreover, if we break down the capabilities of this solution, we actually find out that SHAP explains the output of any machine learning model using Shapley values. This means that SHAP assigns a value to each feature for each prediction (i.e. feature attribution); the higher the value, the larger the feature’s attribution to the specific prediction. It also means that the sum of these values should be close to the original model prediction.

Check their GitHub repo, just like LIME, they have some tutorials and it is also possible to install SHAP via pip. Also, for more nitty-gritty details, check their NIPS pre-print paper.

Final Notes

Machine Learning model interpretability and explainability is a hot topic in the AI community, with immense implications. In order for AI-enabled products and services to enter new, highly regulated markets, it is mandatory to understand how these novel algorithms are making decisions.

Moreover, knowing the reasons of a machine learning model provides an outstanding advantage in debugging it, and even improving it.

It is highly advisable to design Deep Learning/Machine Learning systems with interpretability in mind, so that is is always easy to inspect the model and in a critical situation, to suppress its decisions.

If you’ve made it so far, thank you! I encourage you to make your own research in this new domain and share your findings with us in the comments section.

Also, don’t forget to clap if you liked the article 😏 or even follow me for more articles on miscellaneous topics from machine learning and deep learning.

P.S. Sorry, no Colab Notebook this time because there are already lots of very good tutorials on this topic to get you started.

Speeding up Convolutional Neural Networks

Alex Burlacu — Fri, 06 Jul 2018 07:41:23 +0000

Originally published on Medium, quite some time ago, here

An overview of methods to speed up training of convolutional neural networks without significant impact on the accuracy.

It’s funny how fully connected layers are the main cause for big memory footprint of neural networks, but are fast, while convolutions eat most of the computing power although being compact in the number of parameters. Actually, convolutions are so compute hungry that they are the main reason we need so much compute power to train and run state-of-the-art neural networks.

Can we design convolutions that are both fast and efficient?

To some extent — Yes!

There are methods to speed up convolutions without critical degradation of the accuracy of models. In this blog post, we’ll consider the following methods.

Factorization/Decomposition of convolution’s kernels
Bottleneck Layers
Wider Convolutions
Depthwise Separable Convolutions

Bellow, I’ll dive into the implementation and the reason behind of all these methods.

Simple Factorization

Let’s start with the following example in NumPy

>>> from numpy.random import random
>>> random((3, 3)).shape == (random((3, 1)) * random((1, 3))).shape
>>> True

You might ask, why am I showing you this silly snippet? Well, the answer is, it shows that you can write an NxN matrix, think of a convolutional kernel, as a product of 2 smaller matrices/kernels, of shapes Nx1 and 1xN. Recall that the convolution operation requires in_channels * n * n * out_channels parameters or weights. Also, recall that every weight/parameter requires an activation. So, any reduction in the number of parameters will reduce the number of operations required and the computational cost.

Given that the convolution operation is in fact done using tensor multiplications, which are polynomially dependent on the size of the tensors, correctly applied factorization should yield a tangible speedup.

In Keras it will look like this:

# k - kernel size, for example 3, 5, 7...
# n_filters - number of filters/channels
# Note that you shouldn't apply any activation
# or normalization between these 2 layers
fact_conv1 = Conv(n_filters, (1, k))(inp)
fact_conv1 = Conv(n_filters, (k, 1))(fact_conv1)

Still, note that it is not recommended to factor closest to the input convolutional layers. Also, factoring 3x3 convolutions can even damage the network’s performance. Better keep them for bigger kernel sizes.

Before we dive deeper into the topic, there’s a more stable way to factorize big kernels: just stack smaller ones instead. For example, instead of using 5x5 convolutions, stack two 3x3 ones, or 3 if you want to substitute a 7x7 kernel. For more information see [4].

Bottleneck Layers

The main idea behind a bottleneck layer is to reduce the size of the input tensor in a convolutional layer with kernels bigger than 1x1 by reducing the number of input channels aka the depth of the input tensor.

Here’s the Keras code for it:

from keras.layers import Conv2D

# given that conv1 has shape (None, N, N, 128)

conv2 = Conv2D(96, (1, 1), ...)(conv1) # squeeze
conv3 = Conv2D(96, (3, 3), ...)(conv2) # map
conv4 = Conv2D(128, (1, 1), ...)(conv3) # expand

Almost all CNNs, ranging from revolutionary InceptionV1 to modern DenseNet are using in one way or another Bottleneck Layers. This technique helps in keeping the number of parameters, and thus the computational cost, low.

Wider Convolutions

Another easy way to speed up convolutions is the so-called wide convolutional layer. You see, the more convolutional layers your model has, the slower it will be. Yet, you need the representation power of lots of convolutions. What do you do? You use less-but-fatter layers, where fat means more kernels per layer. Why does it work? Because it’s easier for the GPU, or other massively parallel machines for that matter, to process a single big chunk of data instead of a lot of smaller ones. More information can be found in [6].

# convert from
conv = Conv2D(96, (3, 3), ...)(conv)
conv = Conv2D(96, (3, 3), ...)(conv)
# to
conv = Conv2D(128, (3, 3), ...)(conv)
# roughly, take the sqrt of the number of layers you want
# to merge and multipy the number to
# the number of filters/channels in the initial convolutions
# to get the number of filters/channels in the new layer

Depthwise Separable Convolutions

Before diving into this method, be aware that it’s extremely dependent upon how the Separable Convolutions where implemented in a given framework. As far as I am concerned, TensorFlow might have some specific optimizations for this method while for other backends, like Caffe, CNTK or PyTorch it is unclear.

Vincent Vanhoucke, April 2014, “Learning Visual Representations at Scale”

The idea is that instead of convolving jointly across all channels of an image, you run a separate 2D convolution on each channel with a depth of channel_multiplier. The in_channels * channel_multiplier intermediate channels get concatenated together, and mapped to out_channels using a 1x1 convolution.[5] This way one ends up with significantly fewer parameters to train.[2]

# in Keras
from keras.layers import SeparableConv2D
...
net = SeparableConv2D(32, (3, 3))(net)
...
# it's almost 1:1 similar to the simple Keras Conv2D layer

It’s not so simple tho. Beware that Separable Convolutions sometimes aren’t training. In such cases, modify the depth multiplier from 1 to 4 or 8. Also note that these are not that efficient on small datasets, like CIFAR 10, moreover on MNIST. Another thing to keep in mind, don’t use Separable Convolutions in early stages of the network.

Source: V. Lebedev et al, Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition

CP-Decomposition and Advanced Methods

The factorization scheme showed above work well in practice, but are quite simple. They work but are by far not the limit of what’s possible. There are numerous works, including [3] by V. Lebedev et al. that show us different tensor decomposition schemes that drastically decrease the number of parameters, hence the number of required computations.

Inspired by [1] here’s a code snippet of how to do CP-Decomposition in Keras:

# **kwargs - anything valid for Keras layers,
# like regularization, or activation function
# Though, add at your own risk

# Take a look into how ExpandDimension and SqueezeDimension
# are implemented in the associated Colab Notebook
# at the end of the article

first = Conv2D(rank, kernel_size=(1, 1), **kwargs)(inp)
expanded = ExpandDimension(axis=1)(first)
mid1  = Conv3D(rank, kernel_size=(d, 1, 1), **kwargs)(exapanded)
mid2  = Conv3D(rank, kernel_size=(1, d, 1), **kwargs)(mid1)
squeezed = SqueezeDimension(axis=1)(mid2)
last  = Conv2D(out,  kernel_size=(1, 1), **kwargs)(squeezed)

It doesn’t work, regretfully, but it gives you the intuition of how it should look like in code. Btw, the image at the top of the article is the graphical explanation of how CP-Decomposition works.

Should be noted such schemes as TensorTrain decomposition and Tucker. For PyTorch and NumPy there’s a great library called Tensorly that does all the low-level implementation for you. In TensorFlow there’s nothing close to it, still, there is an implementation of TensorTrain aka TT scheme, here.

Epilogue

The full code is currently available as a Colaboratory notebook with a Tesla K80 GPU accelerator. Make yourself a copy and have fun tinkering around with the code.

If you’re reading this, I’d like to thank you and hope all of the above written will be of great help for you, as it was for me. Let me know what are your thoughs about it in the comments section. Your feedback is valuable for me.

References

[1] https://medium.com/@krishnatejakrothapalli/hi-rain-4e76039423e2
[2] F. Chollet, Xception: Deep Learning with Depthwise Separable Convolutions, https://arxiv.org/abs/1610.02357v2
[3] V. Lebedev et al, Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition, https://arxiv.org/abs/1412.6553
[4] C. Szegedy et al, Rethinking the Inception Architecture for Computer Vision, https://arxiv.org/pdf/1512.00567v1.pdf
[5] https://stackoverflow.com/questions/37092037/tensorflow-what-does-tf-nn-separable-conv2d-do#37092986
[6] S. Zagoruyko and N. Komodakis, Wide Residual Networks, https://arxiv.org/pdf/1605.07146v1.pdf