DEV Community: Rohit Patil

Consistent Hashing Simplified

Rohit Patil — Wed, 13 Feb 2019 14:42:47 +0000

Distributed system problem:-

We want to dynamically add/remove cache servers based on usage load.

As these are cache servers, we have a set of keys and values. This could be Memcached, Redis, Hazelcast, Ignite, etc.

Such setups consist of a pool of caching servers that host many key/value pairs and are used to provide fast access to data originally stored (or computed) elsewhere. For example, to reduce the load on a database server and at the same time improve performance, an application can be designed to first fetch data from the cache servers, and only if it’s not present there — a situation known as cache miss — resort to the database, running the relevant query and caching the results with an appropriate key, so that it can be found next time it’s needed. We want to distribute the keys across the servers so that we can find them again.

Our goal is to design a system such that:

We should be able to distribute the keys uniformly among the set of “n” servers.
We should be able to dynamically add or remove a server.
When we add/remove a server, we need to move the minimal amount of data between the servers.

Here is the simplest approach:-

Generate a hash of the key from the incoming data. For example, in python, we would use the hash function.

hashValue = hash(key)

Find out the server to send the data to by taking the modulo of the hashValue using the number of current servers(n):

serverIndex = hashValue % n

Now consider the following scenario:-

Imagine we have 4 servers
Imagine our hash function returns a value from 0 to 7
We’ll assume that “key0” when passed through our hash function, generates a hash value or 0, “key1” generates 1 and so on.
The serverIndex for “key0” is 0, “key1” is 1 and so on.

The situation assuming that the key data is uniformly distributed is shown in the image below. We receive 8 pieces of data and our hashing algorithm distributes it evenly across our four database servers.

Sharding data across 4 servers

Problem solved, right? Not quite — there are two major drawbacks with this approach, namely, Horizontal Scalability and Non-Uniform data distribution across servers.

Horizontal Scalability:-

The above scheme is not horizontally scalable. If we add or remove servers from the set, all our existing mappings are broken. This is because the value of “n” in our function that calculates the serverIndex changes. The result is that all existing data needs to be remapped and migrated to different servers. This might be a humungous task.

Let us see what happens when we add another server (server4) to the original pool of server. Notice that we’ll need to update 3 out of the original 4 servers which mean 75% of servers need to be updated.

Effect of adding a new server to the cluster and the redistribution of the keys

The effect is more severe when a server goes down as shown below. In this case, we’ll need to update ALL servers.

Effect of removing a server from the cluster and the redistribution of the keys

Data Distribution — Avoiding “Data Hot Spots” in Cluster:-

We cannot expect a uniform distribution of data coming in all the time. There may be many more keys whose hashValue maps to server number 1 than any other servers, in which case server number 1 will become a hotspot for keys.

Consistent hashing allows up to solve both these problems.

What exactly is Consistent Hashing?

So, how can this problem be solved? We need a distribution scheme that does not depend directly on the number of servers, so that, when adding or removing servers, the number of keys that need to be relocated is minimized. Consistent hashing facilitates the distribution of data across a set of nodes in such a way that minimizes the re-mapping/ reorganization of data when nodes are added or removed. Here’s how it works:

Consistent Hashing is a distributed hashing scheme that operates independently of the number of servers or objects in a distributed hash table by assigning them a position on a hash ring. This allows servers and objects to scale without affecting the overall system. Here’s how it works:

Creating the Hash Key Space: Consider we have a hash function that generates hash values in the range [0,2³²-1). We can represent this as an array of integers with 2³² -1 slot. We’ll call the first slot x0 and the last slot x^n — 1.

Linear Hash Key Space

2. Representing the hash space as a Ring: Imagine that these integers generated after hashing are placed on a ring such that the last value wraps around and forms a cycle.

3. Placing servers on the HashRing: We’re given a list of servers to start with. Using the hash function, we map each server to a specific place on the ring. This simulates placing the four servers into a different place on the ring as shown below.

Placing servers on a hash ring

4. Determining Placement of Keys on Servers: To find which server an incoming key resides on, we do the following:

Calculate the hash for the key using the hash function.
After hashing the key, we’ll get an integer value which will be contained in the hash space, i.e., it can be mapped to some position in the hash ring. There can be two cases:

The hash value maps to a place on the ring which does not have a server. In this case, we travel clockwise on the ring from the point where the key is mapped to until we find the first server. Once we find the first server traveling clockwise on the ring, we insert the key there. The same logic would apply while trying to find a key in the ring.
The hash value of the key maps directly onto the same hash value of a server — in which case we place it on that server.

Example: Assume we have 4 incoming keys: key0, key1, key2, key3 and none of them directly maps to the hash value of any of the 4 servers on our hash ring. So we travel clockwise from the point these keys maps to in our ring till we find the first server and insert the key there. This is shown below diagram.

Key placements on servers in a hash ring

Adding a server to the Ring: If we add another server to the hash Ring, server 4, we’ll need to remap the keys. However, only the keys that reside between server 3 and server 0 needs to be remapped to server 4. On average, we’ll need to remap only k/n keys, where k is the number of keys and n is the number of servers. In modulo based approach we needed to remap nearly all the keys.

The figure below shows the effect of inserting a new server4. As server 4 is between key3 and server4, key3 will be remapped from server0 to server4.

Effect of adding a server to the hash ring

Removing a server from the ring: A server might go down and consistent hashing scheme ensures that it has minimal effect on the number of keys and servers affected.

As we can see in the figure below, if server0 goes down, only the keys in between server3 and server 0 will need to be remapped to server 1. The rest of the keys are unaffected.

Effect of removing a server from the hash ring

Hence we can say that consistent hashing successfully solves the horizontal scalability problem by ensuring that every time we scale up or down, we do not have to redistribute all the keys.

Now let us talk about the second problem of non-uniform distribution of data across servers.

To ensure object keys are evenly distributed among servers, we need to apply a simple trick: To assign not one, but many labels to each server on the hash ring.

So instead of having labels server0, server1, server2, server3 we could have, say server00…server03, server10…server13, server20…server23, and server30…server33 all interspersed along the circle.

As the number of replicas or virtual nodes in the hash ring increase, the key distribution becomes more and more uniform.

The factor by which to increase the number of labels (server keys), known as weight, depends on the situation (and may even be different for each server) to adjust the probability of keys ending up on each. For example, if server0 were twice as powerful as the rest, it could be assigned twice as many labels, and as a result, it would end up holding twice as many objects (on average).

Using virtual nodes/ replication to create a better key distribution in a hash ring

Now imagine server0 is removed. To account for this, we must remove labels server00…server03 from the circle. This results in the object keys formerly adjacent to the deleted labels now being randomly labeled server 3x and sever1x, reassigning them to server3 and server1.

But what happens with the other object keys, the ones that originally belonged in server3 and server1? Nothing! That’s the beauty of it: The absence of server0 labels does not affect those keys in any way. So, removing a server results in its object keys being randomly reassigned to the rest of the servers, leaving all other keys untouched.

And this is how consistent hashing solves the non-uniform distribution problem.

References:-

I found that the authors never published an extended version with proofs, even though they said they would. The closest thing to an extended paper is,

Relieving Hot Spots on the World Wide Web

Thank you for reading. You can find me on Twitter @Rohitpatil5, or connect with me on LinkedIn.

The Matrix Calculus You Need For Deep Learning (Notes from a paper by Terence Parr and Jeremy…

Rohit Patil — Tue, 27 Feb 2018 12:44:59 +0000

The Matrix Calculus You Need For Deep Learning (Notes from a paper by Terence Parr and Jeremy Howard)

Jeremy’s courses show how to become a world-class deep learning practitioner with only a minimal level of scalar calculus, thanks to leveraging the automatic differentiation built in to modern deep learning libraries. But if you really want to really understand what’s going on under the hood of these libraries, and grok academic papers discussing the latest advances in model training techniques, you’ll need to understand certain bits of the field of matrix calculus.

Review: Scalar derivative rules

Hopefully you remember some of these main scalar derivative rules. If your memory is a bit fuzzy on this, have a look at Khan academy video on scalar derivative rules.

Basic rules of derivatives

There are other rules for trigonometry, exponential, etc., which you can find at Khan Academy differential calculus course.

Introduction to vector calculus and partial derivatives

Neural network layers are not single functions of a single parameter, f(x). So, let’s move on to functions of multiple parameters such as f(x,y). For example, what is the derivative of xy (i.e., the multiplication of x and y)?

Well, it depends on whether we are changing x or y. We compute derivatives with respect to one variable (parameter) at a time, giving us two different partial derivatives for this two-parameter function (one for x and one for y). Instead of using operator d/dx, the partial derivative operator is ∂/ ∂x (a stylized d and not the Greek letter δ ). So ∂(xy)/ ∂x and ∂(xy)/ ∂y are the partial derivatives of xy; often, these are just called the partials.

The partial derivative with respect to x is just the usual scalar derivative, simply treating any other variable in the equation as a constant. Consider function f(x,y) = 3x²y. The partial derivative with respect to x is written ∂(3x²y*)/ ∂x.* There are three constants from the perspective of ∂/ ∂x : 3, 2, and y. Therefore, ∂(3x²y)/ ∂x = 3y∂(x²)/ ∂x = 3y(2x) = 6xy. The partial derivative with respect to y treats x like a constant and we get ∂(3x²y)/ ∂y = 3x². You can learn more on Khan Academy video on partials.

So from above example if f(x,y) = 3x²y, then

Gradient of f(x,y)

So the gradient of f(x,y) is simply a vector of its partial.

Matrix calculus

When we move from derivatives of one function to derivatives of many functions, we move from the world of vector calculus to matrix calculus. Let us bring one more function g(x,y) = 2x + y⁸. So gradient of g(x,y) is

Gradient vectors organize all of the partial derivatives for a specific scalar function. If we have two functions, we can also organize their gradients into a matrix by stacking the gradients. When we do so, we get the Jacobian matrix (or just the Jacobian ) where the gradients are rows:

Numerator layout of Jacobian

Generalization of the Jacobian

To define the Jacobian matrix more generally, let’s combine multiple parameters into a single vector argument: f(x,y,z) => f( x ). Lowercase letters in bold font such as x are vectors and those in italics font like x are scalars. xi is the ith element of vector x and is in italics because a single vector element is a scalar. We also have to define an orientation for vector x. We’ll assume that all vectors are vertical by default of size n X 1:

With multiple scalar-valued functions, we can combine them all into a vector just like we did with the parameters. Let y = f(x) be a vector of m scalar-valued functions that each take a vector x of length n= | x | where | x | is the cardinality (count) of elements in x. Each fi function within f returns a scalar just as in the previous section

Generally speaking, though, the Jacobian matrix is the collection of all m X n possible partial derivatives (m rows and n columns), which is the stack of m gradients with respect to x :

Derivatives of vector element-wise binary operators

By “element-wise binary operations” we simply mean applying an operator to the first item of each vector to get the first item of the output, then to the second items of the inputs for the second item of the output, and so forth.We can generalize the element-wise binary operations with notation y= f(w) O g(x) where m = n = | y | = | w | = | x |. The O symbol represents any element-wise operator (such as +) and not the o function composition operator.

That’s quite a furball, but fortunately the Jacobian is very often a diagonal matrix, a matrix that is zero everywhere but the diagonal.

Vector sum reduction

Summing up the elements of a vector is an important operation in deep learning, such as the network loss function, but we can also use it as a way to simplify computing the derivative of vector dot product and other operations that reduce vectors to scalars.

Let y=sum(f(x)) = Σ fi ( x ). Notice we were careful here to leave the parameter as a vector x because each function fi could use all values in the vector, not just xi. The sum is over the results of the function and not the parameter. The gradient ( 1 X n Jacobian) of vector summation is:

The Chain Rules

We can’t compute partial derivatives of very complicated functions using just the basic matrix calculus rules. Part of our goal here is to clearly define and name three different chain rules and indicate in which situation they are appropriate.

The chain rule is conceptually a divide and conquer strategy (like Quicksort) that breaks complicated expressions into sub-expressions whose derivatives are easier to compute. Its power derives from the fact that we can process each simple sub-expression in isolation yet still combine the intermediate results to get the correct overall result.

The chain rule comes into play when we need the derivative of an expression composed of nested subexpressions. For example, we need the chain rule when confronted with expressions like d(sin(x²))/dx.

Single-variable chain rule ** :-** Chain rules are typically defined in terms of nested functions, such as y = f(u) where u= g(x) so y= f(g(x)) for single-variable chain rules.

Formulation of the single-variable chain rule

To deploy the single-variable chain rule, follow these steps:

Introduce intermediate variables for nested sub-expressions and sub-expressions for both binary and unary operators; example, X is binary, sin (x) and other trigonometric functions are usually unary because there is a single operand. This step normalizes all equations to single operators or function applications.
Compute derivatives of the intermediate variables with respect to their parameters.
Combine all derivatives of intermediate variables by multiplying them together to get the overall result.
Substitute intermediate variables back in if any are referenced in the derivative equation.

Single-variable total-derivative chain rule :- The total derivative assumes all variables are potentially codependent whereas the partial derivative assumes all variables but x are constants.

This chain rule that takes into consideration the total derivative degenerates to the single-variable chain rule when all intermediate variables are functions of a single variable.

A word of caution about terminology on the web. Unfortunately, the chain rule given in this section, based upon the total derivative, is universally called “multivariable chain rule” in calculus discussions, which is highly misleading! Only the intermediate variables are multivariate functions.

Vector chain rule :- Vector chain rule for vectors of functions and a single parameter mirrors the single-variable chain rule.

If y= f(g(x)) and x is a vector . The derivative of vector y with respect to scalar x is a vertical vector with elements computed using the single-variable total-derivative chain rule.

The goal is to convert the above vector of scalar operations to a vector operation. So the above RHS matrix can also be implemented as a product of vector multiplication.

That means that the Jacobian is the multiplication of two other Jacobians. To make this formula work for multiple parameters or vector x , we just have to change x to vector x in the equation. The effect is that ∂g/ ∂x and the resulting Jacobian, *∂f/ ∂x * , are now matrices instead of vertical vectors. Our complete vector chain rule is:

Please note here that matrix multiply does not commute, the order of (**∂f/ ∂x)(∂g/ ∂x) **matters.

For completeness, here are the two Jacobian components :-

where m = | f |, n = | x | and k = | g |. The resulting Jacobian is m X n. (an m X k matrix multiplied by a k X _n _ matrix).

We can simplify further because, for many applications, the Jacobians are square ( m = n ) and the off-diagonal entries are zero.

Resources

1.The original paper.

There are some online tools which can differentiate a matrix for you:

More matrix calculus.

How to use TensorFlow Object Detection API On Windows

Rohit Patil — Wed, 31 Jan 2018 21:25:37 +0000

Around July 2017, TensorFlow’s Object Detection API was released. The TensorFlow Object Detection API is an open source framework built on top of TensorFlow that makes it easy to construct, train and deploy object detection models.

What makes this API huge is that unlike other models like YOLO, SSD, you do not need a complex hardware setup to run it.

They have published a paper titled Speed/accuracy trade-offs for modern convolutional object detectors . Here they discuss various architecture available for object detection like YOLO, Faster R-CNN, SSD and R-FCN.

This API is capable of identifying many types of objects like cars, pedestrians, person, kite, dog and many more. You can find the whole list here.

I have used this API, for detecting traffic signal in a live video stream for capstone project of Udacity’s self-driving car nanodegree program. In this project we had to run the Carla (self-driving car of Udacity) on the road.

Output from the simulator’s video feed.

Let’s begin the setup.

Clone the tensorflow-model repository.
The main API documentation is at https://github.com/tensorflow/models/tree/master/research/object_detection.
Install tensorflow.

# For CPU
pip install tensorflow
# For GPU
pip install tensorflow-gpu

Install all other dependencies

pip install pillow
pip install lxml
pip install jupyter
pip install matplotlib

Download Google Protobuf https://github.com/google/protobuf Windows v3.4.0 release “protoc-3.4.0-win32.zip”
Extract the Protobuf download to Program Files, specifically

C:\Program Files\protoc-3.4.0-win32

Now cd into models\research.

cd path\to\models\research

Execute the protobuf compile

“C:\Program Files\protoc-3.4.0-win32\bin\protoc.exe” object\_detection/protos/\*.proto --python\_out=.

This is the most important step in the installation process.

Now navigate to models\research\object_detection\protos and and verify the .py files were created successfully as a result of the compilation. (only the .proto files were there to begin with)
cd to \models\research\object_detection. Open the jupyter notebook object_detection_tutorial.ipynb. Here you can play with the API.

Problem you will probably face :

If you move the notebook to any other directory, and run it and you will get an error

ModuleNotFoundError: No module named 'utils'

Source of this error are these two lines in the code.

from utils import label\_map\_util
from utils import visualization\_utils as vis\_util

This error is because we are yet to inform Python how to find the utils directory that these lines use.

Issue Resolution :

Go to System -> Advanced system settings -> Environment Variables -> New, and add a variable with the name PYTHON_PATH and these values:

In system variables, edit PATH and add %PYTHON_PATH%.
You will need to restart the system and then you are free to use this code anywhere in the system.

Some output samples

For experimenting with this API, I used my webcam and mobile’s camera. I used IP Webcam Android App for interfacing the mobile camera. You can checkout the repository.

rohts-patil/TensorFlow-Object-Detection-API-On-Live-Video-Feed

Thank you for reading. You can find me on Twitter @Rohitpatil5, or connect with me on LinkedIn.

Basic Mathematics for Machine Learning

Rohit Patil — Tue, 09 Jan 2018 07:11:57 +0000

EDIT :- For calculus go through my post on matrix calculus.

There are many reasons why mathematics is important for machine learning. Some of them are below:

Selecting the right algorithm which includes giving considerations to accuracy, training time, model complexity, number of parameters and number of features.
Choosing parameter settings and validation strategies.
Identifying underfitting and overfitting by understanding the Bias-Variance tradeoff.
Estimating the right confidence interval and uncertainty.

What really is used!

What are the best resources for learning?

I have tried to summarized mathematics taught in above both resources. So lets begin!

Scalars, Vectors, Matrices and Tensors :

*Scalars * : A scalar is just a single number.
*Vectors * : A vector is an array of numbers. The numbers are arranged in order. We can identify each individual number by its index in that ordering. x=[x1 x2 x3 …. xn]. We can think of vectors as identifying points in space, with each element giving the coordinate along a diﬀerent axis. Sometimes we need to index a set of elements of a vector. In this case, we deﬁne a set containing the indices and write the set as a subscript. For example, to access x1, x3 and x6 we deﬁne the set S ={1,3,6} and write xs.
*Matrices * : A matrix is a 2-D array of numbers, so each element is identiﬁed by two indices instead of just one.

When we need to explicitly identify the elements of a matrix, we write them as an array enclosed in square brackets

The transpose of a matrix is the mirror image of the matrix across a diagonal line, called the main diagonal, running down and to the right, starting from its upper left corner.

Transpose of a matrix

We can add matrices to each other, as long as they have the same shape, just by adding their corresponding elements: C = A + B where Ci,j = Ai,j+ Bi,j.

We allow the addition of matrix and a vector, yielding another matrix: C=A+b, where Ci,j = Ai,j +bj. In other words, the vector b is added to each row of the matrix. This shorthand eliminates the need to deﬁne a matrix with b copied into each row before doing the addition. This implicit copying of b to many locations is called broadcasting.

Tensors : An array of numbers arranged on a regular grid with a variable number of axes is known as a tensor.

Multiplying Matrices and Vectors:

The matrix product of matrices A and B is a third matrix C. In order for this product to be deﬁned, A must have the same number of columns as B has rows. If A is of shape m × n and B is of shape n × p, then C is of shape m × p.

The product operation is deﬁned by

The matrix multiplication is distributive, associative but not commutative (the condition AB =BA does not always hold), unlike scalar multiplication.

For learning more you can go through this course offered by MIT Courseware (Prof. Gilbert Strang).

Linear Algebra

Probability Theory:

Probability theory is a mathematical framework for representing uncertain statements. It provides a means of quantifying uncertainty as well as axioms for deriving new uncertain statements.

Let us understand some of the terminologies used in probability theory:-

*Random Variables * : A random variable is a variable that can take on diﬀerent values randomly. They may be continuous or discrete. A discrete random variable is one that has a ﬁnite or countably inﬁnite number of states. A continuous random variable is associated with a real value.
Probability Distributions : A probability distribution is a description of how likely a random variable or set of random variables is to take on each of its possible states. probability distribution over discrete variables may be described using a probability mass function (PMF) denoted by P(x). When working with continuous random variables, we describe probability distributions using a probability density function (PDF) denoted by p(x). A probability density function p(x) does not give the probability of a speciﬁc state directly; instead the probability of landing inside an inﬁnitesimal region with volume δx is given by p(x)δx.
Conditional Probability : In many cases, we are interested in the probability of some event, given that some other event has happened. This is called a conditional probability. We denote the conditional probability that y = y given x = x as P(y=y | x=x).

The conditional probability is only deﬁned when P(x=x) >0

The Chain Rule of Conditional Probabilities : Any joint probability distribution over many random variables may be decomposed into conditional distributions over only one variable

Chain rule or product rule of probability

Expectation : The expectation, or expected value, of some function f(x) with respect to a probability distribution P(x) is the average, or mean value, that f takes on when x is drawn from P.

For discrete variables this can be computed with a summation

Variance : The variance gives a measure of how much the values of a function of a random variable x vary as we sample diﬀerent values of x from its probability distribution.

The square root of the variance is known as the standard deviation.
Covariance : The covariance gives some sense of how much two values are linearly related to each other, as well as the scale of these variables:

High absolute values of the covariance means that the values change very much and are both far from their respective means at the same time. If the sign of the covariance is positive, then both variables tend to take on relatively high values simultaneously. If the sign of the covariance is negative, then one variable tends to take on a relatively high value at the times that the other takes on a relatively low value and vice versa.

Bayes’ Rule : Bayes’ theorem is a formula that describes how to update the probabilities of hypotheses when given evidence. It follows simply from the axioms of conditional probability, but can be used to powerfully reason about a wide range of problems involving belief updates. We often ﬁnd ourselves in a situation where we know P(y | x) and need to know P(x | y). Fortunately, if we also know P(x), we can compute the desired quantity

Common Probability Distributions:-

Some of the common probability distributions used in machine learning are as follows

Bernoulli Distribution : It is a distribution over a single binary random variable. It is controlled by a single parameter φ ∈ [0,1], which gives the probability of the random variable being equal to 1.

Properties of Bernoulli Distribution

Multinoulli Distribution : The multinoulli, or categorical,distribution is a distribution over a single discrete variable with k diﬀerent states, where k is ﬁnite. Multinoulli distributions are often used to refer to distributions over categories of objects.
Gaussian Distribution : The most commonly used distribution over real numbers is the normal distribution, also known as the Gaussian distribution.

The two parameters µ ∈ R and σ ∈ (0, ∞) control the normal distribution. The parameter µ gives the coordinate of the central peak. This is also the mean of the distribution : E[x] =µ. The standard deviation of the distribution is given by σ, and the variance by σ².

Plot of the normal distribution density function

Khan Academy has got a very good course for statistics and probability.

Statistics and Probability | Khan Academy

I will wrap here. Hope this post helps you in revising some concepts which you learned in high school. 😄 Thank You for reading!

You can find me on Twitter @Rohitpatil5, or connect with me on LinkedIn.

Indian Government says BIG NO to Self Driving Cars

Rohit Patil — Tue, 25 Jul 2017 12:21:03 +0000

On 25th July 2017 Nitin Gadkari, Union Minister for Roads and Transport said India will not allow driver-less cars to ply on its roads.

The reason being that focus of the government is to create more jobs to arrest unemployment.His exact words were “How can we allow such vehicles when we already have huge number of unemployed people?” Gadkari also said there is already annual shortage of about 22,000 trained drivers in India.

These words come from the same government which could not stop selling of BS-III Vehicles (which was for saving environment) and at last Supreme Court had to intervene to settle the issue once and for all.

This is where we are wrong. I know India will be the last country in the world where self driving car will go live on roads but still this is not a cogent reason.

This kind of thinking is not new in India. In the whole world computers were quickly adapted, but in India, they faced opposition and the reason was same (“Unemployment”) and today we see that because of computers many jobs were created.

I don’t know what happens to companies who are developing SDC’s and need to test on roads.