DEV Community: Aniruddha Karajgi

Python: Decorators in OOP

Aniruddha Karajgi — Wed, 27 Jan 2021 13:14:53 +0000

A guide on classmethods, staticmethods and the property decorator

Image by author

The Object Oriented Programming paradigm became popular in the ’60s and ‘70s, in languages like Lisp and Smalltalk. Such features were also added to existing languages like Ada, Fortran and Pascal.

Python is an object oriented programming language, though it doesn’t support strong encapsulation.

Introductory topics in object-oriented programming in Python — and more generally — include things like defining classes, creating objects, instance variables, the basics of inheritance, and maybe even some special methods like __str__. But when we have an advanced look, we could talk about things like the use of decorators, or writing a custom new method, metaclasses, and Multiple Inheritance.

In this post, we’ll first discuss what decorators are, followed by a discussion on classmethods and staticmethods along with the property decorator.

Classmethods, staticmethods and property are examples of what are called descriptors. These are objects which implement the __get__ , __set__ or __delete__ methods.

But, that’s a topic for another post.

We’ll talk about the following in this article:

**- what are decorators?  
- classmethods  
- staticmethods  
- @property**

An example

Let’s work on a simple example: a Student class.

For now, this class has two variables:

name
age
score

We’ll add a simple __init__ method to instantiate an object when these two attributes are provided.

We’ll modify this as we go throughout the post.

Decorators

Decorators are functions (or classes) that provide enhanced functionality to the original function (or class) without the programmer having to modify their structure.

A simple example?

Suppose we want to add a method to our Student class that takes a student’s score and total marks and then returns a percentage.

The get_percent function — Image by author using draw.io

Our percent function can be defined like so:

Let’s define our decorator, creatively named record_decorator. It takes a function as input and outputs another function ( wrapper , in this case).

Our decorator — Image by author using draw.io

The wrapper function:

takes our two arguments score and total
calls the function object passed to the grade_decorator
then calculates the grade that corresponding to the percent scored.
Finally, it returns the calculated percentage along with the grade.

How applying the decorator works — Image by author using draw.io

We can implement our decorator like so.

Now, to improve the get_percent function, just use the @ symbol with the decorator name above our function, which has exactly the same definition as before.

To use this, we don’t need to modify our call statement. Executing this:

**get\_percent(25, 100)**

returns

**25.0, D**

What basically happens is that the function get_percent is replaced by wrapper when we apply the decorator.

We’ll place the get_percent method inside the Student class, and place our decorator outside the class. Since get_percent is an instance method, we add a self argument to it.

How are decorators used in classes?

We’ll see three popular decorators used in classes and their use-cases.

The kinds of methods in our class A — Image by author using draw.io

classmethod

Let’s first talk about instance methods. Instance methods are those methods that are called by an object, and hence are passed information about that object. This is done through the self argument as a convention, and when that method is called, the object’s information is passed implicitly through self.

For example, we could add a method to our class that calculates a student’s grade and percentage (using the get_percent method) and generates a report as a string with the student’s name, percentage, and grade.

Coming to a class method , this type of function is called on a class, and hence, it requires a class to be passed to it. This is done with the cls argument by convention. And we also add the @classmethod decorator to our class method.

It looks something like this:

class A:
    def instance_method(self):
        return self

 **@classmethod  
 def class\_method(cls):  
 return cls**  

A.class_method()

use-cases for class-methods

Since class-methods work with a class, and not an instance, they can be used as part of a factory pattern, where objects are returned based on certain parameters.

For example, there are multiple ways to create a Dataframe in pandas. There are methods like: from_records() , from_dict() , etc. which all return a dataframe object. Though their actual implementation is pretty complex, they basically take something like a dictionary, manipulate that and then return a dataframe object after parsing that data.

Coming back to our example, let's define a few ways to create instances of our Student class.

by two separate arguments: for example, , 20 and 85
by a comma-separated string: for example, “, 20, 85”.
by a tuple: for example, (, 20, 85)

To accomplish this in Java, we could simply overload our constructor:

In python, a clean way to do it would be through classmethods:

We also define the __str__ method, so we can directly print a Student object to see if it has been instantiated properly. Our class now looks like this:

Now, to test this, let’s create three Student objects, each from a different kind of data.

The output is exactly as expected from the definition of the __str__ method above:

Name: John Score: 25 Total : 100
Name: Jack Score: 60 Total : 100
Name: Jill Score: 125 Total : 200

staticmethod

A static method doesn’t care about an instance, like an instance method. It doesn’t require a class being passed to it implicitly.

A static method is a regular method, placed inside a class. It can be called using both a class and an object. We use the @staticmethod decorator for these kinds of methods.

A simple example:

class A:
    def instance_method(self):
        return self

    @classmethod
    def class_method(cls):
        return cls

 **@staticmethod  
 def static\_method():  
 return**  

a = A()

a.static_method()
A.static_method()

use-cases for the staticmethod

Why would this be useful? Why not just place such functions outside the class?

Static methods are used instead of regular functions when it makes more sense to place the function inside the class. For example, placing utility methods that deal solely with a class or its objects is a good idea, since those methods won’t be used by anyone else.

Coming to our example, we can make our get_percent method static, since it serves a general purpose and need not be bound to our objects. To do this, we can simply add @staticmethod above the get_percent method.

property

The property decorator provides methods for accessing (getter), modifying (setter), and deleting (deleter) the attributes of an object.

The property decorator — Image by author using draw.io

The getter and setter methods

Let’s start with getter and setter methods. These methods are used to access and modify (respectively) a private instance. In Java, we would do something like this:

Now, anytime you access or modify this value, you would use these methods. Since the variable x is private, it can’t be accessed outside JavaClass .

In python, there is no private keyword. We prepend a variable by a dunder(__ ) to show that it is private and shouldn’t be accessed or modified directly.

Adding a __ before a variable name modifies that variable’s name from varname to _Classname__varname , so direct access and modification like print(obj.varname) and obj.varname = 5 won’t work. Still, this isn’t very strong since you could directly replace varname with the modified form to get a direct modification to work.

Let’s take the following example to understand this:

Adding getter and setter methods

Taking our Student class example, let’s make the score attribute “private” by adding a __ before the variable name.

If we directly went ahead and added get_score and set_score like Java, the main issue is that if we wanted to do this to existing code, we’d have to change every access from:

**print("Score: " + str(student1.score))**
 **student1.score = 100**

to this:

**print(student1.get\_score())  
student1.set\_score(100)**

Here’s where the @property decorator comes in. You can simply define getter, setter and deleter methods using this feature.

Our class now looks like this:

To make the attribute score read-only, just remove the setter method.

Then, when we update score, we get the following error:

Traceback (most recent call last):
  File "main.py", line 16, in <module>
    student.score = 10
**AttributeError: can't set attribute**

The deleter method

The deleter method lets you delete a protected or private attribute using the del function. Using the same example as before, if we directly try and delete the score attribute, we get the following error:

student = Student("Tom", 50, 100)
del student.score

This gives:
Traceback (most recent call last):
  File "<string>", line 17, in <module>
**AttributeError: can't delete attribute**

But when we add a deleter method, we can delete our private variable score .

The attribute has been successfully removed now. Printing out the value of score gives “object has no attribute…”, since we deleted it.

Traceback (most recent call last):
  File "<string>", line 23, in <module>
File "<string>", line 9, in x
**AttributeError: 'PythonClass' object has no attribute '\_\_score'**

use-cases for the property decorator

The property decorator is very useful when defining methods for data validation , like when deciding if a value to be assigned is valid and won’t lead to issues later in the code.

Another use-case would be when wanting to display information in a specific way. Coming back to our example, if we wanted to display a student’s name as “Student Name: ” , instead of just , we could return the first string from a property getter on the name attribute:

Now, any time we access name, we get a formatted result.

student = Student("Bob", 350, 500)
**print(student.name)**

The output:

**Student Name: Bob**

The property decorator can also be used for logging changes.

For example, in a setter method, you could add code to log the updating of a variable.

Now, whenever the setter is called, which is when the variable is modified, the change is logged. Let’s say there was a totaling error in Bob’s math exam and he ends up getting 50 more marks.

student = Student("Bob", 350, 500)
print(student.score)  
**student.score = 400** print(student.score)

The above gives the following output, with the logged change visible:

70.0 %
**INFO:root:Setting new value...**
80.0 %

Finally, our class looks like this:

Note#1: Where should you define a decorator wrt a class?

There are many places you could define a decorator: outside the class, in a separate class, or maybe even in an inner class (with respect to the class we are using the decorator in). In this example, we simply defined grade_decorator outside the Student class. Though this works, the decorator now has nothing to do with our class, which we may not prefer.

For a more detailed discussion on this, check out this post:

Decorator inside Python class

Note#2: Are there options other than constructor overloading in Java to simulate the methods we discussed (like, from_str or from_tuple)?

Apart from overloading the constructor, we could make use of static factory methods in java. We could define a static method like from_str that would extract key information from the string passed to it and then return an object.

Conclusion

Object-oriented programming is a very important paradigm to learn and use. Regardless of whether you’ll ever need to use the topics discussed here in your next project, it’s necessary to know the basics really well. Topics like the ones in this post aren’t used all that often compared to more basic concepts — like inheritance or the basic implementation of classes and objects — on which they are built. In any case, I hope this post gave you an idea of the other kinds of methods in Python OOP (apart from instance methods) and the property decorator.

How Neural Networks Solve the XOR Problem

Aniruddha Karajgi — Wed, 04 Nov 2020 17:56:49 +0000

And why hidden layers are so important

Image by Author

The perceptron is a classification algorithm. Specifically, it works as a linear binary classifier. It was invented in the late 1950s by Frank Rosenblatt.

The perceptron basically works as a threshold function — non-negative outputs are put into one class while negative ones are put into the other class.

Though there’s a lot to talk about when it comes to neural networks and their variants, we’ll be discussing a specific problem that highlights the major differences between a single layer perceptron and one that has a few more layers.

**The Perceptron**
   Structure and Properties
   Evalutation
   Training algorithm

**2d Xor problem**   
   The XOR function

**Attempt #1: The Single Layer Perceptron**   
Implementing the Perceptron algorithm
   Results
   The need for non-linearity

**Attempt #2: Multiple Decision Boundaries**  
   Intuition
   Implementing the OR and NAND parts

**The Multi-layered Perceptron**  
   Structure and Properties
   Training algorithm

**Attempt #3: The Multi-layered Perceptron**  
   Implementing the MLP
   Results

Structure and Properties

A perceptron has the following components:

Input nodes
Output node
An activation function
Weights and biases
Error function

A representation of a single-layer perceptron with 2 input nodes — Image by Author using draw.io

Input Nodes

These nodes contain the input to the network. In any iteration — whether testing or training — these nodes are passed the input from our data.

Weights and Biases

These parameters are what we update when we talk about “training” a model. They are initialized to some random value or set to 0 and updated as the training progresses. The bias is analogous to a weight independent of any input node. Basically, it makes the model more flexible, since you can “move” the activation function around.

Evaluation

The output calculation is straightforward.

Compute the dot product of the input and weight vector
Add the bias
Apply the activation function.

This can be expressed like so:

This is often simplified and written as a dot- product of the weight and input vectors plus the bias.

Activation Function

This function allows us to fit the output in a way that makes more sense. For example, in the case of a simple classifier, an output of say -2.5 or 8 doesn’t make much sense with regards to classification. If we use something called a sigmoidal activation function, we can fit that within a range of 0 to 1, which can be interpreted directly as a probability of a datapoint belonging to a particular class.

Though there are many kinds of activation functions, we’ll be using a simple linear activation function for our perceptron. The linear activation function has no effect on its input and outputs it as is.

Classification

How does a perceptron assign a class to a datapoint?

We know that a datapoint’s evaluation is expressed by the relation wX + b . We define a threshold ( θ ) which classifies our data. Generally, this threshold is set to 0 for a perceptron.

So points for which wX + b is greater than or equal to 0 will belong to one class while the rest (wX + b is negative) are classified as belonging to the other class. We can express this as:

Training algorithm

To train our perceptron, we must ensure that we correctly classify all of our train data. Note that this is different from how you would train a neural network, where you wouldn’t try and correctly classify your entire training data. That would lead to something called overfitting in most cases.

We start the training algorithm by calculating the gradient , or Δw. Its the product of:

the value of the input node corresponding to that weight
The difference between the actual value and the computed value.

We get our new weights by simply incrementing our original weights with the computed gradients multiplied by the learning rate.

A simple intuition for how this works: if our perceptron correctly classifies an input data point, actual_value — computed_value would be 0 , and there wouldn’t be any change in our weights since the gradient is now 0.

The 2D XOR problem

In the XOR problem, we are trying to train a model to mimic a 2D XOR function.

The XOR function

The function is defined like so:

The XOR Truth table — Image by Author

If we plot it, we get the following chart. This is what we’re trying to classify. The ⊕ (“o-plus”) symbol you see in the legend is conventionally used to represent the XOR boolean operator.

The XOR output plot — Image by Author using draw.io

Our algorithm —regardless of how it works — must correctly output the XOR value for each of the 4 points. We’ll be modelling this as a classification problem, so Class 1 would represent an XOR value of 1, while Class 0 would represent a value of 0.

Attempt #1: The Single Layer Perceptron

Let's model the problem using a single layer perceptron.

Input data

The data we’ll train our model on is the table we saw for the XOR function.

**Data Target  
[0, 0] 0  
[0, 1] 1  
[1, 0] 1  
[1, 1] 0**

Implementation

Imports

Apart from the usual visualization ( matplotliband seaborn) and numerical libraries (numpy), we’ll use cycle from itertools . This is done since our algorithm cycles through our data indefinitely until it manages to correctly classify the entire training data without any mistakes in the middle.

The data

We next create our training data. This data is the same for each kind of logic gate, since they all take in two boolean variables as input.

The training function

Here, we cycle through the data indefinitely, keeping track of how many consecutive datapoints we correctly classified. If we manage to classify everything in one stretch, we terminate our algorithm.

If not, we reset our counter, update our weights and continue the algorithm.

To visualize how our model performs, we create a mesh of datapoints, or a grid, and evaluate our model at each point in that grid. Finally, we colour each point based on how our model classifies it. So the Class 0 region would be filled with the colour assigned to points belonging to that class.

The Perceptron class

To bring everything together, we create a simple Perceptron class with the functions we just discussed. We have some instance variables like the training data, the target, the number of input nodes and the learning rate.

Results

Let’s create a perceptron object and train it on the XOR data.

You’ll notice that the training loop never terminates, since a perceptron can only converge on linearly separable data. Linearly separable data basically means that you can separate data with a point in 1D, a line in 2D, a plane in 3D and so on.

A perceptron can only converge on linearly separable data. Therefore, it isn’t capable of imitating the XOR function.

Remember that a perceptron must correctly classify the entire training data in one go. If we keep track of how many points it correctly classified consecutively, we get something like this.

The value of correct_counter over 100 cycles of training — Image by Author

The algorithm only terminates when correct_counter hits 4 — which is the size of the training set — so this will go on indefinitely.

The Need for Non-Linearity

It is clear that a single perceptron will not serve our purpose: the classes aren’t linearly separable. This boils down to the fact that a single linear decision boundary isn’t going to work.

Non-linearity allows for more complex decision boundaries. One potential decision boundary for our XOR data could look like this.

A potential non-linear decision boundary for our XOR model — Image by Author using draw.io

The 2d XOR problem — Attempt #2

We know that the imitating the XOR function would require a non-linear decision boundary.

But why do we have to stick with a single decision boundary?

The Intuition

Let’s first break down the XOR function into its AND and OR counterparts.

The XOR function on two boolean variables A and B is defined as:

Let’s add A.~A and B.~B to the equation. Since they both equate to 0, the equation remains valid.

Let’s rearrange the terms so that we can pull out A from the first part and B from the second.

Simplifying it further, we get:

Using DeMorgan’s laws for boolean algebra:~A + ~B = ~(AB) , we can replace the second term in the above equation like so:

Let’s replace A and B with x_1 and x_2 respectively since that’s the convention we’re using in our data.

The XOR function can be condensed into two parts: a NAND and an OR. If we can calculate these separately, we can just combine the results, using an AND gate.

Let’s call the OR section of the formula part I, and the NAND section as part II.

Modelling the OR part

We’ll use the same Perceptron class as before, only that we’ll train it on OR training data.

The OR truth table — Image by Author using draw.io

This converges, since the data for the OR function is linearly separable. If we plot the number of correctly classified consecutive datapoints as we did in our first attempt, we get this plot. It’s clear that around iteration 50, it hits the value 4, meaning that it classified the entire dataset correctly.

correct_counter measures the number of consecutive datapoints correctly classified by our Perceptron

The correct_counter plot for our OR perceptron — Image by Author

The decision boundary plot looks like this:

The Output plot of our OR perceptron — Image by Author

Modelling the NAND part

Let’s move on to the second part. We need to model a NAND gate. Just like the OR part, we’ll use the same code, but train the model on the NAND data. So our input data would be:

The NAND Truth table — Image by Author using draw.io

After training, the following plots show that our model converged on the NAND data and mimics the NAND gate perfectly.

Decision boundary and correct_counter plots for the NAND perceptron — Image by Author

Bringing everything together

Two things are clear from this:

we are performing a logical AND on the outputs of two logic gates (where the first one is an OR and the second one a NAND)
and that both functions are being passed the same input (x1 and x2).

Let’s model this into our network. First, let’s consider our two perceptrons as black boxes.

The plan for our model — Image by Author using draw.io

After adding our input nodes x_1 and x_2, we can finally implement this through a simple function.

Adding input nodes — Image by Author using draw.io

Finally, we need an AND gate, which we’ll train just we have been.

The correct_count and output plots of our AND perceptron. — Image by Author

What we now have is a model that mimics the XOR function.

If we were to implement our XOR model, it would look something like this:

If we plot the decision boundaries from our model, — which is basically an AND of our OR and NAND models — we get something like this:

The Output plot of our 2nd Attempt, showing a correct classification on our XOR data— Image by Author using draw.io

Out of all the 2 input logic gates, the XOR and XNOR gates are the only ones that are not linearly-separable.

Though our model works, it doesn’t seem like a viable solution to most non-linear classification or regression tasks. It’s really specific to this case, and most problems can’t be split into just simple intermediate problems that can be individually solved and then combined. For something like this:

A binary classification problem in two dimensions — Image by Author using draw.io

A potential decision boundary could be something like this:

A potential decision boundary that fits our example — Image by Author using draw.io

We need to look for a more general model, which would allow for non-linear decision boundaries, like a curve, as is the case above. Let’s see how an MLP solves this issue.

The Multi-layered Perceptron

The overall components of an MLP like input and output nodes, activation function and weights and biases are the same as those we just discussed in a perceptron.

The biggest difference? An MLP can have hidden layers.

Hidden layers

Hidden layers are those layers with nodes other than the input and output nodes.

An MLP is generally restricted to having a single hidden layer.

The hidden layer allows for non-linearity. A node in the hidden layer isn’t too different to an output node: nodes in the previous layers connect to it with their own weights and biases, and an output is computed, generally with an activation function.

The general structure of a multi-layered perceptron — Image by Author using draw.io

Activation Function

Remember the linear activation function we used on the output node of our perceptron model? There are several more complex activation functions. You may have heard of the sigmoid and the tanh functions, which are some of the most popular non-linear activation functions.

Activation functions should be differentiable, so that a network’s parameters can be updated using backpropagation.

Training algorithm

Though the output generation process is a direct extension of that of the perceptron, updating weights isn’t so straightforward. Here’s where backpropagation comes into the picture.

Backpropagation is a way to update the weights and biases of a model starting from the output layer all the way to the beginning. The main principle behind it is that each parameter changes in proportion to how much it affects the network’s output. A weight that has barely any effect on the output of the model will show a very small change, while one that has a large negative impact will change drastically to improve the model’s prediction power.

Backpropagation is an algorithm for update the weights and biases of a model based on their gradients with respect to the error function, starting from the output layer all the way to the first layer.

The method of updating weights directly follows from derivation and the chain rule.

There’s a lot to cover when talking about backpropagation. It warrants its own article. So if you want to find out more, have a look at this excellent article by Simeon Kostadinov.

Understanding Backpropagation Algorithm

Attempt #3: the Multi-layered Perceptron

The architecture

There are no fixed rules on the number of hidden layers or the number of nodes in each layer of a network. The best performing models are obtained through trial and error.

The architecture of a network refers to its general structure — the number of hidden layers, the number of nodes in each layer and how these nodes are inter-connected.

Let’s go with a single hidden layer with two nodes in it. We’ll be using the sigmoid function in each of our hidden layer nodes and of course, our output node.

The final architecture of our MLP — Image by Author using draw.io

Implementation

The libraries used here like NumPy and pyplot are the same as those used in the Perceptron class.

The training algorithm

The algorithm here is slightly different: we iterate through the training data a fixed number of times — num_epochs to be precise. In each iteration, we do a forward pass, followed by a backward pass where we update the weights and biases as necessary. This is called backpropagation.

The sigmoid activation function

Here, we define a sigmoid function. As discussed, it’s applied to the output of each hidden layer node and the output node. Its differentiable, so it allows us to comfortably perform backpropagation to improve our model.

Its derivate its also implemented through the _delsigmoid function.

The forward and backward pass

In the forward pass, we apply the wX + b relation multiple times, and applying a sigmoid function after each call.

In the backward pass, implemented as the update_weights function, we calculate the gradients of each of our 6 weights and 3 biases with respect to the error function and update them by the factor learning rate * gradient.

Finally, the classify function works as expected: Since a sigmoid function outputs values between 0 and 1, we simply interpret them as probabilities of belonging to a particular class. Hence, outputs greater than or equal to 0.5 are classified as belonging to Class 1 while those outputs that are less than 0.5 are said to belong to Class 0 .

The MLP class

Let’s bring everything together by creating an MLP class. All the functions we just discussed are placed in it. The plot function is exactly the same as the one in the Perceptron class.

Results

Let’s train our MLP with a learning rate of 0.2 over 5000 epochs.

If we plot the values of our loss function, we get the following plot after about 5000 iterations, showing that our model has indeed converged.

The Loss Plot over 5000 epochs of our MLP — Image by Author

A clear non-linear decision boundary is created here with our generalized neural network, or MLP.

The Decision Boundary plot, showing the decision boundary and the classes — Image by Author

Note #1: Adding more layers or nodes

Adding more layers or nodes gives increasingly complex decision boundaries. But this could also lead to something called overfitting — where a model achieves very high accuracies on the training data, but fails to generalize.

A good resource is the Tensorflow Neural Net playground, where you can try out different network architectures and view the results.

Tensorflow - Neural Network Playground

Note #2: Choosing a loss function

The loss function we used in our MLP model is the Mean Squared loss function. Though this is a very popular loss function, it makes some assumptions on the data (like it being gaussian) and isn’t always convex when it comes to a classification problem. It was used here to make it easier to understand how a perceptron works, but for classification tasks, there are better alternatives, like binary cross-entropy loss.

How to Choose Loss Functions When Training Deep Learning Neural Networks - Machine Learning Mastery

Conclusion

Neural nets used in production or research are never this simple, but they almost always build on the basics outlined here. Hopefully, this post gave you some idea on how to build and train perceptrons and vanilla networks.

Thanks for reading!

Understanding Dynamic Programming

Aniruddha Karajgi — Sun, 04 Oct 2020 17:02:41 +0000

An intuitive guide to the popular optimization technique.

Image by author

Dynamic programming , or DP, is an optimization technique. It is used in several fields, though this article focuses on its applications in the field of algorithms and computer programming. Its a topic often asked in algorithmic interviews.

Since DP isn’t very intuitive, most people (myself included!) often find it tricky to model a problem as a dynamic programming model. In this post, we’ll discuss when we use DP, followed by its types and then finally work through an example.

When is DP used?

There are two necessary conditions a problem must satisfy for DP to work.

Overlapping Sub-problems
Optimal substructure

Let's go over these in a little more detail.

Overlapping Sub-Problems

This property is exactly what it sounds like: repeating sub-problems. But for this to make sense, we need to know what a sub-problem is.

A sub-problem is simply a smaller version of the problem at hand. In most cases, this would mean smaller parameter values which you would pass to your recursive function.

If you’re looking for a particular page in a book, what would you do? You’d open the book to a particular page and compare the page number you’re on with the page number you’re looking for.

If the current page is smaller than the required page, you’d start looking in between the current page and the last page. On the other hand, if the current page number is greater, you’d start searching between the start of the book and the current page.

You’d continue this until you found the page.

If you had to model this as a recursive function, what would that look like? Maybe something like this.

Note: The following snippets have been written in a form of pseudocode to improve readability

Pretty straightforward. There’s a getpage function which returns the page ( target_page , here) we’re looking for. The function looks at the middle page between from_page and to_page and checks if we have a match.

If not, the function looks at either the left half or the right half of the section we are looking at.

But what do those two recursive calls to getpage represent? You’ll notice that at each recursive call, we are reducing our search space by half. What we’re doing is solving the same problem, that is, looking for a specific page, in a smaller space. We’re solving sub-problems.

Divide and Conquer, or DAC algorithms work through the principle of sub-problems. The “divide” part refers to splitting a problem into sub-problems. Sorting algorithms like mergesort and quicksort are great examples. Note that binary search isn’t exactly a DAC algorithm for the simple reason that it doesn’t have a “combine” step, whereas an actual divide and conquer algorithm would combine the results of its sub-problems to get the final solution.

Now that we have answered the question of what a sub-problem is, we move on to the other word: “ overlapping ”.

When these sub-problems have to be solved more than once, they are said to be overlapping. Look at the call graph for computing the value of the nth Fibonacci term.

The recurrence relation is:

the relation  
**f(n) = f(n - 1) + f(n-2)**  

the base case
**f(0) = 0  
f(1) = 1**

The recursive Fibonacci call tree. f(n) is the nth Fibonacci number — image created by Author using draw.io.

The calls have been shaded to represent overlapping subproblems. Compare this with something like binary search, where the subproblems aren’t overlapping.

The optimal substructure property

The optimal substructure property is slightly more intricate: it refers to the scenario where optimal solutions to sub-problems can directly be considered when computed the overall optimal solution.

A quick example? Say you want to find the shortest path from A to B. Let X be an intermediate point between A and B with a single edge connecting it to A.

Finding the shortest path using intermediate nodes — image created by Author using draw.io.

To solve this, we can find the shortest path from all intermediate nodes ( X ) to B, and then find the path from A to X plus the shortest path from X to B which is shortest for all X.

shortest(A, B) = min(AX + shortest(X, B)) for all intermediate nodes X.

What we’re doing here is using an optimal intermediate solution (shortest(X, B)) and use that (as opposed to considering every solution for a sub-problem) to find the final optimal answer.

The two kinds of DP

The top-down (memoization) approach

In a top-down approach, we start from the highest level of our problem. In this approach, we initially check if have already solved the current sub-problem. If we have, we just return that value. If not, we solve that sub-problem. We use recursive calls to solve our sub-problem.

Since those calls require solving smaller sub-problems which we haven’t seen before, we continue this way, until we encounter a sub-problem we have either solved or know the answer to trivially.

The bottom-up (tabulation) approach

In this approach, we start at the very bottom and then work our way to the top. Since we start from the “base case”, and use our recurrence relation, we don’t really need recursion, and so, this approach is iterative.

The main difference between the two approaches is that bottom-up calculates all solutions, while top-down computes only those that are required. For example, to find the shortest path between source and destination, using the top-down approach, we only compute the distances with intermediate points near the shortest path, choosing the minimum at each stage.

On the other hand, in a bottom-up approach, we end up calculating the shortest distance between each point on the grid and the destination, finally returning the shortest distance from the start to the end.

As a comparison, let's look at a possible top-down and bottom-up function that returns the nth Fibonacci term.

While both approaches have the same asymptotic time complexities, the recursive calls in a top-down implementation may lead to a stack overflow, which is a non-issue owing to the iterative nature of the bottom-up approach.

Remember that though we implement the latter iteratively, your logic would still use the recurrence relation from the very basic recursive approach, as we shall see in this example.

An example

Let's go over a problem which we’ll solve using both approaches to dynamic programming.

The Problem

Find the maximum sum of elements in an array ensuring that no adjacent elements are included. Let’s assume that no elements are negative.

**example 1:  
[1, 2, 3] => 1 + 3 = 4**  

**example 2:  
[1, 1, 1, 1] => 1 + 1 = 2**  

**example 3:  
[2, 5, 2] => 5 = 5**

The Analysis

First, let's try a greedy approach.

Since our goal is to maximize the sum of the elements we choose, we could hope to accomplish this by choosing the biggest elements, ignoring its neighbours, and then continuing this way. Here, we’re ensuring that at each step of the way, we have a maximum sum. But this would be correct only in a local context, while we are, of course, looking for a global solution.

This approach could work in certain cases.

**[1, 9, 1, 10, 1, 9, 1]**

Here, we first choose 10, since its the biggest element. We then ignore its neighbours, so that we don’t violate the condition that we aren’t allowed to choose adjacent elements.

Next, we choose both the 5’s, since they’re the next biggest elements, and then ignore their neighbours. Our algorithm ends here since there aren’t any elements left. The result we get — 10 + 5 + 5 — is in fact, the right answer.

But this won’t always work. Take the following example:

**[1, 1, 9, 10, 9, 1, 1]**

At every step, if you chose the maximum element, ignored its neighbours and continued that way, you’d end up choosing 10, then 1 and then 1 again after ignoring both the 9's, which would add up to 12, but the right answer would be 1 + 9 + 9 + 1, which is 20.

Its clear this approach isn’t the right one. Let’s start from a basic recursive solution and work up to one that uses dynamic programming one.

This is the difference between the greedy and dynamic programming approaches. While a greedy approach focuses on doing its best to reach the goal at every step, DP looks at the overall picture. With a greedy approach, there’s no guarantee you’ll even end up with an optimal solution, unlike DP. Greedy algorithms often get trapped in local maxima, leading to sub-optimal solutions.

The recursive solution

After thinking for a bit, you can probably see that we have a condition to keep in mind: no adjacent elements. You can probably figure out that:

we can choose to either consider an element in our sum or ignore it
if we consider it, we will have to ignore its adjacent element

For the sake of brevity, let f(a..b) represent a call to f our array from index a to index b (both inclusive). That function f would represent our recursive function which would solve the problem.

So f(0..4) would mean running the function from index 0 to index 4.

Our function call representation — image created by Author using draw.io.

The two arrows pointing from a cell represent our choices of subsequent function calls. Since this is a maximization problem, we’d have to choose the maximum out of these options.

Let’s come back to our array.

**[5, 10, 100, 10, 5]**

Keeping the conditions discussed above in mind let’s actually write down what we would be doing.

Our first call would be on the entire array, which is of length 5 as can be seen above.

**f(0..4)**

For the element at index 0 (which happens to be 5 here), we can either choose to:

include it in our sum: our current sum would then be 5 + the maximum sum of the rest of the array, but excluding the next element (index 1). Thus, our sum becomes 5 + f(2..4). Or to generalize it, arr[0] + f(2..4)
exclude it: our current sum would then just be equal to the maximum sum of the remaining array. This can be written as: 0 + f(1..4). Notice that our next call is from index 1 and not 2 as in the previous case. Since we aren’t considering the element at index 0, we are free to consider the element at index 1 — we aren’t forced to ignore it.

The few first calls of our function — image created by Author using draw.io.

The graph here visually explains this. As mentioned earlier, all arrows at a given level represent our choices, from which we choose the greatest one.

So our final answer would be:

**f(0..4) = max(arr[0] + f(2..4), f(1..4))**

Let’s expand this for the next iteration.

First, we’ll do it for the left tree, which is f(2..4). This is just like what we did for the first call to f. Remember that the arr[0] + part is still there. It will be added to the value of f(2..4) on our way back up the call tree.

Our choices:

consider arr[2] in our sum: our sum at this stage becomes arr[2] + f(4..4). Remember that since we’re considering the element at index 2, we would have to ignore the next element — index 3.
ignore arr[2]: our sum here is the same as the maximum result of the remaining array without having to ignore the adjacent element. So, that's f(3..4).

The third level of our call tree — image created by Author using draw.io.

Just like before, the value of f(2..4) would be the maximum of our two choices.

**f(2..4) = max(arr[2] + f(4..4), f(3..4))**

The base case

What do you think f(4..4) would evaluate to? Following our notation, it is the result of our function call on the array from index 4 to … well, index 4. That means that we are calling the function on a single element. The maximum sum of a single element is itself.

Another thing to keep in mind: in f(a..b), a should never be greater than b. Since this call represents starting from index a and going up to index b, we would have to return 0 if a ever gets bigger than b. There is no maximum sum if there are no elements.

We have our base case here. Our function f , when called on a single element, would return that element directly and returns 0 if we are not in a valid range. There are no further recursive calls. That’s why its called the base case.

In our case, our call to f(3..4) leads to an invalid call to f(5..4), which we handle by returning 0. We’ll generalize this later.

**f(4..4) = arr[4]  
f(5..4) = 0**

The recurrence relation

Let’s have another look at our results.

first call:  
**f(0..4) = max(arr[0] + f(2..4), f(1..4))**  

second call:
**f(2..4) = max(arr[2] + f(4..4), f(3..4))**

the base case:
**f(4..4) = arr[4]  
f(5..4) = 0**

Notice a pattern in the first two results? If we generalize these, we get:

**f(a..b) = max(arr[a] + f(a+2 .. b), f(a+1, b))**

This still isn’t the most simplified version of our relation. Notice the occurrences of b here. In fact, go back and look at our specific calls in the previous block.

They don’t change. There’s no b + 1 or b + 2. It’s always b. And what’s the value of b in our first call? The last index. Since b is constant throughout our algorithm, we can remove it.

Our recurrence relation becomes:

**f(a) = max(arr[a] + f(a+2), f(a+1))**

where f(a) is a call on the array from index a onwards.

Another thing to realize is that similar to how we removed b since it was always equal to the last index in the array, the base case, which refers to a single element, would only happen if that element was the last in the array.

A generalized version of our base case is:

**f(n-1) = arr[n-1]** where **n** is the size of the array
**f(a) = 0** if **a** >= **n** where **n** is the size of the array

Thus, we have our relation:

**f(a) = max(arr[a] + f(a+2), f(a+1))  
f(n-1) = arr[n-1] **where** n** is the size of the array
**f(a) = 0** if **a** >= **n** where **n** is the size of the array

Let’s implement the recursive approach based on this relation.

This function would be called like so:

**array := [1, 5, 2, 4, ...]  
return f(array, 0)**

What would be the complexity of this?

If we were to approximate the complexity based on the size of the array ( n ) we are operating on, we get something like this:

**T(n) = T(n-2) + T(n-1) + O(1)**

**T(0) = O(1)**

Intuitively, every call to f on an array of size n — represented as T(n) — leads to two calls on f on arrays of size n-2 and n-1. That is, at each stage, we’re doubling the number of calls to f.

The asymptotic time complexity is exponential. With the above reasoning, we get O(2^n).

This is a loose estimate on the upper bound, since the n-2 tree is bound to end before the n-1 tree, and so we are doing slightly less than doubling the calls. The actual complexity is O(phi^n) — phi is the golden ratio — or O(1.618^n), which is slightly lesser than our original estimate, but let's stick to O(2^n).

Another thing to notice is that the recurrence relation above is similar to that of the nth Fibonacci term, which would hence give a similar complexity.

A dynamic programming approach

Here’s where dynamic programming comes into the picture.

Notice the repeating sub-problems in the call graph — image created by Author using draw.io.

If you look closely, you’ll see the overlapping sub-problems we were talking about earlier.

Now comes the important part — converting this recursive implementation to a dynamic programming approach. What if we stored the values of the function calls that are being repeated?

Let’s maintain an array where the ith element is the value of f(i), which in turn, is the maximum sum of the array from index i to the end.

**dp[i] = f(i..n) = f(i)**

And since we already have a result for f(i),

**dp[i] = max(arr[i] + f(i + 2), f(i + 1))**

Now that we have this relation, we can go two different ways. Either we go the top-down route, where our function is still recursive, like our result above, or we remove all recursive calls and go the bottom-up route.

We’ll focus on the bottom-up route, but let's discuss the top-down approach.

- The Top-down approach

Look at our previous result.

**dp[i] = max(arr[i] + f(i + 2), f(i + 1))**

That’s all we need to implement the top-down approach. For any call to f , we’ll first check in our array dp if we have already made that call earlier, and if we have, we use the pre-calculated value directly.

On the other hand, if the call we are making has never been done before, we have to compute the entire thing. In that case, once we arrive at a value, we make sure to store it in our array dp so that we won’t have to repeat the whole process.

The call tree should look something like this:

The call tree in the top-down dynamic programming approach — image created by Author using draw.io.

Let’s implement this algorithm.

The additional space required to store the results of our sub-problems grows linearly with the size of the input array. Hence, apart from the O(n) space required due to the recursive stack, we have an O(n) space for the dp array, n being the size of the input array.

The time complexity, though harder to compute, is linear to the input size. This is because we are storing the answers to the sub-problems we have already solved, and so, we have O(n) unique sub-problems that we have to solve. This result can also be verified with the complexity we get using the bottom-up approach.

- The Bottom-up approach

Recall that in this approach, we seek to eliminate all recursive calls by following an iterative approach, where we start from the base case, or the “bottom” and make our way up.

Let’s replace the other calls to f with accessing elements of dp.

**dp[i] = max(arr[i] + dp[i + 2], dp[i + 1])**

What about the base case, f(n-1) = arr[n-1]? This would be the last element of the array dp.

**dp[n-1] = arr[n-1]**

And just like that, we have our solution for a bottom-up dp approach!

Let’s implement this, just like we did for the recursive approach.

This function would be called like so:

**array := [1, 5, 2, 4, ...]  
output(f(array))**

The complexity here would be linear in both space and time.

Why?

We are running a single for-loop n-1 times, and in each iteration, we are performing constant time operations — a linear time complexity.

Since the size of the array dp depends on the size of the input array — which, of course, is variable — our space complexity is also linear.

Improving the algorithm

But can we do better? Let’s see.

In terms of asymptotic time complexity, we can’t do better. To find the answer, we have to check every element of the array. So we can’t do better than linear time.

But what about space complexity? Do we need to maintain an array of size n to solve the problem?

Look closely at the line inside the for-loop:

**dp[i] = max(arr[i] + dp[i + 2], dp[i + 1])**

At any point of time, all we need to populate dp[i] is the next two elements in dp — at indices i +1 and i + 2. There’s no reason to maintain all of our results. We just need to keep track of the last two iterations.

Let’s use three variables here. Let’s name them i_0 , i_1 and i_2 for make it easier to relate between them.

**dp[i] --> i\_0  
dp[i+1] --> i\_1  
dp[i+2] --> i\_2**

Notice that in the next iteration of our loop, our loop counter i , becomes i + 1 , since we’re decrementing i in each iteration. dp[i +1] would be the next dp[i +2], dp[i] would be the next dp[i +1] and dp[i+2] — which we wouldn’t need since dp[i +3] isn’t required — can be reused as the next dp[i].

Replacing this with our three new variables, the code inside our loop becomes:

**i\_0 := max(arr[i] + i\_2, i\_1)  
i\_2 := i\_1  
i\_1 := i\_0**

We initialize these variables just like our array implementation.

**dp[n-1] = arr[n-1] --> i\_1 = arr[n-1]  
dp[n] = 2 --> i\_2 = 0**

One last thing to keep in mind: what if the input array has only a single element? Our loop, which runs from n-2 to 0 , wouldn’t run even once.

Hence, we initialize i_0 with the value of i_1. So if the loop never runs — the input array has only one element — returning i_0 would return the value of i_1 , which is the arrays only element.

Finally, we return i_0 instead of dp[0].

**return dp[0] --> return i\_0**

Thus, our final algorithm would look something like this.

Just like the previous dynamic programming approach, this function would be called by simply passing in an array or a reference to one.

**array := [1, 5, 2, 4, ...]  
return f(array)**

For an array of any length, all we need is three variables. Thus, the space complexity of our algorithm is now O(1) — constant.

Summarizing our results,

A summary of our implementations — image created by Author using draw.io.

Comparing the recursive approach with our top-down approach, it's clear that we are trading space complexity for better time complexity. Of course, since both are recursive, they have the additional space required for the recursive call stack.

In a similar vein, the lowest two rows are the results of our bottom-up approaches. They are iterative, so they don’t require storing function records recursively on the stack. And since they’re essentially the same algorithm as the top-down approach, they have the same linear time complexity.

The best case is the bottom up approach requiring O(1) space — meaning that the space our dp algorithm is using doesn’t change with the input size n.

The code

Let's implement our final algorithm of constant space bottom-up dynamic programming in C++. The variable and function names are the same as before.

Note: the final space complexity optimization step is slightly harder to look for, but drastically improves your space usage as we just saw. See if you can spot a similar relation for the bottom-up approach for the nth Fibonacci term.

Conclusion

Dynamic Programming is not often very intuitive or straightforward. Then again, most complex things aren’t. But things do get easier with practice. There are tonnes of dynamic programming practise problems online, which should help you get better at knowing when to apply dynamic programming, and how to apply it better. Hopefully, this post served as a good starting point.

Understanding Maximum Likelihood Estimation

Aniruddha Karajgi — Mon, 10 Aug 2020 02:59:30 +0000

Maximum Likelihood Estimation , or MLE, for short, is the process of estimating the parameters of a distribution that maximize the likelihood of the observed data belonging to that distribution.

Simply put, when we perform MLE, we are trying to find the distribution that best fits our data. The resulting value of the distribution’s parameter is called the maximum likelihood estimate.

MLE is a very prominent frequentist technique. Many conventional machine learning algorithms work with the principles of MLE. For example, the best-fit line in linear regression calculated using least squares is identical to the result of MLE.

The likelihood function

Before we move forward, we need to understand the likelihood function.

The likelihood function helps us find the best parameters for our distribution. It can be defined as shown:

where θ is the parameter to maximize, x_1, x_2, … x_n are observations for n random variables from a distribution and f is the joint density function of our distribution with the parameter θ.

The pipe (“ | “) is often replaced by a semi-colon, since θ isn’t a random variable, but an unknown parameter.

Of course, θ could also be a set of parameters.

For example, in the case of a normal distribution, we would have

θ = (μ,σ), with μ and σ representing the two parameters of our distribution.

Intuition

Likelihood is often interchangeably used with probability, but they are not the same. Likelihood is not a probability density function, meaning that integrating over a specific interval would not result in a “probability” over that interval. Rather, it talks about how likely a distribution with certain values for its parameters fits our data.

θ_MLE is the value that maximizes the likelihood of our data x

Looking at it this way, we can say that likelihood is how likely the distribution fits of given data for variable values of its parameters. So, if L(θ_1|x) is greater than L(θ_2|x), the distribution with parameter value as θ_1 fits our data better than the one with a parameter value of θ_2.

Process

To re-iterate, we’re looking for the parameter (or parameters, as the case may be) that maximize our likelihood function. How do we do that?

To simplify our calculations, lets assume that our data is independently and identically distributed , or i.i.d, for short, meaning that observations are independent of each other and that they can be quantified in the same way, which basically means that all points are from the same distribution.

The i.i.d assumption allows us to easily calculate the cumulative likelihood considering all data points as a product of individual likelihoods.

Also, most likelihood functions have a single maxima, allowing us to simply equate the derivate to 0 to get the value of our parameter. If multiple maxima exist, we would need to look at the global maxima to get our answer.

In general, more complex numerical methods would be required to find the maximum likelihood estimate.

An Example

To understand the math behind MLE, let’s try a simple example. We’ll derive the value of the exponential distribution ’s parameter corresponding to the maximum likelihood value.

The Exponential Distribution

The exponential distribution is a continuous probability distribution used to measure inter-event time.

It has a single parameter, called λ by convention. λ is called rate.

Its mean and variance is 1/ λ and 1 / λ², respectively.

The probability density function for the exponential distribution is as shown below.

PDF plots with variable λ

There’s a single parameter λ. Let’s calculate its value, given n random points x_1 to x_n.

As discussed earlier, we know that the likelihood for a given point xi is given by the following:

We calculate the likelihood for each of our n points.

The combined likelihood for all n points would just be the product of their individual likelihoods, since we are considering independent and identically distributed points.

The log-likelihood

Our next step would be to find the derivative of our likelihood function and set it to 0, since we want to find the value of our distribution parameter (in this case, λ) which gives the maximum likelihood.

Since the derivatives of both functions and those of their logarithms have the same stationary points (derivative equates to 0), we can simplify our calculations by considering the logarithm of our likelihood function.

Lets plot a simple graph that represents the following equations when a and b are both set to 1. The rate parameter, λ, has been replaced by x.

The terms a and b represent two datapoints, say, x_1 and x_2. Our likelihood function is represented by the orange curve. It is the product of likelihoods of the two individual datapoints.

The logarithm of the likelihood function, or the log-likelihood , is represented by the pink curve.

The likelihood and the log-likelihood function for our points x_1 and x_2.

The blue-dotted line will be covered later.

Two things to notice:

Both the likelihood function (orange) and its logarithm (pink) have the same stationary point (the derivative is 0).
That common stationary point occurs at the x=1 (blue curve), which may not make much sense at the moment, is basically a simple proof of our result. We’ll revisit this point after obtaining our result.

The product of small probabilities, as is the case in calculating the likelihood over several data points, can also lead to numerical underflow due to very small probabilities, giving us another reason to prefer working with the “sum of logs” rather than “products”.

Simplifying our result, we get:

This is the log-likelihood for the exponential distribution.

The derivative

Now that we have our log-likelihood function, lets find its maxima. To do this, we simply find its first derivative with respect to λ.

Differentiating log (L(λ)), we get:

Following that, we end up with:

Simplying this further, we get the following relation for λ:

So the value of λ that maximizes likelihood can be calculated using the above relation. Similar calculations can be done for other continuous and even discrete distributions.

Now, coming back to our graph example, the blue dotted line, with the equation: x = 2 / (a + b), is our value for λ when n is 2.

Remember, the value of λ we obtained represents the maximum value of likelihood function (orange curve). For a = 1 and b = 1, we get x = 1 and hence λ = 1, which is shown in the graph as the maxima of the orange curve.

In distributions with multiple parameters, like the normal distribution, we consider each one in turn, keeping the others constant.

Conclusion

MLE isn’t the only technique which helps us do this. Other techniques include

Maximum A Priori Estimation (MAP), which uses prior data as well, unlike MLE, which considers a uniform prior; and
Expectation Maximization, which handles latent variables (those unobservable variables which have an effect on present variables), something that MLE struggles with.

There’s a lot more to Maximum Likelihood Estimation — and for that matter, other parameter estimation techniques. This post focuses more on the underlying math behind MLE using the exponential distribution as an example.

Hopefully, this article gets you started with other, more complex techniques!

Of course, please let me know in the comments if anything’s unclear and I’ll update the post. Thanks for reading!

Visualizing the Defective Chessboard problem

Aniruddha Karajgi — Sat, 11 Jan 2020 02:59:30 +0000

A tiled chessboard

The Defective Chessboard problem, also known as the Tiling Problem is an interesting problem. It is typically solved with a “divide and conquer” approach. The algorithm has a time complexity of O(n²).

The problem

Given a n by n board where n is of form 2^k where k >= 1 (Basically, n is a power of 2 with minimum value as 2). The board has one missing square). Fill the board using trionimos. A trionimo is an L-shaped tile is a 2 × 2 block with one cell of size 1×1 missing.

Solving the problem efficiently isn’t the goal of this post. Visualizing the board as the algorithm runs is much more interesting in my opinion, though. Lets discuss the algorithm first.

The board with no defects added

The Algorithm

As mentioned earlier, a divide-and-conquer (DAC) technique is used to solve the problem. DAC entails splitting a larger problem into sub-problems, ensuring that each sub-problem is an exact copy of the larger one, albeit smaller. You may see where we are going with this, but lets do it explicitly.

The question we must ask before writing the algorithm is: is this even possible? Well, yes. The total number of squares on our board is n², or 4^k. Removing the defect, we would have 4^k — 1, which is always a multiple of three. This can be proved with mathematical induction pretty easily, so I won’t be discussing it.

The board in the initial state (with an added defect)

The base case

Every recursive algorithm must have a base case, to ensure that it terminates. For us, lets consider the case when n, 2^k is 2. We thus have a 2×2 block with a single defect. Solving this is trivial: the remaining 3 squares naturally form a trionimo.

The recursive step

To make every sub-problem a smaller version of our original problem, all we have to do is add our own defects, but in a very specific way. If you take a closer look, adding a “defect” trionimo at the center, with one square in each of the four quadrants of our board, except for the one with our original defect would allow use to make four proper sub-problems, at the same time ensuring that we can solve the complete problem by adding a last trionimo to cover up the three pseudo-defective squares we added.

Once we are done adding the “defects”, recursively solving each of the four sub-problems takes us to our last step, which was already discussed in the previous paragraph.

The combine step

After solving each of the four sub-problems and putting them together to form a complete board, we have 4 defects in our board: the original one will lie in one of the quadrants, while the other three were those we had intentionally added to solved the problem using DAC. All we have to do is add a final trionimo to cover up those three ‘defects’ and we are done.

Thus, the recursive equation for time complexity becomes:

T(n) = 4T(n/2)+c

The 4 comes from the fact that to solve each problem of input n, we divide it into 4 sub-problems of input size n/2 (half the length of the original board). Once we are done solving those sub-problems, all that’s left to be done is to combine them: this is done by adding the last trionimo to cover up the pseudo-defects we added. This, of course, is done in constant time.

If you are interested in finding the asymptotic time complexity of the recurrence relation, you could try the recursion tree method or the substitution method. For now, lets just use the Master theorem.

The master theorem says that for a recurrence relation of the form:

T(n) = aT(n/b) + f(n)

the complexity depends on the complexities of f(n) and n ^ log_b(a) (the log is to the base b).

The cases in the image below tell us which case we need to use here.

The Master Theorem. From brilliant.org

Since the value of n ^ log(a) base b is n², while the term f(n) is of constant complexity, we use Case 1, which ultimately tells us that our algorithm has an order of n². In other words, the time complexity of our algorithm is O(n²).

The Code

Initially, each square is represented by a 0. A ‘-1’ represents a defective square, and would appear black in the plots. Each trionimo would be displayed with a unique number, which would be incremented as more trionimos are added.

Again, the goal of the code is not optimization — its to do as much from scratch (in plain python) as possible.

I have commented the code below, so it should be pretty straight-forward.

Importing required libraries

The random library is used to randomly pick a square to be defective, just for starting the problem, and seaborn is, of course, for the visualization.

Creating the board

Nothing too crazy: just creating a two dimensional Python list. The algorithm can be optimized by using structures like numpy arrays instead of vanilla Python lists. The list is initialized with 0's.

Adding a defect randomly

Here we are randomly choosing an element using randint() and making it the defective tile. We will represent defects with a -1.

Locating the defect when solving each sub-problem

We are doing two things here:

locating the row and column of the defect
determining which quadrant of the board the defect lies in

The first step can be optimized by using numpy functions instead of the ‘from-scratch’ approach below.

Adding trionimos

This function adds a trionimo to a 2 x 2 section of the board. Here, the quadrant of the defect comes in handy as we can simply define a dictionary to decide how to add our trionimo.

Recursive Tiling Function

This is a divide-and-conquer implementation of our tiling algorithm. The function accomplishes the following:

determining the location of the defect in the given section of the board (the rows r1 and r2 and the columns c1 and c2 allow the function to focus on a particular section of the board).
adding a trionimo if we are dealing with a 2 x 2 section of the board
otherwise, adding three defects to the center
recursively solving each quadrant of the board
adding a final trionimo to cover up the three defects we added at the center.

The Parent Tiling Function

This just makes the interface cleaner since we technically have only two independent arguments: the board and the parameter k (well, k can be calculated or used as a global variable, but that’s up to the programmer).

Visualizing

I made use of a simple seaborn heat-map to display the board. The drawboard() method creates the heat-map.

The two heat-map calls are to more easily distinguish between defective and non-defective squares:

the first one creates the heat-map without any labels or masks.
the second heat-map has a mask which hides the defective squares, but annotates the rest, allowing us to distinguish each trionimo by number while leaving defective squares blank.

The result

I’ve dragged the post long enough. The snippet below calls relevant functions, allowing us to view each of the board’s states.

A GIF of the board as the algorithm runs on it

Originally published at https://polaris000.github.io on January 11, 2020.

A Guide to the Google Summer of Code

Aniruddha Karajgi — Fri, 03 Jan 2020 02:59:30 +0000

Over the last few days, I’ve received several messages and emails on the subject and after responding to some, I decided to compile everything into a single post.

A short note on the program

The Google Summer of Code (GSoC, for short) is a program whose primary goal is to boost interest in open-source. The program targets college and university students and gives them the opportunity to contribute to open-source organizations of their choice over the summer. Potential candidates are required to write a proposal detailing the work they would be doing along with a timeline with specific deadlines for each sub-task. The coding period is the main part of the program and is when you work with your mentors. It is divided into three sections by periods of evaluation. This is where your mentor provides feedback and evaluates your work in that period.

The GSoC 2020 timeline

The myth of the ideal candidate

Goals

The first thing I would like to say is that GSoC is not a competitive examination. It is an opportunity for you to play a role in open-source software development, implying that you would benefit a lot more from this program if you are actually keen on contributing.

That’s why it is often said that choosing a project that you actually use in your life is a pretty good idea. It would be much harder and oftentimes less rewarding to finish the Summer of Code program successfully if you are just doing it to be more “successful”. For me personally, did finishing the program open more opportunities for me? Probably. Would I have been less ‘successful’ if I hadn’t participated in the program? Maybe not.

Skills

A common question I get asked is: Am I good enough?. Though skills are an important part of your program, I would say that having less than stellar skills is not deal breakers. If you know how to code in the language your project uses, or if you have a decent idea of a particular framework, you should not have any difficultly getting selected. Of course, it is imperative that you spend a little more time honing your skills than someone with “ideal” skills.

Shortlisting organizations

This is an important step of the process. It is necessary to be realistic but at the same time, a little optimistic. It would probably be a little too hard trying as a candidate for, say, the Git project, if you have no idea how version control works or have no knowledge of the C programming language. That’s not to say that it’s impossible, just that most people would be doing it alongside their regular course loads and it would certainly be more difficult to balance both.

One often gets the advice: “Just pick an organization and start contributing…”, It’s apparent from the vague nature of this response that starting out with GSoC isn’t that simple. I suggest going through previous years’ projects and organizations (present in the archives) and focusing on the ones who use the languages and frameworks of that you are interested in. Probably short list about 2–3 or even 4 organizations that interest you — they may seem like a cool project to work on and your skills align to a greater degree when compared to other organizations.

Make sure that your selected organizations have participated in the at least the past year. Though it is very rare for a regularly participating organization to not be selected for the next year’s program, it is still possible. Unfortunately, not much can be done about this, except to keep your options open: hence, the advice of shortlisting multiple organizations.

Contributing

Once you are done with the shortlisting step, you should ideally end up with a set of organizations that you are happy to work with. Now, I’d suggest introducing yourself on their communication channels (generally mailing lists or slack), talking about who you are, why you are interested, which particular area of the project you are interested, your skills, etc.. Make sure to go through any contribution and new comer guidelines prior to this. It does not look to good if you ask them where to start, and there is an entire section with the exact same title on their website.

Don’t be afraid to ask if you’re unsure of how to start (but only if it is not clearly mentioned anywhere, though it usually is) — you’ll save everyone some time. And time is likely the most important factor in your selection. If there is no obvious place to start contributing (again, there usually is), just ask. They’ll like your enthusiasm. You could also suggest adding some contributing or new-comer guidelines to the maintainers. Maybe this could be your first issue.

Communication is an important aspect of your journey — you’ll spend hours in contact with your mentor(s). Being polite, enthusiastic and giving prompt replies to emails or messages gets you closer to being the ideal candidate.

Around February, start thinking about which organizations you are going to move forward with. And that’s it: just stay active, show your skills and enthusiasm and with a little luck, you should see yourself as a participant.

Announcing the Participating Organizations

Though the GSoC time line varies a little (have a look at the timeline on the official website), you should expect participating organizations to be announced in the second half of February. As mentioned earlier, almost every organization continues participating year after year. This is probably a good time to finalize which organization you’ll actually move forward with, so that you have enough time to dedicate your complete attention to it and more importantly, your potential mentors notice your enthusiasm.

The proposal

It was the most daunting part of my journey, and I’m sure some would agree. The scary part is probably the fact that you have to come up with it all by yourself and it is impossible to get help (not a 100% true). Till now, your journey was a little more mechanical — you went with the flow. Personally, I feel that the difference between a good proposal and a great one is the deeper understanding of a project and ones own abilities.

For the uninitiated, your proposal is basically your ticket to the program. If you ace this part, your chances of selection are greatly increased. Your proposal basically outlines what you, as a candidate, will do and when. This means that you should have a timeline in mind along with clear cut goals. Ideally, there should be some big goals, which should be divided into a per-week basis. Each coding period (there are three) can have its own goal.

This is of course, not the only way. There are several excellent posts out there that you could have a look at to fine tune your proposal. Just make sure that you follow your organization’s proposal guidelines (formatting, structure, etc), deadlines and most importantly, ask your future mentors for tips if they are willing (they most likely will).

The last stretch

There are a few weeks between submitting your proposal and the announcement of the selected candidates. Use this time to focus on the areas you’ll be working over the summer. Make sure you are comfortable with stuff like virtual environments, git, GitHub, docker, the command line — whatever your project uses. Continue to contribute, all the while increasing the significance of your contributions. When the day arrives, you thank yourself for your efforts.

In conclusion

Just remember that even if you are not chosen, it may not be only about you. As a mentor, you’d definitely want to go with the ‘best fit’, and that may not always be the ‘most skilled’. Many people don’t get into the program in their first attempt but are much stronger candidates in subsequent years. Be sure to ask for some feedback as to where you can improve the next year. Then again, there are bound to be some candidates who end up failing due to missed deadlines, so make sure you are organized and have everything on your calendar (I hope you use a calendar :)).

F.A.Q. ‘s

Am I skilled enough?

Ans. Yes. (addressed in this post)

2. Which year in my undergraduate is the best time to participate in the program?

Ans. That depends on commitments and responsibilities. With respect to that metric, the general order of years from best to worst is: 1 > 2 > 3 ~ 4. The obvious problem with this is that your experience and skills growth follow the opposite trend. The happy medium is probably the second year, but if you are skilled/interested enough you could potentially do it in the first summer. Make sure you have enough time during the summer. In my opinion, the only thing worse that not getting selected is failing the program because you have too many things on your plate.

3. I have done the Andrew Ng machine learning course on Coursera. What else should I do to improve my chances of getting selected for GSoC?

Ans. If you want to pursue a ML project for GSoC, please work on as many personal projects as possible. Almost everyone has done that course, so you’ll need to build experience.Your projects portfolio will show your potential mentors your style, skills and approach to problem solving.

4. Do I need to know Linux?

Ans. A vast majority of open source projects use Linux. Its hard to give a definite answer since the choice of operating system you will be using the one your project’s development happens in. I’d recommend learning to use Linux — at least the basics like the terminal, the directory structure, creating and using virtual environments and installing packages and tools.

5. Will I be able to get into the program if I start now?

Ans. Yes. But the earlier you start, the better (in my opinion).

6. How do I start contributing?

Ans. Most organizations have some issues reserved for beginners and new-comers. They are usually labeled “easy”, “beginner”, “low-hanging fruit”, or something like that.

Remember that your first contribution need be world-changing. It can be as simple as a typo in their documentation. As you get comfortable with the project, you’ll naturally be able to solve more complex issues and bugs. Use the fact that you are new and let the maintainers know if the “Getting Started” section needs any improvements.

Though I have attempted to make this guide thorough but generic enough to apply to everyone, it may not be as exhaustive as you may like. Please put any suggestions and queries in the comments and I’ll add them to the article.

As I get more questions, they will be added to this post.

Good luck!

Originally published at polaris000.github.io on January 3, 2020.

DEV Community: Aniruddha Karajgi

Python: Decorators in OOP

A guide on classmethods, staticmethods and the property decorator

Table of Contents

An example

Decorators

classmethod

use-cases for class-methods

staticmethod

use-cases for the staticmethod

property

The getter and setter methods

Adding getter and setter methods

The deleter method

use-cases for the property decorator

Note#1: Where should you define a decorator wrt a class?

Note#2: Are there options other than constructor overloading in Java to simulate the methods we discussed (like, from_str or from_tuple)?

Conclusion

How Neural Networks Solve the XOR Problem

And why hidden layers are so important

Table of Contents

Structure and Properties

Input Nodes

Weights and Biases

Evaluation

Activation Function

Classification

Training algorithm

The 2D XOR problem

The XOR function

Attempt #1: The Single Layer Perceptron

Input data

Implementation

Results

The Need for Non-Linearity

The 2d XOR problem — Attempt #2

The Intuition

Modelling the OR part

Modelling the NAND part

Bringing everything together

The Multi-layered Perceptron

Hidden layers

Activation Function

Training algorithm

Attempt #3: the Multi-layered Perceptron

The architecture

Implementation

Results

Note #1: Adding more layers or nodes

Note #2: Choosing a loss function

Conclusion

Understanding Dynamic Programming

An intuitive guide to the popular optimization technique.

When is DP used?

Overlapping Sub-Problems

The optimal substructure property

The two kinds of DP

The top-down (memoization) approach

The bottom-up (tabulation) approach

An example

The Problem

The Analysis

The recursive solution

The base case

The recurrence relation

A dynamic programming approach

- The Top-down approach

- The Bottom-up approach

Improving the algorithm

The code

Conclusion

Understanding Maximum Likelihood Estimation

The likelihood function

Intuition

Process

An Example

The Exponential Distribution

The log-likelihood

The derivative

Conclusion