DEV Community: Ravi Suresh Mashru

A Brief Introduction to Dynamic Programming

Ravi Suresh Mashru — Thu, 01 Jul 2021 04:29:16 +0000

Dynamic programming is a technique that can be used to solve a particular class of problems. Let's take a look at how to determine if you can use dynamic programming for a given problem, and the different approaches (top-down and bottom-up) that you can use.

When can you use dynamic programming?

To use dynamic programming, you need to be able to break down the problem into smaller subproblems. If you are sure you can do that, then you need to check if the problem has the following properties:

Overlapping subproblems
Optimal substructure

If a problem can be broken down into smaller subproblems and has these two properties, then you can apply dynamic programming to solve the problem and success is guaranteed!

Let's dive deeper into what each of these mean while trying to solve problem #70 - Climbing Stairs on Leetcode.

Here is what the problem statement says:

You are climbing a staircase. It takes n steps to reach the top.

Each time you can either climb 1 or 2 steps. In how many distinct ways can you climb to the top?

Break down the problem

Let us first see if we can break down the problem into smaller subproblems. Let us say that the answer we want - the number of distinct ways we can climb a staircase with n steps, can be expressed as countDistinctPaths(n).

Now, we know that we can take either one or two steps at a time. If we take one step, we need to follow the same rules to climb the remaining n-1 steps. Similarly, if we climb two steps, we again need to follow the same rules to climb the remaining n-2 steps.

So, the total number of ways we can climb the staircase is either by taking one step and then taking one of the countDistinctPaths(n-1) paths for the remaining n-1 steps, or by taking two steps and then taking one of the countDistinctPaths(n-2) paths for the remaining n-2 steps.

We can write the total number of paths we can take as follows:

countDistinctPaths(n) = countDistinctPaths(n-1) + countDistinctPaths(n-2)

As you can see, we've managed to break the problem down into smaller subproblems! If we have the answer for n-1 and n-2, then we can combine those answers to calculate the answer for n.

There's one more thing we need to think about though - how small can we keep making the problems?

Well, the smallest staircase we can have is one with a single step. And there is only one way we can climb that step (since we can't take two steps in this case - because there's only a single step!) Also, if there is no staircase at all, there is no way we can climb it!

So we can rewrite the problem as:

countDistinctPaths(0) = 0 (no staircase, so no way to climb it!)

countDistinctPaths(1) = 1 (only one step, only one way to climb it)

For any value of n greater than 1, countDistinctPaths(n) = countDistinctPaths(n-1) + countDistinctPaths(n-2)

This kind of expression is also commonly known as a recurrence relation.

Overlapping subproblems

To check if there are overlapping subproblems, let us try to think about how many times we will call countDistinctPaths if we want the answer of say a staircase with 5 steps.

We know that the answer we're looking for is countDistinctPaths(5).

From our recurrence relation, we know that countDistinctPaths(5) = countDistinctPaths(4) + countDistinctPaths(3).

We can then use the recurrence relation to further break down countDistinctPaths(4) into countDistinctPaths(4) = countDistinctPaths(3) + countDistinctPaths(2).

Now if we put this back in the first expression, we get countDistinctPaths(5) = (countDistinctPaths(3) + countDistinctPaths(2)) + countDistinctPaths(3).

Similarly, we can replace countDistinctPaths(3) with countDistinctPaths(2) + countDistinctPaths(1).

The result is then countDistinctPaths(5) = (countDistinctPaths(3) + countDistinctPaths(2)) + (countDistinctPaths(2) + countDistinctPaths(1)).

If you look closely, you'll notice that we're computing countDistinctPaths(3) and countDistinctPaths(2) multiple times.

It's easier to see all the computations we have to do in the form of a tree:

Each node in this tree is a call to countDistinctPaths and the value in the node is the parameter we are passing to the function. As you can see, we're repeating the calls for 2 and 3. This tells us that the problem we are trying to solve has overlapping subproblems.

Optimal substructure

If you can find the optimal solution to a problem using optimal solutions to its subproblems, then the problem is said to have an optimal substructure.

What this means for us is that to find the optimal solution for countDistinctPaths(n), we need the optimal solution for countDistinctPaths(n-1) and countDistinctPaths(n-2). In this particular context, "optimal" for us means the maximum number of paths. Therefore, this problem has an optimal substructure.

I personally had a tough time understanding this property and found looking at examples of problems that DON'T have optimal substructure helped understand this better. You can find a list of such problems on the optimal substructure page on Wikipedia.

How can you apply dynamic programming to a problem?

Once you have verified that a problem can be solved using dynamic programming, there are two approaches you can use to solve it: the top-down approach or the bottom-up approach. Let's take a closer look at each of these.

The top-down approach

The top-down approach involves converting the recurrence relation we wrote to recursive function calls and then making some minor tweaks to prevent the repeated evaluation of the overlapping subproblems.

Let's start with a recursive implementation of our recurrence relation in JavaScript:

function countDistinctPaths(n) { 

  // If there are no stairs 
  if (n === 0) { 
    return 0; 
  } 

  // If there is only one stair 
  if (n === 1) { 
    return 1; 
  } 

  // The recurrence relation 
  return countDistinctPaths(n-1) + countDistinctPaths(n-2); 

}

As we saw in the tree above, this plain recursive implementation makes repeated calculations for the overlapping subproblems. These repeated calls mean we are spending time calculating the same values over and over again. And the number of repeated calls is much higher for larger values of n so we're wasting a lot of time!

With dynamic programming, we can compute the value of each subproblem just once and store it in memory - a technique called "memoization" (nope, not a typo, it's not memorization. See this Wikipedia page for how this term was coined). The next time we encounter the same subproblem and need to calculate the value, instead of using the recurrence relation, we can just retrieve the result from memory!

// We've added "memo"ry to store values we compute. It is empty in the beginning. 
function countDistinctPaths(n, memo={}) { 

  // If there are no stairs 
  if (n === 0) { 
    return 0; 
  } 

  // If there is only one stair 
  if (n === 1) { 
    return 1; 
  } 

  // Check if we have already computed this before 
  if (n in memo) { 
    return memo[n]; 
  } 

  // The recurrence relation with two minor tweaks: 
  // 1. We store the computed value in memo[n] for future use 
  // 2. We pass the memory object to the recursive calls 
  return memo[n] = countDistinctPaths(n-1, memo) + countDistinctPaths(n-2, memo); 

}

This approach is called top-down because we start with the biggest problem first, countDistinctPaths(n) and keep recursively breaking it down into smaller problems until we reach the smallest possible subproblems - countDistinctPaths(0) and countDistinctPaths(1).

The bottom-up approach

With the bottom up approach, we start from the smallest subproblems and iteratively combine them until we find a solution to the original problem.

For us, this means we start with countDistinctPaths(0) and countDistinctPaths(1)to which we know the answer, and then combine them to get the answer to countDistinctPaths(2) and then countDistinctPaths(3) and so on until we find the answer to countDistinctPaths(n).

We need a way to store the value of countDistinctPaths(n) for each value of n. This storage is commonly known as a "table" and this bottom-up approach is also commonly called "tabulation".

As you can see, there is no recursion involved here. Just good old iteration!

function countDistinctPaths(n) { 

  // Our "table" to store the result for each value of n
  const table = new Array(n + 1); 

  // The case with no stairs 
  table[0] = 0; 

  // The case with one stair 
  table[1] = 1; 

  // We keep combining subproblems until we find a solution to the original problem 
  for (let i = 2; i <= n; i++) { 
    table[i] = table[i-1] + table[i-2]; 
  } 

  return table[n]; 
}

And that's it! Once the loop finishes, we'll have the result we need at index n of the table.

How can you learn more about dynamic programming?

Like the title of this post says, this was just a brief introduction dynamic programming. The example I chose to work with was quite simple. There will be more complicated questions in which either the recurrence relation is not very obvious, or you need a two or even three dimensional table. The only way to get used to identifying if you can use dynamic programming to solve a particular problem and if so, how you can break down the problem and identify the key components is by practice.

To practice problems, I highly recommend LeetCode. Here's a list of all the dynamic programming problems on LeetCode. LeetCode has great quality questions for every topic. But my personal favorite feature on LeetCode is the Discussions tab in each problem. After you've solved the problem you can see and discuss how others are solving the same problem. Learning from other, more experienced programmers has always worked great for me.

So go ahead and start solving problems! Even better, make a challenge out of it! I'm currently doing a "100 Days of Code" challenge (which I log in this repo on GitHub) where I try to spend at least an hour every day solving problems on LeetCode or other similar sites.

Understanding the object-fit CSS property

Ravi Suresh Mashru — Sat, 26 Jun 2021 11:14:52 +0000

The object-fit property determines how the content of a replaced element is resized to fit inside its container on a web page.

What is a replaced element?

A replaced element is an element whose contents are not affected by CSS in the current document.

An example of a replaced element is the <img> tag. We can specify the position of the image, but we can't actually influence the contents of the image displayed inside the <img> tag using CSS.

Using object-fit with `<img>`

Let's look at a few examples of how we can use the object-fit property to influence how an image is resized to fit inside its container element (i.e. the img tag).

The behaviour we see is applicable to all other replaced elements as well. Some other examples of replaced elements apart from <img> are: <embed>, <iframe> and <video>. This page on MDN has more information about replaced elements.

Consider the following two images of dimensions 100x150 and 250x300 respectively:

We will place both images in <img> tags with a width and height of 200px and see how the various object-fit property values affect how the images are resized. I have given the <img> tag a gray border so that its edges are easy to identify.

Note: the object-fit CSS property is used on the <img> tag.

`object-fit: fill`

This is the default value of the object-fit property. With this value, the image inside an <img> tag will be resized to the size of the container (i.e. the <img> tag).

As you can see, if the aspect ratio of the container and the <img> tag aren't the same, the image will be stretched.

`object-fit: contain`

With this value, the image will be resized with its aspect ratio maintained so that the entire image fits within the container.

As you can see, if the aspect ratio of the image is different from that of the <img> tag, then there will be a portion of the container that doesn't contain the image.

`object-fit: cover`

The image is resized with its aspect ratio maintained for this value as well. However, the image will be resized so that it doesn't leave any empty space in the container like with object-fit: contain:

If the aspect ratio of the image is different from that of the container, then there will be parts of the image that are clipped off.

`object-fit: none`

With this value, the image is not resized at all and maintains its original dimensions. If the image is smaller than the container, the entire image is displayed with its original size. If the image is bigger than the container, it is clipped off.

`object-fit: scale-down`

This value behaves as if the object-fit property has either the value none or the value contain depending on which one results in a smaller image.

This means that images smaller than the container size remain the same size (like with object-fit: none) and images that are bigger than the container size are resized to fit the container (like with object-fit: contain).

An Introduction to Hash Tables

Ravi Suresh Mashru — Fri, 04 Jun 2021 06:03:59 +0000

Before diving into hash tables, let us quickly review two other data structures: arrays and linked lists.

Arrays

An array is used to store elements of the same type in contiguous memory locations. Since contiguous memory locations need to be reserved, you need to specify the size of the array upfront (when creating the array). Any element in the array can then be accessed in constant time since we know the exact memory location of each element in the array.

Since contiguous memory locations are used, we know exactly where to find the element at index 2 given the memory address at which the array starts.

Linked Lists

A linked list is also used to store a collection of elements. However, each element in the list is stored at a different location in memory. Each element in the list has a link pointing to the next element in the list.

Starting from the first element in the list (also known as the head of the list), we can follow this chain of links pointing to the next element in the list to sequentially access every item in the list. The last item in the list doesn't have a link pointing to the next element. That's how we know that we have reached the end of the list.

Since the items in a linked list can be stored anywhere in memory, we do not need to specify upfront what the size of the list should be. The linked list grows and shrinks dynamically as elements are added to and removed from it. However, we lose the ability to directly access elements at a given index in the list. Instead, we have to start at the head of the list and follow the links to the next item until we arrive at the index we want.

Hash Tables - the middle ground

Arrays give us the advantage of quick random access, but require us to define a size in advance. On the other hand, linked lists allow us to add and remove items without the need to specify a size. However, they take away our ability to access an item at a given index in constant time.

Hash tables give us the best of both worlds. They allow us to add, look up and remove elements in constant time¹, and they also allow us to add and remove elements without having to specify a list size upfront.

The Hash Function

When storing an element in a hash table, it is first passed through a hash function. The output of the hash function is the location in the hash table where the element should be stored.

Since we need to look up the element from the table later, we need to ensure that when identical elements are passed through the hash function, it produces the same value.

However, it is possible that when different elements are passed through the hash function, it produces the same value. When this happens, we call it a collision.

A good hash function minimizes the chance of collision, but collisions cannot be eliminated entirely.

Handling Collisions

There are two main ways of dealing with collisions: linear probing and chaining.

Linear Probing

When using linear probing, when there is a collision, we just keep looking at the next available location in the table until we find a free slot (or realize that the table is full and we can't store any more elements²).

When using this approach, we need to adjust the way we look up values from the hash table as well. If we hash an element and don't find it at the index specified by the hash function, then there is still a possibility that it was stored at the next available free slot in the table. We therefore have to look other locations in the table, one at a time, starting from the index specified by the hash function to ensure that the element really isn't stored in the hash table. That sounds very similar to traversing all elements in a linked list!³

Chaining

The other way we can deal with collisions is by storing all elements that get mapped to the same index in a linked list within the hash table.

We are now not restricted by the size of the hash table when storing elements, since the elements themselves are stored in a linked list outside the hash table. The hash table just contains links to the head of the various linked lists.

However, if we have too many collisions and a lot of elements get mapped to the same index then the linked list for that index grows. We then have to traverse the entire linked list to find the element we are looking for. This is why it is important that we use a hash function that minimizes collisions.

Choosing a Good Hash Function

A good hash function minimizes the number of collisions by trying to uniformly distribute values over the entire hash table. It also uses all the information available in the element to calculate the hash value, and tries to map similar elements to different locations in a hash table. Finally, a hash function should be very fast to compute since we use it every time we insert into or look up a value in a hash table.

Complexity Analysis

Assuming there are no collisions, it takes constant time to insert into and look up values from a hash table. The actual time taken for these operations depends on how fast the hash function is. However, when there are collisions, the hash table performs slightly worse.

When using chaining, it still takes a constant time to add a new element to the hash table since adding a new element at the beginning of a linked list can be done in constant time. When using linear probing, however, we may take up to O(n) time (where n is the size of the hash table) in the worst case where the hash table is full and we probe the entire table looking for a free slot.

When looking up values in a hash table during a collision, we take O(n) time (where n is the size of the hash table) since in the worst case, all elements are mapped to the same index and therefore we have to traverse the entire list of all elements to find the one we are looking for. This applies to deleting elements from a linked list as well, since to delete an element we first need to find it.

We can use a balanced search tree instead of a linked list while chaining to reduce the worst case time complexity to O(log n).

Conclusion

Although hash tables provide a good balance between an array and a linked list, they are by no means a silver bullet.

Since their worst case performance is O(n), if you know how many elements you need to store upfront, you would be better off using an array which guarantees adding and updating elements in constant time.

Also, if you care about the order of items in the list and don't know how many items you need to store upfront, then you need to use a linked list since a hash table does not store this information.

Resources

The operations take constant time on average, and perform worse in the worst case. See the complexity analysis section for more details. ↩
When this happens, we can tell the user that the hash table is full, but then we wouldn't be doing any better than an array. Instead, we can dynamically increase the size of the hash table once it is full and redistribute the existing items and make space for new items. ↩
The worst case time complexity using this approach is in fact O(n) - the same as that of finding a value in a linked list. ↩

Git: Merge vs Rebase

Ravi Suresh Mashru — Thu, 11 Mar 2021 12:23:26 +0000

I personally struggled to understand the difference between merging and rebasing in Git, especially since they both did what I wanted in the end (bring changes in another branch into my current branch), until I visualized what exactly happens in both processes.

To compare the two, let us assume that the main branch of your repository is where all the code is kept up-to-date.

Each circle in the sequence represents a commit. An arrow goes from a parent commit to a child commit (from a previous commit to the next commit).

To work on a new feature, you check out a new branch cool-feature and make a few commits.

In the meantime, some of your team members finished a feature they were working on and merged their commits to the main branch.

You now need to bring the new changes in the main branch into your cool-feature branch.

Git Merge

If you want to merge the changes into your branch, you would run the command git merge main while still on your cool-feature branch.

This will replay the changes contained in commits on the main branch since the first common commit between the two branches. A new merge commit will be created in your cool-feature branch that contains all the latest changes from the main branch.

If both branches have changes to the same lines, then Git cannot determine which of the two changes to keep (the one in the main branch or the one in the cool-feature branch). This results in what Git calls a conflict that the user has to resolve. After making necessary changes in the file that has a conflict (i.e. deciding whether to keep the changes from the main branch, the cool-feature branch or a mix of both) the user can commit these changes as a part of the merge commit created.

The merge commit has two parent commits: the latest commits on both the branches.

The algorithm Git uses to merge changes this way is called the 3-way merge algorithm because it uses three commits to generate the merge commit:

The latest commit on the main branch.
The latest commit on the cool-feature branch.
The latest commit that is common in both branches.

Git Rebase

To perform a rebase, you would run the command git rebase main while on your cool-feature branch.

This will replay each commit in your cool-feature branch on top of the latest state of the main branch. For each commit in your cool-feature branch, a new commit will be created which contains the same changes as the initial commit.

In case there are conflicts when applying each commit, Git will stop rebasing and notify you so that you can resolve them. The changes you make to resolve the conflicts will be added to the new commit.

Git also gives you the option of performing an interactive rebase where you can decide what to do with each commit that will be reapplied to the main branch.

Which One Should You Use?

The first question I asked myself after understanding the difference between the two was: which one should I use?

Not surprisingly, like most things in software engineering, the answer turned out to be it depends. Merging and rebasing give you two completely different views of your repository's commit history.

The downside of using merge is the creation of merge commits that can clutter the commit history of your repository. In a project where a lot of changes are actively being added to the main branch, your feature branch will have a new merge commit in it every time you merge it with the main branch. This can make it difficult to clearly see the actual commits you are making to implement the feature.

On the other hand, using rebase is not as straight forward as using merge. Since rebasing re-writes history (by dropping old commits and creating new ones), rebasing a branch that you have previously pushed (e.g. to GitHub) and others have started using can cause problems. You will now have a different commit history after the rebase from what others accessing the same branch have. If they do not pull this new rewritten history, they will be working on an outdated copy of the branch. This might become a problem when they try to push their changes in a branch with an outdated history.

Personally, if I am working on a feature branch that I have not yet pushed upstream (and as a result no one would have checked it out), I use rebase to avoid the extra merge commit. However, if the branch I am working on has already been made public, I avoid rewriting history with a rebase and live with the extra merge commit.

An Introduction to Character Encodings

Ravi Suresh Mashru — Wed, 06 Jan 2021 11:52:04 +0000

In this post, we'll take a look at the following:

What character encodings are and why we need them.
A few common character encoding formats (e.g. ASCII, The ISO 8859 Family).
How Unicode and UTF-8 solved the problem of encoding characters from all the different languages in the world (including emojis! 😃).

Letter, words and sentences are all human constructs created to communicate. Computers, however, understand only the language of binary - 0s and 1s. By understanding character encodings, we can understand how computers store all the text that we see on our digital devices - tweets, facebook posts, and even this blog post!

All language can be broken down into a sequence of characters. Different encodings store these characters in different ways. To keep things simple in the beginning, let us assume that we are interested in only letters of the English alphabet (lowercase and uppercase), the 10 digits (0 to 9), and a few special symbols (e.g. +, -, ?, *).

ASCII

In ASCII (American Standard Code for Information Exchange), each character is stored as a sequence of 7 bits. Each bit can be either 0 or 1. Therefore, there are $2^{7} = 128$ possible characters that can be represented using ASCII. This collection of 128 characters is called the ASCII character set.

Most computers deal with memory in chunks of 8 bits (also known as a byte). In this case, the left-most bit is left unused (kept with a value of 0) and the 7 bits on the right-hand side are used to represent the character.

For example, the character "A" has a value of 65 (in decimal). This means it would be stored in memory as: 01000001.

Since writing in binary can be space and time-consuming, we can use the hexadecimal equivalent instead. So the character "A" can be represented as 41.

Exercise 1: Decode the following hexadecimal ASCII encoded text: 48 65 6C 6C 6F 20 77 6F 72 6C 64 21
(Solution)

Exercise 2: Encode the following text using ASCII (in hexadecimal): That's All Folks!
(Solution)

You can find a list of how each character is represented in ASCII here.

Since ASCII can be used to represent only 128 characters, it isn't enough for all the different characters in various languages.

An attempt to fix this issue was to make the left-most bit that is unused in ASCII do something. This gave birth to ISO 8859.

The ISO 8859 Family

The ISO 8859 family is a series of 10 different standards that are a superset of ASCII.

In each of these standards, when the left-most bit is 0, the remaining bits represent ASCII characters as usual.

When the left-most bit is 1, each of the standards use the remaining 7 bits to represent 128 new characters.

You can find a list of all the characters represented by each of the standards when the left-most bit is 1 here.

Although this doubled the number of characters that could be represented, this free-for-all ended up with so many different characters being represented in the extra space created for 128 new characters. And if you saved data using a computer that used one standard and read it in a computer that used another one, you wouldn't be able to make sense of the message since the ASCII characters would be the same, but the extra 128 characters would now be displayed differently.

Also, a total of 256 characters was still not enough for all the languages in the world that collectively have thousands of letters!

Unicode

The solution to the problem of representing thousands of different characters across different languages turned out to be: splitting character sets and character encodings.

A character set is a mapping of a single character to unique numbers. These numbers are called code points.

A character encoding is a mapping of these code points to actual bytes that are stored in a computer.

Unicode is a character set - it is a mapping of over 140,000 characters to a unique number. You can find a complete list of all characters and their corresponding numeric value on the Unicode homepage.

Unicode values are usually represented by "U+" followed by hexadecimal numbers, e.g. U+0033 is the Unicode number for the digit "3".

The way these decimal numbers are stored as bytes in a computer's memory depends on the character encoding used, e.g. UTF-8, UTF-16 and UTF-32.

UTF-8

UTF-8 is a variable encoding format. This means that a fixed number of bytes cannot be used to represent every character like in ASCII. Each character can be between 1 to 4 bytes long.

All code points between 0 - 127 are the same as ASCII and are also stored in a single byte with the left-most bit set to 0. Therefore, all valid ASCII is valid UTF-8.

Exercise 3: Encode the following text from exercise 2 using UTF-8 (in hexadecimal): That's All Folks!
(Solution)

All code points above 127 require multiple bytes to be encoded. The number of left-most 1s followed by a 0 in the first byte indicates how many bytes are there are in the encoding.

Similarly, three-byte encodings would have the format 1110XXXX 10XXXXXX 10XXXXXX.

The Xs are the positions that can be used to store the actual encoding of the character in binary format. However, not all two-byte encodings that have this format are valid.

For example, consider the character "A" with the code point 65 (1000001 in binary). It may be tempting to encode it using two bytes as follows:

This is an invalid encoding since all code points between 0 - 127 require a single byte.

The smallest code point that can be encoded using two bytes is 128. Therefore, 11000010 10000000 is the smallest valid UTF-8 two-byte encoding.

The biggest two-byte encoding is 11011111 10111111, or 2047 (in decimal)/7FF (in hexadecimal).

The following table summarizes the range of code points that each multi-byte encoding can be used to represent.

Number of bytes	Smallest code point (decimal/hexadecimal)	Largest code point (decimal/hexadecimal)
1	0 / 00	127 / 7F
2	128 / 80	2047 / 7FF
3	2048 / 800	65535 / FFFF
4	65536 / 10000	1114111 / 1FFFFF

When trying to encode a character using UTF-8, you need to:

Determine the code point value of the character.
Use the table above to determine how many bytes are required to encode the character.
Convert the code point value of the character to binary.
Place the bits in the binary representation in the right places in the multi-byte encoding.

Let's say we want to encode the Greek capital letter delta (Δ) using UTF-8. We can use this site to find the Unicode value of this character - U+0394. This means the code point value of this character is 394 (in hexadecimal).

Since 394 falls between 80 and 7FF, we need 2 bytes to encode this character. The two bytes will have the format 110XXXXX 10XXXXXX, where the Xs will be replaced by the binary value of the code point.

The hexadecimal value 394 in binary is: 1110010100. This can now be placed in the two bytes as follows:

When placing these bits, start from the right-hand side. Put 0 in all the additional Xs on the left-hand side.

Exercise 4: Encode the following emoji using UTF-8: 😃
(Solution)

Decoding a sequence of bytes encoded using UTF-8 is a two stage process. Since a character may be encoded using multiple bytes, we first need to group bytes that are part of a multi-byte encoding together. Then we can convert each byte/group of bytes into the character they represent.

For example, consider the following byte sequence: 72 C3 A9 73 75 6D C3 A9.

It would be easier to see which bytes are part of a multi-byte sequence if we convert to binary.

01110010 11000011 10101001 01110011 01110101 01101101 11000011 10101001

These bytes can be divided as follows:

The first byte starts with 0 and therefore represents a character on its own (from the ASCII character set).
The second byte starts with 110 and is therefore the first byte of a 2-byte sequence made up of the 2nd and 3rd byte.
The 4th, 5th and 6th bytes all start with 0 and each represent characters from the ASCII character set.
The 7th byte starts with 110 and is also the first of a 2-byte sequence (the 7th and 8th).

01110010 | 11000011 10101001 | 01110011 | 01110101 | 01101101 | 11000011 10101001

Using the ASCII table, we can see that the first byte represents the character "r".
In the following 2-byte sequence, the highlighted bits contain the binary representation of the encoded character: 110 00011 10 101001. When extracted, they form the number 11101001 (in binary) or E9 (in hexadecimal). In Unicode, this code point is for the character "é".
Using the ASCII table again, we find the 4th, 5th and 6th bytes represent the characters "s", "u" and "m" respectively.
The 7th and 8th bytes are exactly the same as the 2nd and 3rd bytes, and represent the character "é".

Therefore, this byte sequence is an encoding of the word "résumé" in UTF-8.

Exercise 5: Decode the following bytes that have been encoded using UTF-8: 74 61 64 61 20 F0 9F 8E 89
(Solution)

Conclusion

Hopefully, the following things were made clear by reading this post:

ASCII is an encoding that requires 7 bits to represent each character. It can represent up to 128 characters.
Since memory deals with groups of 8 bits, the left-most bit is set to 0 in ASCII.
ISO 8859 is a family of encodings that sets the left-most bit of a byte to 1 to create space for a total of 256 characters.
Unicode is not an encoding, but a mapping of characters to code points.
UTF-8 is one of the encodings that can be used to convert code points into sequences of bytes to be stored in a computer's memory. Other encodings are: UTF-16 and UTF-32.

The Evolution of Recurrent Neural Networks

Ravi Suresh Mashru — Thu, 03 Dec 2020 13:21:58 +0000

In this post, we will take a look at the evolution of recurrent architectures of neural networks and why the concept of attention is so exciting.

After reading this post, you should be able to appreciate why the Transformer architecture, introduced in the paper Attention Is All You Need (from which the title of this post is also borrowed), uses only attention mechanisms to achieve previously unimaginable feats like OpenAI's GPT-3.

Recurrent Networks

In feedforward networks, information flows in one direction: from the input layer, through your hidden layers, to the output layer. This type of architecture works for tasks like image classification, where you need to look at the entire input just once.

Recurrent networks, on the other hand, are well suited for inputs like natural language that are recurrent in nature: made up of word after word in a sequence. They take one input at a time from the sequence. With each input, they also get their previous output - this enables recurrent neural networks to process the input as a sequence.

An easier way to visualize recurrent networks is to "unroll" them - draw them separately for each time step.

Although RNNs have access to their previous output, they are not very good at remembering things that came long before in a sentence. These are more commonly known as "long-term dependencies".

LSTMs

LSTMs (short for Long Short Term Memory networks) are a type of recurrent network that has a vector called the "cell state" that helps it remember long-term dependencies.

At each time step, an LSTM performs the following operations:

It decides how much information it should forget from its cell state, using a hidden layer called the forget gate.
It decides what new information it should remember given the current input, using a hidden layer called the input gate.
It decides what information it should use from its memory to provide an output, using a hidden layer called the output gate.

The above diagram is a little over-simplified so that it is easy to visualize the main parts that an LSTM is made up of. Let's take a closer look at each of the gates to see what goes on underneath.

The Forget Gate

The forget gate concatenates the current input and previous output and passes it through a hidden layer with a sigmoid activation. The output of this layer is then multiplied elementwise with the context vector.

Since sigmoid activations are between 0 and 1, values close to 0 instruct the LSTM to forget information in those parts of the cell state, and retain information where the sigmoid activation values are close to 1.

The Input Gate

The input gate also works on the concatenation of the current input and previous output.

There is a sigmoid layer that decides which values to keep (between 0 and 1), and there is a tanh layer that provides values for what could be stored in the context vector. These two outputs are multiplied by each other and then added element-wise to the context vector.

The Output Gate

The output gate then multiplies elementwise the output of a sigmoid gate with the context layer passed through a tanh activation. The tanh activation helps bring the context vector in a range of -1 to 1 and the sigmoid activation helps select which parts of the context should be used for the output.

For a better understanding of what happens inside LSTMs, take a look at this blog post.

GRUs

GRUs (short for Gated Recurrent Units) are different from LSTMs in two major ways:

GRUs don't have a separate cell state. The hidden state of the cell and its output are the same.
GRUs have just 2 gates instead of 3 - the reset gate and update gate. The assumption is that if the network is trying to forget something from the hidden state, then it is because it wants to store some other information there. Therefore, we don't need separate forget and input gates. The additive inverse of the forget gate can be used as the input gate.

The reset gate decides what information from the hidden state needs to be supressed during this time step.

The update gate acts like a combination of the forget gate and input gate in an LSTM.

Sequence modelling with Encoder-Decoder architecture

Although LSTMs and GRUs perform better than vanilla RNNs, they have a problem that they have one output for each input at every time step.

This is not a problem for tasks like sentiment analysis where we are interested in just the last output. However, for tasks like language translation where the length of the output sequence and even the position of the translated words with respect to each other can be different, this is a major limitation.

Sequence-to-sequence (often abbreviated seq2seq) models solve this problem by splitting the entire process into two steps.

The input sequence is first run through an encoder. The encoder generates a fixed vector representation of the input at the last time step.
This fixed representation is then run through a decoder which produces the output sequence of interest.

The importance of paying attention

A major limitation of seq2seq models is that the encoder tries to cram all the information it can into a single vector. The model is therefore limited by how much information it can put in there. Also, the information in this vector will be biased towards the last few words that the encoder has seen most recently.

An attention layer works on all the output states of the encoder. At each step, the decoder looks at its previous output and a weighted sum of all the encoder outputs that the attention layer generates.

The attention layer is a fully connected layer that is trained to select the right encoder output to focus on at each given timestep. It is a fully connected network that receives the previous output of the decoder and all the hidden states of the encoder. The output is a scalar value for each encoder output that acts as the "weightage" for that time step.

These scalar values are passed through a softmax layer to ensure they add up to 1, and then multiplied by the corresponding encoder outputs and added together.

The resulting vector of the weighted sums of the encoder outputs is provided as an input to the decoder along with the output of the decoder at the previous time step. That way, the decoder can "see" which part of the input it needs to focus on to generate the next output.

The concept of attention is so powerful that the Transformer architecture used by models like GPT-3 does away completely with recurrent and convolutional operations and uses only attention!

Further reading
I highly recommend the following articles that helped me understand what I have written in this post:

DEV Community: Ravi Suresh Mashru

A Brief Introduction to Dynamic Programming

When can you use dynamic programming?

Break down the problem

Overlapping subproblems

Optimal substructure

How can you apply dynamic programming to a problem?

The top-down approach

The bottom-up approach

How can you learn more about dynamic programming?

Understanding the object-fit CSS property

What is a replaced element?

Using object-fit with <img>

object-fit: fill

object-fit: contain

object-fit: cover

object-fit: none

object-fit: scale-down

An Introduction to Hash Tables

Arrays

Linked Lists

Hash Tables - the middle ground

The Hash Function

Handling Collisions

Linear Probing

Chaining

Choosing a Good Hash Function

Complexity Analysis

Conclusion

Resources

Git: Merge vs Rebase

Git Merge

Git Rebase

Which One Should You Use?

Further Reading

An Introduction to Character Encodings

ASCII

The ISO 8859 Family

Unicode

UTF-8

Conclusion

Further Reading

The Evolution of Recurrent Neural Networks

Recurrent Networks

LSTMs

The Forget Gate

The Input Gate

The Output Gate

GRUs

Sequence modelling with Encoder-Decoder architecture

The importance of paying attention

Using object-fit with `<img>`

`object-fit: fill`

`object-fit: contain`

`object-fit: cover`

`object-fit: none`

`object-fit: scale-down`