DEV Community: JB

Traveling Salesman Problem

JB — Sat, 13 Jun 2020 00:43:49 +0000

Resources:

Takeaways:

The traveling salesman problem (TSP) is:
- Given a list of cities & the distances between each pair of cities: what is the shortest possible route/tour that visits each city and returns to the origin city?
With vanilla TSP you can assume the following:
- The distance D between city A and city B is the same as the distance between city B and city A. Thus, D[A][B] == D[B][A].
  - If this is not true, TSP (& our graph of cities/distances) is considered asymmetric - a solution is more challenging to arrive at as there are more edges/possible routes.
- There is no given origin/start city. We only care about the shortest (optimal) route.
  - If we must find the optimal route given a starting city, the solution is largely the same.
TSP is an NP-hard problem and the brute force approach to solving it is O(n!) (factorial time) - as we must explore all possible routes.
Exact algorithms that solve TSP optimally can work reasonably well for small inputs, however for larger (or unknown) inputs heuristic algorithms are popular.
- Heuristic algorithms solve problems but produce solutions that are not guaranteed to be optimal. When we talk about TSP heuristic algorithms we mean algorithms that will produce an approximate solution.
- Heuristic approaches to solving TSP are typically faster than optimal approaches, meaning we can get an answer to TSP in a reasonable amount of time. If we choose the right algorithm(s), then an approximate solution can be acceptably close (and sometimes identical) to an optimal one.
Personally, I chose to solve a variation of TSP using two heuristic approaches that result in approximate solutions.
- The variation is minor. I chose to solve TSP for a given start city (if none is provided, then I just solve starting at the first city in the given list).
  - The solution will be very similar for vanilla TSP, in fact slightly less complex as we do not care about final ordering of cities. In my implementations, I had to take extra care, especially when attempting to optimize a tour, to retain the specified start city.
The algorithms I implemented:
- Preorder Minimum Spanning Tree (PMST) using Kruskal's algorithm
- Nearest Neighbour (NN)
  - Optimized using repetition & sampling.
In addition, I chose to optimize the resulting solutions (where appropriate) using 2-opt
- 2-opt is a local search algorithm often used to improve existing TSP solutions (that are not already optimal)
- Per Skiena, in The Algorithm Design Manual: "Two-opting a tour is a fast and effective way to improve any other heuristic." (p.535).
- Here is the Wikipedia page for 2-opt. The algorithm as described is what I ended up using.
I like the PMST approach because I had already covered minimum spanning tree (MST) in an earlier post, it also allowed me to brush up on Union-Find which Kruskal's algorithm relies on (Kruskal's algorithm constructs an MST).
The PMST approach has the benefit, as implemented, of being guaranteed to be no worse than 2x the optimal route (or tour). Meaning if an optimal tour is 20, at worse a PMST solution will be 40.
- An MST is a tree (list) of all the shortest edges in the graph that connect all vertices in the graph
- If we could rearrange an MST at no cost, we could rearrange it to be an optimal tour.
- As rearranging an MST does have a cost, we instead traverse it in a general way (preorder using DFS) to come up with an ordering that is "good enough" i.e an approximation.
- When rearranging an MST into a tour, the most we could traverse an edge is twice, but instead of doing this we instead "skip" to the next city.
  - This skip is a straight line & represents a single edge in our graph.
  - Going via one edge in this way will always be the same or better than going via two edges. Why? because of triangle inequality.
    - Triangle inequality states that for any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side.
  - So if a subsection of our MST is a triangle with each side of our triangle being an edge and each point of our triangle being a vertex (C--A--B):
    - If we start at vertex A and then visit B, we will need to visit C next (to complete our tour).
    - To visit C we can skip from B to C via a single edge edge(B,C) or go via A which will mean visiting two edges edge(B,A) & edge(A,C).
    - We will always choose to visit C via a single edge edge(B,C) because: edge(B,C) <= edge(B,A) + edge(A,C) - meaning edge(B,C) (due to triangle inequality) can be shorter, but never longer.
    - Therefore, preordering an MST means that in a worst case scenario our "skips" (going via one edge) will be as bad as going via two edges. Meaning at worst a preordering of vertices in an MST will produce a tour that is 2x the distance of an optimal tour (but often better than 2x).
NN is much simpler to reason about and the code is more concise, so I decided to provide an implementation for this algorithm as well.
At a high level, this is how the chosen algorithms work:
- PMST:
  - This approach first constructs an MST from the input graph of cities.
  - This MST will have it's edges duplicated, to form an Eulerian graph.
    - We do this so that we can perform depth-first search (DFS) on our MST and preorder it's vertices (cities).
  - Once an MST is constructed we can perform DFS and the resulting order of vertices (cities) is our approximate solution.
- NN:
  - This approach starts at any city A and then proceeds to visit the nearest city to it B.
  - The algorithm repeats this process (so will visit city C that is closest to B etc.) until all cities have been visited, then it will return to the origin city.
  - We can optimize NN by executing it for every city in our list of cities, whilst keeping track of what the shortest tour produced was. We can call this repeated nearest neighbour.
  - In the same way, we can take a subset (or sample) of cities at random and execute NN for each city in our subset, then return the shortest tour. We can call this sampled repeated nearest neighbour.
  - Both repetition and sampling improve NN, but sampling is often almost as good as repetition whilst being less expensive (quicker) - this is of course dependent on the sample size.
- 2-opt:
  - 2-opt's goal is to "uncross" any part of the route and see if preventing routes from crossing over each other's paths improves the solution.
  - To do this 2-opt swaps 2 edges in a tour.
  - To swap the 2 edges, 2-opt takes a tour and creates 3 subsections of it.
  - The middle subsection is reversed (i.e 2 edges are swapped), with the leading and trailing subsections staying the same.
  - If the resulting tour is better, we recurse and begin the 2-opt routine again (passing in our new, shorter, tour) - this way 2-opt will continue to optimize an already optimized tour, and will produce the most optimal tour it can.
    - Don't confuse this as meaning 2-opt can produce an optimal tour - most of the time it cannot, but it can improve tours.
  - If the tour is no better, or worse, than the original tour, then the window size is increased and the process is repeated.
  - A bigger window means a larger middle subsection will get reversed.
So what are the time complexities of PMST, NN, & 2-opt?
- PMST:
  - Time complexity for Kruskal's algorithm (used to construct our MST) is O(e log v) where e is the number of edges and v is the number of vertices in the graph. Space is O(v + e).
    - In our PMST solution of TSP we will double the edges, so space is actually worse, but constants are factored out in Big O.
  - Preorder traversal of our MST is O(v + e).
  - Total time complexity of the PMST approach is O(e log v).
    - My implementations take an adjacency matrix for city distances. I transform this matrix into an edge list. This incurs an O(v^2) cost. I think this means that, as implemented, my solution is O(e log v^2). However, if supplied an edge/adjacency list as an input, this cost is not incurred.
- NN:
  - NN's time complexity is O(n^2) in the worst case. Space is O(n). Repeated NN is O(n^3) and sampled NN is O(n^k) where k is the size of the sample.
- 2-opt:
  - 2-opt's complexity varies on the type of tour it is given:
    - In the average case, if starting with a tour made using a greedy algorithm (which our PMST & NN produce), the time complexity is O(n). If the tour is more randomly arrived at, then 2-opt's time complexity is O(n log n).

Extras:

Held-Karp algorithm is a dynamic programming algorithm that solves the TSP optimally. It is O(n^2 * 2^n).
3-opt is similar to 2-opt, but it swaps 3 edges instead of 2. It is O(n^3) for a single run, but can produce close to an optimal tour.
- Both 2 & 3-opt are a class of K-optimal tours.
- Per Skiena in The Algorithm Design Manual: K-optimal tours are "local refinements to an initially arbitrary tour in the hopes of improving it. In particular, subsets of k edges are deleted from the tour and the k remaining subchains rewired to form a different tour with hopefully a better cost. A tour is k-optimal when no subset of k edges can be deleted and rewired to reduce the cost of the tour." (p.535).
  - When K > 3 the time complexity increases faster than the solution improves. So we often only deal with 2-opt or 3-opt in the context of TSP.

Below you will find implementations of, & test cases for, preorder minimum spanning tree, variations of nearest neighbour, & 2-opt approximate solutions for the Traveling Salesman problem:

As always, if you found any errors in this post please let me know!

Dynamic Programming

JB — Sun, 31 May 2020 21:20:02 +0000

Resources:

Takeaways:

Dynamic programming is a process by which a larger problem is reduced to sub-problems.
These sub-problems are easier to reason about, easier to solve individually, and are typically decision problems.
By solving these sub-problems, dynamic programming enables us to build up an answer to the larger, more complex, problem.
If a problem can be solved optimally in this way then the problem is considered to have an optimal substructure.
- Optimal substructure just means that a problem has an optimal solution that has been constructed from optimal solutions of it's sub-problems.
Dynamic programming solutions are usually either top-down or bottom-up.
- Top-down means we start from our input(s) and work downwards to our base-case using recursion (remember recursive functions have a base-case that determines when the recursion should end, similar to an exit condition in a while loop).
- Bottom-up means we start from the base-case and work upwards.
  - Bottom-up is recursion free and thus will avoid building up a call stack of O(n) size. This means bottom-up approaches can be more memory efficient than their top-down counterparts.
Top-down solutions use recursion and memoization to solve the larger problem, whereas bottom-up solutions will use a nested for-loop to build a solution matrix/table before contructing the answer from the matrix.
- Memoization is sort of like caching. We remember the results of a computation so we don't have to recompute the same thing over and over. We instead check the "cache" and see if we have already done the work previously. Here's a good article on memoization.
- A sub-problem solution table, or matrix, is a matrix with each coordinate (x,y) containing the answer to a sub-problem (x,y axis are representations of the problem inputs).
- For example, x might represent a weight input, and y might represent a cost input. Given a weight of 1 and a cost of $3 we can find out in our matrix what the optimal solution is for those inputs by going to the x,y coordinates 1,3.
- Top-down dynamic programming solutions memoize the solutions to sub-problems to avoid duplicated work.
- Bottom-up solutions merely keep a solution matrix/table in memory, but only use it for constructing the solution - meaning technically there is no memoization, as the same sub-problem does not get solved more than once.
A famous dynamic programming problem is the 0/1 Knapsack problem
The 0/1 Knapsack problem statement is:
- Given a knapsack & a set of items (each item has a weight and value) determine the items to put in your knapsack. The total weight of the knapsack must be less than or equal to a given limit, and the total value should be as large as possible.
- The items cannot be split/divided. Which is where the 0/1 comes from - we either take an item or leave it.
The following is an overview of solutions to 0/1 Knapsack (at the conclusion of this post you will find actual code for both top-down and bottom-up solutions):
The naive, brute force, solution to 0/1 Knapsack is to try all possible combinations of items, keeping track of which subset yields the greatest value whilst being under the weight limit.
- This approach is exponential O(2^n), with space being O(n)
- The algorithm would be something like:
  - Define a recursive function (e.g solveKnapsack(int[] values, int[] weights, int weightLimit, int index))
  - Define a base-case at the start of our function that will ensure our index and weight limit are not out of bounds.
  - For each item i (i.e value of i is values[i], weight of i is weight[i]):
    - Compute a value v1 which includes item i if it's weight is lower than our weight limit, deduct i's weight from our limit & then recursively process the remaining items.
    - Otherwise, if i has a weight greater than our limit, compute a value v2 which exludes item i and then recursively process the remaining items.
  - At the end of our recursive function, return the maximum between v1 and v2.
If our naive approach is written recursively similar to described, then we can see that this approach is top-down - it is working from our inputs downwards to a base-case.
Each call to solveKnapsack in our naive solution is solving a sub-problem: do we include or exlude item i? And whatever the answer, what is the value (v1 or v2) of that decision?
We can add an extra parameter (representing a table or matrix) to our function that could be used to memoize the results of these sub-problems. This will help to avoid duplicated work (assuming we also add a check that will return the memoized value/sub-problem solution, if it has already been solved)
A bottom-up approach merely removes recursion from our previous solution and builds up the matrix at the start of the function. We can then use the matrix to determine things like: what is the maximum value? what items were selected? (via backtracking).
Time & space complexities for the top-down and bottom-up solutions are O(nm) where n is the number of items and m is the weight limit.
Another famous problem that can be solved with dynamic programming is Longest Common Subsequence (LCS).
LCS problem statement is:
- Given two strings inputA and inputB, return the length of their longest common subsequence.
- Other variations might also ask you to return the LCS string itself. In my implementations I return the length, but also print the LCS before returning.
So what exactly is a subsequence?
- A subsequence of a string A is a new string B that has some of the characters from A deleted, without changing the relative order of the remaining characters.
  - This is unlike a substring, which is simply a sub-section of another string.
  - For example: given A = "abcdefg" a subsequence B could be "acdeg" or "ag" or "abd". Notice how B's characters all exist in A? And how each character's relative position is unchanged?
  - "acb" & "dac" would represent a change in the relative order and so are examples of invalid subsequences of A.
  - Further, we can say that if two strings share a subsequence it is a common subsequence. An example of a common subsequence: Given two strings A & B(A = "hello", B = "below") a common subsequnce C of the two is C = "el".
The solutions to LCS are strikingly similar to the ones for 0/1 Knapsack:
- The naive approach, again, is brute force and checks every possible subsequence in inputA[0..n] and inputB[0..m]. The longest subsequence that exists in both inputs will be returned.
  - This approach is O(n * 2^m)
- But what decisions does our naive approach make? How is it finding the subsequences?
  - For each recursive call to solveLCS(string inputA, string inputB), if both inputA & inputB end with the same character then we remove the trailing character from each and recursively call solveLCS(inputA[0..n-1], inputB[0..m-1]) and return the result.
    - Why? Because if they end with the same character, that means we have found a subsequence - we remove that character and process the rest of each string to determine how long the longest subsequence is.
  - If our inputs do not have the same last character, then we have yet to identify a subsequence so we must remove a trailing character from one of the inputs and recursively call solveLCS(), returning the result.
    - We do this for both inputA and inputB ((inputA[0..n-1], inputB[0..m]) & (inputA[0..n], inputB[0..m-1])), after computing the LCS of both we return the larger of the two values.
  - The base-case for our naive recursive algorithm checks if inputA or inputB are empty strings (as we will be removing characters from each during the recursion).
Like 0/1 Knapsack, to transform the naive recursive solution into a dynamic top-down solution all we need to do is add an additional data structure as a parameter. This parameter will represent our table/matrix and will be used to memoize each sub-problem's solution. We effectively cache each decision made in the recursive routine.
Again, like with 0/1 Knapsack, a bottom-up solution will not use recursion and instead, upfront, build the solution matrix using a nested for-loop. From the resulting matrix we can deduce solutions to: what is the longest common subsequence? What characters are in that subsequence?
Time & space complexities for the top-down and bottom-up LCS solutions are O(nm) where n is the number of characters in inputA and m is the number of characters in inputB. The top-down solution, due to recursion, is technically O(nm + n) - but asymptotically we don't factor in the extra cost (because it is constant).

Lastly, you will notice that both 0/1 Knapsack and LCS problems are solvable in exponential time when approached naively. Dynamic programming not only gives us a blueprint for solving both, but enables us to reduce the running time for both these problems to polynomial running time (specifically O(nm)).

Polynomial running time is simply a category of running times. Formally, it is any running time that is O(n^k) where k is non-negative. Polynomial running time is much better than exponential running time. Here is a good comparison of various time complexities and their rates of growth.

Below are top-down & bottom-up solutions to LCS & 0/1 Knapsack problems, with test cases:

As always, if you found any errors in this post please let me know!

Fenwick Tree (Binary Indexed Tree)

JB — Fri, 15 May 2020 23:48:55 +0000

To understand Fenwick trees you will need to first understand binary representation of numbers and basic bit manipulation, a previous post of mine covers these topics.

Resources:

Takeaways:

A Fenwick tree is a data structure that can efficiently calculate prefix sums for a range of numbers, whilst also supporting updates efficiently.
- A prefix sum is essentially a running total of a range of numbers. The prefix sums of sourceArray = [1,2,3,4,5] are prefixSumArray = [1,3,6,10,15].
- Arrays, like Fenwick trees, can be used for prefix sums.
- Using the previous example: calculating the prefix sum at prefixSumArray[i] is an O(1) operation.
- Updating an element at sourceArray[i] means we also need to update our prefix sum array. This is an expensive operation that takes O(n) - because we need to update n prefixes in prefixSumArray.
- To construct a prefix sum array takes O(n).
- By contrast, a Fenwick tree also takes O(n) for construction but is O(log n) for both updates & prefix sum operations.
- Space complexity of a Fenwick tree is O(n)
A Fenwick tree is an array that stores, at each index, prefix sums of n elements from a source array
- This is different to a prefix array because each index has a relation to other indices. Because of this, at each index we do not store the prefix sum of all elements that precede it - instead we store the prefix sum of all it's child indices.
- This relationship between indices means that at each index in a Fenwick tree we store data that is the sum of n indices, but not necessarily the sum of all the indices up to and including i (which is what a prefix array does).
A Fenwick tree is actually not a tree structure, instead the tree is stored as an array that represents a tree.
- Index 0 of the array is the virtual root node of the tree and is not used for data (prefixes)
- Half of the elements have a range of responsibility for 1 element
- A quarter of elements have responsibility for 2 elements
- An eighth of elements have a responsibility for 4 elements
- A sixteenth of the elements have a responsibility for 8 elements
- And so on...
Range of responsibility, in this context, means that the elements contain a prefix sum for the range of elements they are responsible for.
- This means, unlike prefix arrays, we do not need to check every element in a range of indices to calculate a prefix sum.
A Fenwick tree does this dividing/assigning of range of responsibility based on the binary representation of the indices of the array. This works like so:
- If the furthest right bit of i is set to 1 (1) then the same data from the source array is stored at the same position in the Fenwick tree array. These are the "half of elements" and have a range of responsibility of 1 - as they contain the exact values from the source array.
- If the first bit of i is 0 (0) but the second bit is set to 1 (2) then the range of responsibility is 2. These are the "quarter of elements". Again, the data is stored at the same index in the Fenwick tree array as it was in the source array - however this time the data will be the sum of sourceArray[i] + sourceArray[i - 1] (indices of the "quarter of elements" are parents of the "half of elements" indices).
- If the first two bits of i are 0 but the third bit is set to 1 (4) then the range of responsibility is 4. These are the "eighth of elements". These elements are the parents of the "quarter of elements" and grandparents of the "half of elements". The sum of the previous 3 elements, and the current element (sourceArray[i]), are stored in the Fenwick tree array at the current index.
- This process continues until all indices are filled with sums of elements from the source array. The Fenwick tree will now contain elements that are all prefixes across a certain range of elements (1,2,4,8 etc.).
To calculate a prefix sum using a Fenwick tree, if we are given an index i we simply loop while (i > 0) (0 is our root node with no data in it) and inside the loop add the current element at i to a running sum sum += tree[i]. Then we decrement i by flipping it's last set bit to 0. The process is repeated until all of i's bits are flipped to 0.
- Flipping last set bits like this, means we traverse up the tree to i's parent & any ancestor nodes (indices). As indices can have a range of responsibility larger than one, we do not have to visit all the elements in a range to calculate the prefix sum. Just i and it's ancestors in the tree.
To increase or decrease the value in a Fenwick tree stored at an index i by an integer x, the logic is very similar to prefix sum:
- while(i < tree.Length) we add x to the current index tree[i] += x.
- After updating the current index, we need to update all of i's ancestors.
- To do this we add the last set bit to of i to itself (so 0010 (2) becomes 0100 (4)).
- We continue adding x to tree[i] and incrementing i whilst i is smaller than our tree length (to prevent going out of bounds).
The logic in our add operation (just described), of incrementing i by it's last set bit (which leaves us with an index representing i's parent), is used when constructing a Fenwick tree in O(n). We loop over every element in the source array, find i's parent and populate data in our tree. This helps to effectively partition our Fenwick tree array into indices with ranges of responsibility.

Below you will find a Fenwick tree implementation with test cases. There are other basic operations a Fenwick tree can support by reusing the prefix sum or add operations, some of these have been implemented below:

As always, if you found any errors in this post please let me know!

String Searching Using Rabin-Karp

JB — Tue, 05 May 2020 03:39:41 +0000

The Rabin-Karp algorithm is a string-searching algorithm that uses hashing to find a pattern (or patterns) in an input string.

It is not the quickest algorithm for finding a single pattern, however a small tweak can allow the algorithm to perform well for multiple patterns. Using Rabin-Karp to search for multiple patterns could come in handy for things like detecting common words, sentences, or characters between documents or even source code.

In this post I will detail resources for learning about Rabin-Karp, go over it's key features, & provide a complete + efficient implementation of the algorithm (with test cases).

Resources:

Takeaways:

Aside from a tricky rolling-hash, the algorithm is simple compared to other string-searching algorithms.
The algorithm is as follows:
- Accept an input i and a pattern p
- Define a variable n to be the length of i and a variable m to be the length of p.
- Create a hash of pattern p (hashP) and a hash of length m of input i (hashI).
  - I.e for an input i = "abcde" and a pattern length m of 2, hashI would be a hash of the substring "ab".
- If hashP == hashI then we have possibly found pattern p at the start of input i. Check each character of the substring representing hashI with our pattern p to be sure (call this function CheckEqual())
- Otherwise iterate from pattern length m to input length n
- For each iteration recalculate the hash hashI:
  - We are sliding a window across our input
  - Window is of size m (pattern length)
  - This means for each iteration we are removing a leading character from our hash and adding a trailing character
  - If i = "abcde", m = 2, and hashI represents "ab" - then our first iteration would remove "a" from hashI and add "c". Meaning hashI now represents "bc". This part of the operation is done in constant time.
- After rehashing, check if hashP == hashI. If they are the same, run our character equality check CheckEqual() to be certain.
- Explore all of i until we have found our pattern or no pattern exists in i.
A rolling hash is a hash function where the input is hashed in a sliding window. As the window moves through the input the hash is updated quicker than rehashing all the characters in the new window - as the previous window and the current window share characters.
- E.g if our input is "abcde" and our window size is 3. We first hash "abc" then slide our window and next hash "bcd" - a rolling hash takes advantage of the fact that "bc" was already hashed. Instead of hashing "bc" again a rolling hash removes "a" and appends to the hash of "bc" the hash of "c". This means a hash for "bc" was only calculated once for two hashes.
- Extrapolated over larger inputs, and more times, a rolling hash can be very efficient. It's the reason Rabin-Karp is reasonably fast at string-searching.
Rabin-Karp's rolling hash is calculated modulo a large prime number. This means hash values are also relatively prime and the distribution of the hash values are more uniform.
- Choosing a large prime helps us avoid false positives - where two hashes are the same but the values they represent are not.
- Choosing a random prime means we can guard against the worst case running time, as no specific inputs will affect the running time predictably.
The rolling hash is often computed multiplied by a relatively prime Base/Radix. This value should be at least as large as the character set. For my implementation I used 256 - as this is the number of ASCII characters.
A common rolling hash for Rabin-Karp is: S[i]*R^m-i % Q. Where S is our input string, i is the position (in a loop/iteration), R is the radix/base, m is the pattern length, and Q is the prime.
- Using Horner's Method this can be simplified in code to:

        private long Hash(string input, int patternLength)
        {
            long hash = 0;

            for (int i = 0; i < patternLength; i++)
            {
                int ascii = input[i];
                // Assume Base/Prime are already defined
                hash = (Base * hash + ascii) % Prime;
            }

            return hash;
        }

Time complexity of Rabin-Karp is linear (specifically, amortized O(n + m) where n is input length and m is pattern length).
Brute forced approaches, and worst case Rabin-Karp (more likely with smaller, non-random primes, and/or bad rolling hash), are O(n * m).
Space is O(1)
We can modify Rabin-Karp with a Bloom Filter or simply a hash set to search for multiple patterns.
In my implementations I return the index (aka offset) the pattern starts at in the input. You could easily modify either the single or multiple pattern implementations to do other things (like count occurrences of patterns).

Below are Rabin-Karp implementations for single & multiple patterns (of the same size). The implementations assume ASCII input and will work for inputs up to a reasonably large size. The rolling hash uses a relatively large random prime to guard against poor running times:

As always, if you found any errors in this post please let me know!

Sliding Window Technique

JB — Mon, 20 Apr 2020 03:10:34 +0000

Resources:

Takeaways:

The sliding window technique is where we use two pointers on a collection. We say that one pointer represents the start of the window and the other the end of the window. The space/elements between the two pointers is the window size.
A sliding window can be of fixed length or it can be variable length (meaning the window expands/contracts dynamically).
- Fixed length is often easier to grasp/think about. Both pointers, say i and j, in a loop are incremented by the same number each time our max window size is reached.
- Variable length is where i or j change at a different pace to each other. Often questions requiring this technique don't provide a window size, but ask for the largest/smallest possible window given some constraints.
Here are some questions we can use the sliding window technique on:
- In an array of n integers, find a contiguous k length subarray that has the largest value when it's elements are summed.
  - We can use two pointers here and our window would be of size k.
- What is the size of the minimum subarray that when summed will equal a target sum?
  - We can use two pointers here, but because we want the minimum length subarray (i.e the question is asking how big/small the window is) the window can't be of fixed length.
- What is the largest subarray of k distinct characters?
  - We can use two pointers here. This will again be a variable length window, as we are being asked what the size of the window is (largest subarray). We will also need to use an auxiliary data structure to keep track of how many distinct characters are in a given subarray.
A less obvious question where sliding window would come in handy:
- Given two strings inputA & inputB, determine if any permutation of inputA is a substring of inputB.
  - Here we can use two pointers and the window will be of fixed size/won't ever grow larger than the length of inputA.

Below are solutions to all the problems mentioned. I have annotated the source code with time & space complexities for each solution:

As always, if you found any errors in this post please let me know!

Minimum Spanning Tree (Kruskal's Algorithm)

JB — Sun, 12 Apr 2020 00:33:40 +0000

Resources:

This post requires knowledge of graphs and union-find (covered in earlier posts).

Takeaways:

A Minimum Spanning Tree (MST) is a subset of edges of an undirected, connected, weighted graph.
- This means a MST connects all vertices together in a path that has the smallest total edge weight.
One algorithm for finding the MST of a graph is Kruskal's Algorithm.
Kruskal's algorithm is a greedy algorithm - this means it will make locally optimum choices, with an intent of finding the overall optimal solution.
Kruskal's algorithm relies on the union-find data structure.
- First the algorithm sorts the graph's edges in ascending order (by weight).
- Then for every edge, if it's vertices have different root vertices (determined by union-find's Find()), it will add the edge to a list & Union() it's vertices within the union-find data structure.
- If roots are the same, it will skip the edge.
- The final list represents the MST of the graph.
Another common algorithm for finding the MST of a graph is Prim's Algorithm. Commonly, Prim's uses a heap or priority queue in it's implementation.
Time complexity for Kruskal's algorithm is O(e log v) where e is the number of edges and v is the number of vertices in the graph. Space is O(e + v).

Below is my implementation of Kruskal's algorithm:

As always, if you found any errors in this post please let me know!

Union-find (Disjoint-set)

JB — Sun, 05 Apr 2020 23:28:01 +0000

Resources:

Takeaways:

The union-find data structure (also known as disjoint-set) keeps track of elements that are partitioned into disjoint subsets (also called components).
When elements are initially added to union-find, they are all considered members of their own component. That means if we create a union-find with 3 elements it will also contain 3 components (each component being a single element). This is achieved by having each element specify itself as its parent (self-referencing).
Union-find has three main operations: MakeSet(), Union(), & Find().
- MakeSet() takes an element, adds it to the underlying collection/tree, gives it either a rank of 0 or size of 1 (explained later on), and specifies a parent pointer to itself (which creates a new component).
- Find() takes an element x and traverses it's parent chain. Find() tells us what component x is in by finding the root element of that component. If two elements have the same root, they are in the same component.
- Union() takes two elements x & y and merges the components they belong to together. Union() leverages Find() to determine what components x & y are in. If the roots are the same, x & y are in the same component, and no action is taken. If the roots are different, this means x & y are in separate components. These components are merged by pointing the root of one of the components to the root of the other (one root becomes the parent of the other).
  - How do we determine which component is the one to merge into the other? By using rank or size mentioned earlier.
Union by size merges the smallest component (fewest elements) into the larger one.
Union by rank merges the component with the shorter tree into the taller one. Each componenet has a rank, which starts as 0. If two components are unioned and they have the same rank then the resulting component's rank is increased by 1. If the unioned components have a different rank, then the resulting component's rank is equal to the highest rank between the two.
- We use rank instead of tree height or depth because of a techninque called path compression that changes the height/depth of the components over time
Path compression is utilized in the Find() operation to flatten a components tree. It does this by making every element of the component point to the components root (every element's parent is now the root of the component). This makes the Find() operation faster for that component, as it has a flatter chain to traverse when looking for a component's root.
Union-find is often implemented with either an array or using a more object-oriented approach. In my implementations I use an array.
Union-find is a useful data structure when used on graphs. Union-find can be used for things such as efficiently connecting vertices, finding connected components, & determining minimum spanning tree.
Using path compression, both union-find using rank or size have an amortized time complexity of O(n) for Union() & Find() operations. Space is O(n). Without path compression, both Union() & Find() operations have a complexity of O(log n).

Below are array based union-find implementations by size & rank:

As always, if you found any errors in this post please let me know!

Extending An Iterator

JB — Sun, 29 Mar 2020 04:21:57 +0000

Resources:

Takeaways:

Iterators allow us to traverse and access elements in collections without the use of indexing.
Iterators expose a consistent way to traverse all types of data structures, making the code more resilient to change.
Iterators can allow the underlying collection to be modified during traversal, unlike when using indexing to iterate over a collection (like using a for loop on an array). This can mean insertion/deletion of items during traversal.
Iterators exist in many languages, in C# they are called enumerators.
Time & Space complexity in the code comments

Below you will find a class that takes an enumerator in it's constructor and extends the basic functionality of an iterator with the addition of Skip(), Peek(), & Remove() operations:

As always, if you found any errors in this post please let me know!

Checking If An Undirected Graph Is Bipartite

JB — Thu, 19 Mar 2020 02:29:35 +0000

2021/01/05 - The previous C# code sample implemented the algorithm incorrectly. A corrected Java implementation has been added in its place.

If you are unfamiliar with graphs, check out some of my earlier posts on them.

Resources:

Takeaways:

A bipartite graph (bigraph) is a graph where the vertices can be divided into two disjoint, independent, sets u and v. Every edge will connect a vertex from one set to the other (without self referencing edges - I.E edges going from a vertex in u to another vertex in u).
One way to visualize a bipartite graph, is to colour all the vertices in a set the same colour. Set u could be red vertices, whereas v could be black. This would mean an edge would always consist of a red and black pair of vertices.
This type of two-colouring is impossible in non-bipartite graphs. Think of a graph with three vertices arranged in a triangle. We cannot represent this graph as two independent sets, and we cannot two-colour it in such a way that will allow each edge to have different coloured endpoints.
One way in which we can check if a graph is bipartite, is to run a depth-first search (DFS) over the vertices. Applying two colouring to the graph.
- Start at a random vertex v and colour it colour1 (red, for example).
- Colour all adjacent vertices u the opposite colour of v. For each adjacent u, also recursively call our DFS routine.
- If a graph is bipartite, we can complete this two-colouring without a contradiction.
- If the graph is not bipartite, then at some point a vertex will get both colours - and this contradiction means we cannot achieve a two-colouring of the graph.
Time complexity is O(v + e) for an adjacency list. Space complexity is O(v). For an adjacency matrix, the time & space complexity would be O(v^2).

Undirected graph that can be two-coloured:

Undirected graph that cannot be two-coloured:

Below are implementations for checking if undirected graphs are bipartite. There is solutions for both undirected adjacency list & adjacency matrix representations of graphs:

As always, if you found any errors in this post please let me know!

Finding Strongly Connected Components in Directed Graphs using Tarjan's Algorithm

JB — Mon, 09 Mar 2020 03:51:58 +0000

If you are unfamiliar with graphs, check out some of my earlier posts on them.

Resources:

Takeaways:

A strongly connected component in a directed graph is a partition or sub-graph where each vertex of the component is reachable from every other vertex in the component.
Strongly connected components are always the maximal sub-graph, meaning none of their vertices are part of another strongly connected component.
Finding strongly connected components is very similar to finding articulation points & bridges. Many of the solutions for finding them involve depth-first search (DFS).
One way to find strongly connected components is using Tarjan's Algorithm.
Tarjan's Algorithm uses DFS and a stack to find strongly connected components.
The algorithm overview is:
- For every unvisited vertex v in the graph, perform DFS.
- At the start of each DFS routine, mark the current vertex v as visited, push v onto the stack, and assign v an ID and low-link value.
- Initially, like with articulation points & bridges, the ID and low-link values of v will be the same.
- Increment the integer value we are using as the ID/low-link seed. So the next vertex w to enter our DFS routine will get a higher value for both (compared to v).
- For every adjacent vertex u of v, if u is unvisited, perform DFS (recursive call).
- On exit of the recursive DFS call, or if u had already been visited, check if u is on the stack.
- If u is on the stack, update the low-link value of v to be smallest between low-link[v] and low-link[u]
  - low-link represents the earliest ancestor of a vertex (a lower value means the vertex entered DFS earlier). In other words, low-link is the ID of the earliest vertex y vertex v is connected to.
- After exploring all adjacent vertices of v check if the ID of v is the same as it's low-link value
- If they are the same, then v is the root of a strongly connected component.
- If they are the same, pop all vertices off the stack up to and including v. All of these vertices represent a single strongly connected component
- Complete the above steps for the entire graph, and all strongly connected components will be discovered.
Time complexity of Tarjan's Algorithm is O(v + e) - where v is the number of vertices, and e the number of edges, in a graph.
Space is O(v)

Below are implementations for finding strongly connected components in undirected adjacency list & adjacency matrix representations of graphs:

As always, if you found any errors in this post please let me know!

Finding Articulation Points & Bridges in Undirected Graphs

JB — Wed, 26 Feb 2020 05:03:32 +0000

If you are unfamiliar with graphs, check out some of my earlier posts on them.

Resources:

Takeaways:

A component is a sub-graph that is not connected to the rest of the graph.
An articulation point is a vertex that when removed (along with it's edges) creates more components in the graph.
Another name for articulation point is cut vertex.
A bridge is an edge that when removed creates more components in the graph.
Another name for a bridge is cut-edge.
We can use depth-first search (DFS) to find all the articulation points or bridges in a graph.
There are two rules for finding articulation points in undirected graphs:
1. The root vertex in a graph is an articulation point if it has more than one child.
2. Any other vertex v in the graph is an articulation point if it has a child u that cannot reach any ancestors of v, without also visiting v.
  - This means that there is no back-edge between u and any ancestor before v.
How do we apply DFS and follow the above two rules?
- We maintain a discovery time (or ID) for each vertex v in the graph. We also keep track of the earliest vertex y it is connected to - we call this the low-link value (y is the low-link vertex).
- So during DFS, for every vertex v we visit we assign it an ID & a low-link value. To start, we set both ID/low-link to be the same integer that we will increment by 1 each time we visit a new vertex. (For example: root vertex will get 0 assigned to it for ID/low-link, then we will increment both by 1 for the next vertex).
- For every vertex u, that is adjacent to v, we will recursively call our DFS routine.
- When our DFS finds no more adjacent vertices to visit, the stack begins to unwind. As it unwinds, we will update our low-link value for each adjacent vertex (to represent the earliest ancestor it is connected to).
- If we find that the ID (visited time) of the current vertex v is <= the low-link value of our adjacent vertex u, then v is an articulation point.
  - This is because it means u cannot reach one of it's ancestor vertices without also visiting/going via v.
Finding bridges is very similar to finding articulation points. The main changes to the algorithm are:
- We no longer need to keep track of how many children the root vertex has.
- Instead of checking if the ID of v is <= the low-link of u, we just check if it's <.
  - We only check if it's less than, because if the values are the same then there is a back-edge present - and this means a bridge cannot exist
  - A back-edge means that a neighbour w of u could have an edge connecting to v. Creating a cycle. Meaning if edge(u,v) is removed, then u still connects to v via edge(w,v).
  - If there is no back-edge, or cycle, like described, then the ID of v will always be less than the low-link of u if a bridge is present. And this means if edge(u,v) is removed, u and all it's adjacent vertices will become disconnected from the graph - forming another component (and increasing the number of components in the graph).
Time complexity is O(v + e) for an adjacency list. Space complexity is O(v). For an adjacency matrix, the time & space complexity would be O(v^2).
Below is an example of an undirected graph with articulation points and bridges:

Below are implementations for finding articulation points & bridges in undirected adjacency list & adjacency matrix representations of graphs:

As always, if you found any errors in this post please let me know!

Topological Sorting of Directed Acyclic Graphs (DAGs)

JB — Fri, 21 Feb 2020 04:36:02 +0000

Resources:

Takeaways:

Topological sort is an algorithm that produces a linear ordering of a graph's vertices such that for every directed edge v -> u, vertex v comes before vertex u in the ordering.
There can be more than one valid topological ordering of a graph's vertices.
Topological sort only works for Directed Acyclic Graphs (DAGs)
Undirected graphs, or graphs with cycles (cyclic graphs), have edges where there is no clear start and end. Think of v -> u, in an undirected graph this edge would be v <--> u. The same is true for a graph with a back edge (cycle) - we do not know what the order should be as it is ambiguous as to which vertex comes before the other.
Here is a visualization of what a topological sort might produce:

DAG before topological sort

Three valid topological orderings of the DAG's vertices

One important thing to consider when thinking about topological ordering is the in-degree and out-degree of each vertex.
- In-degree is how many edges point to a vertex (incoming edges).
- Out-degree is how many edges point from a vertex (outgoing edges).
Vertices with an in-degree of 0 will tend to be towards the beginning of a topological ordering, because no other vertex points to them.
Vertices with an out-degree of 0 will be towards the end of a topological ordering, because they don't point to any vertices - they are solely destination vertices in directed edges.
There are two main ways to perform topological sort: Kahn's Algorithm & Depth-First Search (DFS).
Roberet Tarjan is credited with being the first to write about using DFS for topological sorting.
Kahn's algorithm relies on pre-calculating the in-degree of each vertex.
Tarjan's approach uses DFS to a achieve a topological ordering of a graph's vertices.
In this post I will be covering DFS and not Kahn's algorithm.
If you wish to learn more about Kahn's algorithm see this explanation.
Using DFS for topological sorting is actually almost identical to vanilla DFS, the main difference is that we keep track of a collection of vertices to return (which will be the topological ordering of vertices).
In Tarjan's approach we loop through each vertex of the graph
- If we haven't visited the vertex before we run DFS on that vertex.
- We mark each vertex in the DFS routine as visited before exploring each of the vertex's neighbours.
- If a neighbour has already been visited, it is skipped.
- Before each DFS execution completes, it will add the current vertex to a Stack or prepend it to a List.
- The List or Stack represents the topological ordering of vertices.
- If we use a Stack we can pop each vertex from the stack, and the order in which they are popped will be the topological order.
- For List, if we always prepend (add new vertices at the start), then the list will represent our topological ordering at completion of the algorithm.
- Stack has cost after adding all vertices to it. List has a cost each time we add to it (can't append, have to prepend).
- We could also append to a collection (like List) and then reverse it after adding all vertices.
- Either way we choose will incur an O(v) cost - so it is a matter of preference.
Run time of DFS for topological sort of an adjacency list is linear O(v + e) - where v is number of vertices and e is number of edges. Space complexity is O(v). For an adjacency matrix, both are O(v^2).
Some applications of topological sort:
- Can be used to detect cycles and find strongly connected components in graphs.
- School: class prerequisites & what order you have to take the classes in.
- Build systems: What order should the dependencies get built in? Also helps to enforce that there are no circular dependencies (as topological sort detects cycles).

Below are implementations of DFS for topological sorting of adjacency list & adjacency matrix representations of graphs:

As always, if you found any errors in this post please let me know!