DEV Community: Vinit Jogani

Essential Tips for Junior Developers: What Thousands of Code Reviews Taught Me

Vinit Jogani — Tue, 25 Jun 2024 11:54:19 +0000

Background

Over the last year or so, I reviewed over 2,000 merge requests from almost 50 engineers, many of whom were junior engineers just starting their careers. After a while, I started to notice a pattern of frequently occurring issues in the code reviews. With some GPT-assisted analysis, I compiled the following set of tips for junior developers to help them write better code. This goes beyond the basics like exception handling, documentation, or unit tests—those are important, but I believe everyone understands that to some extent. Below are the often underestimated aspects.

Tip 1: The IDE is your best friend!

Many developers do not fully utilize the tools available in modern IDEs, from auto-formatters to linters that can catch stylistic issues and even some errors. This is especially important for interpreted languages like Python, where there is no compiler to catch errors beforehand. Setting up tools like Pylint, Flake8, and Black can save you from many runtime exceptions and make your code more consistent.

Configuring your IDE properly can significantly improve productivity. Use hotkeys for faster search and navigation, and take advantage of tools like port-forwarding when connecting to remote SSH systems. Stack-specific extensions, such as those for Django templates or YAML files, can also make development much easier and faster.

Tip 2: Avoid nesting whenver possible

From a code structure and complexity point of view, nesting makes the code hard to read and reason about. Deep indentation requires keeping track of more context, which can be mentally taxing. Using early returns in functions and early continues in loops can dramatically simplify your code.

# Example: Avoiding nesting by using early return and continue
def process_items(items):
    if not items: return 

    for item in items:
        if not item.is_valid():
            continue
        # Process valid items
        process(item)

Tip 3: No DB queries inside loops at all costs

One of the biggest performance pitfalls is the overhead of database queries within loops. Each query adds IO latency, and ORMs can hide the fact that certain property accesses may result in multiple queries per iteration. This can severely slow down your application and database server.

Leverage joins, prefetching of related fields, or other ORM features to minimize queries. Use logging to track which queries are being executed unexpectedly often. Understanding your ORM and the SQL it generates is a valuable skill for any project.

# Example: Using prefetch_related to avoid queries in a loop
# Django ORM example
orders = Order.objects.prefetch_related('user')
for order in orders:
    process(order.user)

Tip 4: Understand data access patterns and choose appropriate data structures

When implementing most features, the lazy option is to use a List or Dictionary for everything. Many junior developers fall prey to this despite their better judgement and despite knowing all the various data structures from school.

Different data access patterns require different data structures. Using a Set instead of a List can dramatically improve performance when making many existence checks in a loop. Similarly, using a Dictionary instead of .FirstOrDefault() on a list in .NET can significantly improve performance.

More advanced data structures should be considered too when applicable. For example, one of the merge requests I was reviewing required lookup based on 4-5 fields. The caveat was that one of those fields was a numeric field that needed to be compared using range checks. Naturally built-in data structures did not help much but a custom binary search based method was able to dramatically improve performance.

Premature optimization is discouraged, but a basic understanding of performance implications can guide better coding practices. Be aware of the time complexity of algorithms and the memory footprint of your data structures. Profiling tools can help identify bottlenecks in your application. Optimize only after identifying real performance issues, and make data-driven decisions based on profiling results.

Tip 5: Become an expert at searching through code

In a large codebase, someone has likely implemented similar functionality to what you need. By becoming proficient in searching through code, you can find reusable chunks of code or helper methods that have already been reviewed, tested, and optimized. This not only saves time but also ensures consistency across the codebase.

Effective code search skills involve knowing how to use your IDE's search functionality, understanding the project's structure, and being familiar with naming conventions used by your team. Additionally, exploring the version history and previous implementations can provide valuable insights into why certain decisions were made.

Duplication not only makes the codebase larger and harder to maintain but also increases the risk of inconsistencies and bugs. By leveraging existing machinery, you can eliminate redundancies and build on a solid foundation. This approach encourages collaboration and knowledge sharing within the team, as you become more familiar with your colleagues' work and the overall project.

Tip 6: Discipline to make small, cohesive merge requests

Junior developers often try to make massive merge requests to ensure everything works before sending it for review. This makes it challenging for reviewers to thoroughly examine the code and provide constructive feedback. The longer the merge request, the fewer comments you are likely to receive, as reviewers might be overwhelmed by the sheer volume of changes.

Smaller merge requests offer several benefits. They make it easier to write unit tests and ensure each change is well-tested. They also make large tasks more manageable by breaking them into meaningful steps. This approach promotes adherence to the single responsibility principle, where each class or function has a clear, focused purpose.

To achieve this, plan your work in advance and identify logical checkpoints where you can split your changes. Write self-documenting code by using meaningful variable names, function names, and class names that clearly convey their purpose and usage. This makes your code easier to understand without extensive comments, improving readability and maintainability.

Tip 7: Read through a lot of code

Reading through your own code and others' code is a useful exercise. Often, even reading your own merge request a day after submitting it can reveal hard-to-understand sections or obvious mistakes. This practice helps you gain a fresh perspective on your work and identify areas for improvement.

Engage in peer code reviews as much as possible. Reviewing code is a critical part of the development process, and it provides an opportunity for junior developers to learn from their peers. Focus on both the correctness and readability of the code. Provide constructive feedback and suggest improvements, while also being open to receiving feedback on your own code.

Through code reviews, you can learn different approaches and techniques to solve problems. This exposure helps you understand the rationale behind certain design choices and fosters a deeper understanding of the codebase. Over time, you will develop a keen eye for spotting potential issues and areas for optimization.

Tip 8: Master version control systems and the terminal in general

Understanding and effectively using version control systems, particularly Git, is crucial. This involves more than just knowing how to commit, push, and pull changes. It includes understanding branching strategies, handling merge conflicts, and writing meaningful commit messages that provide context for your changes.

Familiarize yourself with advanced Git commands and workflows, such as rebasing, cherry-picking, and bisecting to identify problematic commits. Learn how to use Git's history and log features to trace changes and understand the evolution of the codebase.

Additionally, becoming comfortable with the terminal can greatly enhance your efficiency. Tools like screen, grep, sed, and awk are extremely useful for various tasks, such as searching through logs, editing files, and automating repetitive tasks. Embrace the terminal as a powerful tool that complements your development workflow.

Tip 9: Prioritize security from the start

Security should be a fundamental consideration in your development process. Familiarize yourself with common security vulnerabilities, such as SQL injection, cross-site scripting (XSS), and cross-site request forgery (CSRF). Implement security best practices, such as input validation, encryption, and secure authentication methods.

Regularly review and update your code to address potential security issues. Ensure that all routes have appropriate authorization from the start, and define certain invariants that all queries should follow. Always validate data on the backend, as client-side validation alone is insufficient.

Understand the principles of secure coding and apply them consistently. For example, avoid storing sensitive information in JWTs because anyone can read them without needing to decrypt the data. Use environment variables to store sensitive configuration details and avoid hardcoding them in your source code.

Tip 10: Become very comfortable with the entire request lifecycle

When implementing a new feature, it is easy to overlook intermediate steps like the reverse proxy (e.g., nginx), middlewares, decorators, and filters. Understanding the request lifecycle helps prevent mistakes ranging from security vulnerabilities to logical errors. Knowing how data is transferred from start to finish can help you appreciate why things are the way they are.

For instance, understanding how middlewares can modify requests and responses can help you implement features such as logging, authentication, and error handling more effectively. Similarly, knowing how reverse proxies handle requests can help you optimize performance and ensure the security of your application.

Be aware of which parts of the state are ephemeral (i.e., die with the request) and which are stateful. For example, global or static variables may have a different lifespan than the request, requiring careful handling to avoid unintended side effects. This knowledge is crucial for debugging issues related to state management and concurrency.

Conclusion

By incorporating these tips into your development routine, you can significantly enhance your coding skills, making your code more efficient, readable, and secure. Remember, continuous learning and adaptation are key in the ever-evolving field of software development. Embrace these best practices, and you'll find yourself writing better code and contributing more effectively to your team.

Making local development easier with Proxyman

Vinit Jogani — Wed, 17 Apr 2024 11:09:30 +0000

The Problem

Many modern applications use the REST architecture with some or the other frontend frameworks. For example, we have several projects where we use Django as an API server with a Svelte or Vue frontend. Likewise, we have projects with a .NET Core API and an Angular frontend.

When we deploy these apps, we would typically build out the frontend code into a static package e.g. using ng build or vite build. Then, we would serve them using a reverse proxy like Nginx such that they are both served from the same URL.

In development, however, the frontend and API servers are run separately, on separate ports. For example, you might use ng serve or vite serve to run your UI server with hot reloading and all that magic, and then run your API server separately.

This presents many challenges — from storing cookies for XSRF protection, to CORS policies that need to be set up for allowing access from the new origin. In your development settings, you can always allow all these hosts by configuring your environment just right but it’s always an added complexity with every new project. In fact, that’s why I see a lot of projects just separate these servers even in production to avoid the overhead which isnt ideal. Besides, wouldn't you want your dev environment to more closely resemble your production environment?

The Solution

So with all the above, I came to the following conclusion: Wouldn't it be easier to have a quick and dirty reverse proxy setup that is as easy as running one command while emulating how the app would eventually be deployed in production?

Proxyman is just that. It allows you to define a quick development reverse proxy. Here are some examples. In all of these, the proxy server itself is running on port 5000 as the first arg passed.

Example 1: Map all requests to / to localhost:8000 and all requests to /ui to localhost:5173.

proxyman 5000 /:8000 /ui:5173

Example 2: Map all requests to / to localhost:8000 but fall back to localhost:8001 if it returns a 404.

proxyman 5000 /:8000:8001

And that's pretty much it. You can use any combination of reverse proxies to run over localhost using this simple process.

How to get started?

All you need to do is install the NPM Package for Proxyman globally:

npm i -g @vjog/proxyman

The Top Trick to Scalable Search Algorithms

Vinit Jogani — Wed, 14 Dec 2022 19:39:28 +0000

Aside: Okay, I agree -- that title is a bit clickbait, but it's not clickbait if it's (mostly) true... right? At least I did not call it a "secret" trick to build up the suspense for no reason... he says, as he only filibusters farther away from the point. But enough about that, let's jump into it.

I'm going to come right out with this one -- I think, by far, the most powerful weapon in scaling search algorithms is hashing. When I talk about search algorithms here, by the way, I mean something more general than search engines -- most complex algorithms that use Big Data are actually search algorithms. You may want to find similar customers (lookalike modeling) or compare certain parameters of products or recommend books to readers. It's all search! OK -- back to hashing.

Hashing is so commonly used that it's almost underrated. Every computer scientist has likely studied all the nitty-gritty properties of hashing every which way in their data structures class but apart from its applications in hash-tables, I don't think many people realize all the powerful things hashing is capable of.

Brief Background

(feel free to skip if you already know what hashing is)

For the uninitiated, hashing is a technique that takes an input of any size and converts it into a fixed-size output using a mathematical function. This output, called a hash, can be used to represent the original input in a more concise and efficient manner. Hashing is widely used in computer science for various purposes, including data storage and retrieval, security, and identification.

Like mentioned earlier, perhaps the most common use of hashing is in hashtables. The goal here is that you want to efficiently store mappings of keys to values such that search/read/write is near constant time (for a sufficiently large hash table). Hashing allows us to map any arbitrary key into a fixed number of K buckets in a pseudo-random way such that this problem becomes possible and probabilistically efficient.

Now there are a lot of other applications of hashing that are also very common. You may hash passwords for security so that the plaintext passwords, if leaked, are not as useful for hackers. You may also use hashes to verify the integrity of a file e.g. if you download a file, you can hash the file and double check if the hash is an exact match as what is advertised by the author to ensure it has not been corrupted or tampered with.

That's all cool, but I want to specifically discuss the more powerful applications of hashing within the context of writing scalable algorithms for difficult problems.

A Simple Start

Suppose you have N files with M bytes each, and you are tasked with finding duplicates within these files. Here is pseudocode for a very naive algorithm:

foreach pair of files A, B {
    duplicate = True
    foreach byte Ai, Bi {
        if Ai != Bi {
            duplicate = False
            break
        }
    }
    if duplicate: print(A, B)
}

It is clear to see that this file will take roughly O(MN^2) steps because for each pair of files, we would need to compare each of the M bytes. Especially when N or M is large, this is going to be intractable.

However, with hashing, you can do much better:

foreach file F {
    hash = 0
    foreach byte B in F {
        hash = hash_function(hash, B)
    }
    if hash_table.contains(hash):
        print(F, hash_table[hash])
    else:
        hash_table.add(hash, F)
}

Instead of comparing each pair of files byte by byte, we first compute the hash of each file and then compare the hashes instead. If two files have the same hash, there is a good chance that they are duplicates, and we can then perform a more expensive comparison to confirm this. This algorithm has a complexity of O(MN), assuming that the number of duplicates is actually a small subset, which is a significant improvement over the previous O(MN^2) algorithm.

While this problem is a bit contrived, there are actually many problems that fall under this common structure -- you want to find all products that have the same values for N features. Instead of comparing each of the N features for the files, you may want to compare just the hashes and that will likely be a good enough approximation. You can also reuse these hashes over time so this is great for online algorithms as well.

In fact, one reason why this is so especially powerful is that it falls very nicely in the MapReduce structure which we can very heavily parallelize across many cores/machines. You see, when you need to compare pairs of files -- all computers in your cluster need to have all files. However, hashing is an independent process that you can compute for all files in parallel across many machines. Then, you can map the file name to its hash and use MapReduce to actually find the conflicts. Generally, hashing improves efficiency on a single core/machine but its powers multiply when you combine it with the wonders of Big Data tools like Spark.

Fuzzy Matching?!

So exact matching of data is an obvious application of hashing which we can scale tremendously with parallel computing. However, what about more fuzzy matching?

The interesting problem here is of comparing two sets. Say you have two sets of items A and B, and you want to find all sets that are similar i.e. have a high Jaccard Similarity which is just (A intersection B)/(A union B). Hashing the entire set to a single number won't help us here because that is only good enough for exact matching.

One solution here is an algorithm called Min-Hashing. In this algorithm, each item in a set is hashed to a random number, and the smallest hash value for each set is chosen as the hash value for that set. Typically we use K hash functions for this, and then we compute the similarity using element-wise comparisons of these K hash values per set. It can be shown that this is actually a very good approximation of the Jaccard Similarity of the original sets without having to do an intersection over union operation over the possibly many more elements in those original sets. Not only is this K a fixed size vector, we also only need to do element-wise comparisons i.e. compare the i-th hash value in two sets instead of all pairs of hash values.

Let's say each set hash M items and we use K to be much smaller than M (usually, we can pick K to be very very small and still achieve very decent results). If intersections and unions are implemented with hash tables, they can each take O(M) steps and we need to do this for all pairs of sets so again O(MN^2). With Min-Hashing, we can do this in O(KN^2) which is better but let's face it -- not that great since we still have that N^2 term.

Well, need I remind you -- we still have the parallelism of the cloud. We can compute these K hashes parallelly and then use MapReduce to broadcast each set to all of the K hashes for a highly parallel implementation of the above. The idea is that we may not need to compare all N^2 pairs at all -- if we can just map elements to each of the K hashes, we only need to compare elements that share at least 1 hash which may be much smaller in practice.

The recurring theme here is that hashing allows us to do very good approximations -- generally with Big Data, it's hard for us to get scalability at the same time as precision. In the vast majority of cases though, it doesn't matter!

A very similar problem to this one is in finding near-duplicate documents, say for example in Search at Google. If we have a large corpus of documents and we want to find all documents that are almost but not exactly, identical. This is a difficult problem to solve with traditional algorithms. With Min-Hashing, we get very efficient implementations for this (if we consider documents to just be sets of words or terms). In general, many problems can be thought of as decomposing larger entities into sets of objects e.g. video is a set of images, url is a set of paths, etc.

There is another similar algorithm here that is worth pointing out: Sim-Hashing. It tries to solve a very similar problem but it looks more so at bit representations of the inputs. So say you have two images where one of them may have some noise added to it -- you may be able to use sim-hash to get a fingerprint that is more robust to that noise.

Adding Context!!

So far we have talked about matching mostly categorical data where there is no contextual meaning of the exact values themselves. If we were hashing real numbers, there is actually some semantic meaning that they capture in their relative distances. For example, 4 and 4.1 are more similar than 4 and 400. In some fuzzy matching algorithms, we would like our hashing to be more respectful of this additional context.

Enter, locality sensitive hashing (LSH). This is a technique for efficiently finding approximate nearest neighbors in high-dimensional data. It does this by hashing the input items into a series of buckets, such that similar items (e.g. close-by numbers) are more likely to be hashed to the same bucket. One of the key advantages of LSH is that it is highly scalable, making it well-suited to working with large datasets. By using randomization and hashing, LSH can find approximate nearest neighbors much more quickly than traditional methods, which often involve calculating distances between every pair of items in the dataset. This can greatly improve the performance of algorithms such as recommendation systems and anomaly detection.

There is actually a very interesting paper by Google about Grale which uses LSH for graph learning -- the idea is to learn a graph by automatically building edges between similar items in a scalable way such that we don't need to compute pairwise distances.

I actually had to use a similar algorithm recently: I had a long list of products which each had several numeric parameters: say length, width, and height. I wanted to group all products which had similar dimensions and doing pairwise matching on every query was just too inefficient. Using LSH, I was able to significantly cut down the comparisons that I actually had to make.

Conclusion

So hopefully I have expressed how much I love hashing-derived algorithms and the breadth of their applications. Like I have mentioned a few times, this is especially useful when approximate solutions are good enough and you can parallelize your operations across a cluster -- both of which are likely true for Big Data applications.

I have personally used hashing tricks in many diverse real-world projects to great effect. I haven't even covered all the non-search, non-scalability related applications of hashing like cryptography. As a computer scientist, I think this is one of the most useful tools to have in your pocket given how versatile and powerful it can be.

The optimal strategy for solving a Wordle....

Vinit Jogani — Thu, 31 Mar 2022 02:58:48 +0000

Okay, so like many other people, I've been addicted to that word game for a while now. As any engineer, I began to wonder if there is an optimal strategy to play the game and if we can write a program to discover this. This post just goes through my thought process and a sequence of strategies that achieve varying levels of success on the game. Enough talk, let's dive right in.

Pre-work

To test out all the algorithms in this post, we would first need a dictionary of all possible 5-letter words that Wordle could use. I ended up using NLTK to retrieve this list:

from nltk.corpus import words
[w for w in words.words() if len(w) == 5]

This gave me ~10K words to work with. In practice, Wordle uses a subset of these to be valid words so it's not a completely accurate dictionary, but hopefully that subset is a relatively uniform sample of the 10K words so our estimates on the full dictionary carry forward.

Idea 1

Alright, intuitively, what is it that we try to do when we solve a Wordle? A common (and reasonable, at first?) strategy seems to start with covering the most frequent letters. For example, words that have many vowels (e.g. "adieu") so that they can really help us narrow down our search.

We can implement this by a simple character-frequency counter to compute the most frequent characters. Then, we can score each of the words in our dictionary as the sum of the frequencies of each individual character. For example:

score['adieu'] = freq['a'] + freq['d'] + ...

When we get some feedback from the game, e.g. it "a" and "e" are exact matches, "d" is a soft match and "i" and "u" don't exist, we can prune our dictionary to only include words that satisfy these constraints and re-run the scoring.

Problem 1: The first problem we run into here is that the optimal selection of the word may not be a word that satisfies all the existing constraints. At least in the "easy mode" on Wordle, where we are not required to make use of all evidence received so far, it may be beneficial to use a word that violates some constraints but still offers more information.

Problem 2: Another problem here is that this will naturally try to favor words which repeat high-frequency letters. For example, "alala" happened to be a valid word in the dictionary and it has 3 "a"s so it is naturally scored higher. However, that's clearly not of value because it lacks coverage.

To solve these problems, we come up with our second idea.

Idea 2

To solve the first problem, we can maintain two sets of words: (1) A = all words in the dictionary, and (2) S = words that satisfy constraints so far. We rank all words in A but only compute the frequency maps using the words in S. This makes sense because it lets us choose a word outside our dictionary as long as it makes decent use of the information that we have so far.

To solve the second problem, we just add the score for each unique letter in the word. That is, repeated letters don't get multiple points. This should help by scoring words like "alala" much lower because it will only get points from "a" and "l".

Problem 1: Some words have repeats, so we might need special checks to detect if we're close to solving the puzzle and in that case, change the scoring mechanism perhaps.

Problem 2: While this strategy makes use of the green/black responses from the game, it doesn't make the best use of yellow responses (i.e. soft matches). That is, suppose we know that "d" exists in the target word but NOT at the second position, it should ideally rank "midst" higher than "adieu" because it will give us more information about the location of "d" assuming all other letters have a score of 0.

We patch this issue in our third idea.

Idea 3

Let's start with the second problem first. The real problem we face there is that we treat letters in all positions equally wherein we know that position matters. Instead, we can compute position-wise character frequencies instead of global frequencies. Then, we can score words as follows:

score['adieu'] = freq[('a', 0)] + freq[('d', 1)] + ...

That is, the score for "adieu" is how many times we have seen "a" in the 0-th position plus how many times we have seen "d" in the 1-st position, etc.

This nicely incorporates yellow feedback because the positional scores for those words will go down. It turns out that this actually works fairly well and problem 1 seems to resolve itself automatically -- after just a few guesses, the repeat-lettered words start scoring high enough to be considered by the model even if they contribute to the score.

There are still some big problems, however.

Problem 1: This strategy is still not perfect -- it uses the evidence a bit too much. For example, consider that we tried the word "bribe" and we got the feedback that all letters are correct except the one on the 4th position. What our model will do at the moment is try all the words like "brice", "bride" and "brine" one after the other. However, we could gain a lot more information by trying a word like "dance" which would tell us if "d", "n" or "c" occur and help us eliminate the other two directly.

Problem 2: The BIG problem here is that our initial intuition about picking high frequency letters is actually completely wrong. HOW? Well, consider this. You find that "a" has the highest frequency. You try a word with "a" and you find that it does indeed exist in your word. So what can you do with it? Well not much -- because it's already a high frequency word and you aren't able to eliminate many words. Instead, you want to ideally pick a word which is somewhere in the middle so that either way, we can eliminate 50% of the other words.

We can fix this.

Idea 4

Again, let's start with the second problem. The "pick a letter from the middle" concept actually has a formal word: find letters with the highest entropy. That is, we don't care for letters that occur in 99% of the words or 1% of the words. We care about letters that occur in ~50% of the words. These letters will have a higher entropy. In particular, let's say we normalize all our frequencies by dividing with the length of the dictionary so that they are relative frequencies instead:

p[('a', 0)] = freq[('a', 0)] / N  # where N is len(A)

These are 0-1 values. Then, we compute entropy as:

entropy[x] = p[x] * (1 - p[x])  # for all x

This value is highest when p[x] is close to 50% and low on either sides. This means we can then swap out our scoring mechanism to be:

score['adieu'] = entropy[('a', 0)] + entropy[('d', 1)] + ...

Which brings us to the first problem -- combining local (position-dependant) and global (position-independant) information. Here, we can use a little hack -- instead of solely relying on global frequency counts or position-wise counts, compute two scores and then average them:

local_score['adieu'] = entropy[('a', 0)] + entropy[('d', 1)] + ...
global_score['adieu'] = entropy['a'] + entropy['d'] + ...

score = (local_score['adieu'] + global_score['adieu']) / 2

Why doing this in that exact manner does not have a strong conceptual reasoning -- for instance, why not multiply the two scores? Turns out, all those ways of combining the scores seem to work just as well.

At this point, our method gets a success rate of about 99%! That is, it is able to guess the actual word within 6 tries 99% of the time. Pretty good! But can we do better? Really why this doesn't work is that we are computing some joint likelihood of each word while assuming each letter is independent of the other letters in the word. This makes sense because computing the joint distribution gets very hard very quickly. However, perhaps we can take a more holistic approach to this.

Idea 5

At this point, I decided to take a step back and gather the high level learnings:

The entropy trick works better than the frequency trick because it splits our dictionary in a more consistent manner.
The dictionary pruning strategy helps incorporate new information into the scoring strategy.
Using a combination of local and global scores is necessary.

Really, though, the metric to determine whether a selected word is good or not boils down to how many other words it can eliminate on average. That is, consider this brute force algorithm:

For each query word, w1, in the dictionary:
- For each target word, w2, in the dictionary:
  - Compute the feedback we would receive if we guessed w1 when the actual word was w2
  - Calculate the number of words this feedback would help us prune, num_pruned
- Compute the average of num_pruned

In effect, this would give us the expected value of the number of eliminations a word can provide, where the expectation is over all possible target words (which is unknown to us).

When implemented naively, this is an O(N^3) algorithm due to the nested loops and the fact that computing num_pruned is itself an O(N) operation. That is very expensive. However, note that the possible feedbacks for any two words is just 3^5=243 which is a relatively small number (compared to 10K). So we can actually do this in O(N^2) time.

To understand this, however, you need to have a solid understanding of how we can represent feedback. Say the actual word for today is "bribe", and you guess a word like "bikes". You would receive some combination of green, yellow and black boxes. You can represent this as a string of symbols where "+" is green, "-" is black and "~" is yellow. So in this case, you might get "+~-~-". Nice, right?

Now, suppose you are given a guess word w1. You want to find how many other words it will eliminate on average. You go through all possible other words and find that feedback f1 occurs 2 times, f2 occurs 5 times and f3 occurs 1 time. Whenever you get a feedback, you can eliminate all the words that get any of the other feedbacks. So eliminations are just an inverse of the feedback distribution, e.g. e1 = N - f1. You can compute this elimination distribution in O(N^2) time.

Now, you can do another O(N^2) loop where you go through all possible guess word and target word pairs (w1, w2) and compute the feedback, f. You already know how many words the guess-feedback pair (w1, f) will allow you to eliminate based on the distribution above. You can average this over all w2 and you get the average number of eliminations for w1.

When you get some actual feedback (instead of simulated feedbacks over all possible target words), you can simply eliminate those words from the list of possible targets and the algorithm should just work.

It is possible to make this run even faster with some Monte-Carlo Simulation. That is, since we only care about the expectation, we can randomly sample a small set of target words w2 instead of going through all possibilities. While this may be noisy, it should approximate the same results with a large enough sample size!

And that's that -- this would actually give us the theoretically optimal starting word, which happens to be the word "raise"! This actually has a 100% success rate but is much slower to run. To speed it up, I decided to implement it in Golang instead of Python for easy parallelization with goroutines. Also, once this "raise" word was computed (took ~5mins), we could just assume you used that as your starting point which would prune over 97% of the words on average. Since you will now only have 300 or so remaining words, the quadratic runtime is very tractable.

The code for this algorithm can be found here.

Conclusion

Through all these ideas, we just discussed two algorithms. The first one is a noisier algorithm that runs in O(N) time. It is not perfect because it makes the naive assumption of independence (kinda like Naive Bayes!) between letters of the word. Then, we present an O(N^2) algorithm that performs a more thorough estimation of the number of eliminations each word can provide.

Are we done? Well not quite. This is the optimal solution to the one-shot problem. However, in theory, suppose our choice of the first word is very good, it might actually use up all the nice letters and only leave bad words for our second guess. That means in the two-shot problem, we might have another more optimal strategy. In our case, it is a six-shot problem. Doing this would take O((N^2)^6) = O(N^12) steps which is pretty expensive. Until we get some quantum computers working full time on solving the Wordle problem, we will have to be satisfied with the approximation we have achieved so far...

C++ is awesome, here's why...

Vinit Jogani — Sat, 24 Jul 2021 01:18:30 +0000

C++ is, hands down, one of the most misunderstood languages in software developer pop culture today. People often compare it to C because it is a "low-level" language. Consequently, it has received a reputation of being an esoteric language that only the performance paranoid care about. This is far from true. I have been programming in C++ as my primary language for a while now and the developer experience is actually really good -- much better than I had imagined it would be.

In this article, I want to debunk some common myths about C++ that I heard before I started using it. Then, I want to talk about the actual superpowers that C++ provides you that most other languages do not.

Myth 1: memory this, memory that

We all know that C is infamous for manual memory management e.g. with the malloc and free. They are hard to work with, lead to many error conditions that need to manually be checked for, and is overall a nightmare. When people hear that C++ also has great performance, people assume that it is by dabbling into all the specifics of memory allocation much like in C so they conclude that it would also be a nightmare. This is categorically false.

For a while now, C++ has had smart pointers. Using these, you can get the same behavior as you get with objects in other languages like Java and Python. In particular, the std::shared_ptr works by wrapping a regular object in a copy-able and movable object with a reference counting mechanism. Thus, when no code is referencing a shared_ptr, it is safely destructed and freed like in most languages. The simplest way to construct a shared pointer is as follows:

std::shared_ptr cat(new Pet("cat"));
// or
std::shared_ptr cat = std::make_shared<Pet>("cat");

While this is the common pattern in most other languages, C++ allows you to have further control on how an object is accessed e.g. using a unique_ptr, but more on that later.

Overall, with smart pointers, managing memory in C++ is no harder than any other languages. It is, however, intentional in that you need to clarify that this is your expected behavior because you can still create and pass around regular pointers the good (ugly?) old way.

Typically, with good use of these, you are also very unlikely to run into segmentation faults which were oh-so-common in C.

Myth 2: it is old and outdated

C++ is very actively maintained with new features that continue to be rolled out at a regular basis. I think one of the most common "new" features in many languages that people have grown to admire is lambdas. Surely, C++ doesn't have lambdas right? Wrong. A C++ lambda function can be defined as:

auto square = [=](int x) { return x * x; };

For context, Java got lambdas in 2014 with the release of Java 8. C++ has had lambdas since C++11 (2011). Yep.

And there continue to be major updates e.g. C++20 most recently, that introduce even more features to make C++ easier to work with. For example, variable-length arguments have been in C++ for a while with variadic templates. Generics work great in C++ as well, although differently from what you may be used to in Java. These features continue to improve the way we develop software.

Myth 3: it is easy to get wrong

On the contrary, while learning some of its quirks may take a while, C++ makes it very hard for your code to do undesired things. For example, many object-oriented languages do not have support for "pure" functions i.e. functions that are guaranteed to be immutable. In C++, you can actually mark methods of a class as const if they don't modify the class' state. These methods can then also be called on constant instances of the class. Here's an example:

class Greeting {
public:
Greeting(std::string greeting) : greeting_(greeting) {}

std::string get_greeting() const { 
  return greeting_; 
}

std::string set_greeting(std::string new_) { 
  greeting_ = new_; 
}

private:
std::string greeting_;
};

Now, you can initialize this class as a constant and still call the getter. If you try to mutate the state of the class, the compiler will complain!

const Greeting g("hello");
g.get_greeting(); // returns "hello"
g.set_greeting("hi"); // does not compile

To be accurate, these functions are not fully pure in that if you don't type some of your variables correctly, it is possible to mutate the resources. For example, if you have a const pointer to a non-const object, you may not modify the pointer but can modify the object pointed to by the pointer. However, these problems can typically be avoided by typing the pointer correctly (i.e. making it a const pointer to a const object).

It may seem like I am self-contradicting here by mentioning such an edge case in a common use case. However, I don't think it really contradicts what my larger claim here is: C++ makes it hard to go wrong assuming you know what you want by giving you the tools to express exactly that in code. While programming languages like Python might abstract it all away from you, they come at a much higher cost. Imagine going into an ice cream shop and only having chocolate served to you because that's what most people generally want -- that's Python. Depending on how you look at it, sure in some sense it is harder to go wrong with chocolate, in general it is not upto the shop but the user to know what they need.

Good const-ness is a big plus but there's several other things that C++ allow you to do that prevent production bugs from occurring in larger projects. It allows you to configure move/copy/delete semantics for the classes you design if you need. It allows you to pass things by value and advanced features like multiple inheritance. All these things make C++ less restrictive.

Myth 4: it is verbose

Okay, this is not completely inaccurate. Coming from Python, where folks are so tired of typing numpy that they just collectively all decided to import it as np, typing more than 2 letters does feel verbose. But modern C++ is a lot less verbose than it used to be! For example, type inference like Go is available in C++ with the introduction of the auto keyword.

auto x = 1;
x = 2; // works
x = "abc"; // compilation error

You can also use auto in return types:

auto func(int x) { return x * x; }

You can also use it in for loops to loop over maps for example:

for (auto& [key, value]: hashmap) {...}

auto does not mean that the types are dynamic -- they are still inferred at compile time and once assigned, they cannot be changed. This is probably for the best. In practice for large codebases, it arguably also helps readability to specifically type out the full type instead of using auto. However, the feature is there if you need it.

You can also specify type and/or namespace aliases like in Python. For example, you can do something like:

using tf = tensorflow;
// Now you can use tf::Example instead of tensorflow::Example.

C++ templates (similar to Java Generics, but pretty different at the same time) can also help significantly cut down duplicate code and may be an elegant solution for many use cases.

Overall, C++ is definitely more verbose than many new programming languages like Kotlin and Python. However, it is not a lot more verbose than C# or Java or even JavaScript to some extent, and the verbosity of those languages has not affected their popularity too much.

Myth 5: hard to do common things

Again, this is not completely inaccurate. Common operations like joining strings by a delimiter are more complicated than they need to be. However, this is a problem solved rather easily with open source libraries like Google's Abseil with thousands of utility functions that make it very easy to work with. Apart from string utilities, Abseil also contains special implementations of hashmaps, concurrency helpers, and debugging tools. Other libraries like Boost make it easy to work with BLAS routines (e.g. dot products, matrix multiplications, etc.) and are very performant.

Using libraries itself can be a challenging task in C++ with having to maintain CMake files, although in many ways those are not much different from the gradle or package.json files in other languages. However, again Google's open source build tool Bazel makes it extremely easy to work with even with cross-language builds. It takes some getting used to, but provides really quick builds and in general has a very good developer experience.

Superpowers

So after busting all those common myths about C++, here are some things in C++ that a lot of other languages don't allow you to do:

Customize semantics of your classes

Suppose you have a class that contains some large data. You would prefer it not be copied and instead be passed by reference always. You can enforce this as a designer of the interface. Moreover, you can configure exactly how you would like it to be copied if at all it is required.

What about moving objects -- instead of copying all data from the previous block of memory to a new block and then deleting the old block of memory, maybe you can optimize it by just switching the pointer locations.

What about destroying objects -- when a object goes out of scope, maybe you automatically want to release some resources (e.g. think mutexes that are automatically released at the end of the function). This works much like the defer functionality in Go.

As a designer of an interface, you can customize every small aspect of how users will use your class. In most cases, this is not necessary but when it is, C++ allows you to fully express your requirements. That is very powerful and can save your company hours of time on large projects.

Optimize memory access

I briefly mentioned smart pointers above. Apart from shared_ptr, you can also have a unique_ptr that ensures that only one object can own a resource. Having one owner for data makes it easy to organize and reason about large projects. In fact, while shared_ptr most closely mimics Java and other OOP languages, you're generally better off using a unique_ptr. This also adds to the safety of the language.

You can also specify compile-time constants to enable the compiler to do more work during the build so that the binary runs faster overall.

Strong types

In general, I have found that working with typed objects is SO much easier to debug than other languages. For example, after hours of debugging a JavaScript project, I found that the error arose because I was passing in 1 argument to a 2 argument function. JavaScript threw no errors and just produced undesired outputs. I would much rather have something fail to compile than fail during runtime.

However, there are many typed languages out there so it may seem overkill to use this as a super power. But C++ does more. First of all, C++ lets you pass anything by reference, by pointer or by value. That means you can pass in a reference to an integer and have a function mutate it instead of using the return value (might be handy in some cases). It also means that you can perform memory-level operations using a pointer on anything if need be. Usually, though, you would like to pass things as a constant reference (e.g. const A&) which does not lead to a copy and keeps your object safe from being mutated unintentionally. Those are strong guarantees to have and that makes your code just so much easier to reason about. In typescript, for example, you cannot reassign a const object but you can mutate it. Why even -- just uggh.

Caveats

So C++ is great and all, but there are obvious limitations to what I would use it for. You can write microservices (e.g. in gRPC) rather easily but I would likely not use C++ for the actual web server (I would likely use TypeScript for that). Despite its speed, I would likely not use C++ for data analysis (I would likely use pandas for that). There are some things that C++ is great for and there are other things that it's just not suitable for. Ultimately, you still have to choose the right tool for whatever job you're trying to accomplish. Having said that, hopefully this article made C++ a bit more attractive in your eyes than you're used to seeing it as.

A Guide to Git-Secret

Vinit Jogani — Sun, 01 Mar 2020 01:02:59 +0000

Git what?

Git-secret is a tool to manage API secrets in source control -- but it doesn't have to be just API keys. This git extension allows you to encrypt/decrypt any files as you push/pull from source control. This guide should walk you through:

Why git-secret? What are the advantages and disadvantages?
How do you set it up? And what does a normal workflow look like?
What are the alternatives?

Why `git-secret`?

Git-secret has a few key advantages that made our team use it:

It lets you encrypt any kind of file, not just plain-text files!
It lets you add multiple users to the keyring so that they can simultaneously encrypt/decrypt it. This is very valuable when you have a team of developers who all need access to the secret file.
It works well on both Linux and Mac, and integrates seamlessly with git.
The setup is relatively simple and a lot of recurring tasks can be automated with simple scripts.

That said, there are a few disadvantages that we experienced:

Since it encrypts the whole file, it often leads to conflicts that cannot be auto-merged because there is no specific part of the file that changed.
For every new key added, the file has to be decrypted and re-encrypted. While there is no way around that in any solution, this may pollute the commit history.
It does not have good Windows support and therefore must be used on Windows Subsystem for Linux.

How to `git-secret`?

Installation

If you are on Mac OS X, this tutorial assumes that you have homebrew installed. It may be necessary to install some of the dependencies below.

If you are using Windows, you will have to use the Windows Subsystem for Linux to follow along. Some of the dependencies below do not have good support for Windows.

The first step is to make sure you have git installed and working, obviously. Most developers should have it installed already but you can run git --version just to make sure. If you are on a mac, run brew install git. If you are on a debian machine, run sudo apt-get install git. Similar installation commands would exist for other linux distributions. Visit this site to figure it out.

Next up, you will need gnupg. Again, most linux machines come with this pre-installed so you should be able to run gpg --version to verify this. Otherwise, install it with the following command: sudo apt-get install gnupg. On osx, you can install it with: brew install gnupg. Nothing fancy so far.

Finally, you will have to install git-secret itself. Detailed instructions can be found here. On a mac, you can just run brew install git-secret. On debian, you can run the following:

echo "deb https://dl.bintray.com/sobolevn/deb git-secret main" | sudo tee -a /etc/apt/sources.list
wget -qO - https://api.bintray.com/users/sobolevn/keys/gpg/public.key | sudo apt-key add -
sudo apt-get update && sudo apt-get install git-secret

Key Setup

The first thing you want to do is to generate a new GPG key that you will be using with git-secret. I used RSA & RSA with 4096-bit keys.

gpg --gen-key

Then, you need to import everyone else's keys. I set up a folder on our repo called public_keys and then added a shell script to automatically import/export all keys as following (fill $EMAIL with your local email used with your key, via a shell argument perhaps):

gpg --import public_keys/* 
gpg --batch --yes -a -o public_keys/`echo $USER`.gpg --export $EMAIL
echo "Finished importing and exporting keys"

That's it! This syncs your local keyring with the keys of everyone on your team.

Normal workflow

The steps to use the tool are simple from here on forward.

Step 0: Add the secret file to .gitignore. Suppose you want to encrypt config/.env, then you want to add it to .gitignore so that you don't accidentally push it to source control. Also, git-secret will not let you add a file to the vault unless it is in your .gitignore.

Step 1: Add a file to git-secret.

git secret add config/.env

Step 2: Give people access to the file.

git secret tell <EMAIL>

Since we only used GPG for this repo, I could employ an automatic command to give access to everyone on my team. In general, this may not be a safe operation.

for i in `gpg -k | grep -Eo "[^<]+@\S+\.[^>]+"`; do
   git secret tell $i 
done

Step 3: Encrypt!

git secret hide

Step 4: Decrypt as you wish:

git secret reveal

Note that every time you grant access to new people, you should reveal and then hide so that the file is encrypted using new keys.

And that's pretty much it! You can find more detailed docs here.

Deploying

When deploying to production servers or CI, I inject a secret environment variable (e.g. through GitHub Secrets) that uses a base64 encoded string and the python gnupg package alongside dotenv to load the configuration file.

What else, if not `git-secret`?

There are several other tools like ejson, blackbox, and git-crypt. Most of these tools either have a lot of installation steps, or use a very similar GPG based encryption system, or don't have all of the advantages listed in the first section above.

That said, there is a class of tools that use cloud-based keys like sops which might be more appropriate for larger teams. While it has a lot more boilerplate and setup involved, it may be a cleaner solution when the number of keys explode to a lot more than a handful, and the drawbacks of git-secret mentioned above can be addressed through this.

Why Google AppScript is a real blessing

Vinit Jogani — Sat, 29 Feb 2020 22:07:07 +0000

The Problem

As a developer, sometimes you are working on small-scale projects e.g. a personal project or as a short-term freelance gig. The requirements for these projects are usually not scalability or extendability, but rather getting the job done as quickly as possible for cost efficiency.

Something about this whole process that has personally frustrated me in the past is that the client may keep requesting small changes, or update the data slightly, or request data snapshots. Obviously, something you could do is build a configuration screen in your application to let the client tune the knobs. You could also build in functionality for exporting/importing data more easily. But all of this has a lot of overhead in terms of development time!

Why spend all of this time reinventing the wheel when you could leverage the existing functionality of an app in the Google Suite? For example, you could provision a SQL database, but if all you need is to save POST data from a few forms and organize them with some algorithmic logic, why not just save the data to a Google Sheets? Sure, there are a lot of extra features that a SQL database would provide in terms of performance, advanced queries etc. However, if you are expecting ~5-10 users for your project and the client cares only about the functionality, this can be of significant value because suddenly the client gets all the functionality that Google Sheets provides without any additional costs. They can now use this Sheet for creating their own graphs and analyses without having to ask you to implement a full-fledged dashboard! Every small change in the data doesn't need to go through you anymore. The Software Architect within you may cringe from this idea, but as my former manager would say "Speed Wins!".

Here's the second problem - often you need to connect some form of APIs within your app. For example, you might need directions data (e.g. from Google Maps) or you might need to integrate with a mail app (e.g. with GMail), or you might want to automate Calendar events (e.g. from Google Calendar). Setting up all of this requires you to manage API keys and dependencies in your app, and then set up all the boilerplate. This can be a big pain for a small app. You would also need to provision a server to run your app (unless you go serverless). In either case, connecting with these APIs can be a real task!

TL;DR: For small projects, sometimes you just need a quick way to connect some APIs and allow the client to make small changes without going through you. This can be for a personal automation project OR for a short-term freelance gig, but speed and efficiency is of the essence!

What Google AppScript provides

Google AppScript is a lesser known service by Google that allows you to automate several Google services to address the problems above. Some of the key features it provides is:

Seamless integration with many Google services (including Gmail, Google Drive, Calendar, Maps, etc.) without any API key management!
Publishing as web apps and/or extensions (so you can expose a public API)
Ability to setup automatic triggers based on recurring events or emails etc.
Familiar JavaScript like syntax (most JavaScript features work, except accessing the DOM).
Store user properties across sessions.
All of this for free!

A Simple Example

Let's walk through a simple example on how to implement something in Google AppScript. We're going to build a simple API that takes in the name of a person and gets their phone number from your contacts.

Step 1: Head to Google Drive and create a new Google Apps Script:

Step 2: Create a function as shown below.

function getPhones() {
  var contacts = ContactsApp.getContactsByName('Dad');
  for (var i in contacts) {
    var contact = contacts[i];
    var phones = contact.getPhones();
    for (var j in phones) {
      Logger.log(phones[j].getPhoneNumber());
    }
  }
}

This should be pretty self-explanatory. We first get a list of contacts which match some name, then we loop through them and get all their phones, and then we print the phone number for each one of those phones to the logging console.

Step 3: From the toolbar, select getPhones as the function and hit the play button. Alternatively, you can also press Ctrl+R (or Cmd+R on OS X) to run the function. At this point, you will have to grant some permissions so that your function can access your contacts. After granting access, the function should run without any issues. You can then go to View > Logs (or press Ctrl+Enter) to view the phone numbers printed to console.

Step 4: Now it's time to convert this into an API! I am going to update my function to take in a query parameter, and return a list of phone numbers. Then, I am going to create a new function called doGet (the exact name is important) as follows.

function getPhones(query) {
  var contacts = ContactsApp.getContactsByName(query);
  var output = []
  for (var i in contacts) {
    var contact = contacts[i];
    var phones = contact.getPhones();
    for (var j in phones) {
      output.push(phones[j].getPhoneNumber());
    }
  }
  return output;
}

function doGet(e) {
  var phones = getPhones(e.parameters.query);
  var response = JSON.stringify(phones);
  var response = ContentService.createTextOutput(response);
  var response = response.setMimeType(ContentService.MimeType.JSON);
  return response;
}

The doGet function extracts the query variable from the query string, passes it to the getPhones function and then outputs the list of phone numbers as JSON.

Step 5: Now you can deploy your app from "Publish > Deploy as Web App". You can configure some parameters such as who has access, and then hit Publish. This should give you a url. You can then navigate to https://<YOUR_URL>?query=<YOUR_QUERY> and it should work as expected! If you set your permissions right, this should now be accessible by any client app (not very safe if you have sensitive contacts!).

And that's it! Just like that, without any API Key nonsense, or servers, we could deploy a simple app in minutes.

How I've used it so far

Here are some of the use cases I ended up going with App Scripts:

I had a team calendar with events named after me. I wanted to turn notifications on for all my events but by default, I would get notifications for every event on the calendar. I wrote an App Script to periodically sync all my events into a personal calendar that I could turn on notifications for.
One of the freelance projects I was working on required a hierarchical form where next choices depended on previous selections and at the end of it, we could see an output based on the choices. The dataset for this would likely have changed so I didn't want to keep updating the dataset myself. I automated a data pull from Google Sheets and exposed it as a JSON API for my Vue Frontend to use.
I once had to build an MVP for a database design project. The fields for the database were automatically rendered based on an excel template that the user could set up over time without my intervention.
I wanted to use the Google Translate API to translate short English phrases into Hindi phrases for an Android app I was building. I didn't want to set up API keys or an app server, and ended up deploying a 10-line function through Google Apps Script.

Overall, I hope you see how this can make life so much easier.

Conclusion

There's a lot more things you can do in this environment but hopefully you get how this is so easy and helpful when you want to automate your workflow or implement small projects! You can check out the documentation to explore the full power that this service provides. (View Docs)

I'd be happy to hear any feedback on this post. Feel free to share any interesting projects you make with this.

DEV Community: Vinit Jogani

Essential Tips for Junior Developers: What Thousands of Code Reviews Taught Me

Background

Tip 1: The IDE is your best friend!

Tip 2: Avoid nesting whenver possible

Tip 3: No DB queries inside loops at all costs

Tip 4: Understand data access patterns and choose appropriate data structures

Tip 5: Become an expert at searching through code

Tip 6: Discipline to make small, cohesive merge requests

Tip 7: Read through a lot of code

Tip 8: Master version control systems and the terminal in general

Tip 9: Prioritize security from the start

Tip 10: Become very comfortable with the entire request lifecycle

Conclusion

Making local development easier with Proxyman

The Top Trick to Scalable Search Algorithms

Brief Background

A Simple Start

Fuzzy Matching?!

Adding Context!!

Conclusion

The optimal strategy for solving a Wordle....

Pre-work

Idea 1

Idea 2

Idea 3

Idea 4

Idea 5

Conclusion

C++ is awesome, here's why...

Myth 1: memory this, memory that

Myth 2: it is old and outdated

Myth 3: it is easy to get wrong

Myth 4: it is verbose

Myth 5: hard to do common things

Superpowers

Customize semantics of your classes

Optimize memory access

Strong types

Caveats

A Guide to Git-Secret

Git what?

Why git-secret?

How to git-secret?

Installation

Key Setup

Normal workflow

Deploying

What else, if not git-secret?

Why Google AppScript is a real blessing

The Problem

What Google AppScript provides

A Simple Example

How I've used it so far

Conclusion

Why `git-secret`?

How to `git-secret`?

What else, if not `git-secret`?