DEV Community: Kiran U Kamath

DNS Resolver: The Unsung Hero of the Internet

Kiran U Kamath — Tue, 08 Oct 2024 04:37:00 +0000

We use the internet every day without even thinking about the gears turning behind the scenes. One of those crucial gears is the DNS resolver—the behind-the-curtain magician that converts those easy-to-remember domain names (like example.com) into IP addresses that computers need to communicate.

But what is a DNS resolver, how does it work, and why should you care? Let's break this down, layer by layer, into a comprehensive yet digestible explanation. Buckle up, because we're about to get into the workings of the internet!

What Exactly is a DNS Resolver?

At its core, a DNS resolver is a specialized server that processes DNS queries from clients (like your web browser or an application).

When you type in a domain name like example.com, your computer doesn’t magically know how to find it on the web. It relies on the DNS resolver to figure out what IP address corresponds to that domain, much like how a phone book helps you find someone’s number based on their name.

The resolver is often provided by your ISP (Internet Service Provider), but there are many third-party services, such as Google DNS (8.8.8.8), Cloudflare DNS (1.1.1.1), and OpenDNS (208.67.222.222), which offer enhanced speed, privacy, and security features.

Two Types of DNS Resolvers

Before diving into the nuts and bolts, let’s clarify the two primary types of DNS resolvers:

Stub Resolver: This lives on your local machine (or within an application). It sends DNS queries to the recursive resolver and waits for a response. Its job is simple—start the process and display the result.

Recursive Resolver: This is where the magic happens. It takes the query from the stub resolver and performs the heavy lifting, often making multiple requests to various DNS servers to resolve the domain name to an IP address.

The Full DNS Resolution Journey

Let's walk through what happens behind the scenes when you type www.example.com into your browser and hit enter.

Step 1: Local Cache Check

The first stop for the DNS resolution journey is your own machine's cache. Modern operating systems cache DNS records for a while to avoid unnecessary network trips. If you’ve recently visited example.com, there’s a chance the IP address is already sitting in your local DNS cache, and the process ends right here. No need to bother the DNS resolver.

But let’s assume the IP address isn’t cached. Your machine now sends the query to a recursive resolver.

Step 2: Recursive Resolver: The Mastermind

Now, the recursive resolver steps in. This resolver acts as a middleman, tasked with finding the right IP address for the domain name. To do that, it might have to talk to several other servers, which we’ll go over next. It could have cached the response from previous queries, but if it doesn't, the resolver begins its journey through the DNS hierarchy.

Step 3: Query to the Root Servers

If the resolver has never encountered example.com, it first asks one of the root DNS servers. These root servers are a vital part of the internet’s infrastructure, operating globally at 13 key locations. There are 13 root servers (named from A to M). While there are 13 named root servers, each of these actually exists as hundreds of physical instances distributed globally using Anycast technology.

Their job isn’t to know the IP address for example.com directly, but they can point the resolver in the right direction, typically by directing it to the appropriate Top-Level Domain (TLD) DNS servers.

In the case of example.com, the root server will go to the .com TLD servers.

Step 4: Query to the TLD DNS Servers

Next, the recursive resolver queries the TLD DNS servers. For example.com, the resolver heads over to the .com DNS servers. This server doesn’t return the final IP address, but it provides a referral to the authoritative domain name servers for example.com.”

Step 5: Query to Authoritative DNS Servers

Finally, the recursive resolver reaches out to the authoritative DNS servers for the example.com domain. These servers hold the DNS records (such as A, AAAA, CNAME, etc.) for the domain and can give the exact IP address. The authoritative server looks up its records and responds with the requested IP address for the domain.

Step 6: Caching and Returning the Answer

The recursive resolver now knows the answer, but it’s smart about things. Instead of repeating this process for every query, it caches the IP address and stores it for a set period (determined by the Time-To-Live (TTL) of the DNS record). This caching helps reduce the load on DNS infrastructure and speeds up subsequent lookups.

Finally, the resolver sends the answer back to the stub resolver on the client machine, and browser can now load the website by connecting to 93.184.215.14.


Try running below command on Terminal
dig @1.1.1.1 www.example.com A

you will get ip address of example.com

DNS Query Types

Throughout the resolution process, different

A (Address) Record: Returns the IPv4 address of the domain.
AAAA Record: Returns the IPv6 address of the domain.
CNAME (Canonical Name) Record: Returns an alias for another domain.
MX (Mail Exchange) Record: Returns the mail server responsible for receiving emails for the domain.
TXT Record: Provides additional text information, often used for verification or security purposes (e.g., SPF records).

Example: CNAME Resolution

Let’s take an example of a CNAME resolution. If blog.example.com is an alias for www.example.com, a query for blog.example.com will first return the CNAME record, which points to www.example.com. The resolver will then make another query for www.example.com to get the actual IP address.

DNS Resolver Caching: The Speed Booster

Caching is crucial for DNS resolvers, especially for performance. Imagine if every single DNS query had to go through all these steps—we’d all be waiting a lot longer for our websites to load.

Local Machine Cache: Your device’s operating system caches DNS responses.
Recursive Resolver Cache: The DNS resolver caches results to reduce the need for repeated queries to authoritative DNS servers.
Browser Cache: Some modern browsers even perform DNS caching internally to make browsing faster.

Cache Expiry: TTL (Time-To-Live)

The duration for which a DNS record is cached is governed by the

Real-Life Example of TTL Impact

Let’s say you run a website and recently migrated to a new server with a different IP address. If you’ve set a long TTL (e.g., 24 hours), some users might still be routed to the old server until the TTL expires and the new IP address propagates through the DNS system. On the other hand, a short TTL (e.g., 5 minutes) allows changes to propagate more quickly but at the cost of increased DNS query volume

DNS Security and Extensions

DNSSEC (DNS Security Extensions): DNSSEC is a suite of security protocols that provide authenticity and integrity to DNS data, ensuring that the information returned during DNS resolution has not been tampered with. DNSSEC uses digital signatures to verify the authenticity of the DNS records.

DoH (DNS over HTTPS) and DoT (DNS over TLS): To enhance user privacy and security, DNS queries can be encrypted using DoH or DoT. These protocols prevent eavesdropping and man-in-the-middle attacks by encrypting DNS queries between the client and the recursive resolver.

Conclusion: Why You Should Care

The DNS resolver may seem like an obscure part of the internet’s plumbing, but without it, the web as we know it wouldn’t function. From converting domain names to IP addresses, handling various DNS record types, caching responses for faster browsing, to securing the resolution process with DNSSEC, the resolver does a lot more than meets the eye.

As a software engineer, understanding how DNS resolution works gives an edge in diagnosing network issues, optimizing performance, and ensuring that your systems and applications run efficiently.

So, the next time you browse the web, remember: DNS resolvers are the silent heroes that make it all possible.

Maximizing Redis Efficiency: Cutting Memory Costs with Redis Hashes

Kiran U Kamath — Tue, 08 Oct 2024 04:30:00 +0000

In-memory databases like Redis are known for their speed and efficiency, but their memory-centric design makes memory usage a critical factor when scaling an application. As applications grow, developers need to consider memory optimization strategies to keep costs low and performance high.

Redis Hashes are highly useful when you have multiple related fields associated with a single entity. Instead of creating multiple individual keys for each field, you can consolidate them into a single key as a hash with multiple fields.

This blog explores how Redis Hashes save memory, reduce infrastructure costs, when use case is of have multiple related fields associated with a single entity.

Redis: A Memory-Centric Database

Redis stores all its data in memory, making it incredibly fast but also sensitive to memory consumption. Efficient memory usage directly impacts the performance and cost of a Redis instance, especially in high-scale systems where millions of keys may be managed.

Every key stored in Redis incurs memory overhead, typically around 40 bytes per key which is primarily attributed to - Key management, Pointers , Hash table buckets. When dealing with a massive number of keys, this overhead can quickly add up, leading to higher memory costs. For systems that require significant scaling, optimizing memory becomes crucial, and Redis provides several data structures to facilitate this, including Hashes, Sets, and Lists.

What is Redis Hashes ?

A Redis Hash is a key-value data structure where each Redis key contains a field-value pair, similar to how a dictionary works in programming languages. Unlike storing each field as a separate Redis key, multiple fields can be stored under one key.

`Example:

Plain Key-Value:

User:123:Name -> "John"

User:123:Age -> "30"

Using Redis Hash:

User:123 ->
{ Name: "John",
Age: "30" }`

With a hash, we store related fields (such as name and age) within a single Redis key, effectively reducing the number of keys and thus, the overhead incurred.

Why Use Redis Hashes Instead of Plain Key-Value Pairs?

Now that we understand the basics, let’s dig deeper into why Redis Hashes are so effective for memory optimization.

Reduced Memory Overhead

The memory overhead of managing individual keys adds up quickly in Redis. By combining related fields into a single Redis Hash, you significantly reduce the number of keys in your dataset. This directly reduces the 40-byte overhead per key.

Ziplist Compression

Redis Hashes are stored as ziplists when they contain fewer than a configurable number of fields (default 512). Ziplists are optimized for memory efficiency, as they store data in a contiguous block of memory, avoiding the overhead of pointers and metadata associated with each field.

Better Memory Management in Large Scale Systems

In systems where you might have hundreds of millions of key-value pairs, Redis Hashes allow you to organize and compress data more effectively. By reducing the number of keys, Redis spends less time resizing its hash tables, which improves performance and reduces memory fragmentation. Redis hash tables grow and shrink dynamically, and with fewer keys, Redis avoids frequent resizing operations.

Cost Savings

One of the most compelling benefits of using Redis Hashes is the cost savings.
Let’s take a real-world example: consider an application with millions of users where each user’s activity is tracked in Redis. If you store each piece of user data as a separate Redis key, the memory overhead grows rapidly. However, by switching to Redis Hashes, you could reduce memory consumption significantly—by as much as 60%, depending on the dataset.

This means you can run your Redis instance on smaller, less expensive hardware or reduce your cloud infrastructure costs. Memory optimization through Redis Hashes can lead to massive cost savings over time, especially at scale.

Challenges of Using Redis Hashes

While Redis Hashes offer significant memory optimization benefits, there are a few challenges to be aware of:

Memory Usage for Large Hashes

If a hash grows beyond the hash-max-ziplist-entries threshold, Redis will convert the ziplist to a traditional hash table, which incurs more memory overhead. While this is generally acceptable for larger datasets, it’s important to monitor hash sizes and adjust the hash-max-ziplist-entries setting accordingly to balance memory efficiency and performance.

Granular TTL

In Redis version before 7, Redis Hashes did not support individual TTLs for fields inside the hash. You can only set an expiration time for the entire hash key, meaning that if one field needs to expire sooner than the others, you cannot achieve that with Redis hashes alone. New in Redis Community Edition 7.4 is the ability to specify an expiration time or a time-to-live (TTL) value for individual hash fields.

Use Case: Real-Time Analytics in Gaming Leaderboards

Consider an online gaming platform that tracks players' scores across multiple games. Initially, you might store each player's score for each game as a separate key:

Player:123:Game:567:Score -> 100
Player:123:Game:890:Score -> 150

In this scenario, as more players engage with more games, the number of keys in Redis rapidly grows, leading to high memory overhead and management complexity. This explosion of keys makes Redis inefficient as it needs to handle a large number of keys, increasing lookup times and memory usage due to the metadata overhead associated with each key.

Optimization with Redis Hashes:

To optimize this, you can store scores in a Redis Hash where the player ID is the key, and the game IDs with their respective scores are stored as fields within the hash:

Player:123 -> {Game:567 -> 100, Game:890 -> 150}

This approach significantly reduces the number of keys, minimizing memory consumption. Instead of maintaining a separate key for each player-game combination, Redis handles just one key per player, with game-specific scores stored inside the hash.

Advantages:

Memory Efficiency: You reduce the memory overhead by collapsing multiple keys into one, avoiding the 40-byte per key overhead associated with Redis key management.

Faster Retrieval: All game scores for a player can be retrieved in one go, improving performance for leaderboard queries or score lookups.

Reduced Complexity: Managing scores for millions of players and thousands of games becomes more manageable, with fewer keys to handle during data replication or backup processes.

Conclusion

Switching from plain key-value pairs to Redis Hashes is one of the most powerful and effective strategies for optimizing memory usage in Redis. By consolidating multiple related key-value pairs into a single hash, you significantly reduce the number of individual keys in the database, which in turn minimizes overhead and improves overall memory efficiency. Redis also applies additional memory optimizations, such as ziplist compression for small hashes, allowing you to further conserve space.

In large-scale applications where millions of keys are being managed or where high interaction rates are the norm, these optimizations can lead to substantial reductions in memory consumption. This not only results in better system performance but also mitigates memory fragmentation, which can degrade performance over time. More importantly, by lowering memory usage, Redis Hashes enable significant reductions in infrastructure costs, allowing you to achieve greater efficiency with the same hardware or cloud resources.

When used properly, Redis Hashes provide an excellent tool for managing complex datasets efficiently. By grouping related data under a single key, you not only simplify your data model but also ensure that Redis performs optimally even under heavy load. This approach is particularly valuable in memory-constrained environments, or in scenarios where optimizing for cost is a priority.

Ultimately, organizations that adopt Redis Hashes can expect to reduce infrastructure expenses while improving the responsiveness and scalability of their Redis clusters—making it a smart choice for any high-demand, data-intensive application.

Understanding hashCode() and equals() in Java

Kiran U Kamath — Mon, 07 Oct 2024 04:33:00 +0000

Java’s hashCode() and equals() methods are fundamental to the functioning of many core Java classes, particularly those in the Collections framework, such as HashMap, HashSet, and Hashtable. These methods define how Java objects behave when stored in collections that use hashing for efficient retrieval.

In this blog, we’ll dive deep into how these methods work, the rationale behind them, and their impact on performance. We’ll also look at the relationship between hashCode() and equals(), explore best practices, and investigate a real-world example with Java's String class.

Overview of hashCode() and equals()

In Java, objects are often stored in collections such as HashMap or HashSet, which use hashing for efficient access and storage. For these collections to work as expected, the objects need to adhere to certain rules regarding equality (equals()) and hashing (hashCode()).

hashCode(): This method returns an integer hash code, which is used by hash-based collections to determine the "bucket" where an object should be placed.
equals(): This method checks if two objects are meaningfully equal. It compares their internal state to determine if they represent the same logical entity.

The correct implementation of these methods ensures that objects can be retrieved efficiently from collections and behave correctly when compared.

Deep Dive into hashCode()

What is a Hash Code?

A hash code is a numerical value generated for an object. It serves as a compact representation of the object’s contents. Hash codes are particularly useful for storing objects in hash-based collections (like HashMap, HashSet, and Hashtable) because they allow these collections to quickly find an object in a large dataset by using the hash code to determine the storage location (bucket).

How hashCode() Works

The hashCode() method is defined in Java’s Object class and can be overridden to provide custom hash codes for objects. Here’s the method signature:

public int hashCode();

When you insert an object into a hash-based collection, the collection first calls the object’s hashCode() method to get its hash code. This hash code is then used to find the correct bucket where the object should be stored.

But what happens if two objects have the same hash code? This is called a hash collision, and it’s something we’ll explore in more depth later. For now, the important takeaway is that two objects with the same hash code may be stored in the same bucket, and the collection will use equals() to differentiate between them.

Overriding hashCode()

Here’s an example of how you can override


public class Person {
    private String name;
    private int age;

    @Override
    public int hashCode() {
        int result = 17;  // Start with a non-zero constant
        result = 31 * result + name.hashCode();  // Combine fields
        result = 31 * result + age;
        return result;
    }
}

Example of Hash Code Calculation:

Let's consider the string

String str = "abc";
System.out.println(str.hashCode());

The hash code is calculated as follows:

For the first character 'a' (ASCII 97): h = 31 * 0 + 97 = 97
For the second character 'b' (ASCII 98): h = 31 * 97 + 98 = 3105
For the third character 'c' (ASCII 99): h = 31 * 3105 + 99 = 96354

Thus, the hash code for "abc" is 96354.

The Role of Prime Numbers in hashCode()

You might have seen prime numbers like 31 commonly used in hash code calculations (e.g., 31 * h + val[i]). This is not arbitrary—prime numbers help reduce the likelihood of hash collisions. By multiplying intermediate hash values by a prime number, you distribute the hash values more evenly across possible buckets, ensuring better performance for hash-based collections.

Prime numbers ensure that hash codes are distributed more uniformly because their factors are less predictable, reducing the likelihood that different combinations of object fields will produce the same hash code.

Hash Collisions and Buckets

A hash collision occurs when two different objects return the same hash code. When this happens, both objects are stored in the same bucket, but they still need to be distinguishable. This is where the equals() method comes into play.

In a HashMap, for instance, if two objects have the same hash code, they will be placed in the same bucket. However, the collection will then use equals() to check if the objects are truly equal. If they are, the collection considers them duplicates; if not, both objects are stored in the same bucket but treated as distinct elements.

Deep Dive into equals()

What is Equality?

In Java, the equals() method defines logical equality between two objects. It is used to determine whether two objects represent the same logical entity, even if they are different instances.

Here’s the method signature of equals():

public boolean equals(Object obj);

The default implementation of equals() in the Object class compares object references using ==, which checks for reference equality—i.e., whether two objects point to the same memory location. However, this default behavior is not suitable for most applications, where you want to compare the actual contents or values of objects, not just their memory addresses.

Overriding equals()

Here’s an example of how you might override equals() in a custom class:


public class Person {
    private String name;
    private int age;

    @Override
    public boolean equals(Object obj) {
        if (this == obj) return true;
        if (obj == null || getClass() != obj.getClass()) return false;
        Person person = (Person) obj;
        return age == person.age && name.equals(person.name);
    }
}

The Contract Between equals() and hashCode()

There is an important contract between

If two objects are equal according to equals(), they must have the same hash code.
If two objects are not equal, they can have the same or different hash codes (hash collisions).

Hash-based collections rely on this contract to function properly. If two equal objects have different hash codes, they might be stored in different buckets, causing the collection to behave incorrectly (e.g., failing to retrieve an object or incorrectly treating two objects as distinct).

Reflexive: For any non-null object x, x.equals(x) must return true.
Symmetric: For any non-null objects x and y, if x.equals(y) returns true, then y.equals(x) must also return true.
Transitive: For any non-null objects x, y, and z, if x.equals(y) returns true and y.equals(z) returns true, then x.equals(z) must also return true.
Consistent: For any non-null objects x and y, repeated calls to x.equals(y) must consistently return true or false, provided no information used in equals() comparisons is modified.

The Importance of hashCode() and equals() in Collections

In hash-based collections like HashMap, HashSet, and Hashtable, the hashCode() and equals() methods are used together to store and retrieve elements.

Insertion in a HashMap:

Hashing: The hashCode() method is called on the key to determine the bucket (array index) where the object should be placed.

Equality Check: If another object already exists in the bucket (due to a hash collision), the equals() method is used to check if the two objects are equal. If they are equal, the existing value is replaced with the new one; otherwise, the new object is stored in the same bucket using techniques like chaining or probing.

Retrieval from a HashMap:

When retrieving a value from a HashMap using a key:

Hashing: The hashCode() of the key is used to find the correct bucket.

Equality Check: The equals() method is used to identify the correct key-value pair within the bucket (if multiple objects are stored due to hash collisions).

Impact of Incorrect Implementation:

If equals() is overridden without hashCode(), hash-based collections might not work correctly. Two objects that are equal according to equals() could end up in different buckets because their hash codes differ.

If hashCode() is poorly implemented, you might experience frequent hash collisions, leading to performance degradation in collections like HashMap.

Common Mistakes and Best Practices:

Always Override hashCode() When Overriding equals():

Ensure that equal objects have the same hash code.

Failing to override hashCode() when overriding equals() violates the contract between the two methods and can lead to bugs in collections.

Use Immutable Fields

It’s a good practice to use immutable fields (e.g., final fields) in hashCode() and equals() to prevent the state of an object from changing after it’s been inserted into a collection.

If the fields that equals() and hashCode() rely on can be modified, the object’s behavior in hash-based collections may become unpredictable.

Use Prime Numbers for Hashing

Prime numbers like 31 are commonly used in hash functions because they help distribute hash values more uniformly across the hash table, reducing collisions.

Avoid Using Floating-Point Numbers

Using float or double in hashCode() can be tricky due to precision issues. If you must include them, consider converting them to int using Float.floatToIntBits() or Double.doubleToLongBits().

Caching Hash Codes for Immutable Objects

If an object is immutable (like String), you can cache its hash code to avoid recomputation, improving performance. This is done in the String class, where the hashCode is computed once and stored for future use.

String Class implementation

In Java, String is a class that represents a sequence of characters.

Implementation of equals() in String

Here’s the implementation of the equals() method in the String class:


public boolean equals(Object anObject) {
    if (this == anObject) {
        return true;
    }
    if (anObject instanceof String) {
        String anotherString = (String) anObject;
        int n = value.length;
        if (n == anotherString.value.length) {
            char v1[] = value;
            char v2[] = anotherString.value;
            int i = 0;
            while (n-- != 0) {
                if (v1[i] != v2[i])
                    return false;
                i++;
            }
            return true;
        }
    }
    return false;
}

Breakdown of equals() Logic:

Reference Check: The method first checks if the two references (this and anObject) point to the same object (this == anObject). If true, they are considered equal, and the method returns true.

Type Check: If the two objects are not the same reference, the method checks whether anObject is an instance of String. If not, false is returned.

Length Check: If both objects are strings, their lengths are compared. If the lengths differ, the strings cannot be equal, and the method returns false.

Content Comparison: If the lengths are the same, the method compares the individual characters of the two strings. If any character differs, false is returned. If all characters match, true is returned, indicating that the strings are equal.

This implementation ensures that two strings are considered equal if and only if they contain the exact same sequence of characters in the same order.

Implementation of hashCode() in String

Here’s the actual implementation of


public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}

Breakdown of hashCode() Logic:

Cached Hash Code: The String class caches the hash code once it has been calculated. The variable hash is used to store the hash code, and the method first checks if hash is already set (h == 0). If it’s non-zero, the method returns the cached hash code.

Hash Calculation: If the hash code hasn’t been calculated yet, the method iterates over the characters in the string (char val[] = value). For each character, the current hash is multiplied by 31 and added to the character’s value (h = 31 * h + val[i]). This results in a final hash code that represents the string.

Return Value: Once the hash code is computed, it is cached in the hash variable and returned.

Thanks for reading Kiran’s Blog! Subscribe for free to receive new posts and support my work.

Understanding Git Rebase

Kiran U Kamath — Sun, 06 Oct 2024 04:30:00 +0000

In the world of version control, Git rebase stands as one of the most powerful yet often misunderstood tools. Especially for developers like us, working on collaborative projects, mastering git rebase can transform how we manage code history, resolve conflicts, and maintain clean, linear commit histories.

However, it's equally known for being dangerous when used incorrectly, as it involves rewriting history. This guide explores all facets of git rebase, including real-world scenarios, practical command-line examples, and why it can be both powerful and hazardous.

What is Git Rebase?

Git rebase allows you to move or "reapply" commits from one branch on top of another. This operation rewrites the commit history in the process. When working with multiple collaborators, rebase offers a way to synchronize the local work with the upstream repository cleanly.

In a typical git pull operation, the default merge strategy creates a new "merge commit" to combine changes. In contrast, rebase moves your changes onto the tip of another branch, creating a linear history and avoiding merge commits.

Key Points:

Rewriting History: Rebase changes the commit history.
Commit Cherry-picking: It picks commits from one branch and replays them on another.
Linear History: This is the primary advantage—simplifying the history by removing unnecessary merge commits.

Why is it everyone think Git Rebase is Dangerous?

The primary danger of git rebase comes from the fact that it rewrites commit history. Once history is rewritten and pushed to a shared repository, it can lead to confusion, lost commits, or merge conflicts for others working on the same branch.

A few important points:

Changes Commit IDs: Each commit gets a new hash, so any reference to old commit IDs will be broken.

Pushing After Rebase Can Break Shared Repos: If you push after rebasing a branch that others have already pulled, their version of the branch will conflict with the rewritten history.

Losing Uncommitted Work: If conflicts are not properly resolved during rebasing, uncommitted work might be lost.

To safely use git rebase, follow this golden rule:

Never rebase commits that have been pushed to a shared repository.

Rebase vs. Merge: Key Differences

While both rebase and merge aim to combine changes from different branches, but they take radically different approaches:

Git Merge

Commit History: Creates a merge commit, keeps all history.

Workflow: No changes to existing commits

Merge Commits: Creates merge commits

Usage: Suitable for preserving full history

Git Rebase

Commit History: Rewrites history by moving commits

Workflow: Re-applies commits on top of the new branch

Merge Commits: Does not create merge commits

Usage: Ideal for linear, clean history.

Example:

Let's say you have the following commit history:

  A---B---C feature-branch
 /
D---E---F master

Using git merge feature-branch from master will result in:

  A---B---C   feature-branch
 /         \
D---E---F---M   merge commit master

Using git rebase master from feature-branch will reapply the feature branch commits on top of master:

D---E---F---A---B---C rebased feature-branch

History Rewrite: What Does It Mean?

When we talk about rewriting history in Git, we mean changing the commit graph or altering the details of past commits (like commit IDs, messages, or ordering). Rebase changes the parent commit of each commit in the rebased branch. This makes it appear as if the commits were made directly on top of the target branch.

Interactive Rebase: The Power at Your Fingertips

Interactive rebase (git rebase -i) is one of the most powerful features in Git. It allows you to rewrite commits in a detailed, customizable way. You can:

Squash commits: Combine multiple commits into one.

Reword commit messages: Edit the message of specific commits.

Drop commits: Remove unnecessary commits from the history.

Basic Rebase Workflow

Let's consider a scenario where you're working on a feature-branch and the master branch has progressed with new commits. You want to incorporate the latest changes from master into your feature-branch while keeping a linear history.

Steps:

Fetch latest changes from the remote:

git fetch origin

Switch to feature-branch:

git checkout feature-branch

Rebase on top of master:

git rebase origin/master

This will move all your commits in feature-branch on top of the latest master branch commits.

Complex Rebase Workflow

Rebase Conflicts and How to Resolve Them

Rebase conflicts are inevitable when your branch modifies the same files as the branch you’re rebasing onto.

When rebasing feature-branch onto master, Git may detect that both branches modified the same line in file foo.txt.

Git will stop at the conflicting commit and mark the file as conflicted.

Open the conflicted file and manually resolve the differences.

After resolving, run:

git add foo.txt
git rebase --continue

Rebase with Multiple Branches

In a scenario with more than two branches, rebasing becomes more complex but follows the same principles.

git checkout feature-branch
git rebase release

Your commits will be replayed on top of release, ensuring your changes are up-to-date with the release branch while still staying synced with develop.

git rebase --abort

Rebase Best Practices

Rebase Locally, Merge Remotely: Rebase is perfect for keeping your local history clean, but when pushing changes, consider using merge to avoid overwriting shared history.

Rebase Before Push: Always rebase your feature branch onto the latest master before pushing to avoid merge commits.

Interactive Rebase for History Cleanup: Use git rebase -i to curate your commit history before making your changes public.

Conclusion

Git rebase is an incredibly powerful tool, but it should be used with caution. Its ability to rewrite history makes it ideal for keeping commit histories clean, but also risky when used on shared branches. By mastering rebase commands and understanding when to apply them, can streamline Git workflows while avoiding common pitfalls.

Power of Java Virtual Threads: A Deep Dive into Scalable Concurrency

Kiran U Kamath — Sat, 05 Oct 2024 03:25:02 +0000

Java introduces a groundbreaking feature: Virtual Threads, designed to address the limitations of traditional threading models and make high-concurrency applications more accessible and efficient. In this blog, we'll dive deep into the why, what, and how of virtual threads, compare them with other concurrency models, and explore practical use cases with coding examples.

What are Virtual Threads?

Virtual threads are lightweight threads that are managed by the Java runtime rather than the OS, that reduce the effort of writing, maintaining, and debugging high-throughput concurrent applications. They provide a similar programming model to traditional threads but with much lower resource overhead, enabling the creation and management of a large number of concurrent tasks more efficiently.

There are two kinds of threads, platform threads and virtual threads.Like a platform thread, a virtual thread is also an instance of java.lang.Thread. However, a virtual thread isn't tied to a specific OS thread and when code running in a virtual thread calls a blocking I/O operation, the Java runtime suspends the virtual thread until it can be resumed. The OS thread associated with the suspended virtual thread is now free to perform operations for other virtual threads.

Why Use Virtual Threads?

Use virtual threads in high-throughput concurrent applications, especially those that consist of a great number of concurrent tasks that spend much of their time waiting.

Traditional threads, or platform threads, in Java are directly mapped to operating system (OS) threads. While they are powerful, they come with several limitations:

Resource Heavy: Each platform thread consumes a significant amount of memory (typically 1MB stack size by default).

Scalability Issues: Managing a large number of threads can lead to high context-switching overhead, making it difficult to scale applications efficiently.

Complexity: Writing scalable and maintainable multi-threaded code is complex and error-prone.

These limitations hinder the development of highly concurrent applications, especially those that need to handle tens of thousands or even millions of concurrent tasks, such as web servers or real-time data processing systems.

Virtual threads are not faster threads; they exist to provide scale (higher throughput), not speed (lower latency). Virtual threads are suitable for running tasks that spend most of the time blocked, often waiting for I/O operations to complete. However, they aren't intended for long-running CPU-intensive operations.

How Do Virtual Threads Work?

Virtual threads decouple the application-level concurrency from the OS-level threading model. This decoupling allows the JVM to manage thousands or millions of virtual threads efficiently by multiplexing them onto a smaller number of platform threads.

When the Java runtime schedules a virtual thread, it assigns or mounts the virtual thread on a platform thread, then the operating system schedules that platform thread as usual. This platform thread is called a carrier. After running some code, the virtual thread can unmount from its carrier. This usually happens when the virtual thread performs a blocking I/O operation. After a virtual thread unmounts from its carrier, the carrier is free, which means that the Java runtime scheduler can mount a different virtual thread on it.A virtual thread cannot be unmounted during blocking operations when it is pinned to its carrier. A virtual thread is pinned when it runs code inside synchronized block. So it is recommended to use ReentrantLock.

Creating Virtual Threads:

Creating virtual threads in Java 21 is straightforward. Here’s an example:

Thread.startVirtualThread(() -> {
 // Simulate some work
 System.out.println("Running in virtual thread: " +  Thread.currentThread());
});

Virtual threads can also be created with ExecutorService which is easy to manage

ExecutorService myExecutor=Executors.newVirtualThreadPerTaskExecutor());
Future<?> future = myExecutor.submit(() -> System.out.println("Running thread"));
future.get();
System.out.println("Task completed");

Can I create 1,000,000 Virtual Threads?

Yes, I tried creating 1,000,000 and it took 5 seconds to create and run them.


public static void main(String[] args) throws InterruptedException {
        Instant start = Instant.now();
        Set<Long> vThreadIds = new HashSet<>();

        var vThreads = IntStream.range(0, 1_000_000)
                .mapToObj(i -> Thread.ofVirtual().unstarted(() -> {
                    vThreadIds.add(Thread.currentThread().getId());
                    try {
                        Thread.sleep(100);
                    } catch (InterruptedException e) {
                        e.printStackTrace();
                    }
                })).toList();

        vThreads.forEach(Thread::start);
        for (var thread : vThreads) {
            thread.join();
        }

        Instant end = Instant.now();
        System.out.println("Time =" + Duration.between(start, end).toMillis() + " ms");
        System.out.println("Number of unique vThreads used " + vThreadIds.size());
    }

Output:
Time = 4482 ms
Number of unique vThreads used 1000000

Memory Efficiency: Virtual threads use a much smaller stack size than platform threads, often just a few kilobytes compared to megabytes. This reduction in stack size allows the JVM to manage a much larger number of virtual threads without exhausting memory.

JVM Management: The JVM handles the scheduling of virtual threads onto a smaller number of platform threads, reducing the overhead of context switching and improving efficiency.

Scalability: By decoupling application-level concurrency from OS-level threads, virtual threads can scale to handle hundreds of thousands of concurrent tasks, making them ideal for modern applications requiring high concurrency.

Why Can’t I Create 100,000 Normal Threads?

Creating 100,000 normal (platform) threads is not feasible due to their heavy memory consumption and the OS's limitations in handling such a large number of threads. Each platform thread typically uses around 1MB of memory for its stack. Creating 100,000 threads would require around 100GB of memory just for the stacks, which is impractical for most systems.Try: I tried to create Executors.newFixedThreadPool(100000) and got OutOfMemoryError. Please do try, its interesting.

Using Virtual Threads vs. Other Concurrency Models

Virtual Threads and Parallel Streams:

Parallel streams abstract the complexity of parallel processing but are limited by the number of available cores. Virtual threads can handle a larger number of concurrent tasks by efficiently managing the available resources, making them suitable for tasks that involve I/O operations and need higher concurrency.Note: Parallel streams should never be used where IO operation is involved.

Virtual Threads and CompletableFutures:

CompletableFutures offer a way to handle asynchronous computations but can become complex when dealing with many interdependent futures. Virtual threads simplify this by allowing a straightforward threading model without the overhead of managing numerous futures.

Virtual Threads and Reactive WebFlux:

Reactive programming with WebFlux is powerful for I/O-bound applications but requires a different programming model. Virtual threads offer a simpler, more intuitive model while achieving similar scalability for many concurrent I/O tasks.I like reactive programming.

Virtual Threads in One Request Per Thread Model

The "one request per thread" model is a common pattern where each incoming request is handled by a separate thread(tomcat). This model is simple and intuitive but scales poorly with platform threads due to their high resource usage. Virtual threads can revolutionize this model by making it feasible to handle thousands or even more of concurrent requests efficiently.Simple and Scalable: This example sets up an HTTP server where each request is handled by a new virtual thread, using the Executors.newVirtualThreadPerTaskExecutor(). This approach combines the simplicity of the one request per thread model with the scalability of virtual threads.Low Overhead: Virtual threads allow the server to handle a massive number of concurrent connections without the resource overhead associated with platform threads.

How Virtual threads are so efficient?

Assume a scenario where we have one carrier thread(platform thread) and three virtual threads. We will focus on what happens when a virtual thread performs an I/O operation and how the other virtual threads are managed.

VirtualThread-1 starts executing on CarrierThread-1. At some point, VirtualThread-1 needs to perform a blocking I/O operation (e.g., reading from a file or a network socket).VirtualThread-2 and VirtualThread-3 are in a ready-to-run state but are not currently running.

When VirtualThread-1 hits the blocking I/O operation, it cannot continue execution until the I/O is complete. The JVM detects that VirtualThread-1 is about to block and performs a context switch.

The JVM unlinks VirtualThread-1 from CarrierThread-1. VirtualThread-1 is now parked and placed in a waiting state. CarrierThread-1 becomes available to run other virtual threads.

The JVM scheduler selects VirtualThread-2 to run next. VirtualThread-2 is now linked to CarrierThread-1 and starts executing on this carrier thread.

VirtualThread-2 runs on CarrierThread-1 until it either completes or hits another blocking operation. When VirtualThread-2 also performs a blocking operation, a similar unlinking process occurs. CarrierThread-1 would then be free to run VirtualThread-3.

Once the I/O operation that blocked VirtualThread-1 completes, the virtual thread is ready to resume execution. The JVM scheduler finds an available carrier thread for VirtualThread-1.

If CarrierThread-1 is available, VirtualThread-1 is re-linked to CarrierThread-1. If CarrierThread-1 is busy, the JVM scheduler will find another available carrier thread. VirtualThread-1 resumes execution from the point where it was blocked. The execution continues until the virtual thread completes or hits another blocking operation.

Virtual threads have their stacks stored in the heap, unlike platform threads that use OS-provided stack space. This allows the JVM to efficiently manage and switch stacks without the overhead of OS-level thread context switching.

Virtual threads leverage a continuation-based model. When a virtual thread is unlinked from a carrier thread, its state is captured as a continuation. This state can be stored and resumed later without needing a one-to-one mapping with carrier threads.

The JVM's scheduler ensures that carrier threads are not idle when there are runnable virtual threads. This efficient scheduling minimizes the time a carrier thread is idle and maximizes CPU utilization.

What is Continuation-based model

A continuation is a mechanism that allows a computation to be paused and resumed at a later point. In the context of virtual threads, continuations enable the JVM to suspend and resume the execution of virtual threads efficiently.

Pausing Execution: When a virtual thread encounters a blocking operation (e.g., waiting for I/O), the JVM can pause the execution of the thread. This is done by saving the current state of the virtual thread, including its call stack and local variables.

Resuming Execution: Once the blocking operation completes, the JVM can resume the execution of the virtual thread from the exact point it was paused. The saved state is restored, and the computation continues as if it were never interrupted.

Encountering a Blocking Operation: When a virtual thread performs a blocking operation (such as waiting for an I/O operation), the JVM detects this and initiates the process of unlinking the virtual thread from its current platform thread.

Unlinking from Platform Threads: The JVM suspends the execution of the virtual thread by creating a continuation. The current state of the virtual thread is saved and the virtual thread is unlinked from the platform thread. The platform thread is then free to execute other tasks.

Moving to the Heap: The state of the virtual thread, now encapsulated as a continuation, is stored in the heap. This means that while the virtual thread is waiting for the I/O operation to complete, it does not consume the resources of a platform thread.

Completing the I/O Operation: Once the I/O operation completes, the JVM reactivates the virtual thread. The saved continuation is retrieved from the heap, and the virtual thread is linked back to an available platform thread.

Resuming Execution: The virtual thread resumes execution from the point where it was paused, continuing with its task as if it was never interrupted.

The JDK's virtual thread scheduler is a work-stealing ForkJoinPool that operates in FIFO mode. The parallelism of the scheduler is the number of platform threads available for the purpose of scheduling virtual threads. By default it is equal to the number of available processors. A virtual thread can be scheduled on different carriers over the course of its lifetime, i.e the scheduler does not maintain affinity between a virtual thread and any particular platform thread.

Conclusion: When to Use Virtual Threads

When to Use Virtual Threads:

High Concurrency Needs: When your application requires handling a large number of concurrent tasks, such as a web server handling many simultaneous connections.

I/O-Bound Tasks: Ideal for applications with many I/O-bound tasks, where threads spend most of their time waiting for I/O operations to complete.

Simplifying Thread Management: When you want to simplify concurrency management without dealing with the complexities of asynchronous programming or reactive frameworks.

Request per Thread Model: Perfect for web servers and other applications following the request per thread model, allowing each request to be handled independently and efficiently.

When Not to Use Virtual Threads:

Heavy Computational Tasks: For CPU-bound tasks that require intensive computation, traditional threading models or ForkJoinPool or parallel streams might be more efficient.

Limited Threading Needs: If your application only requires a manageable number of threads, the benefits of virtual threads might not be significant.

Credits: Java,Oracle,openjdk official documentation and Java Youtube channel.

Unlocking Efficiency: How Bloom Filters Save Space and Supercharge Data Access

Kiran U Kamath — Tue, 19 Sep 2023 16:01:00 +0000

Bloom filters stand out as a clever and efficient way to determine whether an element is a member of a set. This probabilistic data structure is particularly useful when dealing with large datasets and applications where memory efficiency and fast set membership testing are essential. In this blog post, we will delve into the fascinating world of Bloom filters, exploring their inner workings, use cases, advantages, and limitations.

What is a Bloom Filter?

A Bloom filter is a space-efficient probabilistic data structure designed to quickly test whether an element belongs to a set or not. It accomplishes this by using a bit array of a fixed size and a series of hash functions. When an element is added to the Bloom filter, the hash functions generate a set of positions in the bit array where bits are set to 1. To check for membership, the same hash functions are applied to the query element, and if all corresponding bits are set to 1, it suggests that the element may be in the set. However, false positives are possible, but false negatives are not.

How Does a Bloom Filter Work?

Initialisation: A Bloom filter begins as an array of bits, all initially set to 0.
Adding Elements: To add an element to the filter, it undergoes multiple hash functions that generate a set of bit positions. These positions are then set to 1 in the filter. Each hash function takes the element as input and produces an output, typically a numeric value. This output is then mapped to positions in the bit array using modulo arithmetic. For example, if you have a Bloom filter with a bit array of size , and one of the hash functions produces an output of , you can map it to a position in the bit array using .
Membership Query: To check if an element is a member of the set, the same hash functions are applied to it. If all corresponding bits in the filter are set to 1, the element is considered a possible member. If any bit is 0, it is definitely not in the set.

Let’s take an example to illustrate this. Suppose we have a Bloom filter with 8 bits(bit array of size 8) and two hash functions.

Hash Function 1 takes the element “java” and produces an output of 3.
Hash Function 2 takes the same element “java” and produces an output of 6.
To add “java” to the Bloom filter:

Position 3 and Position 6 in the bit array are set to 1.
We want to check if “java” is in the bloom filter, then you’ll apply Hash Function 1 and Hash Function 2 to “java” and check if both Position 3 and Position 6 in the bit array are set to 1. If all the corresponding bits at positions 3 and 6 are set to 1, “java” is considered a possible member.

It’s important to note that the positions in the bit array for different elements can overlap, which is why false positives can occur when checking for membership. False positives happen when the bits set to 1 for one element overlap with the bits set to 1 for another element, making the filter think an element is present when it’s not. The probability of false positives depends on the size of the bit array, the number of hash functions, and the number of elements added to the filter.

How Bloom Filters Save Space in Data Storage

Bloom filters are ingenious data structures known for their space-efficient characteristics. They accomplish this by making a few trade-offs and using probabilistic techniques. Here’s how Bloom filters save space in data storage:

Compact Representation: Bloom filters use a compact representation of data compared to other data structures like hash tables or binary search trees. Instead of storing the actual elements, a Bloom filter employs a fixed-size bit array.
Each element is mapped to multiple positions (bits) in this array through hash functions. As a result, the storage requirements are proportional to the size of the bit array, which can be significantly smaller than storing the elements themselves.
Elimination of Redundant Data: Traditional data structures often store all the elements individually. In contrast, a Bloom filter doesn’t store the elements themselves. By using multiple hash functions, it efficiently encodes the presence or absence of elements in a highly compressed form. This eliminates the need to store redundant data, which can be especially advantageous when dealing with large datasets.
Constant Size: The space occupied by a Bloom filter is not directly related to the number of elements it contains. Instead, it depends on parameters like the desired false positive probability and the expected number of insertions. This means that regardless of the size of the dataset, the Bloom filter maintains a relatively constant size, making it suitable for memory-constrained applications.
Probabilistic Nature: One of the trade-offs made by Bloom filters is their probabilistic nature. They allow for a small probability of false positives, which means that in some cases, the filter might incorrectly suggest an element is in the set when it’s not. This trade-off enables Bloom filters to achieve their space efficiency, as they don’t need to maintain complete and precise information about the elements.
Scalability: Bloom filters can scale effectively for large datasets without a significant increase in memory usage. The size of the bit array and the number of hash functions can be adjusted to balance memory consumption and false positive rates.
Parallelism: Due to the independence of hash functions and bit positions in the array, Bloom filters allow for efficient parallel processing. Multiple membership tests can be performed concurrently, making them suitable for multi-threaded or distributed systems.

Advantages of Bloom Filters

Space Efficiency: Bloom filters use relatively small amounts of memory compared to other data structures like hash tables, making them ideal for applications with limited memory.
Fast Membership Testing: Checking membership in a Bloom filter is extremely fast. The number of hash functions and size of the bit array can be adjusted to balance space and accuracy.
Parallelism: Multiple membership tests can be performed in parallel since each query element’s hash positions are independent.
No False Negatives: Bloom filters never produce false negatives. If the filter says an element is not present, it’s definitely not in the set.

Limitations of Bloom Filters

False Positives: The probabilistic nature of Bloom filters means there can be false positives. If all bits are set to 1 for a query, it suggests membership, but the element might not be in the set.
No Deletion: Bloom filters do not support element deletion. Removing an element is not straightforward as it could affect other elements.
Hash Functions: The quality of hash functions used is crucial. Poor hash functions can increase false positives.

Use Cases

Caching: Web browsers use Bloom filters to quickly determine if a website is in a local cache, reducing network requests.
Spell Checkers: Bloom filters can help identify whether a word exists in a dictionary, improving the speed of spell checkers.
Network Routers: Routers use Bloom filters to efficiently decide whether an IP address is in a blacklist.
Duplicate Elimination: In distributed systems, Bloom filters can be employed to eliminate duplicate data transmission.
Indexing: When a document is inserted or updated in the database, its key or ID is hashed using one or more hash functions. These hash values are then used to determine the positions in a Bloom filter. Bits at these positions in the Bloom filter are set to 1. For example, consider a NoSQL database that uses a Bloom filter in its index. When a new document with the key “doc123” is added to the index, the Bloom filter is updated based on the hash values of “doc123.”

Conclusion

Bloom filters are an ingenious data structure for efficient set membership testing, offering space-efficient solutions in various applications.

Bloom filters save space in data storage by using a compact representation, eliminating redundant data, maintaining constant size, minimising overhead, and leveraging their probabilistic nature. While they do have limitations, such as the possibility of false positives.

Understanding their limitations and use cases is essential for harnessing their power effectively. When memory efficiency and fast querying are essential, Bloom filters are a valuable tool in a programmer’s toolbox.

Why should one minute be 60 seconds and not 100 seconds?

Kiran U Kamath — Tue, 15 Jun 2021 04:59:11 +0000

To understand why 60 seconds.

Let's see where all this began?

Well, to start with, we have to know about the history of the Babylonian period. They derived their number system from the Sumerians who were using it as early as 3500 BC. They decided 24 hours are divided into two parts

a day lasting 12 hours and a night lasting 12 hours,
1hour is 60 minutes and one minute is 60seconds.

But why??

Because they used the duodecimal system(base 12) and sexagesimal(base 60) which led to 12 hrs a day and 12 hrs a night.

Hipparchus gave us the “Equinoctial hours” by proposing the division of a day into 24 equal hours.

** '12' is a very special number **

The importance of the number 12 is typically attributed either to the fact that it equals the number of lunar cycles in a year or the number of finger joints on each hand (three in each of the four fingers, excluding the thumb), making it possible to count to 12 with the thumb.

The number of finger joints on each hand (excluding the thumb) makes it possible to count to 12 by using the thumb. So it can be said that the structure of our fingers may be the reason!

Another reason behind using '12' is that ‘12’ can be written as ‘6 X 2’, ‘3 X 4’. So that a day can be divided into half and quarters easily. whereas 10 has only three divisors - whole numbers that divide it a whole number of times.

Sixty has 12 divisors and because 60 = 5 x 12 it combines the advantages of both 10 and 12. It is notably convenient for expressing fractions since 60 is the smallest number divisible by the first six counting numbers(1,2,3,4,5,6) as well as by 10, 12, 15, 20, and 30.

Nobody really cared about seconds until after the Middle Ages and all the division of time was mainly to communicate time to others.

If someone asks you what time is it, you can say “half a day completed” or “quarter day completed”.

But imagine what it would be if a day was 10 hrs, it would hard to tell since 10/3 is 3.3333333…..
Obviously, it would be difficult to speak.
60 can be divided by 6,5,4, 3, 2. So it would be easy to communicate the time.

Another reason might be with the calculation of degrees.

Longitude is measured by imaginary lines that run around the Earth vertically (up and down) and meet at the North and South Poles. These lines are known as meridians. Each meridian measures one arcdegree of longitude. The distance around the Earth measures 360 degrees. Each degree was divided into 60 parts, each of which was again subdivided into 60 smaller parts.

The first division, partes minutae primae, or first minute, became known simply as the "minute."

The second segmentation, partes minutae secundae, or "second minute," became known as the second.

So now division of time can be connected to the rotation of the earth. Earth rotates 1 degree in 4 minutes. This rotation is a full 360 degree.

So, in one hour, the earth turns 15 degrees. Every four minutes, the earth has turned one degree.
So in 60 mins, it would rotate 15 degrees. Now as we have 360 degrees in total to rotate, we take 24 hours to do so. (24×15 = 360).

Isn't this fun to know about this.

Hope you learned something new here and don't forget to comment below your thoughts. Thanks for reading!
Keep learning, Keep Growing.

Measures of Similarity and Distance

Kiran U Kamath — Tue, 15 Jun 2021 04:54:11 +0000

In statistics, a similarity measure is a real-valued function that quantifies the similarity between two objects. The purpose of a measure of similarity is to compare two vectors and compute a single number which evaluates their similarity. The main objective is to determine to what extent two vectors co-vary.

We will look at metrics that helps in understanding similarities by measures of correlation.

Pearson's Correlation Coefficient
Spearman's Correlation Coefficient

Let's take a look at each of these individually.

Pearson's Correlation Coefficient

Pearson's correlation coefficient is a measure related to the strength and direction of a linear relationship. The value for this coefficient will be between -1 and 1 where -1 indicates a strong, negative linear relationship and 1 indicates a strong, positive linear relationship.

If we have two vectors x and y, we can compare them in the following way to calculate Pearson's correlation coefficient:

where

$$\bar{x} = \frac{1}{n}\sum\limits_{i=1}^{n}x_i$$

or it can also be written as

$$CORR(x, y) = \frac{\text{COV}(x, y)}{\text{STDEV}(x)\text{ }\text{STDEV}(y)}$$

where

$$\text{STDEV}(x) = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}$$

and

$$\text{COV}(x, y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})$$

where n is the length of the vector, which must be the same for both x and y and
$$\bar{x}$$ is the mean of the observations.

Function to get Pearson correlation coefficient.

def pearson_corr(x, y):
    '''
    Parameters
    x - an array of matching length to array y
    y - an array of matching length to array x
    Return
    corr - the pearson correlation coefficient for comparing x and y
    '''
    mean_x, mean_y = np.sum(x)/len(x), np.sum(y)/len(y) 

    x_diffs = x - mean_x
    y_diffs = y - mean_y
    corr_numerator = np.sum(x_diffs*y_diffs)
    corr_denominator = np.sqrt(np.sum(x_diffs**2))*np.sqrt(np.sum(y_diffs**2))

    corr = corr_numerator/corr_denominator
    return corr

Spearman's Correlation coefficient

Spearman's correlation is non-parametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a monotonic function.

The Spearman correlation between two variables is equal to the Pearson correlation between the rank values of those two variables; while Pearson's correlation assesses linear relationships, Spearman's correlation assesses monotonic relationships (whether linear or not).

You can quickly change from the raw data to the ranks using the .rank() method as shown code cell below.

If should map each of our data to ranked data values:

$$\textbf{x} \rightarrow \textbf{x}^{r}$$
$$\textbf{y} \rightarrow \textbf{y}^{r}$$

Here r indicate these are ranked values (this is not raising any value to the power of r). Then we compute Spearman's correlation coefficient as:

where

$$\bar{x}^r = \frac{1}{n}\sum\limits_{i=1}^{n}x^r_i$$

Function that takes in two vectors and returns the Spearman correlation coefficient.

def spearman_corr(x, y):
    '''
    Parameters
    x - an array of matching length to array y
    y - an array of matching length to array x
    Return
    corr - the spearman correlation coefficient for comparing x and y
    '''
    # Change each vector to ranked values
    x = x.rank()
    y = y.rank()

    # Compute Mean Values
    mean_x, mean_y = np.sum(x)/len(x), np.sum(y)/len(y) 

    x_diffs = x - mean_x
    y_diffs = y - mean_y
    corr_numerator = np.sum(x_diffs*y_diffs)
    corr_denominator = np.sqrt(np.sum(x_diffs**2))*np.sqrt(np.sum(y_diffs**2))

    corr = corr_numerator/corr_denominator

    return corr

Measures of distance

Euclidean Distance
Manhattan Distance

Euclidean Distance

Euclidean distance is a measure of the straight line distance from one vector to another. In other words, euclidean distance is the square root of the sum of squared differences between corresponding elements of the two vectors. Since this is a measure of distance, larger values are an indicate two vectors are different from one another. The basis of many measures of similarity and dissimilarity is euclidean distance.

Euclidean distance is only appropriate for data measured on the same scale.
Consider two vectors x and y, we can compute Euclidean Distance as:

$$ EUC(\textbf{x}, \textbf{y}) = \sqrt{\sum\limits_{i=1}^{n}(x_i - y_i)^2}$$

Function to Euclidean Distance. (I have taken help from numpy)

def eucl_dist(x, y):
    '''
    Parameters
    x - an array of matching length to array y
    y - an array of matching length to array x
    Return
    euc - the euclidean distance between x and y
    '''  
    return np.linalg.norm(x - y)

Manhattan Distance

Different from euclidean distance, Manhattan distance is a 'manhattan block' distance from one vector to another. Therefore, imagine this distance as a way to compute the distance between two points when you are not able to go through buildings or blocks.

Specifically, this distance is computed as:

$$ MANHATTAN(\textbf{x}, \textbf{y}) = \sqrt{\sum\limits_{i=1}^{n}|x_i - y_i|}$$

Function to calculate Manhattan distance

def manhat_dist(x, y):
    '''
    INPUT
    x - an array of matching length to array y
    y - an array of matching length to array x
    OUTPUT
    manhat - the manhattan distance between x and y
    '''  
    return sum(abs(e - s) for s, e in zip(x, y))

In the above image, the blue line gives the Manhattan distance, while the green line gives the Euclidean distance between two points.

Here in finding similarity by measure of distance, no scaling is performed in the denominator. Therefore, you need to make sure all of your data are on the same scale when using this metric.

Because measuring similarity is often based on looking at the distance between vectors, it is important in these cases to scale your data or to have all data be in the same scale.

It becomes a problem if some measures are on a 5 point scale, while others are on a 100 point scale, and we are most likely to have non-optimal results due to the difference in variability of features.

Is Confusion Matrix really Confusing?

Kiran U Kamath — Mon, 31 May 2021 05:05:59 +0000

After reading this blog, I am sure you will not be confused with the confusion matrix.

Let's get started.

The confusion matrix is the table that is used to describe the performance of the model.

We can use accuracy as a metric to analyze the performance of the model, then why confusion matrix???

What is the need for Confusion matrix???

So to understand this let us consider an example of the cancer prediction model.

Since this is a binary classification model its job is to detect cancerous patients based on some features. Considering that only a few, get cancer out of millions of population we consider only 1% of the data provided has cancer positive.

Having cancer is labeled as 1 and not cancer labeled as 0,

An interesting thing to note here is if a system gives the prediction as all 0’s, even then the prediction accuracy will be 99%. It is similar to writing print(0) in model output. This will have an accuracy of 99%.

But this is not correct right??

Now that you know what is the problem and the need for a new metric to help in this situation, let us see how the confusion matrix solves this problem.

Let us consider an example with a classification dataset having 1000 data points.

We get the below confusion matrix:

There will be two classes 1 and 0.

1 would mean the person has cancer, and 0 would mean they don't have cancer.

By seeing this table we have 4 different combinations of predicted and actual values.
Let us consider predicted values as Positive and Negative and actual values as True and False.

Just hold on,,, this is easy and you will understand....

True Positive:

Interpretation: Model predicted positive and it’s true.

Example understanding: The model predicted that a person has cancer and a person actually has it.

True Negative:

Interpretation: Model predicted negative and it’s true.

Example understanding: The model predicted that a person does not have cancer and he actually doesn't have cancer.

False Positive:

Interpretation: Model predicted positive and it’s false.

Example understanding: The model predicted that a person has cancer but he actually doesn't have cancer.

False Negative:

Interpretation: Model predicted negative and it’s false.

Example understanding: The model predicted that a person does not have cancer and person actually has cancer.

Precision:

Out of all the positive classes we have predicted correctly, how many are actually positive.

Recall:

Out of all the positive classes, how much we predicted correctly.

Image credit: Wikipedia

Calculating precision and recall for the above table.

Let's compare this with accuracy.

The model got an accuracy of 96% but the precision of 0.5 and recall of 0.75

which means that 50% percent of the correctly predicted cases turned out to be cancer cases. Whereas 75% of the cancer positives were successfully predicted by our model.

Consider an example where prediction is replaced by print(0). so that we get 0 every time.

	Actual y=0	Actual y=1
Predicted y = 0	914	86
Predicted y = 1	0	0

Here accuracy will be 91.4% but what happens to Precision and recall??

Precision becomes 0 since TP is 0.

Recall becomes 0 since TP is 0.

This is a classic example to understand Precision and Recall.

So now you understand why accuracy is not so useful for imbalanced dataset and how Precision and Recall plays a key role.

One important thing is to understand is, when to use Precision and when to use Recall??

Precision is a useful metric where False Positive is of greater importance than False Negatives.

For example, In recommendation systems like Youtube, Google this is an important metric, where the wrong recommendations may cause users to leave the platform.

Recall is a useful metric where False Negative is of greater importance than False Positive. For example, In the medical field, detecting patients without a disease positive can be tolerated to an extent but patients with disease should always be predicted.

So what is the case when we are not sure whether to use Precision or Recall??

or what to do when one of the two is high, whether the model is good???

To answer this let us see what is F1 score.

F1-score is a harmonic mean of Precision and Recall, and it is a high value when Precision is equal to Recall.

Why not normal arithmetic mean instead of harmonic mean??

Because arithmetic mean gives high value when one of the two is high value but harmonic mean will be high only when both are almost equal.

So from our example, the F1 score becomes

F1 = 2TP / (2TP + FP + FN) = 2*30 / (2*30 + 30 + 10) = 0.6

I believe after reading this Confusion matrix is not so confusing anymore!

Hope you learned something new here and don't forget to comment below your thoughts. Thanks for reading!

Keep learning, Keep Growing.

Between Stimulus and response - We have freedom of Choice

Kiran U Kamath — Mon, 03 May 2021 15:58:54 +0000

Stimulus is an event that happens to us, and response is our reaction or action towards that event.

We respond in a particular way to a particular stimulus.

But in between stimulus and response, We have the freedom to choose how we respond.

And based on our freedom of choice to respond, we either become reactive or proactive.

Reactive people are affected by physical, social, or psychological environment and response to the stimulus is based on their surrounding environment. Response keeps changing based on the change in the environment.

Proactive people are also influenced by external stimuli, but their response to stimuli is a value-based choice.

How do we shift our attitude from Reactive to Proactive???

For this paradigm shift to happen from Reactive to Proactive we should start by taking Initiative.

We should make a conscious effort to change. Many people wait for something to happen or someone to help them, but in reality, we should help ourselves and instead of being problem ourselves we should start finding solutions to problems.

And that comes with taking responsibility to act. Just Planning is not enough, Taking that first step is so much important.

When facing any critical situation we can ask these 3 questions to find a solution to Problem.

What's happening to me? What is the stimulus?
What's going to happen in the future based on the decision I make?
What is my response? What can I do? What initiative can I take in this situation?

Our initiative to respond to stimulus determines our degree of Proactivity.

As many people say be a positive person and have positive thinking in life. But that's just not enough.

Thinking positively without understanding reality may cause danger. And that's exactly the difference between Positive thinking and Proactivity.

The stimulus may be Good or bad, but response towards that stimulus is our own choice, which may make the situation better or worse and just(mostly) depends on the choice we make.

Just thinking Positive is not the solution, We should face reality. We should consider the current and future impact and still find the power to choose a positive response to that stimulus.

Using correct language also has an impact on our Proactive behavior. For example instead of "I have to" start using "I choose to"

To become Proactive we should start focusing more of our time and energy on the circle of Influence. These are things within our Circle of Concern on which we can do something about. This also determines our degree of Proactivity.

Try to work on the Circle of Influence by making small commitments and work on them. Don't be in Blaming, accusing mode, rather work on things that have you have control over. Work on You.

Concentrate on the Freedom of Choice you choose for responding to stimuli.

You can read my other Blogs:

Why do I write Blogs
Series of blogs based on my Learnings from IKIGAI : the Japanese secret to a long happy life
Invention vs Innovation
www of new years resolution

Hope you learned something new here and don't forget to comment below your thoughts.
Thanks for reading!

Keep learning, Keep Growing.

Why do I write blogs?

Kiran U Kamath — Fri, 16 Apr 2021 13:26:35 +0000

Documenting my learning and creating beginner-friendly content.

The main reason for me to blog, is to write about things I learned and difficulties I faced while learning something new so that if a beginner is starting on the same journey, they might quickly understand beginner stuff and quickly move to intermediate level.

We usually learn complicated concepts with hard work and dedication but as we work on it for few days, those concepts become easy for us, and at that time if a beginner asks us about that concept we explain with lot jargons and might not be able to explain in a beginner-friendly way.

so I feel the solution for this to try to write a blog about what I learn so that it might help beginners and even for me to refer in the future. So basically for Beginner friendly documentation.

Learning by teaching

I believe I learn more when I am able to teach a concept to someone in simple words. Teaching to others also improves communication skills.

If I am able to explain a concept to a person without much jargon and in a simple way, means that I have learned that concept correctly. So second reason is to teach what I have learned, by which I learn in more depth. One of the best ways to test whether I understood something is to explain it to someone else.

Habit > motivation

I want to develop the habit of learning a new concept in Machine learning every week. And when I learn something I will write a blog on it.

So indirectly, excitement and maintaining consistency of publishing blog every week will keep me motivated to learn a new concept in ML every week.

Making a habit of writing a blog with consistency is difficult. But once this becomes a habit it is much better than striving for motivation. Building a system of habit is much better than motivation.

Expand my network and learn from Feedback

I want to expand my network and connect with like-minded people and learn from them. As I expand my network, I will get more feedbacks for the blogs I write.

Feedbacks play a vital role for a writer and helps to improve. Feedbacks and suggestions always improve writing, and may sometimes be an eye-opener.

Positive feedbacks motivate me to write more, whereas, negative feedback helps me grow and get better.

Confidence

Blogging increased my confidence. I don't know why, but I am now much confident to post a blog compared to posting my first blog. I still remember I did not published and post my blog on social handles for 7 days after writing it, and used to check the entire content to see if everything is correct.
Writing blogs has increased my confidence, both in terms of depth of understanding a concept and expressing my voice in social media.

Coming to Monetization,

I don't want to monetize my blog. Actually, I don't feel like reading blog which contains ads around them, so I don't want to give that trouble to the readers.
Maybe in future, if there is a way, by which I can monetize the blog without disturbing readers than surely I may try that option.

My Future plan in blogging

I want to cover as much as the topic I can in Data science which is necessary for beginners to start the journey in data science. I also want to write about python and how to effectively use it for competitive programming.

P.S : I have learned a lot about deep learning from fastai, and was inspired by Racheal Thomas's blog Why you should blog?

If you have not read it, please do read it.

Thanks for your time and reading!

See you until the next article!

Exploring Couchbase Analytics Service

Kiran U Kamath — Fri, 16 Apr 2021 13:25:24 +0000

Recently I have been exploring Couchbase Analytics service and finding out how it could be more useful and increase efficiency compared to Couchbase query service.

Couchbase Analytics service comes with enterprise addition.

Couchbase Analytics is a parallel data management capability for Couchbase Server which is designed to efficiently run large ad hoc join, set, aggregation, and grouping operations over many records.

Why analytics?

Every business does these three things in a cycle or a spiral [The Goal].

Run the business process to deliver products or services to the customers.
Analyze the business to determine what to change and what to change to.
Make the change happen.

The Query Service is used by the applications needed to run the the business; it is designed for large number of concurrent queries, each doing small amount of work. In the RDBMS world, this workload is called the OLTP workload.

Applications or tools used for analysis have different workload characteristics. These typically use the Analytics Service; it is designed for a smaller number of concurrent queries analyzing a larger number of documents. In the RDBMS world, this workload is called the OLAP workload(Online analytical processing)

Advantages Couchbase Analytics approach/ WHY Couchbase Analytics?

Common data model: we don’t have to force our data into a flat, predefined, relational model to analyze it

Workload isolation: Operational query latency and throughput are protected from slow-downs due to your analytical query workload

High data freshness: Analytics runs on data that’s extremely current, without ETL or delays.
Continuous data sync

Easy to manage SQL++ interface

Reduce infrastructure complexity, application development complexity, and cost with a single system for operations and analytics.

Analytics Service should be run alone, on its own cluster node, with no other Couchbase Service running on that node Due to the large scale and duration of operations it is likely to perform.

The minimum memory(RAM) quota required is 1024 MB for analytics service.

Analytics queries never touch Couchbase data servers, but instead running on real-time shadow copies of our data in parallel. Because of this, there is no worry about slowing down the Couchbase Server nodes with complex queries.

Dataverse

The top-level organizing concept in the Analytics data world is the dataverse.

A dataverse, is a namespace that gives you a place to create and manage datasets and other artifacts for a given Analytics application.

In that respect, a dataverse is similar to a database or a schema(schema is the skeleton structure that represents the logical view of the entire database. It defines how the data is organized and how the relations among them are associated) in a relational DBMS.

Datasets are containers that hold collections of JSON objects. They are similar to tables in an RDBMS or keyspaces in N1QL. A dataset is linked to a Couchbase bucket so that the dataset can ingest data from Couchbase Server.

Fresh Analytics instance starts out with two dataverses, one called Metadata (the system catalog) and one called Default (available for holding data).

The first task is to tell Analytics about the Couchbase Server data that you want to shadow and the datasets where you want it to live.

CREATE DATASET datasetName ON `bucketName`;

If the bucket has data that are of different types or categories, we can create a separate dataset for those using WHERE and it will be created.

CONNECT LINK Local;

Local = local server(Data service in the same cluster)

Now perform basic query from analytics service to check if everything is working fine.

SELECT meta(k) AS meta, k AS data 
FROM datasetName k 
LIMIT 1;

As in SQL, the query’s FROM clause binds the variable k incrementally to the data instances residing in the dataset named datasetName. The SELECT clause returns all of the meta-information plus the data value for each binding that satisfies the predicate.

Once this is set up you can access this service from Couchbase java SDK too.

Follow Establishing connection to Couchbase server using Couchbase Java SDK to set up Java SDK.

After following that blog edit App.java code with following code to use Couchbase analytics.

import com.couchbase.client.java.*;
import com.couchbase.client.java.kv.*;
import com.couchbase.client.java.json.*;
import com.couchbase.client.java.query.*;
import com.couchbase.client.core.error.CouchbaseException;
import com.couchbase.client.java.analytics.*;


public class App {
    public static void main(String[] args) {
        try {
            Cluster cluster = Cluster.connect("localhost", "username", "password");
              final AnalyticsResult result = cluster
                .analyticsQuery("SELECT * FROM datasetName LIMIT 2;");


              for (JsonObject row : result.rowsAsObject()) {
                    System.out.println("Found row: " + row);
                    System.out.println();

              }

            } catch (CouchbaseException ex) {
              ex.printStackTrace();
            }
    }
}

Credits: Couchbase Documentation