DEV Community

Geoffrey Kim
Geoffrey Kim

Posted on

Exploring C: Grouping Anagrams Using Hash Tables

Introduction

In this blog post, we are going to deep dive into a piece of C code that groups anagrams together using hash tables. Hash tables, or hash maps, are a type of data structure that implements an associative array abstract data type, a structure that can map keys to values. Hash tables use a hash function to compute an index into an array of buckets or slots from which the desired value can be found. This makes accessing values very efficient, even for large amounts of data.

Anagrams are words or phrases formed by rearranging the letters of a different word or phrase, typically using all the original letters exactly once. The goal of the code we will be examining is to take an array of words and return a two-dimensional array, where each sub-array contains all the words that are anagrams of each other.

This code also makes use of the qsort function. This is a built-in C library function that allows us to sort arrays of any built-in data type or objects constructed from these types. qsort requires four arguments: the array to be sorted, the number of elements in the array, the size of each element, and a comparison function that dictates the order in which elements should be sorted. In our code, qsort is used to sort the characters in a string to form a key for anagrams.

Now that we have an understanding of the main components, let's delve into the code.

Code Explanation

Libraries and Definitions

The code begins with the inclusion of standard C libraries, stdio.h, stdlib.h, and string.h that provide the basic input/output, memory management, and string operations respectively.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
Enter fullscreen mode Exit fullscreen mode

Next, we have a #define directive for HASH_SIZE, setting it to 50000. This size is chosen as a relatively large number to minimize the possibility of hash collisions, which occur when different keys produce the same hash.

#define HASH_SIZE 50000
Enter fullscreen mode Exit fullscreen mode

Structures and Functions

A structure Entry is defined. This structure will serve as our hash table entry. It contains a dynamic array of strings strs, a string key that will be used as the hash key, an integer strsSize to keep track of the current size of strs, and a pointer next that will point to the next Entry in case of a hash collision (making it a linked list).

typedef struct Entry {
    char **strs;
    char *key;
    int strsSize;
    struct Entry *next;
} Entry;
Enter fullscreen mode Exit fullscreen mode

The function createEntry is used to allocate memory for a new Entry, initializing it with a provided key.

Entry* createEntry(char *key) {
    // Allocate memory and initialize
}
Enter fullscreen mode Exit fullscreen mode

The function hash calculates a hash for a given string. It uses a common hash function design that is effective for string hashing, which involves multiplying the current hash by a prime number (in this case, 31), and adding the ASCII value of the current character. This is done for each character in the string. Afterward, the resulting hash is taken modulo HASH_SIZE to ensure it falls within the bounds of the hash table. It's worth noting that good hash functions should distribute keys uniformly across the hash table to minimize collisions.

int hash(char *str) {
    // The function computes a hash value for the input string by iterating over each character
    // It multiplies the current hash value by 31 (a prime number), adds the ASCII value of the current character
    // Finally, it uses modulo operator to ensure the hash value falls within the hash table boundaries
}
Enter fullscreen mode Exit fullscreen mode

cmp is a comparison function used by the qsort function to sort the characters of a string. The qsort function is part of the C standard library and requires a comparison function to determine the order of elements. Here, cmp simply subtracts the ASCII value of one character from another. This returns a negative value if the first character is smaller (should come first), a positive value if it's larger (should come after), and zero if they're equal.

int cmp(const void *a, const void *b) {
    // This comparison function subtracts the ASCII value of the character pointed by 'b' from 'a'
    // The result determines the order in which 'qsort' will arrange the characters
    // If the result is negative, 'a' should come before 'b'
    // If the result is positive, 'a' should come after 'b'
    // If the result is zero, both 'a' and 'b' are equal and their order doesn't matter
}
Enter fullscreen mode Exit fullscreen mode

The Main Function

The groupAnagrams function is the crux of this program. It's responsible for grouping all anagrams together.

Firstly, a hash table is created and initialized. The hash table is an array of pointers to Entry structure.

Entry **hashTable = (Entry **) malloc(HASH_SIZE * sizeof(Entry *));
memset(hashTable, 0, HASH_SIZE * sizeof(Entry *));
Enter fullscreen mode Exit fullscreen mode

For every string in the given list, the function creates a sorted version of it (this is used as a key for anagrams), calculates its hash, and creates a new Entry in the hash table at the corresponding index if one does not exist already. If an entry already exists, it simply adds the string to the list of strings in the found Entry.

for (int i = 0; i < strsSize; i++) {
    // For each string in the input array:
    // 1. Create a copy of the string and sort its characters (this sorted string is used as a key for anagrams)
    // 2. Compute a hash value for the sorted string
    // 3. Search the hash table for an entry with the same key. If it doesn't exist, create a new entry
    // 4. Add the original (unsorted) string to the list of strings in the found or created entry
}
Enter fullscreen mode Exit fullscreen mode

Finally, it creates a two-dimensional array and populates it with all anagrams grouped together. It also populates returnSize and returnColumnSizes with the number of groups and the sizes of each group respectively.

char ***result = (char ***) malloc(strsSize * sizeof(char **));
*returnColumnSizes = (int *) malloc(strsSize * sizeof(int));
*returnSize = 0;
// Here, we're preparing for the final output:
// 1. We allocate memory for the result, which will be a two-dimensional array.
// 2. We also allocate memory for 'returnColumnSizes'. This array will hold the count of strings in each group of anagrams.
// 3. We initialize 'returnSize' to zero. This variable will be updated to reflect the number of groups of anagrams.
Enter fullscreen mode Exit fullscreen mode

After filling the result, the function frees the hash table and returns the result.

free(hashTable);
return result;
Enter fullscreen mode Exit fullscreen mode

Complete Code

Below is the full C code that we've discussed throughout this blog post. It combines all the components we've explained, including data structures, the hash function, sorting, and the main function. It illustrates how these components come together to solve the problem of grouping anagrams:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define HASH_SIZE 50000

typedef struct Entry {
    char **strs;
    char *key;
    int strsSize;
    struct Entry *next;
} Entry;

Entry* createEntry(char *key) {
    Entry *entry = (Entry *) malloc(sizeof(Entry));
    entry->strs = (char **) malloc(10000 * sizeof(char *));
    entry->key = key;
    entry->strsSize = 0;
    entry->next = NULL;
    return entry;
}

int hash(char *str) {
    int hash = 0;
    while (*str) {
        hash = (hash * 31 + *(str++)) % HASH_SIZE;
    }
    return hash;
}

int cmp(const void *a, const void *b) {
    return *(char *)a - *(char *)b;
}

char *** groupAnagrams(char ** strs, int strsSize, int* returnSize, int** returnColumnSizes) {
    Entry **hashTable = (Entry **) malloc(HASH_SIZE * sizeof(Entry *));
    memset(hashTable, 0, HASH_SIZE * sizeof(Entry *));

    for (int i = 0; i < strsSize; i++) {
        char *str = strs[i];
        char *key = (char *) malloc(101 * sizeof(char));
        strcpy(key, str);
        int len = strlen(key);
        qsort(key, len, sizeof(char), cmp);
        int index = hash(key);
        Entry *entry = hashTable[index];
        Entry *prev = NULL;
        while (entry && strcmp(entry->key, key)) {
            prev = entry;
            entry = entry->next;
        }
        if (entry == NULL) {
            entry = createEntry(key);
            entry->strs[entry->strsSize++] = str;
            if (prev == NULL) {
                hashTable[index] = entry;
            } else {
                prev->next = entry;
            }
        } else {
            free(key);
            entry->strs[entry->strsSize++] = str;
        }
    }

    char ***result = (char ***) malloc(strsSize * sizeof(char **));
    *returnColumnSizes = (int *) malloc(strsSize * sizeof(int));
    *returnSize = 0;

    for (int i = 0; i < HASH_SIZE; i++) {
        Entry *entry = hashTable[i];
        while (entry) {
            result[*returnSize] = entry->strs;
            (*returnColumnSizes)[*returnSize] = entry->strsSize;
            (*returnSize)++;
            entry = entry->next;
        }
    }
    free(hashTable);

    return result;
}
Enter fullscreen mode Exit fullscreen mode

This comprehensive view of the code should help you understand how all the individual parts we've explained fit together to solve the problem at hand.

Summary and Conclusion

Throughout this blog post, we've walked through a piece of C code that groups anagrams together using a combination of hashing, sorting, and linked lists. Let's summarize how it all comes together:

  1. Data Structures: We started by defining a custom data structure, Entry, which includes an array of strings (anagrams), a key string, the size of the array, and a pointer to the next Entry. This struct is used in a hash table, with each entry in the table serving as the head of a linked list of entries that have collided (i.e., share the same hash value).

  2. Hash Function: Our hash function takes a string as input and returns an index in the hash table. The goal of the hash function is to distribute the keys evenly across the hash table to minimize collisions.

  3. Sorting: The qsort function sorts the characters in each string, which allows us to create a "key" for each group of anagrams.

  4. Main Function: The groupAnagrams function takes an array of strings and a few other parameters related to the size of the input and output arrays. It creates a hash table, fills it with entries, and then constructs the final output array of arrays.

In the end, the function returns a two-dimensional array where each sub-array contains a group of anagrams. Additionally, it populates two output variables with the number of groups (returnSize) and the size of each group (returnColumnSizes).

In conclusion, this code is a great example of how different data structures and algorithms can be combined to solve a complex problem efficiently. Understanding this code can deepen your grasp on hash tables, sorting, memory management, and problem-solving strategies in C.

Top comments (0)