DEV Community

Mihai
Mihai

Posted on

Exploring Hash Functions in Python: Distribution, Collisions, and Performance

This script demonstrates different hash functions and tests how they distribute elements in a hash table. It includes functions for measuring distribution, collisions, execution time, and sensitivity to minor changes.

  1. Simple Hash Function (simple_hash):

This hash function sums the ASCII values of each character in the input string and then takes the result modulo the table size. It's a basic form of hashing and can lead to poor distribution.

def simple_hash(input_str, table_size):
    hash_value = 0
    for char in input_str:
        hash_value += ord(char)
    return hash_value % table_size
Enter fullscreen mode Exit fullscreen mode
  1. Polynomial Hash (polynomial_hash):

Uses a polynomial accumulation of ASCII values, with a prime number (default is 31) to give more weight to characters earlier in the string.

def polynomial_hash(input_str, table_size, prime=31):
    hash_value = 0
    for i, char in enumerate(input_str):
        hash_value += ord(char) * (prime ** i)
    return hash_value % table_size
Enter fullscreen mode Exit fullscreen mode
  1. FNV-1a Hash (fnv1a_hash):

A 32-bit version of the FNV-1a hash, which multiplies and XORs the input characters using constants to generate a hash value. It's commonly used for its good distribution characteristics.

def fnv1a_hash(key, table_size):
    FNV_prime = 16777619
    FNV_offset_basis = 2166136261
    hash_value = FNV_offset_basis
    for char in key:
        hash_value ^= ord(char)
        hash_value *= FNV_prime
        hash_value &= 0xffffffff  # Ensure 32-bit hash
    return hash_value % table_size

Enter fullscreen mode Exit fullscreen mode
  1. XXHash (xx_hash):

Utilizes the xxhash library to compute a fast, non-cryptographic hash. It’s very efficient for large datasets.

def xx_hash(input_str, table_size):
    return xxhash.xxh32(input_str).intdigest() % table_size

Enter fullscreen mode Exit fullscreen mode
  1. SipHash (sip_hash):

Uses the hmac library to wrap the input string with a secret key and apply the sha256 hash function. This can provide security, though it's slower than non-cryptographic hashes.

def sip_hash(input_str, table_size, key=b'secretkey'):
    hash_value = hmac.new(key, input_str.encode(), digestmod='sha256').hexdigest()
    return int(hash_value, 16) % table_size
Enter fullscreen mode Exit fullscreen mode
  1. MurmurHash (murmur_hash):

A fast, non-cryptographic hash function using the mmh3 library. It's widely used in hash tables and bloom filters.

def murmur_hash(input_str, table_size):
    hash_value = mmh3.hash(input_str) % table_size
    return hash_value
Enter fullscreen mode Exit fullscreen mode

Generating Random Strings (generate_random_strings):

This function generates a list of random alphanumeric strings of a specified length. It's used for testing the hash functions.

def generate_random_strings(n, length=12):
    random_strings = []
    for _ in range(n):
        random_str = ''.join(random.choices(string.ascii_letters + string.digits, k=length))
        random_strings.append(random_str)
    return random_strings
Enter fullscreen mode Exit fullscreen mode

Printing the Distribution (pretty_print_distribution):

Displays how many elements are present in each location of the hash table, giving insight into the distribution and potential collisions.

def pretty_print_distribution(distribution):
    print("Element distribution in the table:")
    for num_elements, count in sorted(distribution.items()):
        print(f"  Locations with {num_elements} element(s): {count}")
Enter fullscreen mode Exit fullscreen mode

Testing Distribution and Collisions (test_distribution_and_collisions):

This function evaluates how well a hash function distributes a set of elements across the hash table. It checks for collisions and calculates the final distribution.

def test_distribution_and_collisions(hash_func, table_size, num_elements):
    hash_table = [0] * table_size
    elements = generate_random_strings(num_elements)
    collisions = 0
    for elem in elements:
        hash_index = hash_func(elem, table_size)
        if hash_table[hash_index] != 0:
            collisions += 1
        hash_table[hash_index] += 1
    distribution = collections.Counter(hash_table)
    print(f"Total number of collisions: {collisions}")
    pretty_print_distribution(distribution)
Enter fullscreen mode Exit fullscreen mode

Testing Execution Time (test_execution_time):

Measures the time it takes to apply a hash function to a set of elements, helping compare performance between different hash functions.

def test_execution_time(hash_func, table_size, num_elements):
    elements = generate_random_strings(num_elements)
    start_time = time.time()
    for elem in elements:
        hash_func(elem, table_size)
    end_time = time.time()
    execution_time = end_time - start_time
    print(f"Total execution time for {num_elements} elements: {execution_time:.6f} seconds")
Enter fullscreen mode Exit fullscreen mode

Testing Sensitivity to Minor Changes (test_sensitivity):

This function checks whether the hash function is sensitive to small changes in input. It compares the hash values of two very similar strings.

def test_sensitivity(hash_func, table_size):
    original_str = "example"
    modified_str = "exampLe"  # Changing a single character
    original_hash = hash_func(original_str, table_size)
    modified_hash = hash_func(modified_str, table_size)
    print(f"Original hash for '{original_str}': {original_hash}")
    print(f"Modified hash for '{modified_str}': {modified_hash}")
    if original_hash != modified_hash:
        print("The hash function is sensitive to small changes.")
    else:
        print("The hash function is NOT sensitive to small changes.")
Enter fullscreen mode Exit fullscreen mode

Top comments (0)