Solving the Hash Code 2022 Practice Challenge with 70 lines of code

#programming #python #contest #hashcode

Introduction

The Google Hash Code is a team coding competition organized by Google. The qualification round usually takes place in February and the organizers usually publish a practice challenge to warm you up. In this post, I explain how I solved this challenge with just 70 lines of Python code.

Description of the challenge

As usual, the practice challenge for the Hash Code 2022 is about Pizzas. This time, the goal is to find the only pizza that appeals to as many people as possible.

The input of the challenge is a list of potential customers with the list of ingredients they like and the list of ingredients they dislike. For example:

3
2 cheese peppers
0
1 basil
1 pineapple
2 mushrooms tomatoes
1 basil

means that we have 3 customers :

The first potential customers likes 2 ingredients, namely "cheese" and "peppers" and dislikes nothing (0 ingredient).
The second potential customers likes 1 ingredient, namely "basil" and also dislikes 1 ingredient, namely "pineapple".
The third potential customers likes 2 ingredients, namely "mushrooms" and "tomatoes" and dislikes 1 ingredient, namely "basil".

A customer would buy the pizza if it has all the ingredients he likes and none of the ingredient he dislikes.

So using the set theory notation, given a pizza with the set of ingredients $p$ , the customer with the set of ingredients he likes $l$ and the set of ongredients he dislikes $d$ would buy it if and only if $p \cap l = l$ and $p \cap d = \emptyset$ .

To submit the solution, we just write a single line with the number of ingredients followed by the list of the ingredients on the pizza. For example:

4 cheese mushrooms tomatoes peppers

means that the only pizza we serve has 4 ingredients, namely "cheese", "mushrooms", "tomatoes" and "peppers".

Analysis

One possible solution for this problem would be to try all possible combination of ingredients. This would work if the number of ingredients is rather small, but the most difficult input has 10000 ingredients and there are $2^{10000} \approx 2 \cdot 10^{3010}$ possible combinations (this is a 2 followed by 3010 zeros)!

As usually with such challenges, we need another approach. The most basic one consists of computing many pizzas with a random set of ingredients and keeping the one with the best score. This would work and would be very fast, but it will not converge toward the best solution. But we can keep the idea and use a genetic algorithm to improve the convergence. This is what we are going to do!

Coding

The Hash Code can be solved with any programming language, but for this example, I will use Python 3. Python 3 is a rather simple language and there is a good library for genetic algorithm. So let's begin.

I start by implementing two classes :

The customer (with the ingredients he likes and dislikes)
The challenge itself (with the list of customers, the set of all ingredients, the solution, and the score)

from dataclasses import dataclass, field

@dataclass
class Customer:
    likes: set[str]
    dislikes: set[str]

@dataclass
class Challenge:
    customers: list[Customer] = field(default_factory=list)
    all_ingredients: set = field(default_factory=set)
    solution: set = field(default_factory=set)
    score: int = 0

Then in the Challenge class, I implement a method to reset the challenge (if I want to keep the instance of this class for several inputs):

class Challenge:
    def reset(self):
        self.customers = []
        self.all_ingredients = set()
        self.solution = set()
        self.score = 0

I also implement a method add_customer that appends the customer to the list and update the set of all ingredients with the one that this customer likes. Note that there is no need to add the ingredients that he dislikes because there is no benefit in having them on our pizza.

class Challenge:
    def add_customer(self, customer):
        self.customers.append(customer)
        self.all_ingredients |= customer.likes

Now I need a method that computes the score of a given pizza. The score is the number of customers who would buy this pizza. To do that, we iterate over all customers and test if they would buy it with the formula that we found earlier ( $p \cap l = l$ and $p \cap d = \emptyset$ ):

class Challenge:
    def evaluate(self, pizza: set[str]) -> int:
        result = 0
        for c in self.customers:
            if c.likes & pizza == c.likes and c.dislikes & pizza == set():
                result += 1
        return result

Now this is where the magic occurs! In the method solve, we implement a fitness function that takes a solution candidate as argument and returns the score. The candidate is a NumPy binary vector of the size of the set of all ingredients and where a 1 means that this ingredient is chosen and a 0 means that the ingredient is not chosen.

We let then the PyGAD package do the job of selecting the best candidate:

class Challenge:
    def solve(self, generations=100):
        ingr_list = sorted(list(self.all_ingredients))

        def fitness_func(solution, solution_idx):
            pizza = set([ingr_list[k] for (k,v) in enumerate(solution) if v == 1])
            return self.evaluate(pizza)

        ga_instance = pygad.GA(
            num_generations=generations,
            num_parents_mating=2,
            sol_per_pop=3,
            num_genes=len(ingr_list),
            fitness_func=fitness_func,
            init_range_low=0,
            init_range_high=2,
            random_mutation_min_val=0,
            random_mutation_max_val=2,
            mutation_by_replacement=True,
            gene_type=int)

        ga_instance.run()

        solution, solution_fitness, solution_idx = ga_instance.best_solution()
        self.solution = set([ingr_list[k] for (k,v) in enumerate(solution) if v == 1])
        self.score = solution_fitness

If we plot the fitness function for the difficult challenge (600 ingredients), we see that we almost reach a score of 1550 points in 2000 generations:

We then add a process method that deal with file reading and writing and this is the final code:

from dataclasses import dataclass, field
import pygad

@dataclass
class Customer:
    likes: set[str]
    dislikes: set[str]

@dataclass
class Challenge:
    customers: list[Customer] = field(default_factory=list)
    all_ingredients: set = field(default_factory=set)
    solution: set = field(default_factory=set)
    score: int = 0

    def reset(self):
        self.customers = []
        self.all_ingredients = set()
        self.solution = set()
        self.score = 0

    def add_customer(self, customer):
        self.customers.append(customer)
        self.all_ingredients |= customer.likes

    def evaluate(self, pizza: set[str]) -> int:
        result = 0
        for c in self.customers:
            if (c.likes & pizza == c.likes and 
                c.dislikes & pizza == set()):
                result += 1
        return result

    def solve(self, generations=100):
        ingr_list = sorted(list(self.all_ingredients))

        def fitness_func(solution, solution_idx):
            pizza = set([ingr_list[k] for (k,v) in enumerate(solution) if v == 1])
            return self.evaluate(pizza)

        ga_instance = pygad.GA(
            num_generations=generations,
            num_parents_mating=2,
            sol_per_pop=3,
            num_genes=len(ingr_list),
            fitness_func=fitness_func,
            init_range_low=0,
            init_range_high=2,
            random_mutation_min_val=0,
            random_mutation_max_val=2,
            mutation_by_replacement=True,
            gene_type=int)

        ga_instance.run()

        solution, solution_fitness, solution_idx = ga_instance.best_solution()
        self.solution = set([ingr_list[k] for (k,v) in enumerate(solution) if v == 1])
        self.score = solution_fitness

    def process(self, filename, generations=100):
        self.reset()
        with open(f"in/{filename}.in.txt") as f:
            n = int(f.readline())
            for i in range(n):
                customer = Customer(
                    set(f.readline().strip().split()[1:]),
                    set(f.readline().strip().split()[1:])
                )
                self.add_customer(customer)

            print(len(self.all_ingredients))
            self.solve(generations)

        with open(f"out/{filename}.out-{self.score}.txt", "w") as f:
            f.write(f"{len(self.solution)} ")
            f.write(" ".join(self.solution))

c = Challenge()
c.process("a_an_example", 1000)
c.process("b_basic", 1000)
c.process("c_coarse", 1000)
c.process("d_difficult", 1000)
c.process("e_elaborate", 1000)

The input files are expected in the ./in directory and you need a ./out directory where the program can write the results.

Conclusion

We were able to solve the Hash Code 2022 Practice Challenge with just 70 lines of Python 3 code and the magic of the PyGAD. The code is not yet optimized and I am sure that we can improve the operations on sets and also tune the settings of PyGAD to get better and faster results, but this goes beyond the scope of this post.

Credit : Cover photo by Vitalii Chernopyskyi on Unsplash