DEV Community

Cover image for A `for`-loop to stop writing.
vincent d warmerdam
vincent d warmerdam

Posted on • Originally published at koaning.io

A `for`-loop to stop writing.

Let's make life a whole log simpler.

When you're working in a notebook you've probably written a for-loop like the one below.

import numpy as np 

data = []

def birthday_experiment(class_size, n_sim):
    """Simulates the birthday paradox. Vectorized = Fast!"""
    sims = np.random.randint(1, 365 + 1, (n_sim, class_size))
    sort_sims = np.sort(sims, axis=1)
    n_uniq = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
    proba = np.mean(n_uniq != class_size)
    return proba

# This is the for-loop everybody keeps writing! 
for size in [10, 15, 20, 25, 30]:
    # At every turn in the loop we add a number to the list.
    data.append(birthday_experiment(class_size=size, n_sim=10_000))
Enter fullscreen mode Exit fullscreen mode

We're doing a simulation here, but it might as well be a grid-search. We're looping over settings in order to collect data in our list.

Pandas

We can expand this loop to get our data into a pandas dataframe.

import numpy as np 

data = []

def birthday_experiment(class_size, n_sim):
    """Simulates the birthday paradox. Vectorized = Fast!"""
    sims = np.random.randint(1, 365 + 1, (n_sim, class_size))
    sort_sims = np.sort(sims, axis=1)
    n_uniq = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
    proba = np.mean(n_uniq != class_size)
    return proba

sizes = [10, 15, 20, 25, 30]
for size in sizes:
    data.append(birthday_experiment(class_size=size, n_sim=10_000))

# At the end we put everything in a DataFrame. Neeto! 
pd.DataFrame({"size": sizes, "result": data})
Enter fullscreen mode Exit fullscreen mode

So far, so good. But will this pattern work for larger grids?

It gets bad.

Let's see what happens when we add more elements we'd like to loop over.

import numpy as np 

data = []

def birthday_experiment(class_size, n_sim):
    """Simulates the birthday paradox. Vectorized = Fast!"""
    sims = np.random.randint(1, 365 + 1, (n_sim, class_size))
    sort_sims = np.sort(sims, axis=1)
    n_uniq = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
    proba = np.mean(n_uniq != class_size)
    return proba

# We're now looping over a larger grid!
for size in [10, 15, 20, 25, 30]:
    for n_sim in [1_000, 10_000, 100_000]:
        data.append(birthday_experiment(class_size=size, n_sim=n_sim))
Enter fullscreen mode Exit fullscreen mode

We now need to write two loops but this has a consequence. How can we possibly link up the size parameter with the n_sim parameter when we cast this into a dataframe? You could do something like this;

import numpy as np 

data = []

def birthday_experiment(class_size, n_sim):
    """Simulates the birthday paradox. Vectorized = Fast!"""
    sims = np.random.randint(1, 365 + 1, (n_sim, class_size))
    sort_sims = np.sort(sims, axis=1)
    n_uniq = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
    proba = np.mean(n_uniq != class_size)
    return proba

# We're now looping over a larger grid!
for size in [10, 15, 20, 25, 30]:
    for n_sim in [1_000, 10_000, 100_000]:
        result = birthday_experiment(class_size=size, n_sim=n_sim)
        row = [size, n_sim, result]
        data.append(row)

# More manual labor. Kind of error prone.
df = pd.DataFrame(data, columns=["size", "n_sim", "result"])
Enter fullscreen mode Exit fullscreen mode

But suddenly we're spending a lot of effort in maintaining a for-loop.

Been here before?

I've noticed that this for-loop keeps getting re-written in a lot of notebooks. You find it in simulations, but also in lots of grid-searches. It's a source of complexity, especially when our nested loops increase in size. So I figured I'd write a small package that can make all this easier.

Decorators

Let's make a three minor changes to the code.

import numpy as np 
from memo import memlist

data = []

@memlist(data=data)
def birthday_experiment(class_size, n_sim):
    """Simulates the birthday paradox. Vectorized = Fast!"""
    sims = np.random.randint(1, 365 + 1, (n_sim, class_size))
    sort_sims = np.sort(sims, axis=1)
    n_uniq = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
    proba = np.mean(n_uniq != class_size)
    return {"est_proba": proba}

for size in [5, 10, 20, 30]:
    for n_sim in [1000, 1_000_000]:
        birthday_experiment(class_size=size, n_sim=n_sim)
Enter fullscreen mode Exit fullscreen mode

We've changed three things.

  • We've added a memlist decorator to our original function from the memo package. This will allow us to configure a place to relay out stats into. Note that the decorator receives an empty list as input. It's this data list that will receive new data.
  • We've changed our function to output a dictionary. This way we can attach names to our output and we're able to support functions that output more than one number.
  • The for-loops now only run the function and don't handle any state any more.

If you were to run this code, the data variable would now contain the following information:

[
    {"class_size": 5, "n_sim": 1000, "est_proba": 0.024},
    {"class_size": 5, "n_sim": 1000000, "est_proba": 0.027178},
    {"class_size": 10, "n_sim": 1000, "est_proba": 0.104},
    {"class_size": 10, "n_sim": 1000000, "est_proba": 0.117062},
    {"class_size": 20, "n_sim": 1000, "est_proba": 0.415},
    {"class_size": 20, "n_sim": 1000000, "est_proba": 0.411571},
    {"class_size": 30, "n_sim": 1000, "est_proba": 0.703},
    {"class_size": 30, "n_sim": 1000000, "est_proba": 0.706033},
]
Enter fullscreen mode Exit fullscreen mode

What's nice about a list of dictionaries is that this is pandas can parse this directly without the need for you to worry about column names.

pd.DataFrame(data)
Enter fullscreen mode Exit fullscreen mode

Let's do more.

This pattern is nice, but we're still dealing with for-loops. So let's fix that and add some extra features.

import numpy as np 
from memo import memlist, memfile, grid, time_taken

data = []

@memfile(filepath="results.jsonl")
@memlist(data=data)
@time_taken()
def birthday_experiment(class_size, n_sim):
    """Simulates the birthday paradox. Vectorized = Fast!"""
    sims = np.random.randint(1, 365 + 1, (n_sim, class_size))
    sort_sims = np.sort(sims, axis=1)
    n_uniq = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
    proba = np.mean(n_uniq != class_size)
    return {"est_proba": proba}

setting_grid = grid(class_size=[5, 10, 20, 30], n_sim=[1000, 1_000_000])
for settings in setting_grid:
    birthday_experiment(**settings)
Enter fullscreen mode Exit fullscreen mode

Pay attention to the following changes.

  1. We've got two mem-decorators now. One decorator is passing the stats to a list while the other one appends the results to a file ("results.json" to be exact).
  2. We've also added a decorator called time_taken which will make sure that we also log how long the function took to complete.
  3. We've used a grid method to generate a grid of settings on our behalf. It represents a generate of settings that can directly be passed to our function. This way, we need one (and only one) for loop. Even if we are working on large grids. You can even configure it to show a neat little progress bar!

If you were to inspect the "results.json" file it would look like this:

{"class_size": 5, "n_sim": 1000, "est_proba": 0.024, "time_taken": 0.0004899501800537109}
{"class_size": 5, "n_sim": 1000000, "est_proba": 0.027178, "time_taken": 0.19407916069030762}
{"class_size": 10, "n_sim": 1000, "est_proba": 0.104, "time_taken": 0.000598907470703125}
{"class_size": 10, "n_sim": 1000000, "est_proba": 0.117062, "time_taken": 0.3751380443572998}
{"class_size": 20, "n_sim": 1000, "est_proba": 0.415, "time_taken": 0.0009679794311523438}
{"class_size": 20, "n_sim": 1000000, "est_proba": 0.411571, "time_taken": 0.7928380966186523}
{"class_size": 30, "n_sim": 1000, "est_proba": 0.703, "time_taken": 0.0018239021301269531}
{"class_size": 30, "n_sim": 1000000, "est_proba": 0.706033, "time_taken": 1.1375510692596436}
Enter fullscreen mode Exit fullscreen mode

When is this useful?

The goal for memo is to make it easier to stop worrying about that one for-loop that we all write. We should just collect stats instead. Note that you can use the decorators from this package to send information to files and lists, but also to callback functions or as a post-request payload to a central logging service.

I've found it useful in many projects. The main example for me is running benchmarks with scikit-learn. I do a lot of NLP and a lot of my components are not serializable with a python pickle which means that I cannot use the standard GridSearch from scikit-learn. You can even combine it with ray to gather statistics from compute happening in parallel. It also plays very nicely with hiplot if you're interested in visualising your statistics.

If this tools sounds useful, feel free to install it via:

pip install memo
Enter fullscreen mode Exit fullscreen mode

If you'd like to understand more about the details, check out the github repo and the documentation page. There's also a full tutorial on calmcode.io in case you're interested.

Oldest comments (1)

Collapse
 
nicoteiz profile image
Nico Oteiza

This is great! I can imagine this becoming the new standard way of doing experiments! Definetly less error prone and more explicit. I love it!