Let’s make life a whole log simpler.
When you’re working in a notebook you’ve probably written a for-loop like the one below.
import numpy as np
= []
data
def birthday_experiment(class_size, n_sim):
"""Simulates the birthday paradox. Vectorized = Fast!"""
= np.random.randint(1, 365 + 1, (n_sim, class_size))
sims = np.sort(sims, axis=1)
sort_sims = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
n_uniq = np.mean(n_uniq != class_size)
proba return proba
# This is the for-loop everybody keeps writing!
for size in [10, 15, 20, 25, 30]:
# At every turn in the loop we add a number to the list.
=size, n_sim=10_000)) data.append(birthday_experiment(class_size
We’re doing a simulation here, but it might as well be a grid-search. We’re looping over settings in order to collect data in our list.
We can expand this loop to get our data into a pandas dataframe.
import numpy as np
= []
data
def birthday_experiment(class_size, n_sim):
"""Simulates the birthday paradox. Vectorized = Fast!"""
= np.random.randint(1, 365 + 1, (n_sim, class_size))
sims = np.sort(sims, axis=1)
sort_sims = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
n_uniq = np.mean(n_uniq != class_size)
proba return proba
= [10, 15, 20, 25, 30]
sizes for size in sizes:
=size, n_sim=10_000))
data.append(birthday_experiment(class_size
# At the end we put everything in a DataFrame. Neeto!
"size": sizes, "result": data}) pd.DataFrame({
So far, so good. But will this pattern work for larger grids?
Let’s see what happens when we add more elements we’d like to loop over.
import numpy as np
= []
data
def birthday_experiment(class_size, n_sim):
"""Simulates the birthday paradox. Vectorized = Fast!"""
= np.random.randint(1, 365 + 1, (n_sim, class_size))
sims = np.sort(sims, axis=1)
sort_sims = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
n_uniq = np.mean(n_uniq != class_size)
proba return proba
# We're now looping over a larger grid!
for size in [10, 15, 20, 25, 30]:
for n_sim in [1_000, 10_000, 100_000]:
=size, n_sim=n_sim)) data.append(birthday_experiment(class_size
We now need to write two loops but this has a consequence. How can we possibly link up the size
parameter with the n_sim
parameter when we cast this into a dataframe? You could do something like this;
import numpy as np
= []
data
def birthday_experiment(class_size, n_sim):
"""Simulates the birthday paradox. Vectorized = Fast!"""
= np.random.randint(1, 365 + 1, (n_sim, class_size))
sims = np.sort(sims, axis=1)
sort_sims = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
n_uniq = np.mean(n_uniq != class_size)
proba return proba
# We're now looping over a larger grid!
for size in [10, 15, 20, 25, 30]:
for n_sim in [1_000, 10_000, 100_000]:
= birthday_experiment(class_size=size, n_sim=n_sim)
result = [size, n_sim, result]
row
data.append(row)
# More manual labor. Kind of error prone.
= pd.DataFrame(data, columns=["size", "n_sim", "result"]) df
But suddenly we’re spending a lot of effort in maintaining a for-loop.
I’ve noticed that this for-loop keeps getting re-written in a lot of notebooks. You find it in simulations, but also in lots of grid-searches. It’s a source of complexity, especially when our nested loops increase in size. So I figured I’d write a small package that can make all this easier.
Let’s make a three minor changes to the code.
import numpy as np
from memo import memlist
= []
data
@memlist(data=data)
def birthday_experiment(class_size, n_sim):
"""Simulates the birthday paradox. Vectorized = Fast!"""
= np.random.randint(1, 365 + 1, (n_sim, class_size))
sims = np.sort(sims, axis=1)
sort_sims = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
n_uniq = np.mean(n_uniq != class_size)
proba return {"est_proba": proba}
for size in [5, 10, 20, 30]:
for n_sim in [1000, 1_000_000]:
=size, n_sim=n_sim) birthday_experiment(class_size
We’ve changed three things.
memlist
decorator to our original function from the memo package. This will allow us to configure a place to relay out stats into. Note that the decorator receives an empty list as input. It’s this data
list that will receive new data.If you were to run this code, the data
variable would now contain the following information:
["class_size": 5, "n_sim": 1000, "est_proba": 0.024},
{"class_size": 5, "n_sim": 1000000, "est_proba": 0.027178},
{"class_size": 10, "n_sim": 1000, "est_proba": 0.104},
{"class_size": 10, "n_sim": 1000000, "est_proba": 0.117062},
{"class_size": 20, "n_sim": 1000, "est_proba": 0.415},
{"class_size": 20, "n_sim": 1000000, "est_proba": 0.411571},
{"class_size": 30, "n_sim": 1000, "est_proba": 0.703},
{"class_size": 30, "n_sim": 1000000, "est_proba": 0.706033},
{ ]
What’s nice about a list of dictionaries is that this is pandas can parse this directly without the need for you to worry about column names.
pd.DataFrame(data)
This pattern is nice, but we’re still dealing with for-loops. So let’s fix that and add some extra features.
import numpy as np
from memo import memlist, memfile, grid, time_taken
= []
data
@memfile(filepath="results.jsonl")
@memlist(data=data)
@time_taken()
def birthday_experiment(class_size, n_sim):
"""Simulates the birthday paradox. Vectorized = Fast!"""
= np.random.randint(1, 365 + 1, (n_sim, class_size))
sims = np.sort(sims, axis=1)
sort_sims = (sort_sims[:, 1:] != sort_sims[:, :-1]).sum(axis = 1) + 1
n_uniq = np.mean(n_uniq != class_size)
proba return {"est_proba": proba}
= grid(class_size=[5, 10, 20, 30], n_sim=[1000, 1_000_000])
setting_grid for settings in setting_grid:
**settings) birthday_experiment(
Pay attention to the following changes.
mem
-decorators now. One decorator is passing the stats to a list while the other one appends the results to a file ("results.json"
to be exact).time_taken
which will make sure that we also log how long the function took to complete.grid
method to generate a grid of settings on our behalf. It represents a generate of settings that can directly be passed to our function. This way, we need one (and only one) for loop. Even if we are working on large grids. You can even configure it to show a neat little progress bar!If you were to inspect the "results.json"
file it would look like this:
{"class_size": 5, "n_sim": 1000, "est_proba": 0.024, "time_taken": 0.0004899501800537109}
{"class_size": 5, "n_sim": 1000000, "est_proba": 0.027178, "time_taken": 0.19407916069030762}
{"class_size": 10, "n_sim": 1000, "est_proba": 0.104, "time_taken": 0.000598907470703125}
{"class_size": 10, "n_sim": 1000000, "est_proba": 0.117062, "time_taken": 0.3751380443572998}
{"class_size": 20, "n_sim": 1000, "est_proba": 0.415, "time_taken": 0.0009679794311523438}
{"class_size": 20, "n_sim": 1000000, "est_proba": 0.411571, "time_taken": 0.7928380966186523}
{"class_size": 30, "n_sim": 1000, "est_proba": 0.703, "time_taken": 0.0018239021301269531}
{"class_size": 30, "n_sim": 1000000, "est_proba": 0.706033, "time_taken": 1.1375510692596436}
The goal for memo
is to make it easier to stop worrying about that one for-loop that we all write. We should just collect stats instead. Note that you can use the decorators from this package to send information to files and lists, but also to callback functions or as a post-request payload to a central logging service.
I’ve found it useful in many projects. The main example for me is running benchmarks with scikit-learn. I do a lot of NLP and a lot of my components are not serializable with a python pickle which means that I cannot use the standard GridSearch
from scikit-learn. You can even combine it with ray to gather statistics from compute happening in parallel. It also plays very nicely with hiplot if you’re interested in visualising your statistics.
If this tools sounds useful, feel free to install it via:
pip install memo
If you’d like to understand more about the details, check out the github repo and the documentation page. There’s also a full tutorial on calmcode.io in case you’re interested.
For attribution, please cite this work as
Warmerdam (2021, March 3). koaning.io: A Loop to Stop Writing.. Retrieved from https://koaning.io/posts/for-loop-memo/
BibTeX citation
@misc{warmerdam2021a, author = {Warmerdam, Vincent}, title = {koaning.io: A Loop to Stop Writing.}, url = {https://koaning.io/posts/for-loop-memo/}, year = {2021} }