Parallelized vectorization with Dask - a Monte-Carlo example

joaomcteixeira profile image João M.C. Teixeira ・2 min read

Today, I came across an article on Medium about parallelization in Python (here); I used that post as an example to practice vectorization principles with Numpy - you can read my previous post on DEV here. The performance gain obtained in a single core with Numpy is outstanding.

Can we improve the performance of the vectorized Monte-Carlo approach even further?

Dask offers a Numpy-similar interface with automated parallelization. So, let us try it!

This is the solution I came up with to compute the number pi using a Monte-Carlo approach, in other words, reproducing the same algorithm as in the previous referred posts but with Dask. Here, I am using the default configuration, I am not exploring tweaks in Dask to gain further performance. I find it amazing how Dask keeps the memory profile really low. After all, Dask managed the parallelization in my laptop's 8 threads and the available memory seamlessly.

start = time.time()
sample = 10_000_000_000  # <- this is huge!
xxyy = da.random.uniform(-1, 1, size=(2, sample))
norm = da.linalg.norm(xxyy, axis=0)
summ = da.sum(norm <= 1)
insiders = summ.compute()
pi = 4 * insiders / sample
print("pi ~= {}".format(pi))
print("Finished in: {:.2f}s".format(time.time()-start))

In my laptop:

pi ~= 3.141615808
Finished in: 107.14s
CPU~Quad core Intel Core i7-8550U (-MT-MCP-)
speed/max~800/4000 MHz
Kernel~4.15.0-99-generic x86_64
HDD~2250.5GB(56.6% used)
inxi~2.3.56 `

Additional notes:

It is possible to write this statement:

sum = da.sum(norm <= 1)

using masked arrays:

mask = da.ma.masked_inside(norm, 0, 1)
trues = da.ma.getmaskarray(mask)
summ = da.sum(trues)

Yet this latter form consumes more time, about 20% in my machine.

What are your thoughts?


Editor guide