DEV Community

MLOps Community

MLOps Coffee Sessions #14 Conversation with the Creators of Dask // Hugo Bowne-Anderson and Matthew Rocklin

Dask
What is it?
Parallelism for analytics
What is parallelism?
Doing a lot at once by splitting tasks into smaller subtasks which can be processed in parallel (at the same time)
Distributed work across multiple machines and then combining the results
Helpful for CPU bound - doing a bunch of calculations on the CPU. The rate at which process progresses is limited by the speed of the CPU
Concurrency?
Similar but a but things don’t have to happen at the same time, they can happen asynchronously. They can overlap.
Shared state
Helpful to I/O bound - networking, reading from disk, etc. The rate at which a process progresses is limited by the speed of the I/O subsystem.
Multi-core vs distributed
Multi-core is a single processor with 2 or more cores that can cooperate through threads - multithreading
Distributed is across multiple nodes communicating via HTTP or RPC Why is this hard?
Python has it challenges due to GIL, other languages don't have this problem
Shared state can lead to potential race conditions, deadlocks, etc
Coordination work across the machines
For analytics?
Calculating some statistics on a large dataset can be tricky if it can’t fit in memory

// Show Notes

Coiled Cloud: https://cloud.coiled.io/
Coiled Launch Announcement: https://medium.com/coiled-hq/coiled-dask-for-everyone-everywhere-376f5de0eff4
OSS article: https://www.forbes.com/sites/glennsolomon/2020/09/15/monetizing-open-source-business-models-that-generate-billions/#2862e47234fd
Amish barn raising: https://www.youtube.com/watch?v=y1CPO4R8o5M
MessagePassingInterface: https://en.wikipedia.org/wiki/Message_Passing_Interface

----------- Connect With Us ✌️-------------

Join our Slack community: https://go.mlops.community/slack
Follow us on Twitter: @mlopscommunity
Sign up for the next meetup: https://go.mlops.community/register

Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
Connect with David on LinkedIn: https://www.linkedin.com/in/aponteanalytics/
Connect with Matthew on LinkedIn: https://www.linkedin.com/in/matthew-rocklin-461b4323/

Timestamps:
0:00 - Intro to Matthew Rocklin and Hugo Bowne-Anderson
0:37 - Matthew Rocklin's Background
1:17 - Hugo Brown-Anderson's Background
3:47 - Where did that inspiration come from?
10:04 - Is there a close relationship between Best Practices and Tooling or are these two separate things?
11:27 - Why is Data Literacy important with Coiled?
14:46 - How do you think about the balance between enabling Data Science to have a lot of powerful compute?
17:05 - Machine Learning as a space for tracking best practices experimentation
19:32 - What makes Data Science so difficult?  
24:07 - How can a for-profit company compliment Open Source Software (OSS)
29:40 - Amazon becoming a competitor with your own open-source technology (?)
32:50 - How do you encourage more people to contribute and ensure quality?
34:58 - Do you see Coiled operating within the DASK ecosystem?
37:30 - What is DASK?
39:19 - What should people know about parallelism?
41:28 - Why is it so hard to put things back together?
41:34 - Why does Python need a whole new tool to enable that? Or maybe some other tools as well?
44:44 - Dynamic Tasks Scheduling as being useful to Data Scientists
47:15 - Why is reliability in particular important in Data Science?
52:27 - What's in store for DASK?

Episode source