<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: holden karau</title>
    <description>The latest articles on DEV Community by holden karau (@holdenkarau).</description>
    <link>https://dev.to/holdenkarau</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F501887%2F9cbdbfff-a3c9-4c8e-851a-aac6fa03bc7a.jpg</url>
      <title>DEV Community: holden karau</title>
      <link>https://dev.to/holdenkarau</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/holdenkarau"/>
    <language>en</language>
    <item>
      <title>Building a Physical Test K8s Cluster</title>
      <dc:creator>holden karau</dc:creator>
      <pubDate>Thu, 19 Nov 2020 20:16:46 +0000</pubDate>
      <link>https://dev.to/holdenkarau/building-a-physical-test-k8s-cluster-4fpb</link>
      <guid>https://dev.to/holdenkarau/building-a-physical-test-k8s-cluster-4fpb</guid>
      <description>&lt;h1&gt;
  
  
  Building the Test Cluster
&lt;/h1&gt;

&lt;p&gt;To ensure that the results between tests are as comparable as possible, I'm using a consistent hardware setup whenever possible. Rather than use a cloud provider I (with the help of Nova) set up a rack with a few different nodes. Using my own hardware allows me to avoid the &lt;a href="https://en.wikipedia.org/wiki/Cloud_computing_issues#Performance_interference_and_noisy_neighbors"&gt;noisy neighbor problem&lt;/a&gt;&lt;br&gt;
with any performance numbers and gives me more control over simulating network partitions. A downside is that the environment is not as easily re-creatable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building the Rack
&lt;/h2&gt;

&lt;p&gt;If I'm honest, a large part of my wanting to do this project is that ever since I was a small kid, I've always dreamed of running "proper" networking gear (expired CCNA club represent). I got a &lt;a href="https://amzn.to/32OCQEq"&gt;rack&lt;/a&gt; and some shelves. (I also got an avocado tree to put on top and a &lt;a href="https://www.etsy.com/listing/787021025/kubectl-corgi-kubernetes-sticker?ga_order=most_relevant&amp;amp;ga_search_type=all&amp;amp;ga_view_type=gallery&amp;amp;ga_search_query=kubernetes&amp;amp;ref=sr_gallery-1-2&amp;amp;organic_search_click=1&amp;amp;col=1"&gt;cute kubecuddle sticker&lt;/a&gt; for good luck)&lt;/p&gt;

&lt;p&gt;It turns out that putting together a rack is not nearly as much like LEGO as I had imagined. Some of the shelves I got ended up being very heavy (and some did not fit), but thankfully Nova came to the rescue when things got too heavy for me to move.&lt;/p&gt;

&lt;p&gt;After running the rack for about a day, I got a complaint from my neighbor about how loud the fan was, so I swapped it out for some &lt;a href="https://amzn.to/32NpeJN"&gt;quieter fans&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hosts
&lt;/h2&gt;

&lt;p&gt;The hosts themselves are a mixture of machines. I picked up three &lt;a href="https://www.raspberrypi.org/products/raspberry-pi-4-model-b/"&gt;Rasberry Pi 4Bs&lt;/a&gt;. I'm also running a &lt;a href="https://amzn.to/3kBFG6c"&gt;Jetson Nano&lt;/a&gt; and three &lt;a href="https://amzn.to/3jzO58O"&gt;Jetson AGX Xavier's&lt;/a&gt; to allow me to experiment with GPU acceleration. To support any x86 only code, I also have a small refurbed x86 present.&lt;/p&gt;

&lt;p&gt;For storage I scrounged up some of the free flash drives I've gotten from conferences over the years to act as storage. This initial set up was not very fast, so I added some inexpensive on-sale external SSD drives.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting up Kubernetes
&lt;/h2&gt;

&lt;p&gt;Since I want to be able to swap between the different Python scaling tools easily, I chose Kubernetes as the base cluster layer rather than installing directly on the nodes. Since it is easy to deploy, I used K3s as the cluster manager. The biggest pain here was figuring out why the storage provisioning I was trying to use wasn't working, but thankfully Duffy came to the rescue, and we figured it out.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next?
&lt;/h2&gt;

&lt;p&gt;Up next, I'll start exploring how the different tools work in this environment. At the very start, I'll just run through each tool's tutorials and simulate some network and node failures to see how resilient they are. Once I've got a better handle on how each tool works, I'm planning on exploring how each of them approaches the problem of scaling pandas operations. Once that's done, we can start to get in a lot deeper and see where each tool shines. If you are interested in following along, check out my &lt;a href="https://www.youtube.com/user/holdenkarau"&gt;Youtube Channel on open source programming&lt;/a&gt; where I will try and stream the process that goes into each post. You can also &lt;a href="https://www.introductiontomlwithkubeflow.com/?from=introductiontomlwithkubeflow.com"&gt;subscribe to the mailing list for notifications on this on my books&lt;/a&gt; when I get something working well enough to make a new post :)&lt;/p&gt;

&lt;h3&gt;
  
  
  Disclaimer
&lt;/h3&gt;

&lt;p&gt;This blog does not represent any of my employers, past or present, and does not represent any of the software projects or foundations I'm involved with. I am one of the developers of Apache Spark and have &lt;a href="https://amzn.to/2O6KYYH"&gt;some books published on the topic&lt;/a&gt; that may influence my views, but my views do not represent the project.&lt;/p&gt;

&lt;p&gt;In as much as possible, I've used a common cluster environment for testing these different tools, although some parts have been easier to test out on Minikube.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>A First (Brief) Look at Ray on Kubernetes</title>
      <dc:creator>holden karau</dc:creator>
      <pubDate>Thu, 29 Oct 2020 20:56:27 +0000</pubDate>
      <link>https://dev.to/holdenkarau/a-first-brief-look-at-ray-on-kubernetes-21cn</link>
      <guid>https://dev.to/holdenkarau/a-first-brief-look-at-ray-on-kubernetes-21cn</guid>
      <description>&lt;h1&gt;
  
  
  A First (Brief) Look at Ray on Kubernetes
&lt;/h1&gt;

&lt;p&gt;After my motorcycle/Vespa crash last year I took some time away from work. While I was out and trying to practice getting my typing speed back up, I decided to play with Ray, which was pretty cool. Ray comes out of the same&lt;sup id="fnref1"&gt;1&lt;/sup&gt; research lab that created the initial work that became the basis of Apache Spark. Like Spark, the primary authors have now started a company (Anyscale) to grow Ray. Unlike Spark, Ray is a Python first library and does not depend on the Java Virtual Machine (JVM) -- and as someone who's spent way more time than she would like getting the JVM and Python to play together, Ray and it's cohort seem quite promising.&lt;/p&gt;

&lt;p&gt;This blog does not represent any of my employers, past or present, and does not represent any of the software projects or foundations I'm involved with. I am one of the developers of Apache Spark &lt;a href="https://amzn.to/2O6KYYH"&gt;and have some books published on the topic&lt;/a&gt; that may influence my views, but my views do not represent the project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installing Ray
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.ray.io/en/latest/installation.html"&gt;Installing Ray&lt;/a&gt; was fairly simple, especially due to its lack of JVM dependencies. The one weird thing I encountered while I was installing Ray is the fact that its developers decided to "vendor" Apache Arrow. This was disappointing because I'm interested in using Arrow to get some of these tools to play together and vendored libraries could make it a bit harder. I filed an issue with the ray-project folks, and they quickly responded that they were working on it and then resolved it, so this is something I want to come back to.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running Ray on K8s
&lt;/h2&gt;

&lt;p&gt;Since I had not yet built my dedicated test cluster, I decided to give Ray on Kubernetes a shot. The documentation had some room for improvement and I got lost a few times along the way, but on my second try a few days later using the nightly builds I managed to get it running.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fault Tolerance
&lt;/h2&gt;

&lt;p&gt;Fault tolerance is especially important in distributed systems like Spark and Ray since as we add more and more computers the chance of one of them failing, or having the network between them fail increases. Different distributed systems take different approaches to fault tolerance, Map-Reduce achieves its fault tolerance by using distributed persistent storage and Spark uses recompute on failures.&lt;sup id="fnref2"&gt;2&lt;/sup&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Fault Tolerance Limitations
&lt;/h2&gt;

&lt;p&gt;One of the things that really excites me about Ray is its actor model for state. This is really important for some machine learning algorithms, and in Spark, our limitations around handling state (like model weights) have made streaming machine learning algorithms very challenging. One of the big reasons for the limitations around how state is handled is fault tolerance.&lt;/p&gt;

&lt;p&gt;To simulate a failure I created an actor and then killed the pod that was running the actor. Ray did not seem to have any automatic recovery here, which could be the right answer. In the future, I want to experiment and see if there is a way to pair Ray with a durable distributed database (or another system) to allow the recovery of actors.&lt;/p&gt;

&lt;p&gt;I want to be clear: This is about the same as in Spark. Spark only&lt;sup id="fnref3"&gt;3&lt;/sup&gt; allows state to accrue on the driver, and recovery of state on the failure of the driver requires some additional custom code.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next?
&lt;/h2&gt;

&lt;p&gt;The ray-project looks really interesting. Along with Dask and other new Python-first tools we're entering a new era of options for scaling our Python ML code. Seeing Apache Arrow inside of Ray is reassuring since one of my considerations is how we can make our tools work together, and I think Arrow has the potential to serve as a bridge between the different parts of our ecosystem. Up next I'm going to try and set up Dask on my new K8s cluster, and then also re-create this initial experiment on physical hardware instead of Minikube. If you've got thoughts or suggestions for what you'd like to see next, please do send me an e-mail and file an issue against the webpage on GitHub.&lt;/p&gt;

&lt;p&gt;You can also follow along with my streams around &lt;a href="https://www.youtube.com/user/holdenkarau"&gt;distributed computing and open-source on my YouTube channel&lt;/a&gt;. The two videos for this post are &lt;a href="https://www.youtube.com/watch?v=WBNmF-wyAlE"&gt;Installing &amp;amp; Poking at Ray&lt;/a&gt; and &lt;a href="https://www.youtube.com/watch?v=IUI5okVvgbQ"&gt;Trying the Ray Project on Kubernetes&lt;/a&gt;. This post originally appeared on my new blog "Scaling Python ML" - &lt;a href="http://scalingpythonml.com/2020/08/16/poke-at-ray.html"&gt;http://scalingpythonml.com/2020/08/16/poke-at-ray.html&lt;/a&gt; :)&lt;/p&gt;

&lt;p&gt;If your interested in learning more about Ray and don't want to wait for me, there is a &lt;a href="https://github.com/ray-project/"&gt;great collection of tutorials in the project&lt;/a&gt;.&lt;/p&gt;




&lt;ol&gt;

&lt;li id="fn1"&gt;
&lt;p&gt;Well… same-ish. It's technically a bit more complicated because of the way the professors choose to run their labs, but if you look at the advisors you'll notice a lot of overlap. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn2"&gt;
&lt;p&gt;Technically it's a bit more complicated, and Spark can use a hybrid of these two models. In some internal places (like it's ALS implementation and other iterative algorithms), Spark uses distributed persistent storage for fault tolerance. ↩&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn3"&gt;
&lt;p&gt;Streaming Spark is a bit different ↩&lt;/p&gt;
&lt;/li&gt;

&lt;/ol&gt;

</description>
      <category>kubernetes</category>
      <category>ray</category>
      <category>python</category>
    </item>
  </channel>
</rss>
