<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: João Ferrão</title>
    <description>The latest articles on DEV Community by João Ferrão (@joaoferrao).</description>
    <link>https://dev.to/joaoferrao</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F624328%2Fe3d3e63f-c282-44c3-8591-84f5c29b14ae.png</url>
      <title>DEV Community: João Ferrão</title>
      <link>https://dev.to/joaoferrao</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/joaoferrao"/>
    <language>en</language>
    <item>
      <title>(Slightly) Quicker PySpark Tests</title>
      <dc:creator>João Ferrão</dc:creator>
      <pubDate>Fri, 11 Jun 2021 21:27:11 +0000</pubDate>
      <link>https://dev.to/mklabs/slightly-quick-pyspark-tests-46j3</link>
      <guid>https://dev.to/mklabs/slightly-quick-pyspark-tests-46j3</guid>
      <description>&lt;p&gt;First of all, let me start by stating this isn't a post about the usage of fancy new technology but rather about sharing how to slightly improve the timing of automated pipelines that require testing PySpark.&lt;/p&gt;

&lt;p&gt;Recently I needed to implement an automated workflow, used in multiple git repositories which required &lt;code&gt;pyspark&lt;/code&gt; tests to run at every commit made on an open pull/merge request as well as after merge (because, you know... we love tests).&lt;/p&gt;

&lt;p&gt;As usual, the team was on a rush to put together a lot of different elements simultaneously but, in the spirit of Agile development, we wanted to make sure that (1) everybody understood how to run the tests and (2) that experience was (could be) the same between running them on a machine or in our CICD pipeline.&lt;/p&gt;

&lt;p&gt;If anyone has experience with (Py)Spark, there's a lot of common mishaps that you'll know how to handle, but if you have newcomers to the technology, it's sometimes hard to identify some of the aforementioned issues, such has: wrong java version installed/selected, wrong python version, input test file doesn't exist, etc. If there's a small change to introduce to the business logic, it should be easy to know if (and where) the intended use of the logic was broken.&lt;/p&gt;

&lt;p&gt;Originally, we ended up this very practical &lt;code&gt;Dockerfile&lt;/code&gt; and &lt;code&gt;docker-compose.yaml&lt;/code&gt;, which installed whatever &lt;code&gt;requirements.txt&lt;/code&gt; we placed in the root of the repo and ran all the tests.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; openjdk:8&lt;/span&gt;
&lt;span class="k"&gt;ARG&lt;/span&gt;&lt;span class="s"&gt; PYTHON_VERSION=3.7&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; PATH="/root/miniconda3/bin:${PATH}"&lt;/span&gt;

&lt;span class="c"&gt;# provision python with miniconda&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;mkdir&lt;/span&gt; /root/.conda &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; bash Miniconda3-latest-Linux-x86_64.sh &lt;span class="nt"&gt;-b&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; Miniconda3-latest-Linux-x86_64.sh
    &amp;amp;&amp;amp; conda install python=$PYTHON_VERSION -y \
    &amp;amp;&amp;amp; conda init bash

&lt;span class="k"&gt;RUN &lt;/span&gt;&lt;span class="nb"&gt;.&lt;/span&gt; /root/.bashrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We placed it in &lt;code&gt;tests/Dockerfile&lt;/code&gt; and then the following &lt;code&gt;docker-compose.yaml&lt;/code&gt; in the root of the repo.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3.0"&lt;/span&gt;

&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
 &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
      &lt;span class="na"&gt;dockerfile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./tests/Dockerfile&lt;/span&gt;
      &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;PYTHON_VERSION&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3.7&lt;/span&gt;
    &lt;span class="na"&gt;working_dir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/app&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;.:/app/&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/bin/bash&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;-c&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;pip install pytest&lt;/span&gt;
        &lt;span class="s"&gt;pip install -r ./requirements.txt # including pyspark==2.4.2&lt;/span&gt;
        &lt;span class="s"&gt;pytest ./tests&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, you can just run &lt;code&gt;docker-compose up test&lt;/code&gt; either on the designated CICD pipeline step and see the results. At least, there's no excuse for anyone to say they can't run the tests. &lt;/p&gt;

&lt;h2&gt;
  
  
  But waiting for this on every commit...?
&lt;/h2&gt;

&lt;p&gt;If you try to run all of this, you'll notice that just provisioning Java, Python and then PySpark itself takes something between 2-4 minutes, depending on your CICD engine muscle and network bandwidth. Doesn't seem very significant, but if you run such logic every commit and that is required to pass before a Pull Request can be merged, might become relevant over time.&lt;/p&gt;

&lt;p&gt;For this reason, we &lt;a href="!https://hub.docker.com/repository/docker/mklabsio/pyspark/general"&gt;created a public image&lt;/a&gt; on Docker Hub, which you can use in your project and  &lt;strong&gt;which is by no means rocket science&lt;/strong&gt; but will hopefully be as convenient for your as it was for us.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Yeah, but there's &lt;code&gt;jupyter pyspark notebook&lt;/code&gt; images out there with the similar pre-installed stack.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Right, right... But we wanted (1) something stripped down of unneeded packages (2) the ability to choose specific combinations of Python and PySpark versions per docker image and a (3) smaller download footprint - the images have almost half the size when decompressed (1.7 Gb vs 3.3 Gb).&lt;/p&gt;

&lt;h2&gt;
  
  
  Slightly quicker
&lt;/h2&gt;

&lt;p&gt;The quick improvement would then be to simply reference directly the &lt;code&gt;mklabsio/pyspark:py37-spark242&lt;/code&gt; image in your &lt;code&gt;docker-compose.yaml&lt;/code&gt;, get rid of the &lt;code&gt;Dockerfile&lt;/code&gt; and leave everything as before.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3.0"&lt;/span&gt;

&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
 &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="err"&gt;*&lt;/span&gt;&lt;span class="nv"&gt;*image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;mklabsio/pyspark:py37-spark242**&lt;/span&gt;
    &lt;span class="na"&gt;working_dir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/app&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;.:/app/&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/bin/bash&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;-c&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;pip install pytest&lt;/span&gt;
        &lt;span class="s"&gt;pip install -r ./requirements.txt # including pyspark==2.4.2&lt;/span&gt;
        &lt;span class="s"&gt;pytest ./tests&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We also took the opportunity to create a simple GitHub Actions workflow to build all the required combinations of Python, Java and PySpark.&lt;/p&gt;

&lt;p&gt;Hope this could help you in any way and if you see a need for improvement, let us know in the comments below or feel free to open an issue in the &lt;a href="!https://github.com/mklabs-io/pyspark-docker"&gt;dedicated GitHub repository&lt;/a&gt;:&lt;/p&gt;

</description>
      <category>python</category>
      <category>pyspark</category>
      <category>cicd</category>
      <category>testing</category>
    </item>
  </channel>
</rss>
