<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nicoda-27</title>
    <description>The latest articles on DEV Community by Nicoda-27 (@nda_27).</description>
    <link>https://dev.to/nda_27</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2866016%2F6f804060-53c1-482e-ad43-288a0ba75a99.jpg</url>
      <title>DEV Community: Nicoda-27</title>
      <link>https://dev.to/nda_27</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nda_27"/>
    <language>en</language>
    <item>
      <title>How to for developers: Mastering your corporate MacBook Setup</title>
      <dc:creator>Nicoda-27</dc:creator>
      <pubDate>Sat, 17 May 2025 09:51:05 +0000</pubDate>
      <link>https://dev.to/nda_27/how-to-for-developers-mastering-your-corporate-macbook-setup-5eoe</link>
      <guid>https://dev.to/nda_27/how-to-for-developers-mastering-your-corporate-macbook-setup-5eoe</guid>
      <description>&lt;p&gt;Starting with a fresh &lt;em&gt;MacBook&lt;/em&gt; can be exciting, but navigating corporate IT requirements can feel daunting. This article demystifies the process, offering a step-by-step guide to ensure a smooth and efficient setup that aligns with your company's policies and empowers your productivity from day one.&lt;/p&gt;

&lt;p&gt;This article will focus on a persona of a &lt;em&gt;Python&lt;/em&gt; Developer but is transferable for any developers.&lt;/p&gt;

&lt;h2&gt;
  
  
  A corporate &lt;em&gt;MacBook&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;Depending on your company, the &lt;em&gt;Macbook&lt;/em&gt; provided to you as a developer workstation can pose certain difficulties when compared with a privately own &lt;em&gt;MacBook&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Namely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You will not be sudo of your workstation&lt;/li&gt;
&lt;li&gt;A proxy could be implemented as company policy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not being sudo will be the main difficulty and we will see how to navigate around that and still comply with your company policies, meaning we will not hack the system but leverage the MacOs potential.&lt;/p&gt;

&lt;h2&gt;
  
  
  Homebrew
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://brew.sh/" rel="noopener noreferrer"&gt;&lt;em&gt;Homebrew&lt;/em&gt;&lt;/a&gt; is the go to for developer using &lt;em&gt;MacOs&lt;/em&gt; to be able to install applications. It's the equivalent of &lt;a href="https://documentation.ubuntu.com/server/how-to/software/package-management/index.html" rel="noopener noreferrer"&gt;&lt;em&gt;Aptitude&lt;/em&gt;&lt;/a&gt; in &lt;em&gt;Ubuntu&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The official documentation requires you to be sudo to install it. But you can also install it for your specific user with&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nv"&gt;$HOME&lt;/span&gt;/homebrew &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; curl &lt;span class="nt"&gt;-L&lt;/span&gt; https://github.com/Homebrew/brew/tarball/master | &lt;span class="nb"&gt;tar &lt;/span&gt;xz &lt;span class="nt"&gt;--strip&lt;/span&gt; 1 &lt;span class="nt"&gt;-C&lt;/span&gt; homebrew
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will create a directory &lt;em&gt;Homebrew&lt;/em&gt; in your &lt;code&gt;$HOME&lt;/code&gt; where brew will be installed.&lt;/p&gt;

&lt;p&gt;You can now add it to your path permanently, in the rest of this article, it's assumed that &lt;em&gt;zsh&lt;/em&gt; is your favorite shell (but feel free to use any other shells that you prefer)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'export PATH=$HOME/homebrew/bin:$PATH'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.zshrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's check that &lt;em&gt;Homebrew&lt;/em&gt; is detected and works:&lt;/p&gt;

&lt;p&gt;The following command will make sure that the above export happened on your current terminal&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source&lt;/span&gt; ~/.zshrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The following command should display a version, and proves that &lt;em&gt;Homebrew&lt;/em&gt; was properly installed and is usable by your system.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Cli tools
&lt;/h3&gt;

&lt;p&gt;Then let's give a try and install something&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you are able to install all your favorite developer tools if packaged in &lt;em&gt;Homebrew&lt;/em&gt;. They would appear for your user and you can install them without sudo permissions, yay ! &lt;/p&gt;

&lt;h3&gt;
  
  
  Rich applications
&lt;/h3&gt;

&lt;p&gt;Rich application will require an Applications directory for it to be installed; by default brew choose the &lt;code&gt;/Applications&lt;/code&gt; directory which is located at the root of your file system and will not be accessible for you.&lt;/p&gt;

&lt;p&gt;For instance, let's try to install &lt;a href="https://github.com/MuhammedKalkan/OpenLens" rel="noopener noreferrer"&gt;&lt;em&gt;OpenLens&lt;/em&gt;&lt;/a&gt;, it's a developer tool that showcase an UI to observe your &lt;em&gt;Kubernetes&lt;/em&gt; cluster, it's proposed as a tool package in &lt;a href="https://formulae.brew.sh/cask/openlens" rel="noopener noreferrer"&gt;homebrew&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you try to install it with&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;openlens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This installation will fail because &lt;em&gt;homebrew&lt;/em&gt; will not be able to use &lt;code&gt;/Applications&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Instead, let's create our own &lt;code&gt;Applications&lt;/code&gt; directory for our user.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nv"&gt;$HOME&lt;/span&gt;/Applications
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then you can leverage it when using &lt;em&gt;Homebrew&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;openlens &lt;span class="nt"&gt;--appdir&lt;/span&gt; &lt;span class="nv"&gt;$HOME&lt;/span&gt;/Applications
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now try to open &lt;em&gt;OpenLens&lt;/em&gt; as an application from the Spotlight Search.&lt;/p&gt;

&lt;p&gt;You can now install rich applications without sudo, yay ! &lt;/p&gt;

&lt;h2&gt;
  
  
  Rosetta
&lt;/h2&gt;

&lt;p&gt;There might be circonstances where you will need a specific target architecture for your application to run (namely either arm or intel). &lt;/p&gt;

&lt;p&gt;The native architecture for &lt;em&gt;M1&lt;/em&gt;, &lt;em&gt;M2&lt;/em&gt;, &lt;em&gt;M3&lt;/em&gt;, &lt;em&gt;M4&lt;/em&gt; MacBook is now &lt;em&gt;arm&lt;/em&gt;, but &lt;em&gt;Rosetta&lt;/em&gt; provides a way to emulate &lt;em&gt;x86_64&lt;/em&gt; architecture. It's very useful when librairies have been compiled for only one specific architecture, x86_64 being older there are higher chances to be available for this rather than for &lt;em&gt;arm&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;You can implement aliases to switch the Rosetta emulator on or off. You can add these two aliases to your .zshrc&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;alias &lt;/span&gt;&lt;span class="nv"&gt;arm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"env /usr/bin/arch -arm64 /bin/zsh --login"&lt;/span&gt;
&lt;span class="nb"&gt;alias &lt;/span&gt;&lt;span class="nv"&gt;intel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"env /usr/bin/arch -x86_64 /bin/zsh --login"&lt;/span&gt;  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To make sure of which target you are own, you can type&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;arch&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It should either displays &lt;code&gt;i386&lt;/code&gt; (x86_64/Intel architectures) or &lt;code&gt;arm64&lt;/code&gt; (arm/M* architectures&lt;/p&gt;

&lt;h2&gt;
  
  
  Installing language specific tools
&lt;/h2&gt;

&lt;p&gt;To help you manage different versions of python, node, awscli, cargo the usage of &lt;a href="https://mise.jdx.dev/" rel="noopener noreferrer"&gt;mise&lt;/a&gt; will be demonstrated. &lt;/p&gt;

&lt;p&gt;We will install two versions of mise so it can handle environments with both architecture as advised in &lt;a href="https://mise.jdx.dev/tips-and-tricks.html#macos-rosetta" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Just follow the installation guide above and should be able to have mise for x64 available with&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mise-x64 &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And you can also have mise available for &lt;em&gt;arm&lt;/em&gt; if you're sure you don't need &lt;em&gt;x86_64&lt;/em&gt; specifics.&lt;/p&gt;

&lt;p&gt;Just follow the usual installation with &lt;em&gt;Homebrew&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;mise
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now install &lt;em&gt;python&lt;/em&gt;, &lt;em&gt;node&lt;/em&gt; or many &lt;a href="https://mise.jdx.dev/registry.html#tools" rel="noopener noreferrer"&gt;tools&lt;/a&gt; with &lt;em&gt;mise&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The following command will install python for a specific &lt;em&gt;x64_86&lt;/em&gt; target:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mise-x64 use python@3.10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Containerization tools
&lt;/h2&gt;

&lt;p&gt;When we speak about containerization tool, there are multiple candidates.&lt;/p&gt;

&lt;p&gt;The most used would be [&lt;em&gt;docker&lt;/em&gt;](&lt;a href="https://www.docker.com/-" rel="noopener noreferrer"&gt;https://www.docker.com/-&lt;/a&gt; but open source alternatives like &lt;a href="https://podman-desktop.io/" rel="noopener noreferrer"&gt;&lt;em&gt;podman&lt;/em&gt;&lt;/a&gt; also exists.&lt;/p&gt;

&lt;h3&gt;
  
  
  Docker
&lt;/h3&gt;

&lt;p&gt;You will need support of your IT team to be able to install docker as it requires sudo access.&lt;/p&gt;

&lt;h3&gt;
  
  
  Podman
&lt;/h3&gt;

&lt;p&gt;It can act as docker but does not requires elevated privileges, you can easily install it with &lt;em&gt;Homebrew&lt;/em&gt; or follow the &lt;a href="https://podman-desktop.io/docs/installation/macos-install" rel="noopener noreferrer"&gt;guide&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;podman
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can alias podman to docker to make it even more transparent. Add the alias in your &lt;code&gt;.zshrc&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;alias docker=podman
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As the podman cli is compliant with the docker cli, it should work transparently.&lt;/p&gt;

&lt;p&gt;Give it a try:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run hello-world
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and it should be working. You will first have to run the following commands to initialise podman:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;podman machine init
podman machine start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is a dedicated &lt;a href="https://podman-desktop.io/docs/troubleshooting/troubleshooting-podman-on-macos" rel="noopener noreferrer"&gt;troubleshoot&lt;/a&gt; section to help you out.&lt;/p&gt;

&lt;p&gt;Disclaimer:&lt;br&gt;
Some tools like &lt;a href="https://github.com/testcontainers/testcontainers-python" rel="noopener noreferrer"&gt;testcontainers&lt;/a&gt; relies entirely on pure docker therefore you will have issue using podman to replace it. &lt;/p&gt;

&lt;p&gt;Though it can be solved thanks to &lt;a href="https://podman-desktop.io/docs/migrating-from-docker/customizing-docker-compatibility" rel="noopener noreferrer"&gt;podman-mac-helper&lt;/a&gt; but it requires elevated privileges from your IT team.&lt;/p&gt;
&lt;h2&gt;
  
  
  Proxy and certificates
&lt;/h2&gt;

&lt;p&gt;Your company might put in place a proxy that is using a certificate. &lt;/p&gt;

&lt;p&gt;When using python tools like &lt;a href="https://learn.microsoft.com/en-us/cli/azure/?view=azure-cli-latest" rel="noopener noreferrer"&gt;az-cli&lt;/a&gt;, you will be blocked and stumble upon similar issues like &lt;a href="https://learn.microsoft.com/en-us/cli/azure/use-azure-cli-successfully-troubleshooting?view=azure-cli-latest#work-behind-a-proxy" rel="noopener noreferrer"&gt;this&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To solve it, as documented you will need to create an environment variables called &lt;code&gt;REQUESTS_CA_BUNDLE&lt;/code&gt; to indicate the path to your certificate.&lt;/p&gt;

&lt;p&gt;This is the case also when you will be installing dependencies and that a python tool requires internet access.&lt;/p&gt;
&lt;h3&gt;
  
  
  Finding the certificate
&lt;/h3&gt;

&lt;p&gt;You will need to find the name of the certificate, so we can export it. To find the name, you can open Keychain Access application, on the certificates section, open the certificate added by your proxy team and copy the "Common name", let's assume the certificate is called "corporate_proxy_certificate".&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;security find-certificate &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"corporate_proxy_certificate"&lt;/span&gt; /Library/Keychains/System.keychain
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This should display the raw certificate value. If not you can look at other usual places for certificates, it can also be under &lt;code&gt;/System/Library/Keychains/SystemRootCertificates.keychain&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Export the certificate
&lt;/h3&gt;

&lt;p&gt;Now you can export it to a pem file like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;security find-certificate &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"corporate_proxy_certificate"&lt;/span&gt; /Library/Keychains/System.keychain &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;/corporate_proxy_certificate.pem
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See the content of the .pem file with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="nv"&gt;$HOME&lt;/span&gt;/corporate_proxy_certificate.pem
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Leverage the certificate
&lt;/h3&gt;

&lt;p&gt;You can now export the environment variable and add it in the &lt;code&gt;.zshrc&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'export REQUESTS_CA_BUNDLE=/$HOME/corporate_proxy_certificate.pem'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.zshrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And now, under the proxy any call to &lt;code&gt;az&lt;/code&gt; cli tool should be working like a charm.&lt;/p&gt;

&lt;p&gt;It also means that when disabling the proxy, you will need to unset the &lt;code&gt;REQUESTS_CA_BUNDLE&lt;/code&gt; because it's part of your &lt;em&gt;.zshrc&lt;/em&gt; like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;unset &lt;/span&gt;REQUESTS_CA_BUNDLE
az login
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Hoping this article help navigating through corporate compliancy and still maintain high developer productivity.&lt;/p&gt;

&lt;p&gt;Don't hesitate to reach out in comments if further explanation are necessary, or if there are missing topics.&lt;/p&gt;

</description>
      <category>macos</category>
      <category>development</category>
      <category>tutorial</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How to be Test Driven with Spark: Chapter 5: Leverage spark in a container</title>
      <dc:creator>Nicoda-27</dc:creator>
      <pubDate>Sat, 15 Mar 2025 07:33:58 +0000</pubDate>
      <link>https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-5-leverage-spark-in-a-container-1p74</link>
      <guid>https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-5-leverage-spark-in-a-container-1p74</guid>
      <description>&lt;p&gt;This goal of this tutorial is to provide a way to easily be test driven with spark on your local setup without using cloud resources.&lt;/p&gt;

&lt;p&gt;This is a series of tutorials and the initial chapters can be found in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/nda_27/how-to-be-tdd-with-spark-chapter-0-and-1-modern-python-setup-3df8"&gt;Chapter 0 and 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-2-ci-4a28"&gt;Chapter 2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-3-first-spark-test-le"&gt;Chapter 3&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-4-leaning-into-property-based-testing-2hln"&gt;Chapter 4&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-3-first-spark-test-le"&gt;chapter 3&lt;/a&gt;, it was demonstrated that the current testing approach rely on &lt;em&gt;Java&lt;/em&gt; being available on the developer setup. As mentioned, this is not ideal as there is limited control and unexpected behavior can happen. A good testing practice is to have reproducible and idempotent tests, this means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Launching the tests an infinite number of times should always have the same results&lt;/li&gt;
&lt;li&gt;A test should leave a clean plate after it has run, there should be no side effect to a test running (no files written, no change of environment variables, no database with remaining data etc)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The reasons why it's so important, is because otherwise you will spend most of your time relaunching the tests due to false positive, you would never be sure if you actually broke something or if the test is randomly failing. At the end, you will not trust the tests anymore and skip some of them, which defeats the purpose.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why using a container?
&lt;/h2&gt;

&lt;p&gt;If you are unfamiliar with the concept of containers and docker images, I suggest you have a look at &lt;a href="https://www.docker.com/" rel="noopener noreferrer"&gt;docker&lt;/a&gt;. It will be leveraged here to start the &lt;em&gt;Spark&lt;/em&gt; server for the tests; it's important to mention there are other opensource alternatives like &lt;a href="https://podman.io/" rel="noopener noreferrer"&gt;podman&lt;/a&gt; or &lt;a href="https://github.com/containerd/nerdctl" rel="noopener noreferrer"&gt;nerdctl&lt;/a&gt; to allow containerization.&lt;/p&gt;

&lt;p&gt;Docker will be used thereafter as it has become the defacto standards for most companies, and it's available in the &lt;em&gt;Github&lt;/em&gt; ci runner. It will be assumed that you have enough knowledge about the technology to use it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Container with spark connect
&lt;/h2&gt;

&lt;p&gt;There is a small subtlety that needs to be understood. Previously, the &lt;em&gt;Java Virtual Machine (JVM)&lt;/em&gt; was used to communicate with the python spark implementation (through the &lt;code&gt;spark_session&lt;/code&gt;), it was using the java binary to create a swarm of workers that were handling the data processing. At the end, all the results were collected and communicated to the &lt;code&gt;spark_session&lt;/code&gt; which was exposing it in the python code.&lt;/p&gt;

&lt;p&gt;If you start a container with this, the &lt;code&gt;spark_session&lt;/code&gt; will never be able to find the &lt;em&gt;JVM&lt;/em&gt; inside the container as it's a binary. The container you want to create needs a way to communicate outside with the &lt;code&gt;spark_session&lt;/code&gt; through the network. Luckily, &lt;a href="https://spark.apache.org/docs/3.5.3/spark-connect-overview.html" rel="noopener noreferrer"&gt;&lt;em&gt;Spark&lt;/em&gt; connect&lt;/a&gt; is providing a solution and the documentation is a must known. This is the chosen approach to containerize the &lt;em&gt;Spark&lt;/em&gt; server and the worker creation.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Spark&lt;/em&gt; is already providing a docker &lt;a href="https://hub.docker.com/r/apache/spark" rel="noopener noreferrer"&gt;image&lt;/a&gt; that you will leverage. If you don't have docker available on your setup, you will need to install it, see the official &lt;a href="https://docs.docker.com/engine/install/ubuntu/#installation-methods" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let's uninstall &lt;code&gt;openjdk&lt;/code&gt; to make sure &lt;code&gt;spark_session&lt;/code&gt; will use the new setup, it will require elevation of privileges:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;apt-get autoremove openjdk-8-jre
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now relaunch the tests, it's expected that they fail with the following error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ERROR tests/test_minimal_transfo.py::test_minimal_transfo - pyspark.errors.exceptions.base.PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.
ERROR tests/test_minimal_transfo.py::test_transfo_w_synthetic_data - pyspark.errors.exceptions.base.PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Start the container
&lt;/h2&gt;

&lt;p&gt;You will need to start the container with spark connect, you can launch&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-p&lt;/span&gt; 8081:8081 &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;SPARK_NO_DAEMONIZE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True &lt;span class="nt"&gt;--name&lt;/span&gt; spark_connect apache/spark /opt/spark/sbin/start-connect-server.sh org.apache.spark.deploy.master.Master &lt;span class="nt"&gt;--packages&lt;/span&gt; org.apache.spark:spark-connect_2.12:3.5.2,io.delta:delta-core_2.12:2.3.0 &lt;span class="nt"&gt;--conf&lt;/span&gt; spark.driver.extraJavaOptions&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'-Divy.cache.dir=/tmp -Divy.home=/tmp'&lt;/span&gt; &lt;span class="nt"&gt;--conf&lt;/span&gt; spark.connect.grpc.binding.port&lt;span class="o"&gt;=&lt;/span&gt;8081
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It will print a lot in the terminal and at the end you should have:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;24/12/27 14:04:27 INFO SparkConnectServer: Spark Connect server started at: 0:0:0:0:0:0:0:0%0:8081
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows that the &lt;em&gt;Spark&lt;/em&gt; server is up and running.&lt;/p&gt;

&lt;p&gt;Each argument in the above command has a meaning and its importance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;docker run&lt;/code&gt; is the docker command to start a container&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-p 8081:8081&lt;/code&gt; is an arguments to &lt;code&gt;docker run&lt;/code&gt; that enables to use port 8081 to communicate with the created container&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-e SPARK_NO_DAEMONIZE=True&lt;/code&gt; is an environment variable that is passed to the container creation, it's necessary to use it for the server to be created as a foreground process&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--name spark_connect&lt;/code&gt; allows to name the created container&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;apache/spark&lt;/code&gt; is the docker image that is used, if you never used it, it will be downloaded from &lt;a href="https://hub.docker.com/r/apache/spark" rel="noopener noreferrer"&gt;&lt;em&gt;Docker Hub&lt;/em&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The rest of the command is what is called an &lt;a href="https://docs.docker.com/reference/dockerfile/#entrypoint" rel="noopener noreferrer"&gt;entrypoint&lt;/a&gt;, it's the command that will be executed inside the container. In here it contains multiple elements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;/opt/spark/sbin/start-connect-server.sh&lt;/code&gt; is the binary of the spark server&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;org.apache.spark.deploy.master.Master&lt;/code&gt; is an argument to the binary, in here the binary is asked to deploy a Master server, the same binary can be used to deploy a Worker&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--packages org.apache.spark:spark-connect_2.12:3.5.2,io.delta:delta-core_2.12:2.3.0&lt;/code&gt; is an optional argument to pass specific versions of spark, and delta dependencies&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--conf spark.driver.extraJavaOptions='-Divy.cache.dir=/tmp -Divy.home=/tmp'&lt;/code&gt; is extra argument to ask the server to write to &lt;code&gt;/tmp&lt;/code&gt; inside the container, it's not a mandatory argument&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--conf spark.connect.grpc.binding.port=8081&lt;/code&gt; is an extra argument to start the server on the port 8081 on the localhost of the container&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The last argument is where the magic happens, the server is started on port 8081, and docker is exposing the port of this container to the port of the docker host. Meaning, a spark server is now available on &lt;code&gt;http://localhost:8081&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Use the container
&lt;/h2&gt;

&lt;p&gt;Keep the previous terminal opened to keep the server running and open a new terminal. Now run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest &lt;span class="nt"&gt;-k&lt;/span&gt; test_transfo_w_synthetic_data &lt;span class="nt"&gt;-s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same error should appear, indeed the &lt;code&gt;spark_session&lt;/code&gt; needs to be adapted to connect to the server you have just created. In &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_5/tests/conftest.py" rel="noopener noreferrer"&gt;&lt;code&gt;test/conftest.py&lt;/code&gt;&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Generator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="nf"&gt;yield &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sc://localhost:8081&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# type: ignore
&lt;/span&gt;        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Testing PySpark Example&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Basically, it indicates the &lt;em&gt;Spark&lt;/em&gt; connect server &lt;em&gt;url&lt;/em&gt; to the &lt;em&gt;Spark&lt;/em&gt; session.&lt;/p&gt;

&lt;p&gt;And you need to add an extra dependency, which is mandatory to communicate with the spark connect server. It's worth pointing to the usage of &lt;a href="https://docs.astral.sh/uv/concepts/projects/dependencies/#optional-dependencies" rel="noopener noreferrer"&gt;extras&lt;/a&gt; in uv:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv add pyspark &lt;span class="nt"&gt;--extra&lt;/span&gt; connect
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As this project is in &lt;em&gt;Python&lt;/em&gt; 3.12, another error will appear related to &lt;a href="https://stackoverflow.com/questions/69919970/no-module-named-distutils-util-but-distutils-installed/76691103#76691103" rel="noopener noreferrer"&gt;distutils&lt;/a&gt; as it was removed from the latest python version, yet some dependencies still requires it. You will have to add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv add setuptools
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest &lt;span class="nt"&gt;-k&lt;/span&gt; test_minimal_transfo &lt;span class="nt"&gt;-s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And it should run successfully, you should also see logs in the spark server in the docker run terminal.&lt;/p&gt;

&lt;h2&gt;
  
  
  Improve the container usage
&lt;/h2&gt;

&lt;p&gt;As mentioned at the beginning of this chapter, the tests need to leave a clean plate. In the previous approach, a container is still running eventhough the tests are done, it's not ideal.&lt;/p&gt;

&lt;p&gt;To improve this, you will leverage &lt;a href="https://github.com/testcontainers/testcontainers-python" rel="noopener noreferrer"&gt;testcontainers&lt;/a&gt; which empower you with easy docker creation and removal at the test level.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv add testcontainers &lt;span class="nt"&gt;--dev&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, the docker can be started at the session fixture level, in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_5/tests/conftest.py" rel="noopener noreferrer"&gt;&lt;code&gt;tests/conftest.py&lt;/code&gt;&lt;/a&gt;, you can add an extra fixture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;testcontainers.core.container&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DockerContainer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;testcontainers.core.waiting_utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;wait_for_logs&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;spark_connect_start&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;kwargs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;entrypoint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/opt/spark/sbin/start-connect-server.sh org.apache.spark.deploy.master.Master --packages org.apache.spark:spark-connect_2.12:3.5.2,io.delta:delta-core_2.12:2.3.0 --conf spark.driver.extraJavaOptions=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-Divy.cache.dir=/tmp -Divy.home=/tmp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; --conf spark.connect.grpc.binding.port=8081&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nf"&gt;with &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;DockerContainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;apache/spark&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_bind_ports&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8081&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8081&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_env&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SPARK_NO_DAEMONIZE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;True&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_kwargs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;container&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;wait_for_logs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SparkConnectServer: Spark Connect server started at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;container&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will create a container with the previously described argument, the great thing with fixtures is that will kill the container at the end of the test execution. There is an extra step with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;        &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;wait_for_logs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SparkConnectServer: Spark Connect server started at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This enforces to yield the container only when the &lt;code&gt;SparkConnectServer: Spark Connect server started at&lt;/code&gt; appeared in the container logs. It's necessary to wait for the server to be ready until it can be called.&lt;/p&gt;

&lt;p&gt;The value that is yielded is the container which also contains the server url, you need to reuse in the &lt;code&gt;spark_session&lt;/code&gt; fixture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_connect_start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DockerContainer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Generator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;ip&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_connect_start&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_container_host_ip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;yield &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remote&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sc://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:8081&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# type: ignore
&lt;/span&gt;        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Testing PySpark Example&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now stop the container you started before&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker stop spark_connect
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And run the tests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You will notice all the tests are passing, and at the end of the test session there is no running containers.&lt;/p&gt;

&lt;p&gt;The following command will show what remaining containers are still running. The spark container should not appear.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker ps &lt;span class="nt"&gt;-a&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;You are now able to run local tests using spark and you can quickly iterate on your codebase and implement new features. You are no more depending on spark server to be launched for you on the cloud and waiting for it to process the data for you.&lt;/p&gt;

&lt;p&gt;The feedback loop is quicker, you are no more giving money to cloud provider for testing purposes and you provide an easy setup for developers to iterate on your project.&lt;/p&gt;

&lt;p&gt;They can launch &lt;code&gt;pytest&lt;/code&gt; and will be transparent; this also means less documentation for you to write to describe the expected developer setup.&lt;/p&gt;

&lt;p&gt;You can find the original materials in &lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_5" rel="noopener noreferrer"&gt;spark_tdd&lt;/a&gt;. This repository exposes what's the expected repository layout at the end of each chapter in each branch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_0" rel="noopener noreferrer"&gt;Chapter 0&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_1" rel="noopener noreferrer"&gt;Chapter 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_2" rel="noopener noreferrer"&gt;Chapter 2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_3" rel="noopener noreferrer"&gt;Chapter 3&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_4" rel="noopener noreferrer"&gt;Chapter 4&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_5" rel="noopener noreferrer"&gt;Chapter 5&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;Several ideas come to mind on how to improve our very small codebase&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Leverage &lt;a href="https://containers.dev/" rel="noopener noreferrer"&gt;devcontainer&lt;/a&gt; to improve ci and local development&lt;/li&gt;
&lt;li&gt;Templatize the repository for easier reusage with the help of &lt;a href="https://github.com/ffizer/ffizer" rel="noopener noreferrer"&gt;ffizer&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Explore &lt;a href="https://github.com/ibis-project/ibis?tab=readme-ov-file" rel="noopener noreferrer"&gt;ibis&lt;/a&gt; to handle multiple transformation backends transparently&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>pyspark</category>
      <category>python</category>
      <category>testing</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How to be Test Driven with Spark: Chapter 4 - Leaning into Property Based Testing</title>
      <dc:creator>Nicoda-27</dc:creator>
      <pubDate>Sun, 09 Mar 2025 08:38:56 +0000</pubDate>
      <link>https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-4-leaning-into-property-based-testing-2hln</link>
      <guid>https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-4-leaning-into-property-based-testing-2hln</guid>
      <description>&lt;p&gt;This goal of this tutorial is to provide a way to easily be test driven with spark on your local setup without using cloud resources.&lt;/p&gt;

&lt;p&gt;This is a series of tutorials and the initial chapters can be found in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/nda_27/how-to-be-tdd-with-spark-chapter-0-and-1-modern-python-setup-3df8"&gt;Chapter 0 and 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-2-ci-4a28"&gt;Chapter 2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-3-first-spark-test-le"&gt;Chapter 3&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The test that you implemented in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_4/tutorials/chapter_3_spark_test.md" rel="noopener noreferrer"&gt;Chapter 3&lt;/a&gt; is great, yet not complete as it takes only a limited amount of data. As spark is used to process data at scale, you have to test at scale too.&lt;/p&gt;

&lt;p&gt;There are several solutions, the first one being taking a snapshot of production data and reusing at the test level (meaning integration test or local test). The second one is to generate synthetic data based on the data schema. With the second approach, you will be leaning into a property based testing approach.&lt;/p&gt;

&lt;p&gt;The second approach will be leveraged here as the test case generation is deported to automated generation.&lt;/p&gt;

&lt;p&gt;The python ecosystem provides &lt;a href="https://hypothesis.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;&lt;em&gt;Hypothesis&lt;/em&gt;&lt;/a&gt; for proper property based testing, or &lt;a href="https://faker.readthedocs.io/en/master/" rel="noopener noreferrer"&gt;&lt;em&gt;Faker&lt;/em&gt;&lt;/a&gt; for fake data generation. &lt;em&gt;Hypothesis&lt;/em&gt; is way more powerful than Faker in the sense that it will generate test cases for you based on data property (being a string, being an integer etc) and shrink the test cases when unexpected behavior happen. &lt;em&gt;Faker&lt;/em&gt; will be used here to generate synthetic data based on business property.&lt;/p&gt;

&lt;h2&gt;
  
  
  A data driven test
&lt;/h2&gt;

&lt;p&gt;You need two new fixtures similar to &lt;code&gt;persons&lt;/code&gt; and &lt;code&gt;employments&lt;/code&gt; that will generate synthetic data. First you need to install faker as a dev dependency:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv add faker &lt;span class="nt"&gt;--dev&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can create &lt;code&gt;persons_synthetic&lt;/code&gt; in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_4/tests/conftest.py" rel="noopener noreferrer"&gt;&lt;code&gt;tests/conftest.py&lt;/code&gt;&lt;/a&gt; like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;persons_synthetic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Generator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;fake&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Faker&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;nb_elem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pyint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;fake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;last_name&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;fake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;date&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nb_elem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalityName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalitySurname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the above, a data frame of 100 000 rows is generated, feel free to increase the size to generate larger data frames. Fake names, surnames and date are generated on the fly according to business needs.&lt;/p&gt;

&lt;p&gt;You can also create &lt;code&gt;employments_synthetic&lt;/code&gt; in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_4/tests/conftest.py" rel="noopener noreferrer"&gt;&lt;code&gt;tests/conftest.py&lt;/code&gt;&lt;/a&gt;, there is a dependency on &lt;code&gt;foreign_key&lt;/code&gt; from &lt;code&gt;persons_synthetic&lt;/code&gt; that needs to be handled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;employments_synthetic&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;persons_synthetic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Generator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;fake&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Faker&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;persons_sample&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;persons_synthetic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;person_ids_sample&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;persons_sample&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;collect_list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;first&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;id_fk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fake&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;job&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;id_fk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;person_ids_sample&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;person_fk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;foreign_key&lt;/code&gt; is reused from a sample of &lt;code&gt;persons_synthetic&lt;/code&gt; and job name are generated on the fly.&lt;/p&gt;

&lt;p&gt;The test can now be created:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_transfo_w_synthetic_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;persons_synthetic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments_synthetic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DataProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persons_synthetic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments_synthetic&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isEmpty&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;surname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_of_birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And you can launch &lt;code&gt;pytest -k test_transfo_w_synthetic_data -s&lt;/code&gt; that should pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to handle slow tests
&lt;/h2&gt;

&lt;p&gt;You might notice that &lt;code&gt;test_transfo_w_synthetic_data&lt;/code&gt; is a bit slow, indeed it's generating a decent amount of data (even though far from a big data scale), modifying the data frames and joining two together.&lt;/p&gt;

&lt;p&gt;In a test driven approach, it's necessary to have a quick feedback loop to iterate quickly on your local setup. Yet, this tests needs to be launched anyway as they validate behavior with decent amount of data.&lt;/p&gt;

&lt;p&gt;A solution is to add tags to tests like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;
&lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.mark.slow&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_transfo_w_synthetic_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;persons_synthetic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments_synthetic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tag can be leveraged by pytest to filter out tests at execution time, see &lt;a href="https://docs.pytest.org/en/stable/example/markers.html#mark-examples" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;and add to &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_4/pyproject.toml" rel="noopener noreferrer"&gt;&lt;code&gt;pyproject.toml&lt;/code&gt;&lt;/a&gt; the expected markers for &lt;em&gt;Pytest&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[tool.pytest.ini_options]&lt;/span&gt;
&lt;span class="py"&gt;pythonpath&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"src"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="py"&gt;markers&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"slow"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Pytest&lt;/em&gt; is now aware of this new marker when launching:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest &lt;span class="nt"&gt;--markers&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now launch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"not slow"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It will validate only the tests not marked as slow.&lt;/p&gt;

&lt;p&gt;In the ci, there is nothing to change as by default &lt;em&gt;Pytest&lt;/em&gt; will launch all the test.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's next?
&lt;/h2&gt;

&lt;p&gt;On the next chapter, the next chapter will focus on test repeatability by improving how java is used for &lt;em&gt;Spark&lt;/em&gt; at the test level.&lt;/p&gt;

&lt;p&gt;You can find the original materials in &lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_2" rel="noopener noreferrer"&gt;spark_tdd&lt;/a&gt;. This repository exposes what's the expected repository layout at the end of each chapter in each branch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_0" rel="noopener noreferrer"&gt;Chapter 0&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_1" rel="noopener noreferrer"&gt;Chapter 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_2" rel="noopener noreferrer"&gt;Chapter 2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_3" rel="noopener noreferrer"&gt;Chapter 3&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;[15/03/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-5-leverage-spark-in-a-container-1p74"&gt;Chapter 5&lt;/a&gt; has been released&lt;/p&gt;

</description>
      <category>python</category>
      <category>pyspark</category>
      <category>testing</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How to be Test Driven with Spark: Chapter 3 - First Spark test</title>
      <dc:creator>Nicoda-27</dc:creator>
      <pubDate>Sat, 01 Mar 2025 08:01:00 +0000</pubDate>
      <link>https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-3-first-spark-test-le</link>
      <guid>https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-3-first-spark-test-le</guid>
      <description>&lt;p&gt;This goal of this tutorial is to provide a way to easily be test driven with spark on your local setup without using cloud resources.&lt;/p&gt;

&lt;p&gt;This is a series of tutorials and the initial chapters can be found in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; &lt;a href="https://dev.to/nda_27/how-to-be-tdd-with-spark-chapter-0-and-1-modern-python-setup-3df8"&gt;Chapter 0 and 1&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt; &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-2-ci-4a28"&gt;Chapter 2&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Chapter 3: Implement a first test with &lt;em&gt;spark&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;This chapter will focus on implementing a first &lt;em&gt;spark&lt;/em&gt; data manipulation with an associated test. It will go through the issues that will be encountered and how to solve them.&lt;/p&gt;

&lt;h3&gt;
  
  
  The data
&lt;/h3&gt;

&lt;p&gt;A dummy use case is used to demonstrate the workflow.&lt;/p&gt;

&lt;p&gt;The scenario is that production data is made of two tables &lt;code&gt;persons&lt;/code&gt; and &lt;code&gt;employments&lt;/code&gt; with the following schema and data types. Here is a sample of the data.&lt;/p&gt;

&lt;h4&gt;
  
  
  Persons
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id: int&lt;/th&gt;
&lt;th&gt;PersonalityName: str&lt;/th&gt;
&lt;th&gt;PersonalitySurname: str&lt;/th&gt;
&lt;th&gt;birth: datetime(str)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;George&lt;/td&gt;
&lt;td&gt;Washington&lt;/td&gt;
&lt;td&gt;1732-02-22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Henry&lt;/td&gt;
&lt;td&gt;Ford&lt;/td&gt;
&lt;td&gt;1863-06-30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Benjamin&lt;/td&gt;
&lt;td&gt;Franklin&lt;/td&gt;
&lt;td&gt;1706-01-17&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Martin&lt;/td&gt;
&lt;td&gt;Luther King Jr.&lt;/td&gt;
&lt;td&gt;1929-01-15&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Employments
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;id: int&lt;/th&gt;
&lt;th&gt;person_fk: int&lt;/th&gt;
&lt;th&gt;Employment&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;president&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;industrialist&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;inventor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;minister&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The goal is to change the names of the columns and to join the data. The data here is just a sample, it's overkill to use &lt;em&gt;spark&lt;/em&gt; to process data like this. Yet, in a big data context, you need to foresee that the data will contains more lines and more complex joins. The sample is just here as a demonstration.&lt;/p&gt;

&lt;h3&gt;
  
  
  The dummy test
&lt;/h3&gt;

&lt;p&gt;First, you need to add spark dependencies&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv add pyspark
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before diving into the implementation, you need to make sure you can reproduce a very simple use case. It's not worth diving into complex data manipulation if you are not able to reproduce simple documentation snippet.&lt;/p&gt;

&lt;p&gt;You will write your first test &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_3/tests/test_minimal_transfo.py" rel="noopener noreferrer"&gt;&lt;code&gt;test_minimal_transfo.py&lt;/code&gt;&lt;/a&gt;. You will try first to use &lt;em&gt;pyspark&lt;/em&gt; to do simple data frame creation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_minimal_transfo&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;master&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Testing PySpark Example&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;([(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first part with the session create or fetch a local &lt;em&gt;spark&lt;/em&gt; session, the second part leverages the session to create a data frame.&lt;/p&gt;

&lt;p&gt;Then you can launch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest &lt;span class="nt"&gt;-k&lt;/span&gt; test_minimal_transfo &lt;span class="nt"&gt;-s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you have a minimal developer setup, it should not work because it's trying to use &lt;em&gt;Java&lt;/em&gt; which you might be missing and the following error will be displayed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FAILED tests/test_minimal_transfo.py::test_minimal_transfo - pyspark.errors.exceptions.base.PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's a bit annoying, because you need to have &lt;em&gt;Java&lt;/em&gt; installed on our &lt;em&gt;dev&lt;/em&gt; setup, the ci setup and all your collaborators setup. On the future chapters, a better alternative will be described.&lt;/p&gt;

&lt;p&gt;There are different flavors of &lt;em&gt;Java&lt;/em&gt;, you can simply install the &lt;a href="https://openjdk.org/" rel="noopener noreferrer"&gt;&lt;em&gt;openjdk&lt;/em&gt;&lt;/a&gt; one. It will require elevation of privileges:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;apt-get &lt;span class="nb"&gt;install &lt;/span&gt;openjdk-8-jre
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now relaunch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest &lt;span class="nt"&gt;-k&lt;/span&gt; test_minimal_transfo &lt;span class="nt"&gt;-s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and it should display&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+----+----+                                                                     
|col1|col2|
+----+----+
|   3|   4|
|   1|   2|
+----+----+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a small victory, but you can now use a local &lt;em&gt;spark&lt;/em&gt; session to manage data frames, yay !&lt;/p&gt;

&lt;h2&gt;
  
  
  The real test case - version 0
&lt;/h2&gt;

&lt;p&gt;On the previous sample, it shows that the &lt;code&gt;spark session&lt;/code&gt; plays a pivotal role, it will be instantiated differently in the tests context than in the production context.&lt;/p&gt;

&lt;p&gt;This means we can leverage a &lt;em&gt;pytest&lt;/em&gt; fixture to be reused for all tests later on; it can be created at the session level so there is only one spark session for the whole test suite. Meaning, you can create a &lt;a href="//../tests/conftest.py"&gt;&lt;code&gt;tests/conftest.py&lt;/code&gt;&lt;/a&gt; to factorize common behavior. If you are not familiar with pytest and fixtures, it's advised to have a look at &lt;a href="https://docs.pytest.org/en/6.2.x/fixture.html" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_3/tests/conftest.py" rel="noopener noreferrer"&gt;&lt;code&gt;tests/conftest.py&lt;/code&gt;&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Generator&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;


&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Generator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="nf"&gt;yield &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;master&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Testing PySpark Example&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then, it can be reused in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_3/tests/test_minimal_transfo.py" rel="noopener noreferrer"&gt;&lt;code&gt;tests/test_minimal_transo.py&lt;/code&gt;&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_minimal_transfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;([(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can again run &lt;code&gt;pytest -k test_minimal_transfo -s&lt;/code&gt; to check the behavior has not changed. It's important in a test driven approach to keep launching the tests after code modification to ensure nothing was broken.&lt;/p&gt;

&lt;p&gt;To be closer to the business context, you can implement a data transformation object. There will be a clear separation between data generation and data transformation. You can do so in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_3/src/pyspark_tdd/data_processor.py" rel="noopener noreferrer"&gt;&lt;code&gt;src/data_transform.py&lt;/code&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DataProcessor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nb"&gt;NotImplementedError&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, there is a prototype for &lt;code&gt;DataProcessor&lt;/code&gt;, the tests can be improved to actually assert on elements like so in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_3/tests/test_minimal_transfo.py" rel="noopener noreferrer"&gt;&lt;code&gt;test_minimal_transfo.py&lt;/code&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark_tdd.data_processor&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataProcessor&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_minimal_transfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;persons&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;George&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Washington&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1732-02-22&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Henry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ford&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1863-06-30&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Benjamin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Franklin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1706-01-17&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Martin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Luther King Jr.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1929-01-15&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalityName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalitySurname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;employments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;president&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;industrialist&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inventor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minister&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;person_fk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DataProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isEmpty&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;surname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_of_birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The example above will ensure that the data frame fits some criteria, but it will raise an &lt;code&gt;NotImplementedError&lt;/code&gt; as you have to implement the actual data processing. It's intended, the actual processing code can be created after testing is properly setup.&lt;/p&gt;

&lt;p&gt;The actual test is still not ideal as test case generation is part of the test itself. &lt;em&gt;Pytest&lt;/em&gt; &lt;a href="https://docs.pytest.org/en/stable/how-to/parametrize.html" rel="noopener noreferrer"&gt;parametrization&lt;/a&gt; can be leveraged:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt; 

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.mark.parametrize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;persons,employments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="p"&gt;[&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;George&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Washington&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1732-02-22&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Henry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ford&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1863-06-30&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Benjamin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Franklin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1706-01-17&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Martin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Luther King Jr.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1929-01-15&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalityName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalitySurname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="p"&gt;[&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;president&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;industrialist&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inventor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minister&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;person_fk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_minimal_transfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;persons&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;employments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DataProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isEmpty&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;surname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_of_birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above example show how test cases generation can be separated from test runs. It allows to see at first glance what this test is about without noise about test data. Most likely, the test data frames could be reused in another test, it needs to be refactored again. The test part becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark_tdd.data_processor&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataProcessor&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_minimal_transfo&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DataProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isEmpty&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;surname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_of_birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and two fixtures &lt;code&gt;persons&lt;/code&gt; and &lt;code&gt;employments&lt;/code&gt; are created in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_3/tests/conftest.py" rel="noopener noreferrer"&gt;&lt;code&gt;tests/conftest.py&lt;/code&gt;&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Generator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;George&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Washington&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1732-02-22&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Henry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ford&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1863-06-30&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Benjamin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Franklin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1706-01-17&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Martin&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Luther King Jr.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1929-01-15&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalityName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalitySurname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Generator&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;president&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;industrialist&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inventor&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;minister&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;person_fk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now relaunch &lt;code&gt;pytest -k test_minimal_transfo -s&lt;/code&gt; and notice the &lt;code&gt;NotImplementedError&lt;/code&gt; being raised; which is a good thing. The code has changed 3 times, yet the behavior remains the same, and the tests confirm it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real test case - version 1
&lt;/h2&gt;

&lt;p&gt;Now that there is a proper testing in place, source code can be implemented. There could be variations of this, the intent here is not to provide the best source code, but the best way to test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;to_date&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DataProcessor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spark_session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persons_rename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalityName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PersonalitySurname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;surname&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_of_birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;employments_rename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;persons&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;to_date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;birth&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;withColumnsRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;colsMap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;persons_rename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;employments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;colsMap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;employments_rename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;joined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;persons&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;employments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;person_fk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;left&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;joined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;joined&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;person_fk&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;joined&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you rerun &lt;code&gt;pytest -k test_minimal_transfo -s&lt;/code&gt;, then the test is successful.&lt;/p&gt;

&lt;h3&gt;
  
  
  What about ci?
&lt;/h3&gt;

&lt;p&gt;A strong dependency to &lt;em&gt;Java&lt;/em&gt; is now in place, running the tests in ci will depend on the ci having &lt;em&gt;Java&lt;/em&gt; installed or not. This is an issue because it requires the developer to have a defined &lt;em&gt;dev&lt;/em&gt; setup outside of the python ecosystem, there are extra steps for anyone to launch the tests.&lt;/p&gt;

&lt;p&gt;Keep in mind, there is limited control over the developer setup, what if the &lt;em&gt;Java&lt;/em&gt; already installed in the developer setup is not spark compliant? It will then be frustrating for the developer to investigate and most likely reinstall another &lt;em&gt;Java&lt;/em&gt; version which might impact other projects. See the mess&lt;/p&gt;

&lt;p&gt;Luckily, the ci runner on &lt;em&gt;Github&lt;/em&gt; has &lt;em&gt;Java&lt;/em&gt; installed for us; so the ci should run.&lt;/p&gt;

&lt;h3&gt;
  
  
  Clean up
&lt;/h3&gt;

&lt;p&gt;You can now also clean up the repository to have a clean plate. For instance, &lt;code&gt;src/pyspark_tdd/multiply.py&lt;/code&gt; and &lt;code&gt;tests/test_dummy.py&lt;/code&gt; can be removed.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's next
&lt;/h3&gt;

&lt;p&gt;Now, you have a comfortable setup to modify and tweak the code. You can run the tests and be sure to reproduce.&lt;/p&gt;

&lt;p&gt;In the next chapter, a more data driven approach to test case generation will be explored.&lt;/p&gt;

&lt;p&gt;You can find the original materials in &lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_3" rel="noopener noreferrer"&gt;spark_tdd&lt;/a&gt;. This repository exposes what's the expected repository layout at the end of each chapter in each branch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_0" rel="noopener noreferrer"&gt;Chapter 0&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_1" rel="noopener noreferrer"&gt;Chapter 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_2" rel="noopener noreferrer"&gt;Chapter 2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_3" rel="noopener noreferrer"&gt;Chapter 3&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;[09/03/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-4-leaning-into-property-based-testing-2hln"&gt;Chapter 4&lt;/a&gt; has been released&lt;br&gt;
[15/03/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-5-leverage-spark-in-a-container-1p74"&gt;Chapter 5&lt;/a&gt; has been released&lt;/p&gt;

</description>
      <category>spark</category>
      <category>python</category>
      <category>testing</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How to be Test Driven with Spark: Chapter 2 - CI</title>
      <dc:creator>Nicoda-27</dc:creator>
      <pubDate>Sun, 23 Feb 2025 10:21:54 +0000</pubDate>
      <link>https://dev.to/nda_27/how-to-be-test-driven-with-spark-2-ci-4a28</link>
      <guid>https://dev.to/nda_27/how-to-be-test-driven-with-spark-2-ci-4a28</guid>
      <description>&lt;p&gt;This goal of this tutorial is to provide a way to easily be test driven with spark on your local setup without using cloud resources.&lt;/p&gt;

&lt;p&gt;This is a series of tutorials and the initial chapters can be found in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; &lt;a href="https://dev.to/nda_27/how-to-be-tdd-with-spark-chapter-0-and-1-modern-python-setup-3df8"&gt;Chapter 0 and 1&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Chapter 2: Continuous Integration (ci)
&lt;/h2&gt;

&lt;p&gt;Having a ci is mandatory for any project that aims at having multiple contributors. In the following chapter, a proposal ci will be implemented.&lt;/p&gt;

&lt;p&gt;As ci implementation is specific to a collaborative platform being &lt;em&gt;Github&lt;/em&gt;, &lt;em&gt;Gitlab&lt;/em&gt;, &lt;em&gt;Bitbucket&lt;/em&gt;, &lt;em&gt;Azure Devops&lt;/em&gt; etc. The following chapter will try to provide a technology agnostic ci as much as possible.&lt;/p&gt;

&lt;p&gt;Similar concepts are available in all ci, you will have to transpose the concepts that will be used here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Content of the ci
&lt;/h3&gt;

&lt;p&gt;The ci here will be very minimal but showcases concepts that you implemented in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_2/tutorials/chapter_1_setup.md" rel="noopener noreferrer"&gt;Chapter 1&lt;/a&gt;, namely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python setup&lt;/li&gt;
&lt;li&gt;Project setup&lt;/li&gt;
&lt;li&gt;Code Formatting&lt;/li&gt;
&lt;li&gt;Test automation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are many more addition to the continuous integration that will not be tackled here. A minimal ci is required to guarantee non regressions in terms of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;code styling rules to guarantee no indivual contributors diverge from the coding style&lt;/li&gt;
&lt;li&gt;tests, namely all tests must be passing&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Implementation
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Github&lt;/em&gt; provides extensive &lt;a href="https://github.com/features/actions" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; for you to tweak your ci.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Github&lt;/em&gt; is expecting ci files to be provided at a specific location, you can therefore create a file in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_2/.github/workflows/ci.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;.github/workflows/ci.yaml&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this file, you can add&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Continuous Integration&lt;/span&gt;
&lt;span class="na"&gt;run-name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Continuous Integration&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Continuous-Integration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;run-name&lt;/code&gt; define the names of the pipeline that will run.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;on&lt;/code&gt; defines the event that will trigger the pipeline to run, &lt;code&gt;push&lt;/code&gt; means that for every commit the pipeline will run.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;jobs&lt;/code&gt; defines a list of jobs, the ci is made of one job with multiple steps for the sake of simplicity.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;runs-on&lt;/code&gt; defined the docker image used to run (the runner) the environment against, it's a list of &lt;a href="https://github.com/actions/runner-images" rel="noopener noreferrer"&gt;docker images&lt;/a&gt; maintained by &lt;em&gt;Github&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now into the steps section we can add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Check out repository code&lt;/span&gt;
    &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jdx/mise-action@v2&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run Formatting&lt;/span&gt;
    &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;uv run ruff check&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run Tests&lt;/span&gt;
    &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;uv run pytest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;actions/checkout@v4&lt;/code&gt; is the &lt;em&gt;Github&lt;/em&gt; action that checkout the current branch of the repository.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;jdx/mise-action@v2&lt;/code&gt; is the &lt;em&gt;Github&lt;/em&gt; action that will read the &lt;code&gt;mise.toml&lt;/code&gt; and install everything for us.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;Run Formatting&lt;/code&gt; step will install the dependencies and run the formatting. It there is an error, the command will fail and the pipeline too.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;Run Tests&lt;/code&gt; step will run the tests. It there is an error, the command will fail and the pipeline too.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Ci as documentation
&lt;/h3&gt;

&lt;p&gt;As it was stated, the ci is the only source of truth. If it passes on ci, it should pass on your local setup. If not, it means there are discrepancies between the ci setup and yours.&lt;/p&gt;

&lt;p&gt;Going through the ci implementation will help you on reproducibility. Maybe you're not using the same way to install python version, or the same dependency management tool. You need to align your tools and the ones presented in &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_2/tutorials/chapter_1_setup.md" rel="noopener noreferrer"&gt;chapter 1&lt;/a&gt; help not to conflict with your local setup. You might have installed python package globally or you might have manually changed &lt;code&gt;PYTHON_HOME&lt;/code&gt; or your &lt;code&gt;PATH&lt;/code&gt; and this can easily be a mess.&lt;/p&gt;

&lt;p&gt;To help on reproducibility, a &lt;a href="https://code.visualstudio.com/docs/devcontainers/containers" rel="noopener noreferrer"&gt;dev container&lt;/a&gt; approach can be used. It means, the ci will run inside a container and this container can be reused as a developer environment. This will not be implemented for the moment.&lt;/p&gt;

&lt;h3&gt;
  
  
  A better ci structure
&lt;/h3&gt;

&lt;p&gt;To improve readability and segregates between code formatting and testing, &lt;em&gt;Github&lt;/em&gt; actions can be implemented as job with interdependencies. Then, the workflow becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Continuous Integration&lt;/span&gt;
&lt;span class="na"&gt;run-name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Continuous Integration&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Formatting&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Check out repository code&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jdx/mise-action@v2&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run Formatting&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;uv run ruff check&lt;/span&gt;
  &lt;span class="na"&gt;Tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Formatting&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Check out repository code&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jdx/mise-action@v2&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run Tests&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;uv run pytest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In here we added the &lt;code&gt;needs: [Formatting]&lt;/code&gt; to create dependencies between ci job. It means, we will not run the tests until the code style is compliant; this will save some time and resources. Indeed, if the code is not formatted, don't even bother running the tests. The execution graph will be like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F946pqc92i72xoltqjdw0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F946pqc92i72xoltqjdw0.png" alt="Ci Execution graph" width="800" height="170"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see here some duplication, which is not ideal as for future code improvements, you will have to do it at two places at the same time. This is technical debt that one would have to tackle using &lt;a href="https://docs.github.com/en/actions/sharing-automations/creating-actions/creating-a-composite-action" rel="noopener noreferrer"&gt;composite action&lt;/a&gt;. We will consider it's ok for now.&lt;/p&gt;

&lt;h3&gt;
  
  
  Caching dependency resolution
&lt;/h3&gt;

&lt;p&gt;You will see additional steps in the &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_2/.github/workflows/ci.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;ci.yaml&lt;/code&gt;&lt;/a&gt;, namely related to cache&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Restore uv cache&lt;/span&gt;
        &lt;span class="s"&gt;uses&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/cache@v4&lt;/span&gt;
        &lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/tmp/.uv-cache&lt;/span&gt;
          &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;uv-${{ runner.os }}-${{ hashFiles('uv.lock') }}&lt;/span&gt;
          &lt;span class="na"&gt;restore-keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;uv-${{ runner.os }}-${{ hashFiles('uv.lock') }}&lt;/span&gt;
            &lt;span class="s"&gt;uv-${{ runner.os }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These steps aim at caching the &lt;code&gt;.venv&lt;/code&gt; when there are no changes on the &lt;code&gt;uv.lock&lt;/code&gt; and reusing it. The intent is to speed up the ci execution as dependency resolution and installation can be time consuming.&lt;/p&gt;

&lt;p&gt;An extra step to minimize caching size is added as &lt;em&gt;mise&lt;/em&gt; proposes such feature, namely an extra step and an environment variable is added to configure the location of the cache.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Minimize uv cache&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;uv cache prune --ci&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;UV_CACHE_DIR&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/tmp/.uv-cache&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What's next
&lt;/h3&gt;

&lt;p&gt;On the next chapter, you will implement your first spark code and implement a way to guarantee test automation of it. This is long overdue as we spent 3 chapters on setup...&lt;/p&gt;

&lt;p&gt;You can find the original materials in &lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_2" rel="noopener noreferrer"&gt;spark_tdd&lt;/a&gt;. This repository exposes what's the expected repository layout at the end of each chapter in each branch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_0" rel="noopener noreferrer"&gt;Chapter 0&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_1" rel="noopener noreferrer"&gt;Chapter 1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_2" rel="noopener noreferrer"&gt;Chapter 2&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;[03/03/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-3-first-spark-test-le"&gt;Chapter 3&lt;/a&gt; has been released&lt;br&gt;
[09/03/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-4-leaning-into-property-based-testing-2hln"&gt;Chapter 4&lt;/a&gt; has been released&lt;br&gt;
[15/03/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-5-leverage-spark-in-a-container-1p74"&gt;Chapter 5&lt;/a&gt; has been released&lt;/p&gt;

</description>
      <category>python</category>
      <category>ci</category>
      <category>testing</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How to be Test Driven with Spark: Chapter 0 and 1 - Modern Python Setup</title>
      <dc:creator>Nicoda-27</dc:creator>
      <pubDate>Sat, 15 Feb 2025 09:24:09 +0000</pubDate>
      <link>https://dev.to/nda_27/how-to-be-tdd-with-spark-chapter-0-and-1-modern-python-setup-3df8</link>
      <guid>https://dev.to/nda_27/how-to-be-tdd-with-spark-chapter-0-and-1-modern-python-setup-3df8</guid>
      <description>&lt;h2&gt;
  
  
  Chapter 0: Why this tutorial
&lt;/h2&gt;

&lt;p&gt;This goal of this tutorial is to provide a way to easily be test driven with spark on your local setup without using cloud resources.&lt;/p&gt;

&lt;p&gt;Before deep diving into spark and how, we must first align on our setup environment to ease reproducibility; this will be the focus of this article.&lt;/p&gt;

&lt;p&gt;The official &lt;a href="https://spark.apache.org/docs/latest/api/python/getting_started/testing_pyspark.html#Putting-It-All-Together!" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; describes how to create tests with pyspark.&lt;/p&gt;

&lt;p&gt;It requires to have spark server with a spark connect support for it to work as described in the &lt;a href="https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As a reminder, this is how spark connect works:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftpe4jivn6hzz01od4wxq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftpe4jivn6hzz01od4wxq.png" alt="spark connect" width="800" height="882"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Namely, a specific server needs to be created so your tests can connect to this server and process the data as intended.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why it is not enough?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Launching the server requires some extra requirements on your machine, namely a java virtual machine.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Launching the server requires a specific script called &lt;code&gt;start-connect-server.sh&lt;/code&gt; which is to be found&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some data engineers might argue they can just use a spark server already deployed to be able to test; but there are several drawbacks to this approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You are being charged to launch simple tests or run experiments keeping cloud providers very happy&lt;/li&gt;
&lt;li&gt;You slow down the &lt;strong&gt;developer feedback loop&lt;/strong&gt; which is the time necessary to implement a feature and validates that no regression has been introduced. A developer is more confident to have no regression when tests are all executed&lt;/li&gt;
&lt;li&gt;You create &lt;strong&gt;external dependencies&lt;/strong&gt; that you have no control off. You might encounter issues with testing when the cloud provider is down, or you don't have internet access or someone changes the configuration of the server by accident.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is to have a test environment that is self descriptive, quick to setup, quick to start and reliable.&lt;/p&gt;
&lt;h2&gt;
  
  
  Chapter 1: Setup
&lt;/h2&gt;

&lt;p&gt;In this chapter, multiples tool will be introduced and setup. The intent is to have a clean python environment to reproduce the code. This is a very opinionated section, but it might be useful to challenge your existing tools with this section.&lt;/p&gt;
&lt;h3&gt;
  
  
  Python version management
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://mise.jdx.dev/" rel="noopener noreferrer"&gt;&lt;em&gt;Mise&lt;/em&gt;&lt;/a&gt; will be leveraged to handle python versions. It claims to be the &lt;em&gt;The front-end to your dev env&lt;/em&gt; and it will be used to install specific versions of languages and tools.&lt;/p&gt;

&lt;p&gt;It can be used for much more, and it is strongly advised to look at the &lt;a href="https://mise.jdx.dev/getting-started.html" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; to understand the true power of this tool not limited to python developement.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Mise&lt;/em&gt; first needs to be installed, see &lt;a href="https://mise.jdx.dev/getting-started.html#_1-install-mise-cli" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; for further instructions. You can launch the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl https://mise.run | sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once installed, you will have to customize your &lt;code&gt;.bashrc&lt;/code&gt; or your &lt;code&gt;.zhsrc&lt;/code&gt; (or other terminal support) to activate &lt;em&gt;mise&lt;/em&gt; on your terminal.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'eval "$(~/.local/bin/mise activate bash)"'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.bashrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Mise&lt;/em&gt; can now be used to install python at a specific version with the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mise &lt;span class="nb"&gt;install &lt;/span&gt;python@3.12
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It will download a pre-compiled version of python and make it available globally.&lt;/p&gt;

&lt;p&gt;Let's now use it, you first need to position yourself at the root of the project and launch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mise use python@3.12
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It will create a &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/mise.toml" rel="noopener noreferrer"&gt;mise.toml&lt;/a&gt; file with the following section&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[tools]&lt;/span&gt;
&lt;span class="py"&gt;python&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"3.12"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And a &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/.python-version" rel="noopener noreferrer"&gt;.python-version&lt;/a&gt; with the indication&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;3.12
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the help of these files, &lt;em&gt;mise&lt;/em&gt; will be able to activate when located at the root of your project. It's also a great way to document other contributors of the requirements to launch this project without relying on README that becomes easily outdated.&lt;/p&gt;

&lt;h3&gt;
  
  
  Python dependency management
&lt;/h3&gt;

&lt;p&gt;A tool to help us add, remove and download dependencies is necessary. &lt;a href="https://docs.astral.sh/uv/" rel="noopener noreferrer"&gt;Uv&lt;/a&gt;, will be used later on as it's very fast and easy to use.&lt;/p&gt;

&lt;p&gt;To install it, the official &lt;a href="https://docs.astral.sh/uv/getting-started/installation/" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;; but in this tutorial &lt;em&gt;mise&lt;/em&gt; will be leveraged:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mise use uv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will both install and setup uv for the project. See how [mise.toml]((&lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/mise.toml" rel="noopener noreferrer"&gt;https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/mise.toml&lt;/a&gt;) has been modified with the addition of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[tools]&lt;/span&gt;
&lt;span class="py"&gt;python&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"3.12"&lt;/span&gt;
&lt;span class="py"&gt;uv&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"latest"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now it can be used to initialize the project, namely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will create a folder structure for you and a &lt;code&gt;hello.py&lt;/code&gt;. In this project, we have customized it a bit to add a tests section a pyspark_tdd package as part of &lt;code&gt;src&lt;/code&gt; so it looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.
├── src
│   ├── hello.py
├── tests
├── .python-version
├── .mise.toml
└── pyproject.toml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Ignoring files
&lt;/h3&gt;

&lt;p&gt;Every repository needs a set of files to ignore before adding them to a commit. This is done via a &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/.gitignore" rel="noopener noreferrer"&gt;.gitignore&lt;/a&gt; file and anyone can leverage existing templates for your language of preference.&lt;/p&gt;

&lt;p&gt;If you start a project from scratch, you will need to first setup git&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Github&lt;/em&gt; maintains ignore files &lt;a href="https://github.com/github/gitignore/tree/main" rel="noopener noreferrer"&gt;template&lt;/a&gt; for each language. You can leverage it with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-L&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; .gitignore https://raw.githubusercontent.com/github/gitignore/refs/heads/main/Python.gitignore
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The chosen language for gitignore is in this project the python template.&lt;/p&gt;

&lt;h3&gt;
  
  
  Adding formatting and linting
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Python
&lt;/h4&gt;

&lt;p&gt;Linters and formatters are powerful tools to enforce code writing rules among developers. It takes away the pain of having to care how the code is written at the syntax level.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Ruff&lt;/em&gt; will be leveraged to format our python code as it's very powerful and can be run at file saves without latency.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Ruff&lt;/em&gt; will be added as a project &lt;em&gt;dev&lt;/em&gt; dependency. A &lt;em&gt;dev&lt;/em&gt; dependency is one that the project does not need to run, it can be related to tests, experimentation, formatting etc. Everything that is not meant to be shipped to production must be retained as a &lt;em&gt;dev&lt;/em&gt; dependency to keep your python package as self contained as possible.&lt;/p&gt;

&lt;p&gt;We can add &lt;em&gt;ruff&lt;/em&gt; like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv add ruff &lt;span class="nt"&gt;--dev&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will add a &lt;em&gt;dev&lt;/em&gt; dependency in the &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/pyproject.toml" rel="noopener noreferrer"&gt;pyproject.toml&lt;/a&gt; with&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[dependency-groups]&lt;/span&gt;
&lt;span class="py"&gt;dev&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="py"&gt;"ruff&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="s"&gt;",&lt;/span&gt;&lt;span class="err"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will also create a &lt;code&gt;.venv&lt;/code&gt; at the current working directory. You might notice that the &lt;code&gt;.venv&lt;/code&gt; is ignored from git which is intended. Indeed, you don't want to commit your &lt;code&gt;.venv&lt;/code&gt; directory as it's a copy of the dependencies of your project and can be quite extensive.&lt;/p&gt;

&lt;p&gt;It will also create an &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/uv.lock" rel="noopener noreferrer"&gt;uv.lock&lt;/a&gt; that documents your direct dependencies version and the indirect dependencies (the dependencies of your dependencies). This mechanism allows to segregates dependencies of your project from the rest.&lt;/p&gt;

&lt;p&gt;Your project should now look like&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.
├── .venv
├── src
│   ├── hello.py
├── tests
├── .python-version
├── .gitignore
├── .mise.toml
├── pyproject.toml
└── uv.lock
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Other languages
&lt;/h4&gt;

&lt;p&gt;As a project is not just python files, but also configuration, pipelines, documentation etc, formatting these files too is also necessary.&lt;/p&gt;

&lt;p&gt;Documenting how these files will be formatted is done using &lt;a href="https://editorconfig.org/#overview" rel="noopener noreferrer"&gt;editorconfig&lt;/a&gt;.&lt;br&gt;
We will use the one from the &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/.editorconfig" rel="noopener noreferrer"&gt;editorconfig&lt;/a&gt; website.&lt;/p&gt;
&lt;h4&gt;
  
  
  Your Integrated Development Environment (IDE)
&lt;/h4&gt;

&lt;p&gt;Whichever &lt;em&gt;IDE&lt;/em&gt; will be used, it's very important that you setup formatting at file saves to save you time and remove the pain from handling it by hand.&lt;/p&gt;

&lt;p&gt;If you are using VSCode, you can install the &lt;a href="https://marketplace.visualstudio.com/items?itemName=charliermarsh.ruff" rel="noopener noreferrer"&gt;ruff&lt;/a&gt; extension and adjust the following to your &lt;em&gt;settings.json&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"editor.formatOnSave"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="nl"&gt;"[python]"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"editor.formatOnSave"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"editor.codeActionsOnSave"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"source.fixAll"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"explicit"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"source.organizeImports"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"explicit"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"editor.defaultFormatter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"charliermarsh.ruff"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The first test
&lt;/h3&gt;

&lt;p&gt;To see if everything works as expected, you will write a very simple unit test. In a test driven approach, the test is written before the source code.&lt;/p&gt;

&lt;p&gt;A test framework is required to launch the test automation, &lt;a href="https://docs.pytest.org/en/stable/" rel="noopener noreferrer"&gt;&lt;em&gt;pytest&lt;/em&gt;&lt;/a&gt; will be used. You need to add it as a &lt;em&gt;dev&lt;/em&gt; dependency&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv add pytest &lt;span class="nt"&gt;--dev&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can create a &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/tests/test_dummy.py" rel="noopener noreferrer"&gt;&lt;code&gt;tests/test_dummy.py&lt;/code&gt;&lt;/a&gt; with the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;your_python_package.multiply&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;multiply&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_my_dummy_function&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;multiply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This requires a function &lt;code&gt;multiply&lt;/code&gt; that can be defined as in &lt;code&gt;src/your_python_package/multiply.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;multiply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can now run the tests, make sure you're using the right python from the &lt;code&gt;.venv&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;which python
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;should display something like &lt;code&gt;/$HOME/somepath/your_project/.venv/bin/python&lt;/code&gt;. If not, you can restart a new terminal, &lt;em&gt;mise&lt;/em&gt; should be able to resolve.&lt;/p&gt;

&lt;p&gt;Then run&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then it will display an error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tests/test_dummy.py:1: in &amp;lt;module&amp;gt;
    from your_python_package.multiply import multiply
E   ModuleNotFoundError: No module named 'your_python_package'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You need to add an extra entry for &lt;em&gt;pytest&lt;/em&gt; to detect the &lt;code&gt;src&lt;/code&gt; layout. In &lt;a href="https://github.com/Nicoda-27/spark_tdd/blob/doc/chapter_1/pyproject.toml" rel="noopener noreferrer"&gt;pyproject.toml&lt;/a&gt;, you can add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[tool.pytest.ini_options]&lt;/span&gt;
&lt;span class="py"&gt;pythonpath&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"src"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pytest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;should display&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;============================================================= test session starts ===============================================================
platform linux -- Python 3.12.8, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/somepath/src/your_project
configfile: pyproject.toml
collected 1 item                                                                                                                                 

tests/test_dummy.py .                                                                                                                      [100%]

=============================================================== 1 passed in 0.01s ================================================================
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can do some housekeeping and remove the unnecessary &lt;code&gt;src/your_python_package/hello.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You now have a proper setup to start working.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's next
&lt;/h3&gt;

&lt;p&gt;Now that one test is implemented, the continuous integration (ci) must be setup. In a collaborative way of working, the ci is the only source of truth to guarantee if everything is broken or not.&lt;/p&gt;

&lt;p&gt;Notice we still have not touched upon any spark components, it's very important to have a clean reproducible codebase before diving.&lt;/p&gt;

&lt;p&gt;That will be the topic of the next chapter.&lt;/p&gt;

&lt;p&gt;You can find the original materials in &lt;a href="https://github.com/Nicoda-27/spark_tdd" rel="noopener noreferrer"&gt;spark_tdd&lt;/a&gt;. This repository exposes what's the expected repository layout at the end of each chapter in each branch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_0" rel="noopener noreferrer"&gt;chapter 0&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Nicoda-27/spark_tdd/tree/doc/chapter_1" rel="noopener noreferrer"&gt;chapter 1&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;[23/02/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-2-ci-4a28"&gt;Chapter 2&lt;/a&gt; has been released&lt;br&gt;
[03/03/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-3-first-spark-test-le"&gt;Chapter 3&lt;/a&gt; has been released&lt;br&gt;
[09/03/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-4-leaning-into-property-based-testing-2hln"&gt;Chapter 4&lt;/a&gt; has been released&lt;br&gt;
[15/03/25 UPDATE]: &lt;a href="https://dev.to/nda_27/how-to-be-test-driven-with-spark-chapter-5-leverage-spark-in-a-container-1p74"&gt;Chapter 5&lt;/a&gt; has been released&lt;/p&gt;

</description>
      <category>python</category>
      <category>ruff</category>
      <category>testing</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
