<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Saša Zejnilović</title>
    <description>The latest articles on DEV Community by Saša Zejnilović (@zejnilovic).</description>
    <link>https://dev.to/zejnilovic</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F81773%2Ff911397e-8b9e-4373-adff-565638cd7af0.png</url>
      <title>DEV Community: Saša Zejnilović</title>
      <link>https://dev.to/zejnilovic</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/zejnilovic"/>
    <language>en</language>
    <item>
      <title>5 things to watch out for in automated regression tests</title>
      <dc:creator>Saša Zejnilović</dc:creator>
      <pubDate>Wed, 06 Jan 2021 21:33:51 +0000</pubDate>
      <link>https://dev.to/zejnilovic/5-things-to-watch-out-for-in-automated-regression-tests-5dg5</link>
      <guid>https://dev.to/zejnilovic/5-things-to-watch-out-for-in-automated-regression-tests-5dg5</guid>
      <description>&lt;ul&gt;
&lt;li&gt;What are regression tests? (in short)&lt;/li&gt;
&lt;li&gt;
The problems

&lt;ul&gt;
&lt;li&gt;1. Change of the output formats&lt;/li&gt;
&lt;li&gt;2. Designed-in assumptions about the test environment&lt;/li&gt;
&lt;li&gt;3. Errors in maintenance&lt;/li&gt;
&lt;li&gt;4. Changing operators.&lt;/li&gt;
&lt;li&gt;5. Not treating your tests as any other codebase&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What are regression tests? (in short)
&lt;/h2&gt;

&lt;p&gt;You design regression tests to detect issues that might come as side-effects of implementing other features you've already tested. &lt;/p&gt;

&lt;h2&gt;
  
  
  The problems
&lt;/h2&gt;

&lt;p&gt;The biggest problem facing regression tests are:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Change of the output formats
&lt;/h3&gt;

&lt;p&gt;Most common. The changes may be so minor that manual testers would barely notice. Automated tests, however, are sensitive and brittle, unable to differentiate between improvements and bugs. The whole suite could have to be updated if we only change metadata from some Map[String, String] to Map[String, Any] to keep all formats.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Designed-in assumptions about the test environment
&lt;/h3&gt;

&lt;p&gt;Test suites may break when moved to different environments or when the configuration is changed (when they are not masters of the environment).&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Errors in maintenance
&lt;/h3&gt;

&lt;p&gt;Writers of automation tests repairing tests make mistakes, introducing bugs into the test suites. Regression test suites then develop regression bugs themselves, which can show after some time.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Changing operators.
&lt;/h3&gt;

&lt;p&gt;Test suites may require up-skilled people and knowledge to run and maintain. People change positions and jobs. If person X disables some test and then is let go, person Y is just running tests unaware there might be a problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Not treating your tests as any other codebase
&lt;/h3&gt;

&lt;p&gt;Very often, and not only in regression testing, but in general, people treat their tests as they treat their documentation. Tests should be treated as any other codebase, there should be standards and design principles applied. You just stop just shy of creating tests for your tests. It actually should work in unison. You can view it as your Test Code testing your Product Code and vice versa.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In conclusion, people tend to invest in regression automation a lot. Sadly, they often find that the tests stopped working sooner than later. The tests are out of sync with the product. They demand repair. They're no longer helping find bugs. Testers respond by updating the tests or just adding new ones ending with 5-6 000 test cases that no one knows what they are doing, and everyone just prays it is ok. I know a company where two full-time SDETs were needed for patching them every day and adding more. Hundreds of tests disabled because they don't have time.&lt;/p&gt;

&lt;p&gt;From all this, uncontrolled maintenance cost is probably the most common outcome. It results in companies then rather "forgetting" the regression suite and testing than repairing it.&lt;/p&gt;

</description>
      <category>testing</category>
      <category>regression</category>
      <category>management</category>
    </item>
    <item>
      <title>Working with nested structures in Spark</title>
      <dc:creator>Saša Zejnilović</dc:creator>
      <pubDate>Sun, 20 Sep 2020 11:01:55 +0000</pubDate>
      <link>https://dev.to/zejnilovic/working-with-nested-structures-in-spark-4c97</link>
      <guid>https://dev.to/zejnilovic/working-with-nested-structures-in-spark-4c97</guid>
      <description>&lt;h2&gt;
  
  
  Table of Content
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Intro&lt;/li&gt;
&lt;li&gt;Add Column&lt;/li&gt;
&lt;li&gt;Drop Column&lt;/li&gt;
&lt;li&gt;Map column&lt;/li&gt;
&lt;li&gt;Afterword&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;I want to introduce a library to you called &lt;a href="https://github.com/AbsaOSS/spark-hats"&gt;spark-hats&lt;/a&gt;, full name Spark &lt;strong&gt;H&lt;/strong&gt;elpers for &lt;strong&gt;A&lt;/strong&gt;rray &lt;strong&gt;T&lt;/strong&gt;ransformation*&lt;em&gt;s&lt;/em&gt;*, but do not let the name fool you. It works with structs as well. This library saves me a lot of time and energy when developing new spark applications that have to work with nested structures. Hope it will help you too.&lt;/p&gt;

&lt;p&gt;The core of the library are methods &lt;strong&gt;add&lt;/strong&gt; a column, &lt;strong&gt;map&lt;/strong&gt; a column, &lt;strong&gt;drop&lt;/strong&gt; a column. All of these engineered so you can turn this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;dfOut&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;select&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"my_array"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;struct&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"a"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;as&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"a"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
  &lt;span class="nv"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"b"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;as&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"b"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt;
  &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"a"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;as&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"c"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
&lt;span class="o"&gt;}).&lt;/span&gt;&lt;span class="py"&gt;as&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"my_array"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;into this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;dfOut&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;nestedMapColumn&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"my_array.a"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;&lt;span class="s"&gt;"c"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's get started with imports and the structure that will be used for examples.&lt;/p&gt;

&lt;p&gt;I will use spark-shell with the package using this command in the shell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$&amp;gt;&lt;/span&gt; spark-shell &lt;span class="nt"&gt;--packages&lt;/span&gt; za.co.absa:spark-hats_2.11:0.2.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and then in the spark-shell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="n"&gt;scala&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;za.co.absa.spark.hats.Extensions._&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;za.co.absa.spark.hats.Extensions._&lt;/span&gt;

&lt;span class="n"&gt;scala&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;printSchema&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;root&lt;/span&gt;
 &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;my_array&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;array&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;struct&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;containsNull&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;scala&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;show&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;+---+------------------------------+&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;my_array&lt;/span&gt;                      &lt;span class="o"&gt;|&lt;/span&gt;
&lt;span class="o"&gt;+---+------------------------------+&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="o"&gt;|[[&lt;/span&gt;&lt;span class="err"&gt;1&lt;/span&gt;, &lt;span class="kt"&gt;foo&lt;/span&gt;&lt;span class="o"&gt;]]&lt;/span&gt;                    &lt;span class="o"&gt;|&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="o"&gt;|[[&lt;/span&gt;&lt;span class="err"&gt;1&lt;/span&gt;, &lt;span class="kt"&gt;bar&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;, &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="err"&gt;2&lt;/span&gt;, &lt;span class="kt"&gt;baz&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;, &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="err"&gt;3&lt;/span&gt;, &lt;span class="kt"&gt;foz&lt;/span&gt;&lt;span class="o"&gt;]]|&lt;/span&gt;
&lt;span class="o"&gt;+---+------------------------------+&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now let's move to the methods.&lt;/p&gt;

&lt;h2&gt;
  
  
  Add Column
&lt;/h2&gt;

&lt;p&gt;Add column comes in two variants. Simple and extended. Simple allows adding of a new field in nested structures. Extend does the same while allowing you to reference other elements.&lt;/p&gt;

&lt;p&gt;The simple one is pretty straight forward. You get your DataFrame, and instead of calling &lt;code&gt;withColumn&lt;/code&gt;, you call &lt;code&gt;nestedWithColumn&lt;/code&gt;. Let's add a literal to a struct.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="n"&gt;scala&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;nestedWithColumn&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"my_array.c"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"hello"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="py"&gt;printSchema&lt;/span&gt;
&lt;span class="n"&gt;root&lt;/span&gt;
 &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;my_array&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;array&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;struct&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;containsNull&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;scala&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;nestedWithColumn&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"my_array.c"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"hello"&lt;/span&gt;&lt;span class="o"&gt;)).&lt;/span&gt;&lt;span class="py"&gt;show&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;+---+---------------------------------------------------+&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;my_array&lt;/span&gt;                                           &lt;span class="o"&gt;|&lt;/span&gt;
&lt;span class="o"&gt;+---+---------------------------------------------------+&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="o"&gt;|[[&lt;/span&gt;&lt;span class="err"&gt;1&lt;/span&gt;, &lt;span class="kt"&gt;foo&lt;/span&gt;, &lt;span class="kt"&gt;hello&lt;/span&gt;&lt;span class="o"&gt;]]&lt;/span&gt;                                  &lt;span class="o"&gt;|&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="o"&gt;|[[&lt;/span&gt;&lt;span class="err"&gt;1&lt;/span&gt;, &lt;span class="kt"&gt;bar&lt;/span&gt;, &lt;span class="kt"&gt;hello&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;, &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="err"&gt;2&lt;/span&gt;, &lt;span class="kt"&gt;baz&lt;/span&gt;, &lt;span class="kt"&gt;hello&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;, &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="err"&gt;3&lt;/span&gt;, &lt;span class="kt"&gt;foz&lt;/span&gt;, &lt;span class="kt"&gt;hello&lt;/span&gt;&lt;span class="o"&gt;]]|&lt;/span&gt;
&lt;span class="o"&gt;+---+---------------------------------------------------+&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The extended version can then use other elements of the array. The API also differs. Here the method &lt;code&gt;nestedWithColumnExtended&lt;/code&gt; expects a function returning a column as a second parameter. Moreover, this function has an argument which is a function itself, the getField() function. The getField() function can be used in the transformation to reference other columns in the DataFrame by their fully qualified name.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="n"&gt;scala&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;dfOut&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;nestedWithColumnExtended&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"my_array.c"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;getField&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt;
         &lt;span class="nf"&gt;concat&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;cast&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"string"&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;getField&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"my_array.b"&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
       &lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;scala&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;dfOut&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;printSchema&lt;/span&gt;
&lt;span class="n"&gt;root&lt;/span&gt;
 &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;my_array&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;array&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;struct&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;containsNull&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;scala&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;dfOut&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;show&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;+---+------------------------------------------------+&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;my_array&lt;/span&gt;                                        &lt;span class="o"&gt;|&lt;/span&gt;
&lt;span class="o"&gt;+---+------------------------------------------------+&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="o"&gt;|[[&lt;/span&gt;&lt;span class="err"&gt;1&lt;/span&gt;, &lt;span class="kt"&gt;foo&lt;/span&gt;, &lt;span class="err"&gt;1&lt;/span&gt;&lt;span class="kt"&gt;foo&lt;/span&gt;&lt;span class="o"&gt;]]&lt;/span&gt;                                &lt;span class="o"&gt;|&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="o"&gt;|[[&lt;/span&gt;&lt;span class="err"&gt;1&lt;/span&gt;, &lt;span class="kt"&gt;bar&lt;/span&gt;, &lt;span class="err"&gt;2&lt;/span&gt;&lt;span class="kt"&gt;bar&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;, &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="err"&gt;2&lt;/span&gt;, &lt;span class="kt"&gt;baz&lt;/span&gt;, &lt;span class="err"&gt;2&lt;/span&gt;&lt;span class="kt"&gt;baz&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;, &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="err"&gt;3&lt;/span&gt;, &lt;span class="kt"&gt;foz&lt;/span&gt;, &lt;span class="err"&gt;2&lt;/span&gt;&lt;span class="kt"&gt;foz&lt;/span&gt;&lt;span class="o"&gt;]]|&lt;/span&gt;
&lt;span class="o"&gt;+---+------------------------------------------------+&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice that for root-level columns it is enough to use &lt;code&gt;col&lt;/code&gt;, but &lt;code&gt;getField&lt;/code&gt; would still be fine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Drop Column
&lt;/h2&gt;

&lt;p&gt;By the second method, you might have already caught to the naming convention. This method is called &lt;code&gt;nestedDropColumn&lt;/code&gt; and is the most straight forward of the three. Just provide a fully qualified name.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="n"&gt;scala&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;nestedDropColumn&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"my_array.b"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;printSchema&lt;/span&gt;
&lt;span class="n"&gt;root&lt;/span&gt;
 &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;my_array&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;array&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;struct&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;containsNull&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;scala&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;nestedDropColumn&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"my_array.b"&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;show&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;+---+---------------+&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;my_array&lt;/span&gt;       &lt;span class="o"&gt;|&lt;/span&gt;
&lt;span class="o"&gt;+---+---------------+&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="o"&gt;|[[&lt;/span&gt;&lt;span class="err"&gt;1&lt;/span&gt;&lt;span class="o"&gt;]]&lt;/span&gt;          &lt;span class="o"&gt;|&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="o"&gt;|[[&lt;/span&gt;&lt;span class="err"&gt;1&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;, &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="err"&gt;2&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;, &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="err"&gt;3&lt;/span&gt;&lt;span class="o"&gt;]]|&lt;/span&gt;
&lt;span class="o"&gt;+---+---------------+&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Map column
&lt;/h2&gt;

&lt;p&gt;Map column is probably the one with the most use-cases. The map will apply a function on each element of your struct and puts an output on the same level by default, or somewhere else if specified. &lt;/p&gt;

&lt;p&gt;If the input column is a primitive, then a simple function will suffice. If it is a struct, then you will have to use &lt;code&gt;getField&lt;/code&gt; again.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="n"&gt;scala&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;nestedMapColumn&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputColumnName&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"my_array.a"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputColumnName&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"c"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expression&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;printSchema&lt;/span&gt;
&lt;span class="n"&gt;root&lt;/span&gt;
 &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;my_array&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;array&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;struct&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;containsNull&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;scala&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;nestedMapColumn&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputColumnName&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"my_array.a"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;outputColumnName&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"c"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expression&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;).&lt;/span&gt;&lt;span class="py"&gt;show&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;+---+---------------------------------------+&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;my_array&lt;/span&gt;                               &lt;span class="o"&gt;|&lt;/span&gt;
&lt;span class="o"&gt;+---+---------------------------------------+&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="o"&gt;|[[&lt;/span&gt;&lt;span class="err"&gt;1&lt;/span&gt;, &lt;span class="kt"&gt;foo&lt;/span&gt;, &lt;span class="err"&gt;2&lt;/span&gt;&lt;span class="o"&gt;]]&lt;/span&gt;                          &lt;span class="o"&gt;|&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="o"&gt;|[[&lt;/span&gt;&lt;span class="err"&gt;1&lt;/span&gt;, &lt;span class="kt"&gt;bar&lt;/span&gt;, &lt;span class="err"&gt;2&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;, &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="err"&gt;2&lt;/span&gt;, &lt;span class="kt"&gt;baz&lt;/span&gt;, &lt;span class="err"&gt;3&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;, &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="err"&gt;3&lt;/span&gt;, &lt;span class="kt"&gt;foz&lt;/span&gt;, &lt;span class="err"&gt;4&lt;/span&gt;&lt;span class="o"&gt;]]|&lt;/span&gt;
&lt;span class="o"&gt;+---+---------------------------------------+&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Afterword
&lt;/h2&gt;

&lt;p&gt;I hope these methods and the library will help you as much as they helped me. They make working with structures a lot easier and keep my code more concise, which in my head means less error-prone.&lt;/p&gt;

&lt;p&gt;For more info go to &lt;a href="https://github.com/AbsaOSS/spark-hats"&gt;https://github.com/AbsaOSS/spark-hats&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Good luck and happy coding!&lt;/p&gt;

</description>
      <category>spark</category>
      <category>bigdata</category>
      <category>scala</category>
      <category>library</category>
    </item>
    <item>
      <title>Black Box Testing Misconceptions</title>
      <dc:creator>Saša Zejnilović</dc:creator>
      <pubDate>Thu, 28 May 2020 12:37:19 +0000</pubDate>
      <link>https://dev.to/zejnilovic/black-box-testing-misconceptions-1mcm</link>
      <guid>https://dev.to/zejnilovic/black-box-testing-misconceptions-1mcm</guid>
      <description>&lt;p&gt;After some time in QA and being a self-proclaimed SDET, I have seen that there are a lot of misconceptions regarding testing.  You can see articles all around the net with "Regression testing vs Retesting", "Performance Done Right", and other similar, but I have not seen one that adequately addresses &lt;strong&gt;Black Box Testing&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;From my experience &lt;strong&gt;Black Box Testing&lt;/strong&gt; is seen as something that can be &lt;em&gt;done quickly&lt;/em&gt;, or with &lt;em&gt;unskilled people&lt;/em&gt;. Some would label it as &lt;em&gt;cheap while providing a lot of feedback&lt;/em&gt;. Let's break this down. &lt;/p&gt;

&lt;p&gt;The easier misconception to disprove is "&lt;strong&gt;it is cheap&lt;/strong&gt;". That goes totally against the basic testing pyramid. Black box testing is done after you have some user interface, CLI or GUI, whatever you want to throw at the users. This means your integration could be garbage, and your supporting code could be garbage, but for some reason, it just worked together, until a user sat behind it. Now when something goes wrong, there is a chance you will have to dig really deep into your underlying code to make it work,  and this could also break the integration with other modules. This seems so expensive to me, but then again, I am not a manger. &lt;/p&gt;

&lt;p&gt;The second misconception is about it &lt;strong&gt;"being easy"&lt;/strong&gt;. I am not saying black box testing is one of the more complex types of testing, but I am sure it is not as ignorance-based, as some would think. Yes, you can throw a team of 20 people on a UI and tell them to go nuts, but does this actually bring the most value out of it? In my experience, proper black box testing profits when the people setting it up are &lt;strong&gt;knowledgable about&lt;/strong&gt; the business &lt;strong&gt;use cases and issues&lt;/strong&gt;, and when they &lt;strong&gt;understand the users&lt;/strong&gt;. Give a tester a one on one with a user, let them chat, see what happens. Another thing that would be good for the testers is to &lt;strong&gt;understand the technology and configurations&lt;/strong&gt; of the system under tests, what is some other software that this software will interact with, and what are the &lt;strong&gt;expectations for the data flow&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I hope this clears it up a bit. If I was not clear enough in everything I said, comment below and I will correct myself ASAP.&lt;/p&gt;

&lt;p&gt;Good luck and happy coding!&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>testing</category>
    </item>
    <item>
      <title>Github Awesome Lists</title>
      <dc:creator>Saša Zejnilović</dc:creator>
      <pubDate>Tue, 26 May 2020 12:38:49 +0000</pubDate>
      <link>https://dev.to/zejnilovic/github-awesome-lists-4mll</link>
      <guid>https://dev.to/zejnilovic/github-awesome-lists-4mll</guid>
      <description>&lt;p&gt;&lt;em&gt;TL;DR: &lt;a href="https://github.com/sindresorhus/awesome"&gt;Awesome repo&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In the last couple of weeks, I have seen a lot of custom made lists and suggestions for apps, frameworks and others, which is fantastic. A good tip is an excellent way to jumpstart someone's work or project. A curated list from someone experienced is just irreplaceable. &lt;/p&gt;

&lt;p&gt;With that said, allow me to present to you Github's repo called &lt;a href="https://github.com/sindresorhus/awesome"&gt;Awesome&lt;/a&gt;. This repository is full of different lists of impressive &lt;strong&gt;open-source&lt;/strong&gt; applications, libraries and frameworks for a plethora of usages.&lt;/p&gt;

&lt;p&gt;Please have a look. Give some of them a try. I know it helped me a lot of times. Sometimes it even sparks an idea.&lt;/p&gt;

&lt;p&gt;Good luck and happy coding!&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>productivity</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Short: The biggest mistake of juniors</title>
      <dc:creator>Saša Zejnilović</dc:creator>
      <pubDate>Sat, 23 May 2020 16:57:24 +0000</pubDate>
      <link>https://dev.to/zejnilovic/short-the-biggest-mistake-of-juniors-5bmj</link>
      <guid>https://dev.to/zejnilovic/short-the-biggest-mistake-of-juniors-5bmj</guid>
      <description>&lt;p&gt;In my approximately five years of professional IT experience (a weird mix of QA, Dev, DevOps) I had the honour of looking at CVs, reviewing candidates and teaching or guiding more junior team members. Let's say my "mentoring" started three years ago. &lt;/p&gt;

&lt;p&gt;In these three years, I have seen a lot of people, both my teammates, colleagues and others, like open-source participants, make the same fundamental mistake. This mistake is something which even the wise &lt;strong&gt;Vesemir&lt;/strong&gt; told us not to do. He said: &lt;strong&gt;"Don't train alone; it only embeds your errors."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This might sound funny. Taking quotes from fictional characters, but it is something that I think is crucial. I have many times seen starting programmers learn something wrongly and burn it into their mind and then spread the harmful code like a disease everywhere in the codebase. &lt;/p&gt;

&lt;p&gt;I want to emphasize that I don't think this is their mistake. Internet is big, dark and full of errors. They are trying, they are learning alone, and they should be praised and encouraged. But they should also start with peer reviewing as soon as possible. This is the "cure" of sorts. Internet is also full of beginning programmers, and not everyone is lucky enough to get a mentor. What I am saying is do things together as soon as possible. Work together but try to learn separately. This will allow all of you to learn new things, share knowledge and discuss better ways, not allowing you to get comfortable with what you know and lowering the risk of embedding wrong ideas into your daily routine.&lt;/p&gt;

&lt;p&gt;Now how to find a place where you could code and someone will review your code for free? Github is a start. You can explore repositories; there is a button for it on the main page. You can filter by topics and languages. Pick a smaller project, look through the issues, play with it a bit. Smaller projects tend to be more open to newcomers. Not only do you learn, build meaningful things, but it will also show on your CV.&lt;/p&gt;

&lt;p&gt;Good luck and Happy Coding!&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>productivity</category>
      <category>career</category>
    </item>
    <item>
      <title>How to compare your data in/with Spark</title>
      <dc:creator>Saša Zejnilović</dc:creator>
      <pubDate>Fri, 01 May 2020 08:18:54 +0000</pubDate>
      <link>https://dev.to/zejnilovic/how-to-compare-your-data-in-with-spark-3m7c</link>
      <guid>https://dev.to/zejnilovic/how-to-compare-your-data-in-with-spark-3m7c</guid>
      <description>&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Intro&lt;/li&gt;
&lt;li&gt;The problem&lt;/li&gt;
&lt;li&gt;The solution&lt;/li&gt;
&lt;li&gt;Who exactly is behind this project&lt;/li&gt;
&lt;li&gt;Hermes dataset comparison Features&lt;/li&gt;
&lt;li&gt;Usage - Spark application&lt;/li&gt;
&lt;li&gt;Summing-up&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;Apache Spark, as is, provides quite a lot of different capabilities and features, but it is missing one that I, as a self-proclaimed SDET, find pretty valuable. The comparison of data. I'm talking about the comparison of complex data, complex structures and generating a report that can be used to see where the problem lies; more than just a normal true/false comparison.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;The main problem that we are trying to solve is that when using standard solutions, running a comparison of sorts on a large dataset returns a basic &lt;code&gt;true&lt;/code&gt; or &lt;code&gt;false&lt;/code&gt; result, after which you then need to comb through all of the data and try to find the root cause.&lt;br&gt;
There is no fast response. Fast feedback loops are essential, but that's for a different article. You also need some basic metrics about the dataset to be provided. Testing without proper results is putting your trust in hope, and hope alone cannot build your big data solutions.&lt;/p&gt;
&lt;h2&gt;
  
  
  The solution
&lt;/h2&gt;

&lt;p&gt;For these reasons, my teammates from &lt;a href="https://github.com/AbsaOSS"&gt;AbsaOSS&lt;/a&gt; and I have written a tool called &lt;a href="https://github.com/AbsaOSS/hermes"&gt;Hermes&lt;/a&gt;. Hermes consists of three modules, and one of its modules is a data comparison tool which works either as a Spark application or as a library, and it can compare whichever format is supported by Apache Spark. This tool is written in Scala, so it should be possible to use within any JVM application of your own. (I have even seen people using py-spark use our libraries, so it's not only JVM compatible. I am, however, not an expert on that, and I am not sure how "clean" of a solution that is.)&lt;/p&gt;

&lt;p&gt;In this article, I would like to give a brief overview of the features of this Spark comparison tool and how to use it as a Spark app. Usage as a library is a bit more complex, and I believe it deserves a full article of its own. Let me first explain who we are.&lt;/p&gt;
&lt;h2&gt;
  
  
  Who exactly is behind this project
&lt;/h2&gt;

&lt;p&gt;AbsaOSS is an initiative of &lt;a href="https://www.absa.africa/absaafrica/"&gt;Absa Group Limited&lt;/a&gt;, a South African bank that wants to go open source. You want to standardize your data, move it from COBOL to something current or track what and how your data is handled? We do that (&lt;a href="https://github.com/AbsaOSS/enceladus"&gt;Enceladus&lt;/a&gt;), that (&lt;a href="https://github.com/AbsaOSS/cobrix"&gt;Cobrix&lt;/a&gt;) and that (&lt;a href="https://github.com/AbsaOSS/spline"&gt;Spline&lt;/a&gt;). And some other interesting stuff.&lt;/p&gt;

&lt;p&gt;Hermes and all other projects are under the Apache License. Meaning, feel free to use it and contribute. The projects are active, and we spend almost our whole days on GitHub, so we are usually quite fast to respond. All of our projects are in some way currently used by ABSA in production.&lt;/p&gt;

&lt;p&gt;Hermes' significant advantage is that even though it is used in production, it is quite young and still looking for ideas. It is still growing.&lt;/p&gt;

&lt;p&gt;Current real-world usages are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A testing framework for the &lt;a href="https://github.com/AbsaOSS/enceladus"&gt;Enceladus&lt;/a&gt; project.&lt;/li&gt;
&lt;li&gt;A data check tool that gives us an assurance that new tools work as well as the old ones that are being decommissioned.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Hermes dataset comparison Features
&lt;/h2&gt;

&lt;p&gt;This feature list should be the same for the people who use it as a library as for those that use it as a Spark app. The features are as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can be used as an Apache Spark application or Spark library&lt;/li&gt;
&lt;li&gt;Compares virtually any data type if you provide the needed library for the source type on the classpath. Spark already supports a lot of source types, but you might need to read Oracle, Hive or Avro. Just provide the application with proper packages, and you are good to go&lt;/li&gt;
&lt;li&gt;JDBC, Spark and other packages are not packaged together with the application. They have a provided dependency. This allows us to keep the jar to 150 Kb and provide users with more flexibility&lt;/li&gt;
&lt;li&gt;Can compare two different source types&lt;/li&gt;
&lt;li&gt;Writes output as parquet (this is planned to be configurable. Issue &lt;a href="https://github.com/AbsaOSS/hermes/issues/72"&gt;#72&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Only compares data sets with the same schema. We have a complex schema comparison so the schema does not have to be aligned, but it has to be the same. (We have a plan for selective comparisons in the future)&lt;/li&gt;
&lt;li&gt;Will write &lt;code&gt;_METRICS&lt;/code&gt; file at the end (this will be written next to the &lt;code&gt;parquet&lt;/code&gt;)

&lt;ul&gt;
&lt;li&gt;If you passed or failed&lt;/li&gt;
&lt;li&gt;How many rows were processed&lt;/li&gt;
&lt;li&gt;If any duplicate rows were found&lt;/li&gt;
&lt;li&gt;Number of differences found&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Provides a precise path to what was wrong in the datasets. Even if the structure is complex (arrays of structs and the likes). This is written to the final parquet&lt;/li&gt;
&lt;li&gt;Final parquet holds only the rows that were different&lt;/li&gt;
&lt;li&gt;Prints summary to STDOUT&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Usage - Spark application
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Disclaimer:&lt;/strong&gt; I will try to cover all of the tool's functionalities, but I will be skipping over &lt;code&gt;spark-submit&lt;/code&gt; configurations. That is beyond the scope of this text. I will also not cover how to set up your Hadoop and Apache Spark.&lt;/p&gt;

&lt;p&gt;In this use case, I will try to show possibilities of Hermes's dataset comparison. This use case covers usage as a Spark application. For usage as a library, look forward to a second article.&lt;/p&gt;

&lt;p&gt;To use Hermes's Dataset Comparison, you just need to know how to run &lt;code&gt;spark-submit&lt;/code&gt;, your data types, their properties/options and where it is. Let's start with an easy example:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example 1&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;spark-submit &lt;span class="se"&gt;\&lt;/span&gt;
&amp;lt;spark-options&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
dataset-comparison-0.2.0.jar &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--new-format&lt;/span&gt; csv &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--new-header&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--new-path&lt;/span&gt; /new/path &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--ref-format&lt;/span&gt; xml &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--ref-rowTag&lt;/span&gt; alfa &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--ref-path&lt;/span&gt; /ref/path &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--out-path&lt;/span&gt; /out/path &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--keys&lt;/span&gt; ID
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Example 2&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;spark-submit &lt;span class="se"&gt;\&lt;/span&gt;
&amp;lt;spark-options&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
dataset-comparison-0.2.0.jar &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--format&lt;/span&gt; xml &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--rowTag&lt;/span&gt; alfa &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--new-path&lt;/span&gt; /new/path &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--ref-path&lt;/span&gt; /ref/path &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--out-path&lt;/span&gt; /out/path &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nt"&gt;--keys&lt;/span&gt; ID
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, let's go over what these are. The job has one independent parameter, and that is &lt;code&gt;--keys&lt;/code&gt;. Keys refers to the set of primary keys. You can provide either a single primary key or a number of keys as a comma-delimited list in the form &lt;code&gt;ID1,ID2,ID3&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Next up is &lt;code&gt;--out-path&lt;/code&gt;. For now, out-path can only be configured to specify the destination path for the parquet file which will contain the output differences and metrics. This is planned to change (&lt;a href="https://github.com/AbsaOSS/hermes/issues/72"&gt;#72&lt;/a&gt;), and it will have the same rules as &lt;code&gt;--ref&lt;/code&gt; and &lt;code&gt;--new&lt;/code&gt; prefixes.&lt;/p&gt;

&lt;p&gt;Last and (probably) hardest to grasp are the &lt;code&gt;--ref&lt;/code&gt; and &lt;code&gt;--new&lt;/code&gt; parameters. These are only prefixes to the Spark source type's standard options. Just add &lt;code&gt;-format&lt;/code&gt; to specify the source format (type). Add &lt;code&gt;-path&lt;/code&gt; to get the input or output path, unless you are using JDBC connector, then use &lt;code&gt;-dbtable&lt;/code&gt; and then any other options prepended with the correct prefix (&lt;code&gt;--ref&lt;/code&gt; or &lt;code&gt;--new&lt;/code&gt;) depending on if it is reference data or the new data that you are testing.&lt;/p&gt;

&lt;p&gt;These options can also be generalized. Taking a look at &lt;em&gt;Example 2&lt;/em&gt;, it has only &lt;code&gt;--format&lt;/code&gt;; no &lt;code&gt;--new-format&lt;/code&gt; or &lt;code&gt;--ref-format&lt;/code&gt;. This is because both source types are &lt;code&gt;XML&lt;/code&gt; and both have the same &lt;code&gt;rowTag&lt;/code&gt;.&lt;br&gt;
In this case, there is no need to specify this twice. If both source types were &lt;code&gt;XML&lt;/code&gt; but had different &lt;code&gt;rowTag&lt;/code&gt;s, then the &lt;code&gt;--ref-rowTag&lt;/code&gt; and &lt;code&gt;--new-rowTag&lt;/code&gt; options would need to be specified.&lt;/p&gt;

&lt;p&gt;After running this, just run &lt;code&gt;hdfs dfs -ls /out/path&lt;/code&gt; and take a look at the results. If there were any differences, you should find a parquet file that has a new column added called &lt;code&gt;err_col&lt;/code&gt;. This error column will be filled with paths highlighting differences in your structure.&lt;br&gt;
Its schema is (pretty simple):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="n"&gt;root&lt;/span&gt;
 &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;errCol&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;array&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;nullable&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
 &lt;span class="o"&gt;|&lt;/span&gt;    &lt;span class="o"&gt;|--&lt;/span&gt; &lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="k"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;containsNull&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Summing-up
&lt;/h2&gt;

&lt;p&gt;Hermes should be easy to use testing tool and framework. Its dataset comparison module currently holds the most value, even outside of AbsaOSS, and I hope it can help you solve an issue or two. If you have any question about this or any of our projects, just send us a message or create a &lt;code&gt;Question&lt;/code&gt; issue on GitHub.&lt;/p&gt;

&lt;p&gt;I am looking forward to your comments and see you in the next article - usage as a library.&lt;/p&gt;

</description>
      <category>spark</category>
      <category>scala</category>
      <category>testing</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Building Hadoop native libraries on Mac in 2019</title>
      <dc:creator>Saša Zejnilović</dc:creator>
      <pubDate>Mon, 20 May 2019 15:37:17 +0000</pubDate>
      <link>https://dev.to/zejnilovic/building-hadoop-native-libraries-on-mac-in-2019-1iee</link>
      <guid>https://dev.to/zejnilovic/building-hadoop-native-libraries-on-mac-in-2019-1iee</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR to be found at the end&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Recently I came into a situation that I "needed" Hadoop native libraries. Well, when I say "needed", I mean I was just getting fed up by the constant warnings like this one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;WARN util.NativeCodeLoader: Unable to load native-hadoop library &lt;span class="k"&gt;for &lt;/span&gt;your platform... using builtin-java classes where applicable
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;So I thought I would build my own Hadoop native libraries. How hard can it be, right? Honest answer? Less than an hour if you don't have a tutorial. Fifteen minutes if you do and most of that is compilation time. In my search, I found out a lot of tutorials and guides were either outdated or didn't offer everything needed for a full compilation and installation and that is why I wrote my own which I tested on two independent Macs, thus it should be "tested enough".&lt;/p&gt;
&lt;h2&gt;
  
  
  Why do it
&lt;/h2&gt;

&lt;p&gt;There was no real world issue I was hoping to solve. I just had a few minutes on my hands and I used them to learn something new. But I did read that there are cases of speed improvements which is good if you are developing or testing something locally because local machines tend to be slow and any improvement is more than welcome. Another thing is I did see two random articles a while back saying they did have some issues with the Java libraries, but chances of some of you having the same issues are really small.&lt;/p&gt;
&lt;h2&gt;
  
  
  Dependencies
&lt;/h2&gt;

&lt;p&gt;First of all, we need to install the dependencies for the build and I am including links so you can check what you are going to install exactly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://gcc.gnu.org/"&gt;gcc&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.gnu.org/software/autoconf/"&gt;autoconf&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.gnu.org/software/automake/"&gt;automake&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.gnu.org/software/libtool/"&gt;libtool&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cmake.org/"&gt;cmake&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/google/snappy"&gt;snappy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.gzip.org/"&gt;gzip&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.bzip.org/"&gt;bzip2&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://zlib.net/"&gt;zlib&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.gnu.org/software/wget/"&gt;wget&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.openssl.org/"&gt;openssl 1.0&lt;/a&gt;

&lt;ul&gt;
&lt;li&gt;1.1 on Brew has an issue. More in the comments section. Thanks to &lt;a class="mentioned-user" href="https://dev.to/imasli"&gt;@imasli&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/protocolbuffers/protobuf"&gt;protobuf 2.5.0&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;(Please note I am skipping maven, java and others that I think you would already have. If I am wrong, tell me and let's update the article. As well as Hadoop installation. There is a beautiful article about Hadoop installation on Mac by Zhang Hao &lt;a href="https://isaacchanghau.github.io/post/install_hadoop_mac/"&gt;here&lt;/a&gt;.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For the installation of most of these, I will be using &lt;a href="https://brew.sh/"&gt;Homebrew&lt;/a&gt;. It's a good tool, has a one-liner installation and a very short average time to be productive with it. As the link provides everything you need I am skipping the installation here. &lt;/p&gt;

&lt;p&gt;If you are not using Homebrew for the first time, update and upgrade your tools. If you are using it for some time already and would like to keep some things with the current version, use &lt;code&gt;brew pin&lt;/code&gt; like &lt;a href="https://docs.brew.sh/FAQ#how-do-i-stop-certain-formulae-from-being-updated"&gt;this&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Update&lt;/span&gt;
brew update
brew upgrade

&lt;span class="c"&gt;# Then the installation&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;wget gcc autoconf automake libtool cmake snappy &lt;span class="nb"&gt;gzip &lt;/span&gt;bzip2 zlib openssl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;As you could have noticed one of those dependencies listed is missing from the list above. Yes! It is a &lt;code&gt;protobuf&lt;/code&gt; that has been deprecated and can't be easily installed from Homebrew. So let's build our own. It's cleaner that way and much more fun then it sounds. We will first need to get it from GitHub and unarchive it somewhere. You can delete it right after, so you don't need a special folder structure.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-xzf&lt;/span&gt; protobuf-2.5.0.tar.gz
&lt;span class="nb"&gt;cd &lt;/span&gt;protobuf-2.5.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Then comes the process of building and making sure everything went smoothly. It takes some time and I advise you to run it step by step to see and know what is happening. Some warnings here and there are normal so you can skip those.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./configure
make
make check
make &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;span class="c"&gt;# And just to check if everything is ok.&lt;/span&gt;
&lt;span class="c"&gt;# This should print libprotoc 2.5.0&lt;/span&gt;
protoc &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  OpenSSL setup
&lt;/h2&gt;

&lt;p&gt;Now, linking OpenSSL libraries by hand as Homebrew refuses to link OpenSSL and the compiler needs them. This is a known feature and needs to be done by running &lt;code&gt;ln&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; /usr/local/include
&lt;span class="nb"&gt;ln&lt;/span&gt; &lt;span class="nt"&gt;-s&lt;/span&gt; ../opt/openssl/include/openssl &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This will solve an error that looks something like the caption below.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; Configuring incomplete, errors occurred!
&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; See also /Users/user/github/hadoop/hadoop-tools/hadoop-pipes/target/native/CMakeCMake Error at /usr/local/Cellar/cmake/3.14.3/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:137 &lt;span class="o"&gt;(&lt;/span&gt;message&lt;span class="o"&gt;)&lt;/span&gt;:
&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;   Could NOT find OpenSSL, try to &lt;span class="nb"&gt;set &lt;/span&gt;the path to OpenSSL root folder &lt;span class="k"&gt;in &lt;/span&gt;the
&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;   system variable OPENSSL_ROOT_DIR &lt;span class="o"&gt;(&lt;/span&gt;missing: OPENSSL_INCLUDE_DIR&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; Call Stack &lt;span class="o"&gt;(&lt;/span&gt;most recent call first&lt;span class="o"&gt;)&lt;/span&gt;:
&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;   /usr/local/Cellar/cmake/3.14.3/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:378 &lt;span class="o"&gt;(&lt;/span&gt;_FPHSA_FAILURE_MESSAGE&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;   /usr/local/Cellar/cmake/3.14.3/share/cmake/Modules/FindOpenSSL.cmake:413 &lt;span class="o"&gt;(&lt;/span&gt;find_package_handle_stFiles/CMakeOutput.log.
&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; andard_args&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;   CMakeLists.txt:20 &lt;span class="o"&gt;(&lt;/span&gt;find_package&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Building native libraries
&lt;/h2&gt;

&lt;p&gt;And finally! The building of the libraries. Again, this will create a folder that you can delete in the end. Here is probably the first place you will need to modify something and that is the version of Hadoop you will be using.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/apache/hadoop.git
&lt;span class="nb"&gt;cd &lt;/span&gt;hadoop
&lt;span class="c"&gt;# Change the version as needed&lt;/span&gt;
git checkout branch-&amp;lt;VERSION&amp;gt;
&lt;span class="c"&gt;# And just package.&lt;/span&gt;
mvn package &lt;span class="nt"&gt;-Pdist&lt;/span&gt;,native &lt;span class="nt"&gt;-DskipTests&lt;/span&gt; &lt;span class="nt"&gt;-Dtar&lt;/span&gt;
&lt;span class="c"&gt;# After build, move your newly created libraries.&lt;/span&gt;
&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; hadoop-dist/target/hadoop-&amp;lt;VERSION&amp;gt;/lib &lt;span class="nv"&gt;$HADOOP_HOME&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Setting up environment variables
&lt;/h2&gt;

&lt;p&gt;Now the critical part, making your shell see the libraries. I don't know what kind of shell you are using, nevertheless, put this into your shell profile (&lt;code&gt;.bashrc&lt;/code&gt;, &lt;code&gt;.zshrc&lt;/code&gt;, etc.):&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;HADOOP_OPTS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"-Djava.library.path=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;HADOOP_HOME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/lib/native"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;LD_LIBRARY_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$LD_LIBRARY_PATH&lt;/span&gt;:&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;HADOOP_HOME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/lib/native
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;JAVA_LIBRARY_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$JAVA_LIBRARY_PATH&lt;/span&gt;:&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;HADOOP_HOME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;/lib/native
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This will point all the libraries to the right path and will make everything fall right into place. The last thing that we need is just to check if everything is ok (and by everything I mean almost everything, because &lt;code&gt;bzip&lt;/code&gt; is acting up and I still have not found a way to solve, when I do I will update this).&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;hadoop checknative &lt;span class="nt"&gt;-a&lt;/span&gt;

&lt;span class="c"&gt;#The output should be something like this.&lt;/span&gt;
19/05/17 19:00:14 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version
19/05/17 19:00:14 INFO zlib.ZlibFactory: Successfully loaded &amp;amp; initialized native-zlib library
Native library checking:
hadoop:  &lt;span class="nb"&gt;true&lt;/span&gt; /usr/local/Cellar/hadoop/2.7.5/lib/native/libhadoop.dylib
zlib:    &lt;span class="nb"&gt;true&lt;/span&gt; /usr/lib/libz.1.dylib
snappy:  &lt;span class="nb"&gt;true&lt;/span&gt; /usr/local/lib/libsnappy.1.dylib
lz4:     &lt;span class="nb"&gt;true &lt;/span&gt;revision:99
bzip2:   &lt;span class="nb"&gt;false
&lt;/span&gt;openssl: &lt;span class="nb"&gt;true&lt;/span&gt; /usr/lib/libcrypto.35.dylib
19/05/17 19:00:14 INFO util.ExitUtil: Exiting with status 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Afterword
&lt;/h2&gt;

&lt;p&gt;Hopefully, everything is running smoothly and you no longer get those warnings and if I helped even one person with this I am glad. Because if there is no added value for the reader, then it is just me talking to my wall. On the other hand, if you did find some issues in the code or the article, please do tell me and I will fix everything I am capable of.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;a&gt;&lt;/a&gt; TL;DR
&lt;/h2&gt;

&lt;p&gt;This is just a step by step shell script extracted from the upper text.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



</description>
      <category>bigdata</category>
      <category>hadoop</category>
      <category>macos</category>
    </item>
  </channel>
</rss>
