<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Exacaster</title>
    <description>The latest articles on DEV Community by Exacaster (@exacaster).</description>
    <link>https://dev.to/exacaster</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F4893%2Fee98febf-ca51-4453-ba24-b4c95b0641bb.png</url>
      <title>DEV Community: Exacaster</title>
      <link>https://dev.to/exacaster</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/exacaster"/>
    <language>en</language>
    <item>
      <title>Lightweight HTTP API for Big Data on S3</title>
      <dc:creator>Paulius</dc:creator>
      <pubDate>Wed, 15 Mar 2023 15:50:29 +0000</pubDate>
      <link>https://dev.to/exacaster/lightweight-http-api-for-big-data-on-s3-3fnb</link>
      <guid>https://dev.to/exacaster/lightweight-http-api-for-big-data-on-s3-3fnb</guid>
      <description>&lt;p&gt;We are happy to announce our third opensource project - &lt;a href="https://github.com/exacaster/delta-fetch"&gt;Delta Fetch&lt;/a&gt;.&lt;br&gt;
Delta Fetch is a configurable HTTP API service for accessing &lt;a href="https://delta.io/"&gt;Delta Lake&lt;/a&gt; tables. Service is highly configurable, with possibility to filter your Delta tables by selected columns.&lt;/p&gt;
&lt;h2&gt;
  
  
  How it works?
&lt;/h2&gt;

&lt;p&gt;Delta Fetch heavily relies on Delta table metadata, which contains statistics about each Parquet file. The same metadata that is used for &lt;a href="https://docs.delta.io/latest/optimizations-oss.html#data-skipping"&gt;data skipping&lt;/a&gt; is used to read only relevant files, in particular - minimum and maximum value of each column in each file. The Delta table metadata is cached for better performance and can be refreshed by enabling auto cache update or making API requests with the &lt;code&gt;...?exact=true&lt;/code&gt; query parameter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Request handling flow:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The user makes an API request to one of the configured API resources.&lt;/li&gt;
&lt;li&gt;Delta Fetch reads Delta table metadata from file storage and stores it in memory.&lt;/li&gt;
&lt;li&gt;Delta Fetch finds the relevant file paths in the stored metadata and starts reading them.&lt;/li&gt;
&lt;li&gt;Delta Fetch uses the Hadoop Parquet Reader implementation, which supports filter push down to avoid reading the entire file.&lt;/li&gt;
&lt;li&gt;Delta Fetch continues reading Parquet files one by one until the requested or configured limit is reached.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Configuration
&lt;/h2&gt;

&lt;p&gt;Resources can be configured in the following way:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/api/data/{table}/{identifier}&lt;/span&gt;
      &lt;span class="na"&gt;schema-path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/api/schemas/{table}/{identifier}&lt;/span&gt;
      &lt;span class="na"&gt;delta-path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;s3a://bucket/delta/{table}/&lt;/span&gt;
      &lt;span class="na"&gt;response-type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SINGLE&lt;/span&gt;
      &lt;span class="na"&gt;filter-variables&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;column&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;id&lt;/span&gt;
          &lt;span class="na"&gt;path-variable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;identifier&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;path&lt;/code&gt; property defines API path which will be used to query your Delta tables. Path variables can be defined by using curly braces as shown in the example.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;schema-path&lt;/code&gt; (optional) property can be used to define API path for Delta table schema.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;delta-path&lt;/code&gt; property defines S3 path of your Delta table. Path variables on this path will be filled in by variables provided in API path.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;response-type&lt;/code&gt; (optional, default: &lt;code&gt;SINGLE&lt;/code&gt;) property defines weather to search for multiple resources, or a single one. Use &lt;code&gt;LIST&lt;/code&gt; type for multiple resources.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max-results&lt;/code&gt; (optional, default: &lt;code&gt;100&lt;/code&gt;) maximum number of rows that can be returned in case of &lt;code&gt;LIST&lt;/code&gt; &lt;code&gt;response-type&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;filter-variables&lt;/code&gt; (optional) additional filters applied to Delta table.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can also configure one of two security mechanisms - Basic Auth or OAuth2, and some caching parameters for better performance. Refer to &lt;a href="https://github.com/exacaster/delta-fetch"&gt;Delta Fetch&lt;/a&gt; Github Repo for more information.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommendations
&lt;/h2&gt;

&lt;p&gt;In order to be able to quickly access the data in Parquet files you need to configure block size to a smaller value that you would normally do. We've got acceptable results by setting &lt;code&gt;parquet.block.size&lt;/code&gt; to &lt;code&gt;1048576&lt;/code&gt; (1mb) value.&lt;/p&gt;

&lt;p&gt;Also we highly recommend &lt;strong&gt;not&lt;/strong&gt; to use &lt;code&gt;OPTIMIZE ... ZORDER ...&lt;/code&gt; on your tables, which are exposed through Delta Fetch, since this command usually stores data split by 1GB chunks. We suggest to rely on simple data ordering by the columns that you are planning to use as "keys" in Delta Fetch API.&lt;/p&gt;

&lt;p&gt;More recommendations and considerations can be found on our &lt;a href="https://github.com/exacaster/delta-fetch/doces/recommendations.md"&gt;recommendations page&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;With those recommendations applied we managed to get ~1s response time, when requesting for a single row by a single column value:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;time &lt;/span&gt;curl http://localhost:8080/api/data/disable_optimize_ordered/872480210503_234678
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"version"&lt;/span&gt;:5,&lt;span class="s2"&gt;"data"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"user_id"&lt;/span&gt;:&lt;span class="s2"&gt;"872480210503_234678"&lt;/span&gt;,&lt;span class="s2"&gt;"sub_type"&lt;/span&gt;:&lt;span class="s2"&gt;"PREPAID"&lt;/span&gt;,&lt;span class="s2"&gt;"activation_date"&lt;/span&gt;:&lt;span class="s2"&gt;"2018-09-01"&lt;/span&gt;,&lt;span class="s2"&gt;"status"&lt;/span&gt;:&lt;span class="s2"&gt;"ACTIVE"&lt;/span&gt;,&lt;span class="s2"&gt;"deactivation_date"&lt;/span&gt;:&lt;span class="s2"&gt;"9999-01-01"&lt;/span&gt;&lt;span class="o"&gt;}}&lt;/span&gt;
curl   0.00s user 0.01s system 1% cpu 0.982 total
&lt;span class="nt"&gt;---&lt;/span&gt;
&lt;span class="nb"&gt;time &lt;/span&gt;curl http://localhost:8080/api/data/disable_optimize_ordered/579520210231_237911
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"version"&lt;/span&gt;:5,&lt;span class="s2"&gt;"data"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"user_id"&lt;/span&gt;:&lt;span class="s2"&gt;"579520210231_237911"&lt;/span&gt;,&lt;span class="s2"&gt;"sub_type"&lt;/span&gt;:&lt;span class="s2"&gt;"PREPAID"&lt;/span&gt;,&lt;span class="s2"&gt;"activation_date"&lt;/span&gt;:&lt;span class="s2"&gt;"2018-06-24"&lt;/span&gt;,&lt;span class="s2"&gt;"status"&lt;/span&gt;:&lt;span class="s2"&gt;"ACTIVE"&lt;/span&gt;,&lt;span class="s2"&gt;"deactivation_date"&lt;/span&gt;:&lt;span class="s2"&gt;"9999-01-01"&lt;/span&gt;&lt;span class="o"&gt;}}&lt;/span&gt;
curl   0.00s user 0.01s system 0% cpu 1.250 total
&lt;span class="nt"&gt;---&lt;/span&gt;
➜  ~ &lt;span class="nb"&gt;time &lt;/span&gt;curl http://localhost:8080/api/data/disable_optimize_ordered/875540210000_245810
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"version"&lt;/span&gt;:2,&lt;span class="s2"&gt;"data"&lt;/span&gt;:&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"user_id"&lt;/span&gt;:&lt;span class="s2"&gt;"875540210000_245810"&lt;/span&gt;,&lt;span class="s2"&gt;"sub_type"&lt;/span&gt;:&lt;span class="s2"&gt;"PREPAID"&lt;/span&gt;,&lt;span class="s2"&gt;"activation_date"&lt;/span&gt;:&lt;span class="s2"&gt;"2018-09-01"&lt;/span&gt;,&lt;span class="s2"&gt;"status"&lt;/span&gt;:&lt;span class="s2"&gt;"ACTIVE"&lt;/span&gt;,&lt;span class="s2"&gt;"deactivation_date"&lt;/span&gt;:&lt;span class="s2"&gt;"9999-01-01"&lt;/span&gt;&lt;span class="o"&gt;}}&lt;/span&gt;
curl   0.00s user 0.01s system 1% cpu 0.870 total
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We consider this API service experimental and hope to get some feedback and contributions from Open Source (and also dev.to :)) community. Let us know what do you think about our new project.&lt;/p&gt;

</description>
      <category>deltalake</category>
      <category>bigdata</category>
      <category>opensource</category>
      <category>s3</category>
    </item>
    <item>
      <title>Testing PySpark &amp; Pandas in style</title>
      <dc:creator>Paulius</dc:creator>
      <pubDate>Thu, 10 Feb 2022 07:49:05 +0000</pubDate>
      <link>https://dev.to/exacaster/testing-pyspark-pandas-in-style-31cg</link>
      <guid>https://dev.to/exacaster/testing-pyspark-pandas-in-style-31cg</guid>
      <description>&lt;p&gt;Today we'd like to share a small utility package for testing Dataframes on PySpark and Pandas.&lt;/p&gt;

&lt;p&gt;If you are a fan of test-driven development and had a chance to work on PySpark (or Pandas) projects, you've probably had written tests similar to this one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pyspark_test&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;assert_pyspark_df_equal&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;your_module&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;calculate_result&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_event_aggregation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"user_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"even_type"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"item_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"event_time"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"country"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"dt"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;expected_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;123456&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'page_view'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2017&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s"&gt;"uk"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"2017-12-31"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;123456&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;'item_view'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;68471513&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2017&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s"&gt;"uk"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"2017-12-31"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt; 
        &lt;span class="n"&gt;schema&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;result_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;calculate_result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;assert_pyspark_df_equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It works OK for small applications, but when your project gets bigger, data gets more complicated and the amount of tests starts to grow, you might want a less tedious way to define test data.&lt;/p&gt;

&lt;p&gt;Exacaster alumni &lt;a href="https://www.linkedin.com/in/vaidasarmonas"&gt;Vaidas Armonas&lt;/a&gt; came up with an idea to represent Spark DataFrames as markdown tables. This idea materialized to a testing package &lt;a href="https://pypi.org/project/markdown-frames/"&gt;markdown-frames&lt;/a&gt;. With this package the test, which was shown before, can be replaced with this one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pyspark_test&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;assert_pyspark_df_equal&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;markdown_frames.spark_dataframe&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;spark_df&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;your_module&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;calculate_result&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_event_aggregation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;input_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;""" 
        |  user_id   |  even_type  | item_id  |    event_time       | country  |     dt      |
        |   bigint   |   string    |  bigint  |    timestamp        |  string  |   string    |
        | ---------- | ----------- | -------- | ------------------- | -------- | ----------- |
        |   123456   |  page_view  |   None   | 2017-12-31 23:50:50 |   uk     | 2017-12-31  |
        |   123456   |  item_view  | 68471513 | 2017-12-31 23:50:55 |   uk     | 2017-12-31  |
    """&lt;/span&gt;
    &lt;span class="n"&gt;expected_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_df&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;result_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;calculate_result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;assert_pyspark_df_equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It makes tests more readable and self-explanatory.&lt;/p&gt;

&lt;p&gt;Everything looks almost the same, when you need to build a Dataframe for Pandas, you just need to use different function:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;from markdown_frames.pandas_dataframe import pandas_df&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Share in the comments, if you know any other convienient tips &amp;amp; tricks when writing PySpark (and Pandas) Unit tests.&lt;/p&gt;

</description>
      <category>spark</category>
      <category>pandas</category>
      <category>testing</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Spark is lit once again</title>
      <dc:creator>Mindaugas</dc:creator>
      <pubDate>Fri, 29 Oct 2021 13:30:23 +0000</pubDate>
      <link>https://dev.to/exacaster/spark-is-lit-once-again-41p7</link>
      <guid>https://dev.to/exacaster/spark-is-lit-once-again-41p7</guid>
      <description>&lt;p&gt;&lt;em&gt;&lt;a class="mentioned-user" href="https://dev.to/pdambrauskas"&gt;@pdambrauskas&lt;/a&gt; and I are marking hactoberfest by releasing our little in-house project...&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Lighter - Running Spark applications on Kubernetes
&lt;/h2&gt;

&lt;p&gt;Here at &lt;a href="https://exacaster.com" rel="noopener noreferrer"&gt;Exacaster&lt;/a&gt; &lt;a href="https://spark.apache.org/" rel="noopener noreferrer"&gt;Spark&lt;/a&gt; applications have been used extensively for years. We started using them on our &lt;a href="https://hadoop.apache.org/" rel="noopener noreferrer"&gt;Hadoop&lt;/a&gt; clusters with &lt;a href="https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html" rel="noopener noreferrer"&gt;YARN&lt;/a&gt; as an application manager. However, with our recent product, we started moving towards a Cloud-based solution and decided to use Kubernetes for our infrastructure needs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Livy
&lt;/h3&gt;

&lt;p&gt;When running Spark applications on YARN, you can submit jobs using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Spark client&lt;/li&gt;
&lt;li&gt;Apache &lt;a href="https://github.com/apache/incubator-livy/" rel="noopener noreferrer"&gt;Livy&lt;/a&gt; - an open-source REST API for interacting with Apache Spark from anywhere.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Latter was a go-to solution at the time when we were only using Spark on YARN. Sadly Apache Livy is not maintained anymore: it has no K8s support, Spark client is more and more outdated with every passing day. For some time we used &lt;a href="https://github.com/jahstreet/incubator-livy/tree/kubernetes-support-initial" rel="noopener noreferrer"&gt;@jahstreet's fork&lt;/a&gt; which had K8s available. But then we saw that the Livy project hadn't received any updates and we decided to implement our own solution - &lt;a href="https://github.com/exacaster/lighter" rel="noopener noreferrer"&gt;Exacaster Lighter&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lighter
&lt;/h3&gt;

&lt;p&gt;Exacaster Lighter is heavily inspired by Apache Livy. The Idea is the same: hide Spark application client under the REST API. However, we are focusing on running those applications on the K8s cluster. YARN mode is also supported. We designed our application to be extendible with different execution backends.&lt;/p&gt;

&lt;p&gt;Lighter has lightweight, React based UI written in TS and back-end written in Java with minor Python integration points.&lt;/p&gt;

&lt;p&gt;Simplified illustration of the architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                                              ┌────────────────────────────────────────────────────────────────────────────┐
                                              │ Lighter                                                                    │
                                              │     ┌────────────────────────────────────────────────────────────────┐     │
                                              │     │                                                                │     │
                                              │     │                         Internal storage                       │     │
                                              │     │                                                                │     │
                                              │     │                                                                │     │
                                              │     └▲────────▲────────────────────┬─────────────────────────┬───────┘     │
                                              │      │        │                    │                         │             │
                                              │  store app    │                 get│new apps            sync status        │
                                              │      │     check status            │                         │             │
┌────────────────────┐                    ┌───┴──────┴──────────┐           ┌──────▼─────────┐      ┌────────▼────────┐    │
│                    │                    │                     │           │                │      │                 │    │
│                    │  Submit            │                     │           │                │      │                 │    │
│                    ├────────────────────►                     │           │                │      │                 │    │
│      Client        │                    │       REST api      │           │  App executor  │      │ Status tracker  │    │
│                    │  Check status      │                     │           │                │      │                 │    │
│                    ◄────────────────────┤                     │           │                │      │                 │    │
│                    │                    │                     │           │                │      │                 │    │
│                    │                    │                     │           │                │      │                 │    │
└────────────────────┘                    └───┬─────────────────┘           └────────┬───────┘      └────────┬────────┘    │
                                              │                                      │                       │             │
                                              │                                   execute               get status         │
                                              │                                      │                       │             │
                                              │                              ┌───────▼───────────────────────▼──────┐      │
                                              │                              │                                      │      │
                                              │                              │                                      │      │
                                              │                              │                Backend               │      │
                                              │                              │               (YARN/K8s)             │      │
                                              │                              │                                      │      │
                                              │                              │                                      │      │
                                              │                              └──────────────────────────────────────┘      │
                                              │                                                                            │
                                              └────────────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;More information can be found on our &lt;a href="https://github.com/exacaster/lighter/edit/master/docs/" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; page.&lt;/p&gt;

&lt;h3&gt;
  
  
  UI
&lt;/h3&gt;

&lt;p&gt;This is the job list view:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0l13dnf409vje3ych64k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0l13dnf409vje3ych64k.png" alt="Job list"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;You can see the configuration of the submitted job inside:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frnthv9qjcm3t7ego932o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frnthv9qjcm3t7ego932o.png" alt="Job configurations"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Driver logs are also available for each job:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fojcy9jesy0il7bldj2jm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fojcy9jesy0il7bldj2jm.png" alt="Job logs"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How it works?
&lt;/h3&gt;

&lt;p&gt;Glad you asked. It is quite simple. Lighter uses &lt;a href="https://spark.apache.org/docs/latest/api/java/org/apache/spark/launcher/SparkLauncher.html" rel="noopener noreferrer"&gt;Spark Launcher&lt;/a&gt; to launch Spark applications on Kubernetes cluster. The launcher takes care of creating all Pods needed for the Spark application to run. When launching applications we tag them with a unique identifier by setting config property &lt;code&gt;spark.kubernetes.driver.label.spark-app-tag&lt;/code&gt;. Then we use that identifier to check application status and retrieve application logs by calling pods API with &lt;code&gt;labelSelector&lt;/code&gt; property.&lt;/p&gt;

&lt;p&gt;Things get a bit more complicated on interactive sessions. We've created &lt;a href="https://github.com/jupyter-incubator/sparkmagic" rel="noopener noreferrer"&gt;Sparkmagic&lt;/a&gt; compatible REST API so that Sparkmagic kernel could communicate with Lighter the same way as it does with Apache Livy. When a user creates an interactive session Lighter server submits a custom PySpark application which contains an infinite loop which constantly checks for new commands to be executed. Each Sparkmagic command is saved on Java collection, retrieved by the PySpark application through Py4J Gateway and executed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Uscases
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Spark on K8s
&lt;/h4&gt;

&lt;p&gt;Since Apache Spark 2.4, applications can be executed on the &lt;a href="https://spark.apache.org/docs/latest/running-on-kubernetes.html" rel="noopener noreferrer"&gt;K8s cluster&lt;/a&gt;. When you submit your Spark application, driver and executor pods are created for your application and removed after the application completes. But if you want to track application status and report them to end-users in a nice manner it gets complicated. Haha.&lt;/p&gt;

&lt;h4&gt;
  
  
  Spark on YARN
&lt;/h4&gt;

&lt;p&gt;In the early days of the Big Data era when K8s hasn't even been born yet, the common open source go-to solution was the Hadoop stack. We have written several old-fashioned Map-Reduce jobs, scripts using &lt;a href="https://pig.apache.org/" rel="noopener noreferrer"&gt;Pig&lt;/a&gt; until we came across Spark. Since then Spark has became one of the most popular data processing engines. It is very easy to start using Lighter on YARN deployments. Just run a &lt;a href="https://github.com/exacaster/lighter/blob/master/docs/docker.md" rel="noopener noreferrer"&gt;docker&lt;/a&gt; with proper configuration and mount necessary configurations in all the default paths.&lt;/p&gt;

&lt;h4&gt;
  
  
  Jupyterlab
&lt;/h4&gt;

&lt;p&gt;For ad-hoc data analysis &lt;a href="https://jupyter.org/" rel="noopener noreferrer"&gt;Jupyterlab&lt;/a&gt; on top of Spark is an elegant solution. Between themselves, however, these two great tools cannot communicate so Lighter together with &lt;a href="https://github.com/jupyter-incubator/sparkmagic" rel="noopener noreferrer"&gt;SparkMagic&lt;/a&gt; acts as a bridge. You only need to provide the correct &lt;a href="https://github.com/exacaster/lighter/blob/master/docs/sparkmagic.md" rel="noopener noreferrer"&gt;configuration&lt;/a&gt; to SparkMagic to have it working.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing remarks
&lt;/h2&gt;

&lt;p&gt;Lighter is a freshly baked tool and open-sourced for everyone to use. Since we developed it to the use-cases that are familiar to us, feel free to contribute if you see any opportunities to make it better. &lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>opensource</category>
      <category>hacktoberfest</category>
      <category>spark</category>
    </item>
  </channel>
</rss>
