<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: DataPotion</title>
    <description>The latest articles on DEV Community by DataPotion (@datapotion).</description>
    <link>https://dev.to/datapotion</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F763822%2F2814be41-355d-423b-b963-c7bc52dc7474.png</url>
      <title>DEV Community: DataPotion</title>
      <link>https://dev.to/datapotion</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/datapotion"/>
    <language>en</language>
    <item>
      <title>How to filter columns in HBase Shell</title>
      <dc:creator>DataPotion</dc:creator>
      <pubDate>Fri, 08 Jul 2022 20:27:59 +0000</pubDate>
      <link>https://dev.to/datapotion/how-to-filter-columns-in-hbase-shell-4384</link>
      <guid>https://dev.to/datapotion/how-to-filter-columns-in-hbase-shell-4384</guid>
      <description>&lt;p&gt;One of the most useful functions available in HBase are filters. In this article, I will show you how to use filters to get lookup columns with specific values. Every example is shown in Hbase Shell, but they are also available in the API, so you can use it in your ETL application as well.&lt;/p&gt;

&lt;p&gt;First, let's check which filters are available for us from Hbase Shell. To do this use &lt;code&gt;show_filters&lt;/code&gt; command in Hbase Shell:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hdL1ES6Y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xb6g0cxte4bs00hpelhh.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hdL1ES6Y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xb6g0cxte4bs00hpelhh.gif" alt="Image description" width="880" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see there are multiple filters available but we are focusing on Column Filters, so 3 that are useful for us are ColumnValueFilter, ValueFilter, and SingleColumnValueFilters. All of these filters have very similar properties and all of them can be used for filtering columns.&lt;/p&gt;

&lt;p&gt;Let's assume we have a table named myTable with the following contents.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rowkey&lt;/th&gt;
&lt;th&gt;myColumnFamily:columnA&lt;/th&gt;
&lt;th&gt;myColumnFamily:columnB&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;row1&lt;/td&gt;
&lt;td&gt;value1&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;row2&lt;/td&gt;
&lt;td&gt;value2&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;row3&lt;/td&gt;
&lt;td&gt;value3&lt;/td&gt;
&lt;td&gt;valueX&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;row3_special&lt;/td&gt;
&lt;td&gt;value4&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;hbase(main):003:0&amp;gt; scan 'myTable'
ROW                             COLUMN+CELL
 row1                           column=myColumnFamily:columnA, timestamp=1656785481871, value=value1
 row2                           column=myColumnFamily:columnA, timestamp=1656785500397, value=value2
 row3                           column=myColumnFamily:columnA, timestamp=1656785504786, value=value3
 special_row3                   column=myColumnFamily:columnA, timestamp=1656785539020, value=value4

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  ColumnValueFilter
&lt;/h2&gt;

&lt;p&gt;The first option to filter values is to use ColumnValueFilter. &lt;br&gt;
As an argument we pass column family and column name in which we want to search, and &lt;br&gt;
filtering argument in this example we search for a value that equals value1.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scan 'myTable', {FILTER =&amp;gt; "ColumnValueFilter('myColumnFamily','columnA',=,'binary:value1')"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ColumnValueFilter will return only columns that match our filtering statement.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--fBuJqzsl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nq3e9rw8eqsxpbu5ksvz.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fBuJqzsl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nq3e9rw8eqsxpbu5ksvz.gif" alt="Image description" width="880" height="495"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;hbase(main):010:0&amp;gt; scan 'myTable', {FILTER =&amp;gt; "ColumnValueFilter('myColumnFamily','columnA',=,'binary:value1')"}
ROW                             COLUMN+CELL\n
 row1                           column=myColumnFamily:columnA,\n timestamp=1656785481871, value=value1\n
1 row(s)\n
Took 0.0188 seconds\n
hbase(main):011:0&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  SingleColumnValueFilter
&lt;/h2&gt;

&lt;p&gt;SingleColumnValueFilter returns an entire row when specified condition is matched. Besides that it works exactly like ColumnValueFilter.&lt;/p&gt;

&lt;p&gt;This example also shows another form of filtering comparison - when only the substring of the value matches it returns this value.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--1Hlb_gR1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4741qhty6bhmq6fwr9mw.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--1Hlb_gR1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4741qhty6bhmq6fwr9mw.gif" alt="Image description" width="880" height="495"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;hbase(main):022:0&amp;gt; scan 'myTable', {FILTER =&amp;gt; "SingleColumnValueFilter('myColumnFamily','columnA',=,'substring:value')"}

ROW                             COLUMN+CELL\n
 row1                           column=myColumnFamily:columnA, timestamp=1656785481871, value=value1\n
 row1                           column=myColumnFamily:columnB, timestamp=1656802766312, value=valueX\n
 row2                           column=myColumnFamily:columnA, timestamp=1656785500397, value=value2\n
 row3                           column=myColumnFamily:columnA, timestamp=1656785504786, value=value3\n
 special_row3                   column=myColumnFamily:columnA, timestamp=1656785539020, value=value4\n
4 row(s)\n
Took 0.0253 seconds\n
hbase(main):023:0&amp;gt; scan 'myTable', {FILTER =&amp;gt; "SingleColumnValueFilter('myColumnFamily','columnA',=,'substring:X')"}
ROW                             COLUMN+CELL
0 row(s)
Took 0.0047 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;⚠️&lt;strong&gt;WARNING&lt;/strong&gt; ⚠️&lt;br&gt;
If the column is not found every column from searched row will be filtered&lt;/p&gt;
&lt;h2&gt;
  
  
  ValueFilter
&lt;/h2&gt;

&lt;p&gt;In ValueFilter we don't provide the column family and column name, just the filter statement. Which makes it search in every column possible.  ValueFilter return only rows that match statement.&lt;/p&gt;

&lt;p&gt;From ValueFilter documentation:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;To test the value of a single qualifier when scanning multiple qualifiers, use link SingleColumnValueFilter&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---9H1hW9a--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/l55ldh5mca4m9e5rw7yu.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---9H1hW9a--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/l55ldh5mca4m9e5rw7yu.gif" alt="Image description" width="880" height="495"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;hbase(main):004:0&amp;gt; scan 'myTable', {FILTER =&amp;gt; "ValueFilter(=,'binary:valueX')"}
ROW                                   COLUMN+CELL
 row1                                 column=myColumnFamily:columnB, timestamp=1656802766312, value=valueX
1 row(s)
Took 0.0106 seconds

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;ValueFilters or specific column values in every column&lt;/li&gt;
&lt;li&gt;SingleValueColumnFilter returns only matched row in a specified column&lt;/li&gt;
&lt;li&gt;ColumnValueFilter returns every row in case it matched the row in a specified column &lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>database</category>
      <category>nosql</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>What's new in Apache Spark 3.3.0</title>
      <dc:creator>DataPotion</dc:creator>
      <pubDate>Sun, 19 Jun 2022 11:57:03 +0000</pubDate>
      <link>https://dev.to/datapotion/whats-new-in-apache-spark-330-4m1b</link>
      <guid>https://dev.to/datapotion/whats-new-in-apache-spark-330-4m1b</guid>
      <description>&lt;p&gt;On June 16, a new version of Apache Spark was released. In this article, I'm going to present to you some of the highlighted features including Kubernetes custom operators, row-level runtime filtering, and more. Grab your favorite beverage and have a pleasant reading 🍵&lt;/p&gt;

&lt;h2&gt;
  
  
  Row-level Runtime Filtering
&lt;/h2&gt;

&lt;p&gt;This feature aims to improve the performance of large table joins by reducing shuffle data.&lt;br&gt;
It is used when broadcast join appears, generates a dynamic filter based on a smaller portion of data, and push down this filter to the data source. &lt;br&gt;
For more information and benchmarks check out the design document of this feature: &lt;a href="https://docs.google.com/document/d/1ndwynp1RPxaQ0dxykCWD-KW4MYxFWv-ojJzOMAei1S8/edit#"&gt;link&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Better ANSI compliance
&lt;/h2&gt;

&lt;p&gt;ANSI compliance is a special mode for Spark SQL that changes dialect from Hive to... ANSI compliant. &lt;br&gt;
When &lt;strong&gt;spark.sql.ansi.enabled&lt;/strong&gt; is set to &lt;strong&gt;true&lt;/strong&gt;, Spark SQL uses an ANSI compliant dialect instead of being Hive compliant. It causes the following changes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Spark SQL will throw runtime exceptions on invalid operations, including integer overflow errors, string parsing errors, etc.
&lt;/li&gt;
&lt;li&gt;Spark will use a different types of coercion rules for resolving conflicts among data types. The rules are consistently based on data type precedence.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;With version 3.3.0 Apache Spark brings the following changes to ANSI mode:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New type coercion syntax rules in ANSI mode &lt;a href="https://issues.apache.org/jira/browse/SPARK-34246"&gt;SPARK-34246&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;ANSI mode: disable ANSI reserved keywords by default &lt;a href="https://issues.apache.org/jira/browse/SPARK-37724"&gt;SPARK-37724&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;New explicit cast syntax rules in ANSI mode &lt;a href="https://issues.apache.org/jira/browse/SPARK-33354"&gt;SPARK-33354&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Elt() should return null if index is null under ANSI mode &lt;a href="https://issues.apache.org/jira/browse/SPARK-38304"&gt;SPARK-38304&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Profiling for Python/Pandas UDF
&lt;/h2&gt;

&lt;p&gt;There was profiling already in PySpark, but only for RDDs. From 3.3.0 you can use it also on UDFs! &lt;br&gt;
Profiling is not enabled by default so you have to set &lt;strong&gt;spark.python.profile&lt;/strong&gt; to &lt;strong&gt;true&lt;/strong&gt;.&lt;br&gt;
If you want to give it a shot locally in REPL then run it like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;pyspark &lt;span class="nt"&gt;--conf&lt;/span&gt; &lt;span class="s2"&gt;"spark.python.profile=true"&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is a little demonstration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;udf&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;math&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"float"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;cos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="p"&gt;...&lt;/span&gt;   &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"float"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
&lt;span class="p"&gt;...&lt;/span&gt;   &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;applied&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;sin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;show_profiles&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="o"&gt;============================================================&lt;/span&gt;
&lt;span class="n"&gt;Profile&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;UDF&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="o"&gt;============================================================&lt;/span&gt;
         &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="n"&gt;function&lt;/span&gt; &lt;span class="n"&gt;calls&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="mf"&gt;0.000&lt;/span&gt; &lt;span class="n"&gt;seconds&lt;/span&gt;

   &lt;span class="n"&gt;Ordered&lt;/span&gt; &lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;internal&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cumulative&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

   &lt;span class="n"&gt;ncalls&lt;/span&gt;  &lt;span class="n"&gt;tottime&lt;/span&gt;  &lt;span class="n"&gt;percall&lt;/span&gt;  &lt;span class="n"&gt;cumtime&lt;/span&gt;  &lt;span class="n"&gt;percall&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;lineno&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="mi"&gt;10&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;built&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cos&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
       &lt;span class="mi"&gt;10&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;stdin&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cos&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="mi"&gt;10&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="s"&gt;'disable'&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="s"&gt;'_lsprof.Profiler'&lt;/span&gt; &lt;span class="n"&gt;objects&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;


&lt;span class="o"&gt;============================================================&lt;/span&gt;
&lt;span class="n"&gt;Profile&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;UDF&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="o"&gt;============================================================&lt;/span&gt;
         &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="n"&gt;function&lt;/span&gt; &lt;span class="n"&gt;calls&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="mf"&gt;0.000&lt;/span&gt; &lt;span class="n"&gt;seconds&lt;/span&gt;

   &lt;span class="n"&gt;Ordered&lt;/span&gt; &lt;span class="n"&gt;by&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;internal&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cumulative&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

   &lt;span class="n"&gt;ncalls&lt;/span&gt;  &lt;span class="n"&gt;tottime&lt;/span&gt;  &lt;span class="n"&gt;percall&lt;/span&gt;  &lt;span class="n"&gt;cumtime&lt;/span&gt;  &lt;span class="n"&gt;percall&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;lineno&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="mi"&gt;10&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;stdin&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="mi"&gt;10&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;built&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sin&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
       &lt;span class="mi"&gt;10&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt;    &lt;span class="mf"&gt;0.000&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="s"&gt;'disable'&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="s"&gt;'_lsprof.Profiler'&lt;/span&gt; &lt;span class="n"&gt;objects&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Support Parquet complex types in vectorized reader
&lt;/h2&gt;

&lt;p&gt;Vectorized reader for Parquet speeds up decoding of file. It is used by default to read Parquet files and can be changed with &lt;strong&gt;spark.sql.parquet.enableVectorizedReader&lt;/strong&gt; property. To enable complex types you have to change &lt;strong&gt;spark.sql.parquet.enableNestedColumnVectorizedReader&lt;/strong&gt;  to &lt;strong&gt;true&lt;/strong&gt;. This option speeds up reading parquet files with nested column types such as struct, array and map.&lt;/p&gt;

&lt;h2&gt;
  
  
  New DataSource V2 capabilities
&lt;/h2&gt;

&lt;p&gt;DataSource V2 (DSV2 for short) is a mechanism in SparkSQL for accessing data from external sources. Besides just fetching the data it offers optimizations to speed up transfer by delegating aggregates to the data source itself reducing network IO. From version 3.3.0 there are additional filter push down capabilities including:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Support push down top N to JDBC data source V2&lt;/li&gt;
&lt;li&gt;Support partial aggregate push-down AVG&lt;/li&gt;
&lt;li&gt;Support DS V2 complete aggregate push down&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Introduce Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches
&lt;/h2&gt;

&lt;p&gt;The trigger setting describes the timing of streaming data processing. Whether the query is going to be executed as micro-batch query with a fixed batch interval or as a continuous processing query.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;org.apache.spark.sql.streaming.Trigger&lt;/span&gt;

&lt;span class="c1"&gt;// Available-now trigger&lt;/span&gt;
&lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;writeStream&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;format&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"console"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;trigger&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;Trigger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;AvailableNow&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;start&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Currently streaming queries with Trigger. Once will always load all of the available data in a single batch. Because of this, the amount of data the queries can process is limited, or the Spark driver will be out of memory.&lt;br&gt;
Because of that new trigger was introduced called TRigger.AvailableNow.&lt;/p&gt;
&lt;h2&gt;
  
  
  Kubernetes custom operators
&lt;/h2&gt;

&lt;p&gt;Spark K8s have been supported since Spark 2.3 in 2017 and has been marked GA in Spark 3.1.1 a year ago. The default scheduler of Kubernetes only supports pod-based scheduling, it cannot provide job-based scheduling for Spark applications. New operators emerged on the horizon like &lt;a href="https://volcano.sh/en/"&gt;Volcano&lt;/a&gt; from CNCF and &lt;a href="https://yunikorn.apache.org"&gt;Yunikorn&lt;/a&gt; from Apache Foundation. They support older versions of Spark but it was done by injecting operators after deployment of an application.&lt;br&gt;
Version 3.3.0 brings native support and a custom operator can be set at the moment of submitting our supplication like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bin/spark-submit  &lt;span class="se"&gt;\&lt;/span&gt;
    --master k8s://https://127.0.0.1:60250  &lt;span class="se"&gt;\&lt;/span&gt;
    --deploy-mode cluster  &lt;span class="se"&gt;\&lt;/span&gt;
    --conf spark.executor.instances&lt;span class="o"&gt;=&lt;/span&gt;1  &lt;span class="se"&gt;\&lt;/span&gt;
    --conf spark.kubernetes.scheduler&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;**&lt;/span&gt;volcano&lt;span class="k"&gt;**&lt;/span&gt; &lt;span class="se"&gt;\ &lt;/span&gt;👈
    --conf spark.kubernetes.job.queue&lt;span class="o"&gt;=&lt;/span&gt;queue1 &lt;span class="se"&gt;\&lt;/span&gt;
    --conf spark.kubernetes.namespace&lt;span class="o"&gt;=&lt;/span&gt;spark &lt;span class="se"&gt;\&lt;/span&gt;
    --conf spark.kubernetes.authenticate.driver.serviceAccountName&lt;span class="o"&gt;=&lt;/span&gt;spark-sa  &lt;span class="se"&gt;\&lt;/span&gt;
    --conf spark.kubernetes.container.image&lt;span class="o"&gt;=&lt;/span&gt;spark:latest  &lt;span class="se"&gt;\&lt;/span&gt;
    --class org.apache.spark.examples.SparkPi  &lt;span class="se"&gt;\&lt;/span&gt;
    --name spark-pi  &lt;span class="se"&gt;\&lt;/span&gt;
    local://opt/spark/examples/jars/spark-examples_2.12-3.3.0-SNAPSHOT.jar
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For more information check: &lt;br&gt;
&lt;a href="https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg/edit#"&gt;Design document of feature&lt;/a&gt;&lt;br&gt;
and related jira issue &lt;a href="https://issues.apache.org/jira/browse/SPARK-36057"&gt;SPARK-36057&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Thanks for reading!
&lt;/h2&gt;

&lt;p&gt;For a full list of new features check out &lt;a href="https://spark.apache.org/releases/spark-release-3-3-0.html"&gt;release notes&lt;/a&gt; &lt;/p&gt;

</description>
      <category>bigdata</category>
      <category>scala</category>
      <category>python</category>
      <category>news</category>
    </item>
  </channel>
</rss>
