<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Subash Sivaji</title>
    <description>The latest articles on DEV Community by Subash Sivaji (@subashsivaji).</description>
    <link>https://dev.to/subashsivaji</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F180660%2Ff5874617-7312-4ace-9777-6d11d44478fc.jpg</url>
      <title>DEV Community: Subash Sivaji</title>
      <link>https://dev.to/subashsivaji</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/subashsivaji"/>
    <language>en</language>
    <item>
      <title>Azure event grid advanced filter limit</title>
      <dc:creator>Subash Sivaji</dc:creator>
      <pubDate>Tue, 25 Feb 2020 13:29:09 +0000</pubDate>
      <link>https://dev.to/subashsivaji/azure-event-grid-advanced-filter-limit-296n</link>
      <guid>https://dev.to/subashsivaji/azure-event-grid-advanced-filter-limit-296n</guid>
      <description>&lt;p&gt;The azure event grid advanced filter limit is 25 filter values on a single event grid subscription, so this can be across one or more advanced filters within that single event grid subscription.&lt;/p&gt;

&lt;p&gt;If you just have one advanced filter in an event grid subscription you can still have 25 filter values in that.&lt;/p&gt;

&lt;p&gt;However, the azure portal will only allow us to create 5 filter values within an advanced filter (even if you just have one advanced filter!). But programmatically (CLI, PowerShell etc...) you can add more than 5 filter values. I use Azure CLI here to create an azure event grid subscription.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;az eventgrid event-subscription create  --name "egsub-test" `
    --source-resource-id $source_resource_id `
    --endpoint-type "storagequeue" `
    --endpoint $endpoint `
    --advanced-filter data.api StringIn PutBlob PutBlockList FlushWithClose `
    --advanced-filter subject StringContains $test_objects `
    --subject-begins-with "$($generic_blob_prefix)/source/blobs/testpath/" `
    --subject-ends-with ".csv"
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;In the example I have 2 advanced filters "data.api" and "subject"&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The "data.api" contains 3 filter values "PutBlob" "PutBlockList" "FlushWithClose"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I can have 22 more filter values in this event grid subscription.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The $test_objects variable within "subject" contains 22 filter values (note: I have hidden the 22 filter values in a variable for brevity). &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once executed, this is how it looks in the portal. The event grid subscription was tested and it filters as expected.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Ex6F0Wal--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/3dq3zaz94luf4hfckl0h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Ex6F0Wal--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/3dq3zaz94luf4hfckl0h.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If I increase $test_objects variable to 23, the CLI script will throw following error,&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The total number of AdvancedFilter values allowed on a single event subscription is 25.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;At the time of writing this post I found the document and portal UI slightly misleading and hence sharing this post.&lt;br&gt;
&lt;a href="https://docs.microsoft.com/en-us/azure/event-grid/event-filtering#advanced-filtering"&gt;https://docs.microsoft.com/en-us/azure/event-grid/event-filtering#advanced-filtering&lt;/a&gt;&lt;/p&gt;

</description>
      <category>azure</category>
      <category>eventgrid</category>
    </item>
    <item>
      <title>Types of Apache Spark tables and views</title>
      <dc:creator>Subash Sivaji</dc:creator>
      <pubDate>Wed, 27 Nov 2019 11:37:15 +0000</pubDate>
      <link>https://dev.to/subashsivaji/types-of-apache-spark-tables-and-views-4pai</link>
      <guid>https://dev.to/subashsivaji/types-of-apache-spark-tables-and-views-4pai</guid>
      <description>&lt;h5&gt;
  
  
  Global Managed Table
&lt;/h5&gt;

&lt;p&gt;A managed table is a Spark SQL table for which Spark manages both the data and the metadata. A global managed table is available across all clusters. When you drop the table both data and metadata gets dropped.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dataframe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"my_table"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h5&gt;
  
  
  Global Unmanaged/External Table
&lt;/h5&gt;

&lt;p&gt;Spark manages the metadata, while you control the data location. As soon as you add ‘path’ option in dataframe writer it will be treated as global external/unmanaged table. When you drop table only metadata gets dropped. A global unmanaged/external table is available across all clusters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dataframe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'path'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"&amp;lt;your-storage-path&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"my_table"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h5&gt;
  
  
  Local Table (a.k.a) Temporary Table (a.k.a) Temporary View
&lt;/h5&gt;

&lt;p&gt;Spark session scoped. A local table is not accessible from other clusters (or if using databricks notebook not in other notebooks as well) and is not registered in the metastore.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dataframe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;createOrReplaceTempView&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h5&gt;
  
  
  Global Temporary View
&lt;/h5&gt;

&lt;p&gt;Spark application scoped, global temporary views are tied to a system preserved temporary database global_temp. This view can be shared across different spark sessions (or if using databricks notebooks, then shared across notebooks).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dataframe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;createOrReplaceGlobalTempView&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"my_global_view"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;can be accessed as,&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"global_temp.my_global_view"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h5&gt;
  
  
  Global Permanent View
&lt;/h5&gt;

&lt;p&gt;Persist a dataframe as permanent view. The view definition is recorded in the underlying metastore. You can only create permanent view on global managed table or global unmanaged table. Not allowed to create a permanent view on top of any temporary views or dataframe. Note: Permanent views are only available in SQL API — not available in dataframe API&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"CREATE VIEW permanent_view AS SELECT * FROM table"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;There isn’t a function like &lt;em&gt;dataframe.createOrReplacePermanentView()&lt;/em&gt;&lt;/p&gt;

&lt;h5&gt;
  
  
  Reference:
&lt;/h5&gt;

&lt;p&gt;&lt;a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.saveAsTable"&gt;http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.saveAsTable&lt;/a&gt;&lt;br&gt;
&lt;a href="https://docs.azuredatabricks.net/spark/latest/spark-sql/language-manual/create-table.html"&gt;https://docs.azuredatabricks.net/spark/latest/spark-sql/language-manual/create-table.html&lt;/a&gt;&lt;br&gt;
&lt;a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.createOrReplaceTempView"&gt;http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.createOrReplaceTempView&lt;/a&gt;&lt;br&gt;
&lt;a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.createOrReplaceGlobalTempView"&gt;http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.createOrReplaceGlobalTempView&lt;/a&gt;&lt;br&gt;
&lt;a href="https://stackoverflow.com/questions/48552620/is-it-possible-to-create-persistent-view-in-spark"&gt;https://stackoverflow.com/questions/48552620/is-it-possible-to-create-persistent-view-in-spark&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Migrated from medium &lt;a href="https://medium.com/@subashsivaji/types-of-apache-spark-tables-and-views-f468e2e53af2"&gt;originally posted here&lt;/a&gt;&lt;/p&gt;

</description>
      <category>spark</category>
      <category>databricks</category>
    </item>
    <item>
      <title>Dissecting azure storage cost for an analytical workload</title>
      <dc:creator>Subash Sivaji</dc:creator>
      <pubDate>Fri, 22 Nov 2019 18:09:15 +0000</pubDate>
      <link>https://dev.to/subashsivaji/dissecting-azure-storage-cost-for-an-analytical-workload-9lc</link>
      <guid>https://dev.to/subashsivaji/dissecting-azure-storage-cost-for-an-analytical-workload-9lc</guid>
      <description>&lt;p&gt;Azure Storage is cheap to just store data. For example 1 Terabytes of data storage cost starts from ~£15 per month. &lt;/p&gt;

&lt;p&gt;But in reality we wanted to read/write and do some analytics on top of it (i.e. do some operations on the data). When you embark on an analytical project it is important we have decent estimate for storage and compute cost.&lt;/p&gt;

&lt;p&gt;For this example I am only going to focus on &lt;em&gt;read&lt;/em&gt; operation. The following API calls are considered read operations ReadFile, ListFilesystemFile&lt;/p&gt;

&lt;p&gt;As per &lt;a href="https://azure.microsoft.com/en-gb/pricing/details/storage/data-lake/"&gt;azure pricing doc&lt;/a&gt; - reading a 4MB data is considered as 10,000 operations.&lt;/p&gt;

&lt;p&gt;So if I read 8MB data &lt;strong&gt;once&lt;/strong&gt; then 8MB/4MB = 2 chunks of 4MB, 2 x 10,000 = 20,000 operations&lt;/p&gt;

&lt;p&gt;So if I read 16MB data &lt;strong&gt;once&lt;/strong&gt; then 16MB/4MB = 4 chunks of 4MB, 4 x 10,000 = 40,000 operations&lt;/p&gt;

&lt;p&gt;So if I read 500MB data &lt;strong&gt;once&lt;/strong&gt; then 500MB/4MB = 125 chunks of 4MB, 125 x 10,000  = 1,250,000 operations&lt;/p&gt;

&lt;p&gt;As per the azure pricing doc £0.0042 is the unit cost per 10,000 operations/per read operation.&lt;/p&gt;

&lt;p&gt;£0.0042 x 125 operations (1,250,000/10,000) = £0.525 for reading 500MB data &lt;strong&gt;once&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Say if I have spark or azure databricks script which is scheduled every 15 mins and reads this 500MB of data every time. The price per day (1440 minutes) for reading 500MB data will be,&lt;/p&gt;

&lt;p&gt;Cost per 15 minutes = £0.525&lt;br&gt;
Cost per day (we have 96 15 minutes per day) 96 x £0.525 = £50.4&lt;/p&gt;

&lt;p&gt;That's £1,512 per month to read 500MB data every 15 minutes.&lt;/p&gt;

&lt;p&gt;The above calculations were helpful for me to estimate true cost of an analytical workload. Based on this, if possible we can tweak and apply optimisations such as partition pruning, projections or selections on the workload/scripts so we read only the data we need.&lt;/p&gt;

</description>
      <category>azure</category>
      <category>analytics</category>
    </item>
  </channel>
</rss>
