DEV Community: Subash Sivaji

Azure event grid advanced filter limit

Subash Sivaji — Tue, 25 Feb 2020 13:29:09 +0000

The azure event grid advanced filter limit is 25 filter values on a single event grid subscription, so this can be across one or more advanced filters within that single event grid subscription.

If you just have one advanced filter in an event grid subscription you can still have 25 filter values in that.

However, the azure portal will only allow us to create 5 filter values within an advanced filter (even if you just have one advanced filter!). But programmatically (CLI, PowerShell etc...) you can add more than 5 filter values. I use Azure CLI here to create an azure event grid subscription.

az eventgrid event-subscription create  --name "egsub-test" `
    --source-resource-id $source_resource_id `
    --endpoint-type "storagequeue" `
    --endpoint $endpoint `
    --advanced-filter data.api StringIn PutBlob PutBlockList FlushWithClose `
    --advanced-filter subject StringContains $test_objects `
    --subject-begins-with "$($generic_blob_prefix)/source/blobs/testpath/" `
    --subject-ends-with ".csv"

In the example I have 2 advanced filters "data.api" and "subject"

The "data.api" contains 3 filter values "PutBlob" "PutBlockList" "FlushWithClose"

So I can have 22 more filter values in this event grid subscription.

The $test_objects variable within "subject" contains 22 filter values (note: I have hidden the 22 filter values in a variable for brevity).

Once executed, this is how it looks in the portal. The event grid subscription was tested and it filters as expected.

If I increase $test_objects variable to 23, the CLI script will throw following error,

The total number of AdvancedFilter values allowed on a single event subscription is 25.

At the time of writing this post I found the document and portal UI slightly misleading and hence sharing this post.
https://docs.microsoft.com/en-us/azure/event-grid/event-filtering#advanced-filtering

Types of Apache Spark tables and views

Subash Sivaji — Wed, 27 Nov 2019 11:37:15 +0000

Global Managed Table

A managed table is a Spark SQL table for which Spark manages both the data and the metadata. A global managed table is available across all clusters. When you drop the table both data and metadata gets dropped.

dataframe.write.saveAsTable("my_table")

Global Unmanaged/External Table

Spark manages the metadata, while you control the data location. As soon as you add ‘path’ option in dataframe writer it will be treated as global external/unmanaged table. When you drop table only metadata gets dropped. A global unmanaged/external table is available across all clusters.

dataframe.write.option('path', "<your-storage-path>").saveAsTable("my_table")

Local Table (a.k.a) Temporary Table (a.k.a) Temporary View

Spark session scoped. A local table is not accessible from other clusters (or if using databricks notebook not in other notebooks as well) and is not registered in the metastore.

dataframe.createOrReplaceTempView()

Global Temporary View

Spark application scoped, global temporary views are tied to a system preserved temporary database global_temp. This view can be shared across different spark sessions (or if using databricks notebooks, then shared across notebooks).

dataframe.createOrReplaceGlobalTempView("my_global_view")

can be accessed as,

spark.read.table("global_temp.my_global_view")

Global Permanent View

Persist a dataframe as permanent view. The view definition is recorded in the underlying metastore. You can only create permanent view on global managed table or global unmanaged table. Not allowed to create a permanent view on top of any temporary views or dataframe. Note: Permanent views are only available in SQL API — not available in dataframe API

spark.sql("CREATE VIEW permanent_view AS SELECT * FROM table")

There isn’t a function like dataframe.createOrReplacePermanentView()

Reference:

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.saveAsTable
https://docs.azuredatabricks.net/spark/latest/spark-sql/language-manual/create-table.html
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.createOrReplaceTempView
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.createOrReplaceGlobalTempView
https://stackoverflow.com/questions/48552620/is-it-possible-to-create-persistent-view-in-spark

Migrated from medium originally posted here

Dissecting azure storage cost for an analytical workload

Subash Sivaji — Fri, 22 Nov 2019 18:09:15 +0000

Azure Storage is cheap to just store data. For example 1 Terabytes of data storage cost starts from ~£15 per month.

But in reality we wanted to read/write and do some analytics on top of it (i.e. do some operations on the data). When you embark on an analytical project it is important we have decent estimate for storage and compute cost.

For this example I am only going to focus on read operation. The following API calls are considered read operations ReadFile, ListFilesystemFile

As per azure pricing doc - reading a 4MB data is considered as 10,000 operations.

So if I read 8MB data once then 8MB/4MB = 2 chunks of 4MB, 2 x 10,000 = 20,000 operations

So if I read 16MB data once then 16MB/4MB = 4 chunks of 4MB, 4 x 10,000 = 40,000 operations

So if I read 500MB data once then 500MB/4MB = 125 chunks of 4MB, 125 x 10,000 = 1,250,000 operations

As per the azure pricing doc £0.0042 is the unit cost per 10,000 operations/per read operation.

£0.0042 x 125 operations (1,250,000/10,000) = £0.525 for reading 500MB data once.

Say if I have spark or azure databricks script which is scheduled every 15 mins and reads this 500MB of data every time. The price per day (1440 minutes) for reading 500MB data will be,

Cost per 15 minutes = £0.525
Cost per day (we have 96 15 minutes per day) 96 x £0.525 = £50.4

That's £1,512 per month to read 500MB data every 15 minutes.

The above calculations were helpful for me to estimate true cost of an analytical workload. Based on this, if possible we can tweak and apply optimisations such as partition pruning, projections or selections on the workload/scripts so we read only the data we need.