DEV Community

Cover image for Spark & AI summit and a glimpse of Spark 3.0
Adi Polak
Adi Polak

Posted on • Updated on

Spark & AI summit and a glimpse of Spark 3.0

If there is a framework that super excites me, it's Apache Spark.
If there is a conference that excites me, it's the Spark & AI summit.

This year, with the current COVID-19 pandemic, the North America version of the Spark & AI summit is online and free. No need to travel, buy an expensive flight ticket, pay for accommodation and conference fees. It's all free and online.

One caveat, it is in pacific timezone(PDT) friendly hours. I kind of wish the organizers would take a more global approach.

Having said that, the agenda and content look promising!

To get ready for the conference and learn about Spark 3.0
I decided to spin a Spark 3.0 cluster with Azure Databricks, you can do the same or use the Databricks Community Edition.
Please note that with community edition, there are no workers nodes.

The Workspace:

Alt Text

MANY Exciting Features, let's briefly look at 2 of them

  • Pandas UDFs and Python Type Hints
    Probably going to be mostly used by the DataScience and Python developers communities. This feature allows us to create a more readable code and support code static analysis by IDEs such as PyCharm.
    Read about it here.

  • SQL Join hints
    Before this change, we had broadcast hash join hints.
    Meaning, if there is a join operation and one of the tables can fit in memory, Spark will broadcast it to execute a faster join. The class in charge of it was named ResolveBroadcastHints. It was replaced with ResolveJoinStrategyHints.
    To learn more, check out the JIRA ticket : SPARK-27225.

Alt Text

List of available hints:

Alt Text

To better understand how they work, I recommend checking out the Apache Spark open source code, specificly, this file:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/hints.scala

If you are interested in learning more about the Spark SQL optimization engine - the Catalyst, I wrote a deep dive on it, here.

Top 4 recommended sessions

-1- End-to-End Deep Learning with Horovod on Apache Spark

For the last months, I have been working on various Autonomous Cars scenarios that involve a high load of data. One of the challenges I faced is enabling the DataScience to run Deep Learning at scale. After digging in, I discovered Horovod's framework and the HorovodEstimator. I am excited to attend this session and learn more about it!
Are you curious about it? read about it more here.

Session link.

-2- Building Reliable ML Pipelines with MLflow

If you follow me for a while now, you know I'm deep into how to build machine learning pipelines at scale.
Here is a GitHub repo describing an End-to-End platform I built for Microsoft Build 2020 session. The platform includes MLFlow, Azure Databricks, Azure Machine Learning, and social media text classification with Scikit learn. The repository include data flow, architecture, tutorials, and code.

Session link.

  • Please note that this session is long (~1 hour) and is running multiple times during the online conference.

-3- An Approach to Data Quality for Netflix Personalization Systems

If you watched my sessions on Big Data and ML, I always mention that:

You are only as Good as your Data

I am referring here to the Machine Learning models of course. We see many biased machine learning models due to unbalanced data and misuse/lack of tools for assessing Data Quality. Many times during the Data Quality process, we need to filter out the data; this is where having a large set of Data can help. However, it brings challenges, as well.

This is why I am excited to hear from Netflix how they tackle these challenges.

BTW, if you would like to get familiar with Data Bias challenges, I recommend this short read from Microsoft Research Blog.

Session link.

-4- The Apache Spark File Format Ecosystem

The Veraset Software developers team is closely involved with open source Spark initiatives such as
Datasource V2 and the External Shuffle Service, and it's interesting to hear from them how using the right file format can improve performance. As well as permit Predicate Pushdown.

Session link.

That's it for now !

Thank you for reading so far.

These are my personal opinions about the summit.
If you enjoy reading, please follow me here on dev.to , Twitter and LinkedIn.

Always happy to take your thoughts and opinions.

💡 Which session can't you wait to attend? What excites you about Apache Spark?

Top comments (0)