DEV Community: JY @ DataMechanics

The New & Improved Spark UI & Spark History Server is now Generally Available

JY @ DataMechanics — Fri, 07 May 2021 10:56:18 +0000

Delight is a free, hosted, cross-platform monitoring dashboard for Apache Spark with memory and CPU metrics that will hopefully delight you!

A year ago, we released a widely shared blog post: We’re building a better Spark UI!

After a lot of engineering work, we’re proud to finally release Delight, our free, hosted, cross-platform monitoring dashboard for Apache Spark.

It works on top of Databricks, EMR, Dataproc, HDInsight, HDP/CDP, and open-source platforms too (anything spark-submit, or using the spark-on-kubernetes operator, or using Apache Livy).

Delight helps you understand and improve the performance of your Spark applications. It provides:

Spark-centric CPU & memory metrics that we hope will delight you!
The Spark UI — so you don’t need to run the Spark History Server yourself.

What we’ll cover in this article:

The motivation and timeline of the project
Its high-level architecture
A tour of the main screens and how they can help you in your day-to-day work

Our vision and motivations for Delight

It’s difficult to troubleshoot the performance of an application for the Spark UI. There’s useful information in the UI, but it’s buried under a lot of noise, and there’s a lot of tribal knowledge involved in understanding it. Moreover, the Spark UI does not show critical system metrics such as CPU Usage, Memory Usage, and I/O.

For this, you have to use a general purpose observability tool (such as Ganglia). The problem with these tools is that they were not built for Spark. You’ll need to jump between metrics for each node, to try to get a rough estimate in your head, and then look at the timestamps to jump back and forth between them and the Spark UI.

Our mission at Data Mechanics is to make Apache Spark much more developer-friendly and cost-effective. This is why we built our managed Spark-on-Kubernetes platform (available on AWS, GCP, and Azure), which automates the infrastructure parameters and Spark configurations to make our customers pipelines stable and performant.

We also wanted to simplify the developer experience of data scientists and engineers, and the observability layer is a critical piece of this puzzle. We want to make it easy for anyone to understand the performance of their Spark code.

This is why we set out to build Delight, an open-source monitoring dashboard for Spark. In June 2020, we shared our vision for Delight in a widely shared blog post featuring the GIF below:

Our Delight design prototype, published in June 2020. Image by Author.

Encouraged by hundreds of sign-ups and comments on our announcement, we set out to build Delight, which turned out to be a large-scale engineering effort.

In November 2020, we released an MVP consisting of a hosted dashboard giving you access to the Spark UI for your terminated applications (avoiding you the trouble to run and maintain your own Spark History Server).
In Q1 2021, we released Delight internally to our customers, and iterated on the feedback they gave, fixing many bugs and stability issues as we uncovered them.
Today there still remains a lot of work to achieve our final vision. But Delight has already proven valuable to many, and so we’re excited to open it up to the Spark community!

Architecture: Open-Source Agent + Hosted Backend

Delight works for free on top of any Spark platform, including Databricks, Amazon EMR, Google Dataproc, or an open-source Spark-on-Kubernetes or Spark-on-YARN setup. It is compatible with running Spark applications using spark-submit, Apache Livy, or the Spark-on-Kubernetes operator.

Delight consists of:

An open-sourced agent running inside your Spark applications, streaming Spark event metrics to the Delight backend.
A hosted backend responsible for collecting these metrics, parsing them, and hosting the dashboard at https://delight.datamechanics.co

High-level architecture: an open-source agent & a hosted backend. Image by Author.

The metrics collected by Delight contain low-level metadata about your Spark application execution, like how much data is read/written by each task, how much memory & cpu is used. These metrics are called Spark event logs ( view an example), and they’re the same information that lets the Spark History Server play back the Spark UI of completed applications.

These metrics are protected using a secret access token you can generate in our dashboard, which uniquely identifies you. They are not shared with any third party, and they are automatically deleted from our servers after 30 days. An additional limit of 1,000 applications stored per organization applies (if you run more than 1,000 applications within 30 days, we’ll start deleting the oldest ones).

Taking a tour of the main screens

1. The Main Dashboard

The main dashboard of Delight lists all your recently completed Spark applications. Image by Author.

‍When you log in to Delight, you see this table of the most recent Spark applications you ran. Note that live applications are not visible yet, they only appear a couple of minutes after their completion. Most of the information in this screen should be self-explanatory, except 3 columns: Spark tasks, CPU, and Efficiency. Before explaining them, here’s a quick reminder of how Spark distributed execution works.

In a Spark **application, you have a single **driver* process (the “brain” of your application), and many executor processes. The driver is responsible for understanding your code and splitting it into jobs, stages and tasks. A stage is a set of tasks which can be executed in parallel on your executors. One executor core can run (at most) one Spark task at a time ; in other words, a Spark executor with 4 CPUs can run up to 4 tasks in parallel.*

Let’s now define three columns:

Spark Tasks. This is the sum of the duration of all the tasks in your application. Side-note: This is the metric that the Data Mechanics platform pricing is based on.
CPU. This is the sum of the lifetime duration of your Spark executors, multiplied by how many cores are allocated to each executors. In other words, this is your “Executor Cores uptime”. This metric is generally proportional to your infrastructure costs.
Efficiency. This is the ratio Spark Tasks divided by CPU, expressed as a percentage. This represents the fraction of the time that your CPU resources are effectively used to run Spark tasks. 100% is ideal (perfect parallelism), 0% means your compute resources are wasted (you have containers running, but they’re idle).

This is one of the key metric we look at at Data Mechanics, as our platform has automation features (autoscaling, autotuning, …) to increase our customers efficiency, and therefore reduce their cloud costs (check this story of a customer who migrated away from EMR to learn more about this).

2. Executor Cores Usage

‍Once you click on a specific application, you get to an app overview page with high-level statistics and a link to the Spark UI at the top. The next information you will find is the executor cores usage graph.

The executor cores usage graph gives a high-level view of your application’s performance. Image by Author.

The executor cores usage graph aggregates low-level CPU metrics measured by Spark during your application execution, into a simple visual overview. You first get a summary of your executor cores usage, broken down by the main categories (click the ℹ️ icon to read their definitions):

CPU — Time spent actually computing the Spark task.
I/O — Time spent waiting for data to be read or written (usually from network). Note: Spark currently treats work done in a Python process as I/O (from the JVM point of view, this is I/O waiting for python to provide data). This happens for example when you execute python User-Defined Functions. Check this StackOverflow post to better understand how the JVM and Python work together in PySpark.‍
Shuffle — Time spent waiting for data blocks to be written or fetched during shuffle (data exchange phases across your Spark executors).‍
Garbage Collection — Time spent doing by the JVM doing GC. For most (healthy) applications, this shouldn’t be more than a few percentage points.‍
Spark Internals — Time spent by Spark on overhead operations such as scheduler delay, task serialization and deserialization. Again, this should be small for most apps.‍
*Some Cores Idle *— Time during which tasks are executed, but only on a portion of your executor cores. A high value there is the sign of imperfect parallelism.
All Cores Idle — Time during which no Spark tasks are running. This could be full idleness, which happens often e.g. when you use Spark in an interactive way, then take a break. Or this could mean that some non-Spark work is happening, for example you may be executing pure Python or Scala code on the driver, so all the executors are idle.

This information is then visible on a graph where the X-Axis is the timeline of your app, the Y-axis is the number of executor cores you have (which may vary over time if you enable dynamic allocation). Under this graph, you can see a timeline of your Spark jobs and stages.

The main benefit of this graph is that you can very quickly focus your attention on the main performance bottleneck of your application. In this example, it’s clear that job-1/stage-1 take up most of the application time. The large gray area indicates that most of the cores are idle during this time. If you click on “stage-1”, you will be taken to the Spark UI, so you can troubleshoot this further (in this example, the problem is that the stage only has 2 tasks).

3. Executors Peak Memory Usage

This graph is only visible if you use Spark 3.0 or later, as it depends on new capabilities introduced in Spark 3.0 ( SPARK-23429 and SPARK-27189).

The executors peak memory usage graphs shops the memory usage breakdown of your Spark executors, at the time they reached their maximum memory usage. Image by Author.

‍While your app is running, Spark measures the memory usage of each executor. This graph reports the peak memory usage observed for your top 5 executors, broken down between different categories:

Memory used by the JVM (this is your maximum heap size)
Memory used by Python processes (if you’re using PySpark)
Memory used by other processes (if you’re using R, then R is currently listed as “Others”)

This graph should be particularly useful to PySpark users, who currently have no simple way of knowing how much memory their python processes consume. This graph could help you decide whether it’s safe to use a different type of instance / size of container than you currently use. It could also warn you that you are flirting with the maximum allowed memory usage. As a reminder, if your Python processes use too much memory, then your resource manager will kill your Spark executor (for example on Kubernetes, you will get the docker exit code 137).

Conclusion: Try it out, and let us know your feedback!

We hope Delight will be true to its word and help you have a delightful developer experience with Spark.

We encourage you to try it out! Sign up, follow the installation instructions on our github page, and let us know your feedback over email (by replying to the welcome email) or using the live chat window in the product.

We have ambitious plans for Delight. Our future roadmap includes:

A page for each executor, and for the driver, to track their memory usage over time, with more detailed breakdown (Offheap memory usage)
Automated performance tuning recommendations (e.g. alerts on data skew / inefficient data partitioning, …)
Making Delight accessible in real-time while the app is running
And much more based on your feedback: let us know what’s your priority!

We will soon be publishing follow-up articles with concrete troubleshooting steps and Spark performance tuning sessions using Delight. We will also be giving meetups and conferences about it, including at the upcoming Data + AI Summit). So stay tuned, and thank you for your support!

Originally published at https://www.datamechanics.co.

Spark on Kubernetes Made Easy - How Data Mechanics Improves on the Open-Source version

JY @ DataMechanics — Wed, 09 Dec 2020 15:40:27 +0000

If you’re looking for a high-level introduction about Spark on Kubernetes, check out The Pros And Cons of Running Spark on Kubernetes, and if you’re looking for a deeper technical dive, then read our guide Setting Up, Managing & Monitoring Spark on Kubernetes.

Data Mechanics is a managed Spark platform deployed on a Kubernetes cluster inside our customers’ cloud account, available on AWS, GCP, and Azure. So our entire company is built on top of Spark on Kubernetes, and we are often asked how we’re different from simply running Spark on Kubernetes open-source.

The short answer is that our platform implements many features which make Spark on Kubernetes much more easy-to-use and cost-effective. By taking care of the setup and the maintenance, our goal is to accelerate your data engineering projects by making Spark as simple, flexible and performant as it should be.

Let’s go over our main improvements on top of Spark-on-Kubernetes.

An intuitive user interface

Data Mechanics users get a dashboard where they can view the logs and metrics for each of their Spark applications. They can also access the Spark UI, soon-to-be replaced with our homegrown monitoring tool called Data Mechanics Delight (Update, December 2020: First milestone of delight has been released!). The goal of this project is to make it easy for Spark developers to troubleshoot their application when there’s a failure, and to give them high-level recommendations to increase its performance when necessary (for example around data partitioning and memory management).

Data Mechanics Delight

They also have access to a “Jobs UI” which gives a historical graph of your pipelines main metrics like the volume of data processed, the duration, and the costs. So that your team can easily make sure that your production pipelines are running as expected, as well as to track your costs when necessary.

Data Mechanics Jobs UI

Dynamic optimizations

The Data Mechanics platform automatically and dynamically optimises your pipelines infrastructure parameters and Spark configurations to make them fast and stable. Here are the settings that we tune: your pod’s memory and cpu allocations, your disk settings, and your Spark configurations around parallelism, shuffle, and memory management. We do this by analysing the logs and metrics of your applications, and using the history of the past runs of your applications to find out its bottleneck and optimise it.

Data Mechanics’ Auto Tuning Feature

In addition to automated tuning, our platform also implements automated scaling at the level of your Spark application (dynamic allocation) and at the level of the Kubernetes cluster. This means we manage the Kubernetes node pools to scale up the cluster when you need more resources, and scale them down to zero when they’re unnecessary. We also make it easy to use spot nodes for your Spark executors to reduce your cloud costs further.‍

Data Mechanics’ Autoscaling Feature

‍Last but not least, we offer a list of Spark images which contain optimised connectors to common data sources and sinks. You can either use these images directly, or use them as a base to build your own Docker Images with your custom dependencies.

The goal of these optimizations is to give you the maximum performance Spark should provide and to reduce your cloud costs. In fact, the management fee that we charge for our services is more than compensated by the savings we generate on your cloud provider bill. We’ve helped customers migrating from competing Spark platforms reduce their cloud bill by 50 to 75%.

Integrations

Data Mechanics has integrations with notebooks services (like Jupyter, JupyterLab, JupyterHub) and scheduler/workflows services (like Airflow).

Since our platform is deployed on a Kubernetes cluster that you have control over, the full ecosystem of Docker/Kubernetes compatible tools is also available to you. And since we’re deployed inside your cloud account, inside your VPC, you can also easily build your own integrations with your home-grown tools inside your company’s network.

Data Mechanics’ Native Integrations With Jupyter, Docker, Kubernetes, Airflow

The peace of mind of a managed service

As a managed service, we take care of the setup and the maintenance of your infrastructure. When you sign up for Data Mechanics, you give us scoped permission on your cloud account, and we use these permissions to create the Kubernetes cluster, keep it up-to-date with the latest security fixes, and push releases with new features every two weeks.

The Data Mechanics Platform Architecture

It’s also our responsibility to keep your deployment secure. We can deploy within your company’s VPC and make your cluster private, so that it can only be accessed through your company’s VPN. We give you the tools to apply the security best practices with multiple options for data access and for user authentication (Single Sign On).

Conclusion

We’re proud to build on top, and sometimes contribute to, Spark-on-Kubernetes as well as other open source projects. We’re trying to build the data platform you would build for yourself — in an open and transparent way. By being deployed in your cloud account and in your VPC, you get the flexibility of a home-grown project, and the ease-of-use of a managed platform.

The optimizations we built internally will more than make up for our pricing, in fact we’ve helped some of our customers reduce their total costs by 50 to 75% as they migrated from competing platforms.

If this sounds interesting, book a time with our team. We’ll be happy to answer your questions and show you how to get started!

Originally published at https://www.datamechanics.co

How to Use our Free Open-Source Spark UI & Spark History Server + Spark Dashboard

JY @ DataMechanics — Wed, 09 Dec 2020 13:31:08 +0000

Accessing the Spark UI for live Spark applications is easy, but having access to it for terminated applications requires persisting logs to a cloud storage and running a server called the Spark History Server. While some commercial Spark platform provides this automatically (e.g. Databricks, EMR) ; for many Spark users (e.g. on Dataproc, Spark-on-Kubernetes) getting this requires a bit of work.

Today, we’re releasing a free, hosted, partly open sourced, Spark UI and Spark History Server that work on top of any Spark platform (whether it’s on-premise or in the cloud, over Kubernetes or over YARN, using a commercial platform or not).

Spark UI Screenshot. Image by Author.

It consists of a dashboard listing your Spark applications, and a hosted Spark History Server that will give you access to the Spark UI for your recently finished applications at the click of a button.

How Can I Use it?

Create an account on Data Mechanics Delight

You should use your company’s Google account if you want to share a single dashboard with your colleagues, or your personal Google account if you want the dashboard to be private to you.

Once your account is created, go under Settings and create a personal access token. This will be needed in the next step.

2. Attach our open-source agent to your Spark applications.

Follow the instructions on our Github page. We have instructions available for the most common setups of Spark:

Spark-submit (on any platform)
Databricks
EMR
Dataproc
Spark-on-Kubernetes using the spark-operator
Apache Livy

If you’re not sure how to install it on your infrastructure, let us know and we’ll add instructions that fit your needs.

This is the dashboard that you’ll see once logged in. Image by Author.

Your applications will automatically appear on the web dashboard once they complete. Clicking on an application opens up the corresponding Spark UI. That’s it!

How Does It Work? Is It Secure?

This project consists of two parts:

An open-source Spark agent which runs inside your Spark applications. This agent will stream non-sensitive Spark event logs from your Spark application to our backend.
A closed-source backend consisting of a real-time logs ingestion pipeline, storage services, a web application, and an authentication layer to make this secure.

Architecture of Data Mechanics Delight.

The agent collects your Spark applications event logs. This is non-sensitive information about the metadata of your Spark application. For example, for each Spark task there is metadata on memory usage, CPU usage, network traffic (view a sample event log). The agent does not record sensitive information such as the data that your Spark applications actually work on. The agent does not collect your application logs either — as typically they may contain sensitive information.

This data is encrypted using your personal access token and sent over the internet using the HTTPS protocol. This information is then stored securely inside the Data Mechanics control plane behind an authentication layer. Only you and your colleagues will be able to see your application in our dashboard. The collected data will automatically be deleted 30 days after your Spark application completion.

What’s Next?

The release of this hosted Spark History Server is a first step towards a Spark UI replacement tool called Data Mechanics Delight. Our announcement in June 2020 to build a Spark UI replacement with new metrics and visualizations generated a lot of interest from the Spark community.

Design prototype for the next releases of Data Mechanics Delight. Image by Author.

The current release does not yet replace the Spark UI, but we hope it will still be valuable to some of you. The next release of Delight, scheduled in January 2021, will consist of an overview screen giving a bird’s-eye view of your applications’ performance. Links to specific jobs, stages or executor pages will still take you to the corresponding Spark UI pages until we gradually replace these pages too.

Our mission at Data Mechanics is to make Apache Spark much more developer-friendly and cost-effective. We hope this tool will contribute to this goal and prove useful to the Spark community!

Spark and Docker: Your Spark development cycle just got 10x faster !

JY @ DataMechanics — Mon, 23 Nov 2020 23:06:51 +0000

Why Spark and Docker?

The benefits that come with using Docker containers are well known: they provide consistent and isolated environments so that applications can be deployed anywhere - locally, in dev / testing / prod environments, across all cloud providers, and on-premise - in a repeatable way.

The software engineering world has fully adopted Docker, and a lot of tools around Docker have changed the way we build and deploy software - testing, CI/CD, dependency management, versioning, monitoring, security. The popularity of Kubernetes as the new standard for container orchestration and infrastructure management follows from the popularity of Docker.

In the big data and Apache Spark scene, most applications still run directly on virtual machines without benefiting from containerization. The dominant cluster manager for Spark, Hadoop YARN, did not support Docker containers until recently (Hadoop 3.1 release), and even today the support for Docker remains “experimental and non-complete”.

Native support for Docker is in fact one of the main reasons companies choose to deploy Spark on top of Kubernetes instead of YARN.

In this article, we will illustrate the benefits of Docker for Apache Spark by going through the end-to-end development cycle used by many of our users at Data Mechanics.

The benefits of Docker for Apache Spark

1. Build your dependencies once, run everywhere

Your application will run the same way wherever you run it: on your laptop for dev / testing, or in any of your production environments.In your Docker image, you will:

Package your application code
Package all your dependencies (python: pypi, eggs, conda, scala / java: jars, maven ; system dependencies)
Define environment variables to tweak behavior at runtime
Customize your operating system the way you want

If running on Kubernetes: you’re free to choose your Spark version for each separate Docker image! This is in contrast to YARN, where you must use the same global Spark version for the entire cluster.

2. Make Spark more reliable and cost-efficient in production.

Since you can control your entire environment when you build your Docker image, this means you don’t need any init scripts / bootstrap actions to execute at runtime.
These scripts have several issues:

They’re not reliable - a library installation can fail due to a spurious network failure and be responsible for your Spark pipeline failure in production.
They’re slow - these runtime scripts can add multiple minutes to the setup time of each of your Spark nodes, which increases your costs.
They’re hard to develop, maintain, and debug. ### 3. Speed up your iteration cycle. At Data Mechanics, our users enjoy a 20-30s iteration cycle, from the time they make a code change in their IDE to the time this change runs as a Spark app on our platform.

This is mostly thanks to the fact that Docker caches previously built layers and that Kubernetes is really fast at starting / restarting Docker containers.

Tutorial: How to speed up your Spark development cycle by 10x with Docker

In this section, we’ll show you how to work with Spark and Docker, step-by-step. Example screenshots and code samples are taken from running a PySpark application on the Data Mechanics platform, but this example can be simply adapted to work on other environments.

1. Building the Docker Image

We’ll start from a local PySpark project with some dependencies, and a Dockerfile that will explain how to build a Docker image for this project. Once the Docker image is built, we can directly run it locally, or push to a Docker registry first then run it on the Data Mechanics cluster.

Thanks to Docker, there's no need to package third-party dependencies and custom modules in brittle zip files. You can just co-locate your main script and its dependencies.

In this example, we adopt a layout for our project that should feel familiar to most Python developers: a main script, a requirements file, and custom modules.

Let’s look at the Dockerfile. In this example:

We start from a base image published by Data Mechanics. This image corresponds to a major Spark image (3.0 here). We discuss this choice in detail at the end of this article.
We then install our dependencies. We used pip here, but we could use conda too.
We finally copy our own code and main file. Since this is the code we’ll be primarily iterating on, it’s recommended to do this at the end, so that when we change this code Docker doesn’t need to rebuild the image from scratch.
It’s also easy to set up environment variables at this step.

2. Local Run

You can then build this image and run it locally. This means Spark will run in local mode; as a single container on your laptop. You will not be able to process large amounts of data, but this is useful if you just want to test your code correctness (maybe using a small subset of the real data), or run unit tests.

On an average laptop, this took 15 seconds. If you’re going to iterate a lot at this stage and want to optimize the speed further; then instead of re-building the image, you can simply mount your local files (from your working directory on your laptop) to your working directory in the image.

3. Run at Scale

The code looks correct, and we now want to run this image at scale (in a distributed way) on the full dataset. We’re going to push the image to our Docker registry and then provide the image name as a parameter in the REST API call to submit this Spark application to Data Mechanics.

It’s important to have a fast iteration cycle at this stage too - even though you ran the image locally - as you’ll need to verify the correctness and stability of your pipeline on the full dataset. It’s likely you will need to keep iterating on your code and on your infrastructure configurations.

On an average laptop, this step took 20 seconds to complete: 10 seconds to build and push the image, 10 seconds for the app to start on Data Mechanics - meaning the Spark driver starts executing our main.py code.

In less than 30 seconds, you’re able to iterate on your Spark code and run it on production data! And you can do this without leaving your IDE (you don’t need to paste your code to a notebook or an interactive shell). This is a game-changer for Spark developers, a 10x speed-up compared to industry average.

"I really love that we can use Docker to package our code on Data Mechanics. It’s a huge win for simplifying PySpark deployment. Based on the iterations I have completed it takes less than 30 seconds to deploy and run the app" - Max, Data Engineer at Weather2020.

This fast iteration cycle is thanks to Docker caching the previous layers of the image, and the Data Mechanics platform (deployed on Kubernetes) optimizing fast container launches and application restarts.

How to choose your Spark base Docker image

To run a Spark application on Data Mechanics, you should use one of our published Spark images as a base image (in the first line of your Dockerfile). We publish new images regularly, including whenever a new Spark version is released or when a security fix is available.

These Spark images contain:

A Spark version, e.g. 2.4 or 3.0
Some data connectors, e.g. to S3, GCS, Azure Blob Storage, and Snowflake
Dependencies required by our platform features such as the jupyter integration or data mechanics delight (our Spark UI replacement)

They are freely available to everyone, and you can also use them to run Spark locally or on other platforms.

Conclusion

We’ve demonstrated how Docker can help you:

Package your dependencies and control your environment in a simple way.
Iterate on your code from your IDE by quickly running Spark locally or at scale
Make Spark more reliable and cost-efficient in production. Finally, you can say goodbye to slow and flaky bootstrap scripts and runtime downloads!

If this sounds like something you would like to try out on the Data Mechanics platform, book a demo with our team to start a trial of Data Mechanics - we’ll help you adopt this Docker-based dev workflow as one of the first steps of our onboarding.

Running Spark on Kubernetes in a Docker native and cloud agnostic way has many other benefits. For example, you can use the steps above to build a CI/CD pipeline that automates the building and deployment of Spark test applications whenever you submit a pull request to modify one of your production pipelines. We will cover this in a future blog post - so stay tuned!

How-to guide: Set up, Manage & Monitor Spark on Kubernetes

JY @ DataMechanics — Fri, 20 Nov 2020 12:28:33 +0000

Here's your how to guide to set up, manage and monitor Spark on k8s

In this post we’d like to expand on that presentation and talk to you about:

What is Kubernetes?
Why run Spark on Kubernetes?
Getting started with Spark on Kubernetes
Optimizing performance and cost
Monitoring your Spark applications on Kubernetes
The future of Spark on Kubernetes

If you’re already familiar with k8s and why Spark on Kubernetes might be a fit for you, feel free to skip the first couple of sections and get straight to the meat of the post!

What is Kubernetes (k8s)?

Kubernetes (also known as Kube or k8s) is an open-source container orchestration system initially developed at Google, open-sourced in 2014 and maintained by the Cloud Native Computing Foundation. Kubernetes is used to automate deployment, scaling and management of containerized apps - most commonly Docker containers.

It offers many features critical to stability, security, performance, and scalability, like:

Horizontal Scalability
Automated Rollouts & Rollbacks
Load Balancing
Secrets & Config Management
...and many more

Kubernetes has become the standard for infrastructure management in the traditional software development world. But Kubernetes isn’t as popular in the big data scene which is too often stuck with older technologies like Hadoop YARN. Until Spark-on-Kubernetes joined the game!

Why Spark on Kubernetes?

When support for natively running Spark on Kubernetes was added in Apache Spark 2.3, many companies decided to switch to it. The main reasons for this popularity include:

Native containerization and Docker support.
The ability to run Spark applications in full isolation of each other (e.g. on different Spark versions) while enjoying the cost-efficiency of a shared infrastructure.
Unifying your entire tech infrastructure under a single cloud agnostic tool (if you already use Kubernetes for your non-Spark workloads).

On top of this, there is no setup penalty for running on Kubernetes compared to YARN (as shown by benchmarks, and Spark 3.0 brought many additional improvements to Spark-on-Kubernetes like support for dynamic allocation.

Read our previous post on the Pros and Cons of Running Spark on Kubernetes for more details on this topic and comparison with main alternatives.

Getting Started with Spark on Kubernetes

Architecture: What happens when you submit a Spark app to Kubernetes

You submit a Spark application by talking directly to Kubernetes (precisely to the Kubernetes API server on the master node) which will then schedule a pod (simply put, a container) for the Spark driver. Once the Spark driver is up, it will communicate directly with Kubernetes to request Spark executors, which will also be scheduled on pods (one pod per executor). If dynamic allocation is enabled the number of Spark executors dynamically evolves based on load, otherwise it’s a static number.

How to submit applications: spark-submit vs spark-operator

This is a high-level choice you need to do early on. There are two ways to submit Spark applications to Kubernetes:

Using the spark-submit method which is bundled with Spark. Further operations on the Spark app will need to interact directly with Kubernetes pod objects
Using the spark-operator. This project was developed (and open-sourced) by GCP, but it works everywhere. It requires running a (single) pod on the cluster, but will turn Spark applications into custom Kubernetes resources which can be defined, configured and described like other Kubernetes objects. It adds other niceties like support for mounting ConfigMaps and Volumes directly from your Spark app configuration.

We recommend working with the spark-operator as it’s much more easy-to-use!

Setup Checklist

The steps below will vary depending on your current infrastructure and your cloud provider (or on-premise setup). But at the high-level, here are the main things you need to setup to get started with Spark on Kubernetes entirely by yourself:

Create a Kubernetes cluster
Define your desired node pools based on your workloads requirements
Tighten security based on your networking requirements (we recommend making the Kubernetes cluster private)
Create a docker registry to host your own Spark docker images (or use open-source ones)
Install the Spark-operator
Install the Kubernetes cluster autoscaler
Setup the collection of Spark driver logs and Spark event logs to a persistent storage
Install the Spark history server (to be able to replay the Spark UI after a Spark application has completed from the aforementioned Spark event logs)
Setup the collection of node and Spark metrics (CPU, Memory, I/O, Disks)

As you see, this is a lot of work, and a lot of moving open-source projects to maintain if you do this in-house.

This is the reason why we built our managed Spark platform (Data Mechanics), to make Spark on Kubernetes as easy and accessible as it should be. Our platform takes care of this setup and offers additional integrations (e.g. Jupyter, Airflow, IDEs) as well as powerful optimizations on top to make your Spark apps faster and reduce your cloud costs.

Optimizing performance and cost

Use SSDs or large disks whenever possible to get the best shuffle performance for Spark-on-Kubernetes

Shuffles are the expensive all-to-all data exchanges steps that often occur with Spark. They can take up a large portion of your entire Spark job and therefore optimizing Spark shuffle performance matters. We’ve already covered this topic in our YARN vs Kubernetes performance benchmarks article, (read “How to optimize shuffle with Spark on Kubernetes”) so we’ll just give our high-level tips here:

Use local SSD disks whenever possible
When they’re not available, increase the size of your disks to boost their bandwidth

Optimize your Spark pod sizes to avoid wasting capacity

Let’s go through an example. Suppose:

Your Kubernetes nodes have 4 CPUs
You want to fit exactly one Spark executor pod per Kubernetes node

Then you would submit your Spark apps with the configuration spark.executor.cores=4 right? Wrong. Your Spark app will get stuck because executors cannot fit on your nodes. You should account for overheads described in the graph below.

Typically node allocatable represents 95% of the node capacity. The resources reserved to DaemonSets depends on your setup, but note that DaemonSets are popular for log and metrics collection, networking, and security. Let’s assume that this leaves you with 90% of node capacity available to your Spark executors, so 3.6 CPUs.

This means you could submit a Spark application with the configuration spark.executor.cores=3. But this will reserve only 3 CPUs and some capacity will be wasted. Therefore in this case we recommend the following configuration:

spark.executor.cores=4
spark.kubernetes.executor.request.cores=3600m

This means your Spark executors will request exactly the 3.6 CPUs available, and Spark will schedule up to 4 tasks in parallel on this executor.

Advanced tip:
Setting spark.executor.cores greater (typically 2x or 3x greater) than spark.kubernetes.executor.request.cores is called oversubscription and can yield a significant performance boost for workloads where CPU usage is low.

In this example we’ve shown you how to size your Spark executor pods so they fit tightly into your nodes (1 pod per node). Companies also commonly choose to use larger nodes and fit multiple pods per node. In this case you should still pay attention to your Spark CPU and memory requests to make sure the bin-packing of executors on nodes is efficient. This is one of the dynamic optimizations provided by the Data Mechanics platform.

Enable app-level dynamic allocation and cluster-level autoscaling

This is an absolute must-have if you’re running in the cloud and want to make your data infrastructure reactive and cost efficient. There are two level of dynamic scaling:

App-level dynamic allocation. This is the ability for each Spark application to request Spark executors at runtime (when there are pending tasks) and delete them (when they’re idle). Dynamic allocation is available on Kubernetes since Spark 3.0 by setting the following configurations:
--- spark.dynamicAllocation.enabled=true
--- spark.dynamicAllocation.shuffleTracking.enabled=true
Cluster-level autoscaling. This means the Kubernetes cluster can request more nodes from the cloud provider when it needs more capacity to schedule pods, and vice-versa delete the nodes when they become unused.

Together, these two settings will make your entire data infrastructure dynamically scale when Spark apps can benefit from new resources and scale back down when these resources are unused. In practice, starting a Spark pod takes just a few seconds when there is capacity in the cluster. If a new node must first be acquired from the cloud provider, you typically have to wait 1-2 minutes (depending on the cloud provider, region, and type of instance).

If you want to guarantee that your applications always start in seconds, you can oversize your Kubernetes cluster by scheduling what is called “pause pods” on it. These are low-priority pods which basically do nothing. When a Spark app requires space to run, Kubernetes will delete these lower priority pods, and then reschedule them (causing the cluster to scale up in the background).

Use Spot nodes to reduce cloud costs

Spot (also known as preemptible) nodes typically cost around 75% less than on-demand machines, in exchange for lower availability (when you ask for Spot nodes there is no guarantee that you will get them) and unpredictable interruptions (these nodes can go away at any time).

Spark workloads work really well on spot nodes as long as you make sure that only Spark executors get placed on spot while the Spark driver runs on an on-demand machine. Indeed Spark can recover from losing an executor (a new executor will be placed on an on-demand node and rerun the lost computations) but not from losing its driver.

To enable spot nodes in Kubernetes you should create multiple node pools (some on-demand and some spot) and then use node-selectors and node affinities to put the driver on an on-demand node and executors preferably on spot nodes.

Monitoring your Spark applications on Kubernetes

Monitor pod resource usage using the Kubernetes Dashboard

The Kubernetes Dashboard is an open-source general purpose web-based monitoring UI for Kubernetes. It will give you visibility over the apps running on your clusters with essential metrics to troubleshoot their performance like memory usage, CPU utilization, I/O, disks, etc.

The main issues with this project is that it’s cumbersome to reconcile these metrics with actual Spark jobs/stages, and that most of these metrics are lost when a Spark application finishes. Persisting these metrics is a bit challenging but possible for example using Prometheus (with a built-in servlet since Spark 3.0) or InfluxDB.

How to access the Spark UI

The Spark UI is the essential monitoring tool built-in with Spark. It’s a different way to access it whether the app is live or not:

When the app is running, the Spark UI is served by the Spark driver directly on port 4040. To access it, you should port-forward by running the following command: --- $ kubectl port-forward 4040:4040 --- You can then open up the Spark UI at http://localhost:4040/
When the app is completed, you can replay the Spark UI by running the Spark History Server and configuring it to read the Spark event logs from a persistent storage. You should first use the configuration spark.eventLog.dir to write these event logs to the storage backend of your choice. You should then follow this documentation to install the Spark History Server from a Helm Chart and point it to your storage backend. UPDATE: As of November 2020, we have released a free, hosted, cross-platform Spark History Server. This is a simpler alternative than hosting the Spark History Server yourself!

The main issue with the Spark UI is that it’s hard to find the information you’re looking for, and it lacks the system metrics (CPU, Memory, IO usage) from the previous tools.

For this reason, we’re developing Data Mechanics Delight, a new and improved Spark UI with new metrics and visualizations. This product will be free, partially open-source, and it will work on top of any Spark platform. We're targeting a release early 2021. Read more about it here.

The future of Spark on Kubernetes

In the upcoming Apache Spark 3.1 release (expected to December 2020), Spark on Kubernetes will be declared Generally Available -- while today the official documentation still marks it as experimental. This is due to a series of usability, stability, and performance improvements that came in Spark 2.4, 3.0, and continue to be worked on.

The most exciting features that are currently being worked on around Spark-on-Kubernetes include:

[SPARK-20624] Better handling for node shutdown. This feature targeted for Spark 3.1 will let Kubernetes use the few seconds of notice it gets when a Spark executor is going down (due to dynamic allocation, or a Kubernetes node going down) to copy the shuffle and cache data files to other executors so that this work isn’t lost.
[SPARK-25299] Use remote storage for persisting shuffle data. This feature will let Spark asynchronously copy shuffle files to remote storage (like S3), which will make the shuffle architecture more robust on Kubernetes (leading to an end-state where compute and storage resources are fully disaggregated). Another option which is evaluated is the possibility to implement an external shuffle service using Kubernetes sidecars (SPARK-32642).

At Data Mechanics, we firmly believe that the future of Spark on Kubernetes is simply the future of Apache Spark. As one the first commercial Spark platforms deployed on Kubernetes (alongside Google Dataproc which has beta support for Kubernetes), we are certainly biased, but the adoption trends in the community speak for themselves.

Conclusion

We hope this article has given you useful insights into Spark-on-Kubernetes and how to be successful with it. If you’d like to get started with Spark-on-Kubernetes the easy way, book a time with us, our team at Data Mechanics will be more than happy to help you deliver on your use case. Focus on your data, while we handle the mechanics.