How to Run Spark SQL on Encrypted Data

Pan Chasinga — Tue, 10 Aug 2021 22:58:00 +0000

Introducing Opaque SQL, an open-source platform for securely running Spark SQL queries on encrypted data. Built by top systems and security researchers at UC Berkeley, the platform uses hardware enclaves to securely execute queries on private data in an untrusted environment.

Opaque SQL partitions the codebase into trusted and untrusted sections to improve runtime and reduce the amount of code that needs to be trusted. The project was designed to introduce as little changes to the Spark API as possible. If you are familiar with Spark SQL, then you already know how to run secure queries with Opaque SQL.

🚀 Prefer a quick, hands-on ride? Follow the Quick Start Guide with Docker and tell us about your experience.

What is Spark SQL?

For those of you who are new, Apache Spark is a popular distributed computing framework used by data scientists and engineers for processing large batches of data. One of its modules, Spark SQL, allows users to interact with structured, tabular data. This can be done through a DataSet/DataFrame API available in Scala or Python, or by using standard SQL queries. Here you can see a quick example of both below:


// Convert a sequence of tuples into a Spark DataFrame
val data = Seq((“dog”, 4), (“chameleon”, 1), (“cat”, 5))
val df = data.toDF(“pet”, “count”)

/********** DataFrame API **********/

// Create a new DataFrame of rows with `count` greater than 3
val apiResult = df.filter($”count” > lit(3))

/******* Writing SQL queries *******/

// Register `df` as a virtual table used to evaluate SQL on
df.createOrReplaceTempView(“df”)
// Create a new DataFrame of rows with `count` greater than 3
val sqlStrResult = spark.sql(“SELECT * FROM df WHERE count > 3”)

In Python via PySpark:


# Convert a list of tuples into a Spark DataFrame
data = [("dog", 4), ("chameleon", 1), ("cat", 5)]
df = spark.createDataFrame(data).toDF(["pet", "count"])

######### Dataframe API ############

# Create a new DataFrame of rows with `count` greater than 3
api_result = df.filter(df["count"] > lit(3))

######## Write SQL queries #########

# Register `df` as a virtual table used to evaluate SQL on
df.createOrReplaceTempView(“pets”)
# Create a new DataFrame of rows with `count` greater than 3
val sqlStrResult = spark.sql(“SELECT * FROM pets WHERE count > 3”)

😉 If you haven't already, now is a good time to head over and install Spark and play with the code at the prompt.

Spark Components

For its distributed computing architecture, Spark adopts a master-worker architecture where the master is known as the driver and workers are known as executors.

The driver is the process where the main Spark program runs. It is responsible for translating a user’s code into jobs to be run on the executors. For example, given a SQL query, the driver builds the SQL plan, performs optimization, and resolves the physical operators that the execution engine will use. It then schedules the compute tasks among the workers and keeps track of their progress until completion. Any metadata, such as the number of data partitions to use or how much memory each worker should have, is set on the driver.

The executors are responsible for the actual computation. Given a task from the driver, an executor performs the computation and coordinates its progress with the driver. They are launched at the start of every Spark application and can be dynamically removed and added by the driver as needed.

Computing on encrypted data using MC²

The MC² Project is a collection of tools for secure Multi-party Collaboration and Coopetition (hence MC²). This goal is achieved through the use of hardware enclaves. Enclaves provide strong security guarantees, keeping data encrypted in memory while in use. They also provide remote attestation, which ensures that the enclaves responsible for computation are running the correct sets of instructions. The result is a platform capable of computing on sensitive data in an untrusted environment, such as a public cloud.

Opaque SQL in a Nutshell

The Opaque SQL query resolution stack. MC² components are in blue.

At a high level, Opaque SQL is a Spark package that uses hardware enclaves to partition Spark’s architecture into untrusted and trusted components. It was originally developed at UC Berkeley’s RISELab as the implementation of a NSDI 2017 paper.

Untrusted Driver

While the query and table schemas are not hidden because the Spark driver still needs to perform planning, the driver is only able to access completely encrypted data. The physical plan built will contain entirely encrypted operators in place of vanilla Spark operators. (However, since the driver is building the plan, the client needs to verify that the plan created is correct; support for this is currently a work-in-progress and will be part of the next release.)

Enclaves on Executor Machines

During execution, the executor program calls into Opaque SQL’s native library that’s loaded inside the hardware enclave. The native library provides encrypted SQL operators that can execute on encrypted, sensitive data inside the enclave. Any private column data such as SSNs, bank account numbers, or PHI remains encrypted in memory and is protected by the enclave.

the MC² Client: The Entry Point

The MC² Client is responsible for communicating with the Spark driver and performing remote attestation and query submission. It is a trusted component that is located on the user’s machine.

Remote attestation

Attest, what? To put it simply, remote attestation is just a way to have the user verify that the enclaves were initialized correctly with the right code to run. The client talks to the driver, which forwards attestation information to the enclaves that are running on the executors. No enclave is able to decrypt any data until attestation is complete and the results are verified by the user. Think of it as a way for you, the data owner, to sign off and trust the enclave to start running code on your behalf.

Query submission

Query submission happens after attestation is completed successfully, and is the step where Spark code is remotely submitted to the driver for evaluation. Any intermediate values remain encrypted throughout the lifetime of the execution stage.

How the MC² client communicates with Opaque SQL

Usage

A key design for Opaque SQL is to have our API as similar to Spark SQL as possible. An encrypted DataFrame is loaded in through a special format:


// Unencrypted
val df = spark.read.format(“com.databricks.spark.csv”)
// Encrypted
val dfEncrypted = spark.read.format(“edu.berkeley.cs.rise.opaque.EncryptedSource”)

After loading, Spark transformations are applied exactly the same as vanilla Spark, only with new encrypted physical operators being created during planning:


val result = dfEncrypted.filter($”count” > lit(3))
result.explain
// == Physical Plan ==
// EncryptedFilter (count#5 > 3)
// +- EncryptedLocalTableScan [word#4, count#5], [[foo,4], [bar,1], [baz,5]]

To save the result after a query has been created, use the same format as loading:


result.write \
  .format(“edu.berkeley.cs.rise.opaque.EncryptedSource”) \
  .save(“path/to/result”)

💡 Now is the time to check our complete API docs to continue your learning journey.

Wrapping Up

Opaque SQL enables analytics processing over encrypted DataFrames, with little overhead to vanilla Spark. In turn, this extension protects data-in-use in the cloud as well as at rest. Queries are submitted remotely with the help of the MC² Client, an easy-to-use interface for communicating with all compute services on the MC² stack.

Check out more blog posts on how to securely process data with MC² Project. We would love your contributions ✋ and support ⭐! Please check out the Github repo to see how you can contribute. No contribution is too small.

Edited by @pancy. Originally posted here by @octaviansima.

DEV Community: Opaque