<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Opaque </title>
    <description>The latest articles on DEV Community by Opaque  (@opaque).</description>
    <link>https://dev.to/opaque</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F4473%2F657553e8-e517-4147-9283-8d425fc4fce9.png</url>
      <title>DEV Community: Opaque </title>
      <link>https://dev.to/opaque</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/opaque"/>
    <language>en</language>
    <item>
      <title>How to Run Spark SQL on Encrypted Data</title>
      <dc:creator>Pan Chasinga</dc:creator>
      <pubDate>Tue, 10 Aug 2021 22:58:00 +0000</pubDate>
      <link>https://dev.to/opaque/how-to-run-spark-sql-on-encrypted-data-1896</link>
      <guid>https://dev.to/opaque/how-to-run-spark-sql-on-encrypted-data-1896</guid>
      <description>&lt;p&gt;Introducing &lt;a href="https://github.com/mc2-project/opaque-sql" rel="noopener noreferrer"&gt;Opaque SQL&lt;/a&gt;, an open-source platform for securely running Spark SQL queries on encrypted data. Built by top systems and security researchers at UC Berkeley, the platform uses &lt;a href="https://en.wikipedia.org/wiki/Software_Guard_Extensions" rel="noopener noreferrer"&gt;hardware enclaves&lt;/a&gt; to securely execute queries on private data in an untrusted environment. &lt;/p&gt;

&lt;p&gt;Opaque SQL partitions the codebase into trusted and untrusted sections to improve runtime and reduce the amount of code that needs to be trusted. The project was designed to introduce as little changes to the Spark API as possible. If you are familiar with Spark SQL, then you already know how to  run secure queries with Opaque SQL.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🚀 Prefer a quick, hands-on ride? Follow the &lt;a href="https://github.com/mc2-project/mc2#quickstart" rel="noopener noreferrer"&gt;Quick Start Guide with Docker&lt;/a&gt; and &lt;a href="https://join.slack.com/t/mc2-project/shared_invite/zt-rt3kxyy8-GS4KA0A351Ysv~GKwy8NEQ" rel="noopener noreferrer"&gt;tell us about your experience&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What is Spark SQL?
&lt;/h2&gt;

&lt;p&gt;For those of you who are new, &lt;a href="https://spark.apache.org/" rel="noopener noreferrer"&gt;Apache Spark&lt;/a&gt; is a popular distributed computing framework used by data scientists and engineers for processing large batches of data. One of its modules, &lt;a href="https://spark.apache.org/sql/" rel="noopener noreferrer"&gt;Spark SQL&lt;/a&gt;, allows users to interact with structured, tabular data. This can be done through a &lt;a href="https://spark.apache.org/docs/3.1.1/api/scala/org/apache/spark/sql/Dataset.html" rel="noopener noreferrer"&gt;DataSet/DataFrame&lt;/a&gt; API available in Scala or Python, or by using standard &lt;a href="https://spark.apache.org/docs/latest/api/sql/index.html" rel="noopener noreferrer"&gt;SQL queries&lt;/a&gt;. Here you can see a quick example of both below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;
&lt;span class="c1"&gt;// Convert a sequence of tuples into a Spark DataFrame&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;data&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Seq&lt;/span&gt;&lt;span class="o"&gt;((&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;dog&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;chameleon&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;cat&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;df&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;data&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;toDF&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;pet&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="cm"&gt;/********** DataFrame API **********/&lt;/span&gt;

&lt;span class="c1"&gt;// Create a new DataFrame of rows with `count` greater than 3&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;apiResult&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;$&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;

&lt;span class="cm"&gt;/******* Writing SQL queries *******/&lt;/span&gt;

&lt;span class="c1"&gt;// Register `df` as a virtual table used to evaluate SQL on&lt;/span&gt;
&lt;span class="nv"&gt;df&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;createOrReplaceTempView&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;// Create a new DataFrame of rows with `count` greater than 3&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;sqlStrResult&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;sql&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="nc"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nc"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="nc"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Python via PySpark:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="c1"&gt;# Convert a list of tuples into a Spark DataFrame
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chameleon&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toDF&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;######### Dataframe API ############
&lt;/span&gt;
&lt;span class="c1"&gt;# Create a new DataFrame of rows with `count` greater than 3
&lt;/span&gt;&lt;span class="n"&gt;api_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;######## Write SQL queries #########
&lt;/span&gt;
&lt;span class="c1"&gt;# Register `df` as a virtual table used to evaluate SQL on
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createOrReplaceTempView&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;pets&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Create a new DataFrame of rows with `count` greater than 3
&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="n"&gt;sqlStrResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pets&lt;/span&gt; &lt;span class="n"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;😉 If you haven't already, now is a good time to head over and install &lt;a href="https://spark.apache.org/downloads.html" rel="noopener noreferrer"&gt;Spark&lt;/a&gt; and play with the code at the prompt.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Spark Components
&lt;/h3&gt;

&lt;p&gt;For its distributed computing architecture, Spark adopts a master-worker architecture where the master is known as the driver and workers are known as executors.&lt;/p&gt;

&lt;p&gt;The driver is the process where the main Spark program runs. It is responsible for translating a user’s code into jobs to be run on the executors. For example, given a SQL query, the driver builds the SQL plan, performs optimization, and resolves the physical operators that the execution engine will use. It then schedules the compute tasks among the workers and keeps track of their progress until completion. Any metadata, such as the number of data partitions to use or how much memory each worker should have, is set on the driver.&lt;/p&gt;

&lt;p&gt;The executors are responsible for the actual computation. Given a task from the driver, an executor performs the computation and coordinates its progress with the driver. They are launched at the start of every Spark application and can be dynamically removed and added by the driver as needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Computing on encrypted data using MC²
&lt;/h2&gt;

&lt;p&gt;The MC² Project is a collection of tools for secure &lt;strong&gt;M&lt;/strong&gt;ulti-party &lt;strong&gt;C&lt;/strong&gt;ollaboration and &lt;strong&gt;C&lt;/strong&gt;oopetition (hence MC²). This goal is achieved through the use of hardware enclaves. Enclaves provide strong security guarantees, keeping data encrypted in memory while in use. They also provide &lt;a href="https://software.intel.com/content/www/us/en/develop/topics/software-guard-extensions.html" rel="noopener noreferrer"&gt;remote attestation&lt;/a&gt;, which ensures that the enclaves responsible for computation are running the correct sets of instructions. The result is a platform capable of computing on sensitive data in an untrusted environment, such as a public cloud.&lt;/p&gt;

&lt;h2&gt;
  
  
  Opaque SQL in a Nutshell
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhms8vnk7cldt78zcb5yn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhms8vnk7cldt78zcb5yn.png" alt="Opaque SQL Architectural Diagram" width="800" height="521"&gt;&lt;/a&gt;&lt;/p&gt;
The Opaque SQL query resolution stack. MC² components are in blue.



&lt;p&gt;At a high level, Opaque SQL is a Spark package that uses hardware enclaves to partition Spark’s architecture into untrusted and trusted components. It was originally developed at &lt;a href="https://rise.cs.berkeley.edu/" rel="noopener noreferrer"&gt;UC Berkeley’s RISELab&lt;/a&gt; as the implementation of a &lt;a href="https://www.usenix.org/system/files/conference/nsdi17/nsdi17-zheng.pdf" rel="noopener noreferrer"&gt;NSDI 2017 paper&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Untrusted Driver
&lt;/h3&gt;

&lt;p&gt;While the query and table schemas are not hidden because the Spark driver still needs to perform planning, the driver is only able to access completely encrypted data. The physical plan built will contain entirely encrypted operators in place of vanilla Spark operators. (However, since the driver is building the plan, the client needs to verify that the plan created is correct; support for this is currently a work-in-progress and will be part of the next release.)&lt;/p&gt;

&lt;h3&gt;
  
  
  Enclaves on Executor Machines
&lt;/h3&gt;

&lt;p&gt;During execution, the executor program calls into Opaque SQL’s native library that’s loaded inside the hardware enclave. The native library provides encrypted SQL operators that can execute on encrypted, sensitive data inside the enclave. Any private column data such as SSNs, bank account numbers, or &lt;a href="https://www.hhs.gov/answers/hipaa/what-is-phi/index.html" rel="noopener noreferrer"&gt;PHI&lt;/a&gt; remains encrypted in memory and is protected by the enclave.&lt;/p&gt;

&lt;h2&gt;
  
  
  the MC² Client: The Entry Point
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://github.com/mc2-project/mc2" rel="noopener noreferrer"&gt;MC² Client&lt;/a&gt; is responsible for communicating with the Spark driver and performing remote attestation and query submission. It is a trusted component that is located on the user’s machine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Remote attestation
&lt;/h3&gt;

&lt;p&gt;Attest, what? To put it simply, remote attestation is just a way to have the user verify that the enclaves were initialized correctly with the right code to run. The client talks to the driver, which forwards attestation information to the enclaves that are running on the executors. No enclave is able to decrypt any data until attestation is complete and the results are verified by the user. Think of it as a way for you, the data owner, to sign off and trust the enclave to start running code on your behalf.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query submission
&lt;/h3&gt;

&lt;p&gt;Query submission happens after attestation is completed successfully, and is the step where Spark code is remotely submitted to the driver for evaluation. Any intermediate values remain encrypted throughout the lifetime of the execution stage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx726pnmxct56slsijcik.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx726pnmxct56slsijcik.png" alt="MC2 Client Diagram" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;
How the MC² client communicates with Opaque SQL



&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;p&gt;A key design for Opaque SQL is to have our API as similar to Spark SQL as possible. An encrypted DataFrame is loaded in through a special format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;
&lt;span class="c1"&gt;// Unencrypted&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;df&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;read&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;format&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="nv"&gt;com&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;databricks&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;csv&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;// Encrypted&lt;/span&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;dfEncrypted&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;read&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;format&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="nv"&gt;edu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;berkeley&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;cs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;rise&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;opaque&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;EncryptedSource&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After loading, Spark transformations are applied exactly the same as vanilla Spark, only with new encrypted physical operators being created during planning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;
&lt;span class="k"&gt;val&lt;/span&gt; &lt;span class="nv"&gt;result&lt;/span&gt; &lt;span class="k"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;dfEncrypted&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;filter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;$&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt;
&lt;span class="nv"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;explain&lt;/span&gt;
&lt;span class="c1"&gt;// == Physical Plan ==&lt;/span&gt;
&lt;span class="c1"&gt;// EncryptedFilter (count#5 &amp;gt; 3)&lt;/span&gt;
&lt;span class="c1"&gt;// +- EncryptedLocalTableScan [word#4, count#5], [[foo,4], [bar,1], [baz,5]]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To save the result after a query has been created, use the same format as loading:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight scala"&gt;&lt;code&gt;
&lt;span class="nv"&gt;result&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;write&lt;/span&gt; &lt;span class="o"&gt;\&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;format&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="nv"&gt;edu&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;berkeley&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;cs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;rise&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;opaque&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;EncryptedSource&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;\&lt;/span&gt;
  &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="py"&gt;save&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;💡 Now is the time to check our complete &lt;a href="https://mc2-project.github.io/opaque-sql/usage/functionality.html" rel="noopener noreferrer"&gt;API docs&lt;/a&gt; to continue your learning journey.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;Opaque SQL enables analytics processing over encrypted DataFrames, with little overhead to vanilla Spark. In turn, this extension protects data-in-use in the cloud as well as at rest. Queries are submitted remotely with the help of the MC² Client, an easy-to-use interface for communicating with all compute services on the MC² stack.&lt;/p&gt;

&lt;p&gt;Check out &lt;a href="https://mc2-project.github.io/resources.html#blogs" rel="noopener noreferrer"&gt;more blog posts&lt;/a&gt; on how to securely process data with MC² Project. We would love your contributions ✋ and support ⭐! Please check out the &lt;a href="https://github.com/mc2-project/mc2" rel="noopener noreferrer"&gt;Github repo&lt;/a&gt; to see how you can contribute. No contribution is too small.&lt;/p&gt;




&lt;p&gt;Edited by &lt;a class="mentioned-user" href="https://dev.to/pancy"&gt;@pancy&lt;/a&gt;. Originally posted &lt;a href="https://towardsdatascience.com/how-to-run-spark-sql-on-encrypted-data-c10adf64619" rel="noopener noreferrer"&gt;here&lt;/a&gt; by &lt;a class="mentioned-user" href="https://dev.to/octaviansima"&gt;@octaviansima&lt;/a&gt;. &lt;/p&gt;

</description>
      <category>datascience</category>
      <category>sql</category>
      <category>scala</category>
      <category>database</category>
    </item>
  </channel>
</rss>
