Announcing MC²: Securely perform analytics and machine learning on confidential data

#security #machinelearning #datascience #showdev

We are excited to announce the initial release of the open source MC² Project, a collection of tools for computing and collaborating on confidential data. Developed by our team in the UC Berkeley RISELab, MC² (Multi-Party Collaboration and Coopetition) enables rich analytics and machine learning on encrypted data, ensuring that data remains concealed even when it’s being processed. The data in use remains hidden from the server running the job, allowing confidential workloads to be offloaded to untrusted third parties or cloud providers. This not only protects confidential data from intrusions, but also enables secure collaboration -- multiple data owners can jointly run analytics or ML on their collective data, without explicitly revealing their individual data to anyone else: not even a trusted third party.

Need for confidential computing and secure data sharing

Personal data is becoming more pervasive and privacy concerns continue to grow. This is causing global data protection laws to become stricter; organizations now face increasingly higher noncompliance risks. At the same time, these organizations are realizing the enormous benefits of being able to share their data with each other -- banks can collaborate to detect financial crime, health institutions can collaborate on medical studies, etc.

Driven by these developments, Gartner predicts that, by 2025, “50% of large organizations will adopt privacy-enhancing computation for processing data in untrusted environments and multiparty data analytics use cases.”

The goal of the MC² Project is to realize this vision and solve this tension between expanding cloud adoption, the need for data sharing, and the increasing concern over data privacy.

Use Cases

MC² has already seen industry adoption and interest in applications surrounding finance and telecommunications: Ant Financial and Scotiabank for efforts towards anti-money laundering, fraud detection, or credit risk modeling; and Ericsson for predicting hardware faults and performance problems across different mobile network operators.

More generally, industries that have data locked down due privacy concerns can benefit from MC². Our platform keeps any confidential data, such as SSNs or PHI data, completely hidden during computation with the use of secure enclaves such as Intel SGX.

The key technology: Secure enclaves

What are secure enclaves?

Enclaves provide isolated execution: Secure enclaves are a recent technology that enable the creation of a trusted execution environment (TEE) within an otherwise untrusted machine. Each enclave has access to a restricted portion of the memory; any data or software placed within the enclave is encrypted and isolated from the rest of the system. No other process on the same processor – not even privileged software such as the OS or the hypervisor – can access the encrypted enclave memory. This creates a layer of protection against any intrusion from the operating system itself; when used properly, anyone with root access to a machine running the workload can learn little to no information about what is happening inside the enclave.

Enclaves support remote attestation: Another key feature of secure enclaves is remote attestation. This is a feature that enables users to cryptographically verify that an enclave is running trusted, unmodified code. The MC² Project provides a remote attestation platform for users to attest any non-local compute service from a trusted local client running on their own machine.

Enclaves and side-channels: Unfortunately, loading existing software into enclaves could expose the data to certain side-channel attacks, where an attacker can learn additional information about the encrypted data by observing auxiliary information such as data access patterns during the software’s execution. Preventing such leakage is left to the software developer; MC² tackles this problem by fortifying the enclave code and ensuring it is resilient to side-channel leakage via memory access patterns.

Secure enclaves vs. other approaches

Secure enclaves are not the only privacy-enhancing approach out there for computing on confidential data. Here, we compare it to other popular alternatives:

MC² provides a software stack that powers secure enclaves

In particular, MC² provides a platform that can seamlessly run popular analytics and machine learning frameworks (Apache Spark, XGBoost, etc.) within enclaves securely and efficiently, abstracting away the complexities of writing enclave code from the end user.

One approach to using enclaves is to simply load the entire application (e.g., Apache Spark) into the enclave. However, doing so adversely affects both the security and efficiency of the enclave application. For instance, if the program is memory-intensive, the performance will be greatly impacted by excessive encryption/decryption and paging. Instead:

MC² partitions the enclave code for security and efficiency: MC² partitions the application so that only the components that need to compute directly on the sensitive data are loaded into the enclave. Other components, such as network communication and task scheduling, are executed outside the enclave. This also benefits security by reducing the trusted computing base, i.e., the amount of code that runs within the enclave and therefore needs to be vetted beforehand.

MC² fortifies enclave execution: MC² fortifies the enclave components using cryptographic techniques to provide stronger security guarantees. This is done in two ways. First, MC² builds in measures to verify the integrity of jobs that have distributed execution. Second, since enclaves are known to be vulnerable to side-channel leakage, MC² makes use of data-oblivious techniques in enclave code to make sure that no side-channel information is leaked via memory access patterns. Data-obliviousness ensures that the memory access patterns do not reveal any information about the sensitive data being accessed.

End-to-End Workflow

Above is a general workflow of a job running on the MC² stack. Trusted components are green, while untrusted are red.

System Components

The MC² Client: The entry point to all compute jobs supported by MC² is the MC² Client. This tool runs in a trusted environment, typically the user’s local machine. Through a command line or Python interface, the client software is responsible for handling remote attestation and submitting jobs to the untrusted compute cluster. The client also contains additional features to generate keys needed for the compute service and to start/stop a cluster of machines on Microsoft Azure Confidential Computing. (Visit the documentation for concrete details on how all of this can be achieved, or the quickstart for a hands-on demonstration of the workflow.)

The MC² Compute Services: MC² offers several compute services: these include Spark SQL, distributed XGBoost, and secure aggregation for federated learning. All are intended to run in a primarily untrusted environment, such as a cluster of machines hosted on a public cloud, that has support for trusted execution environments (hardware enclaves). Data is encrypted in transit using a client key and only ever decrypted inside hardware enclaves, providing the previously mentioned security guarantees for data-in-use. For all compute services, MC² leverages the Open Enclave SDK, a project intended to provide a consistent API for a variety of different enclave architectures.

The following is how everything fits together:

Research Prototypes

MC² also includes exploratory research prototypes (not integrated with the MC² Client) enabling secure computation with novel cryptographic techniques. These works were published at USENIX Security, a top security conference.

Cerebro: A general purpose Python DSL for learning with secure multiparty computation.
Delphi: Secure inference for deep neural networks.

Conclusion

MC² is a platform for running secure analytics on data that stays encrypted even when in use. By doing so, the project also enables secure collaboration among multiple organizations, where individual data owners can use our platform to jointly analyze their collective data without revealing it to one another. To learn more and to see the individual projects’ documentation, visit our landing page.

For any questions or comments, you can reach out to me at chester@opaque.co or message us on Slack. Please feel free to also drop a comment in the comment section below. We would really appreciate any comments or feedback!

Development of the MC² Project is actively maintained by Opaque Systems. To learn more about how Opaque can help you take advantage of confidential computing, visit our website at opaque.co.

We would also love your contributions! Visit our GitHub page to see all the projects under the MC² umbrella.

This article was originally posted on Medium

DEV Community

Announcing MC²: Securely perform analytics and machine learning on confidential data

Need for confidential computing and secure data sharing

Use Cases