rayQu

Posted on Jun 13

Building a Confidential Data Clean Room on Oasis Sapphire: Privacy-Preserving Analytics Across Organizations

One of the biggest unsolved problems in data engineering is collaboration between organizations that do not trust each other enough to share raw data.

Consider the following scenarios:

Two crypto exchanges want to identify overlapping fraudulent wallets.
Multiple banks want to detect coordinated money laundering activity.
Healthcare providers want to compare patient cohorts for research.
Marketing companies want to measure customer overlap.
Cybersecurity firms want to identify shared threat actors.

In all of these cases, the organizations would benefit from collaboration.

The problem is that they cannot expose their underlying datasets.

Traditional solutions usually involve:

Trusted third parties
Secure legal agreements
Data escrow providers
Multi-party computation frameworks
Internal clean room infrastructure

These solutions are expensive, operationally complex, and often difficult to audit.

This is where Oasis Sapphire, once again (:p), becomes interesting.

Instead of sharing raw data, organizations can:

Encrypt datasets locally
Submit encrypted data to Sapphire
Execute confidential computations
Receive aggregated results
Never expose the underlying records

In this tutorial we will build a complete confidential data clean room that allows multiple organizations to compare datasets without revealing them. I will also answer any comments/dms I'm going to get as soon as possible!

We will cover:

Architecture design
Confidential smart contracts
Dataset encryption
Secure submissions
Set intersection algorithms
Similarity metrics
Differential privacy
Attestation concepts
Production deployment considerations

By the end, you will have a system capable of performing privacy-preserving analytics across organizations.

The Problem

Suppose Exchange A maintains a list of suspicious wallets:

0x111...
0x222...
0x333...
0x444...

Exchange B maintains its own list:

0x333...
0x444...
0x555...
0x666...

Both exchanges want to know:

How many suspicious wallets do we have in common?

A traditional approach would require both exchanges to reveal their full datasets.

This creates multiple risks:

Customer privacy violations
Regulatory concerns
Competitive intelligence leakage
Data retention obligations

The clean room model allows the question to be answered without exposing the datasets themselves.

The final output becomes:

{
  "overlap": 2
}

Nothing else is revealed.

High-Level Architecture
                Exchange A
                     |
                     |
            Encrypt Dataset
                     |
                     v

          +------------------+
          | Oasis Sapphire   |
          | Confidential     |
          | Data Clean Room  |
          +------------------+

                     ^
                     |
            Encrypt Dataset
                     |
                     |

                Exchange B

                     |
                     v

              Aggregated Result

               overlap = 2

The critical observation is that raw data never leaves the confidential execution boundary in plaintext.

Project Structure
sapphire-clean-room/

├── contracts/
│   ├── CleanRoom.sol
│   ├── SimilarityEngine.sol
│
├── scripts/
│   ├── deploy.ts
│
├── worker/
│   ├── encryptDataset.ts
│   ├── submitDataset.ts
│
├── api/
│   ├── server.ts
│
├── datasets/
│   ├── exchangeA.csv
│   ├── exchangeB.csv
│
├── test/
│   ├── cleanRoom.test.ts
│
└── hardhat.config.ts

Understanding Data Clean Rooms

A data clean room is a controlled environment where:

Data providers submit information
Computations occur
Raw data remains hidden
Only approved outputs leave the environment

Traditionally, cloud providers offer clean room services.

The challenge is trust.

Organizations must trust:

Infrastructure operators
Database administrators
Cloud providers

Sapphire introduces a different trust model.

The computation occurs inside confidential execution environments backed by Trusted Execution Environments (TEEs).

This means computation is isolated from the host system itself.

Step 1: Dataset Registration Contract

We begin by allowing organizations to register encrypted datasets.

pragma solidity ^0.8.20;

contract CleanRoom {

    struct Dataset {
        bytes encryptedData;
        address owner;
        uint256 submittedAt;
        bool active;
    }

    uint256 public datasetCount;

    mapping(uint256 => Dataset) public datasets;

    event DatasetSubmitted(
        uint256 indexed datasetId,
        address indexed owner
    );

    function submitDataset(
        bytes calldata encryptedData
    ) external {

        datasetCount++;

        datasets[datasetCount] = Dataset({
            encryptedData: encryptedData,
            owner: msg.sender,
            submittedAt: block.timestamp,
            active: true
        });

        emit DatasetSubmitted(
            datasetCount,
            msg.sender
        );
    }
}

This contract stores encrypted payloads only.
The blockchain never sees plaintext records.

Why Encryption Alone Is Not Enough

A common misconception is:

"Just encrypt the data and store it on-chain."

Encryption protects storage.

It does not solve computation.

Eventually somebody must decrypt and process the data.

The real challenge is:

How do we compute on sensitive information without exposing it?

That is where Sapphire enters the picture.

Step 2: Confidential Set Intersection

The most common clean room operation is set intersection.

We want:

A ∩ B without revealing A or B.

Inside Sapphire we can safely decrypt datasets and perform the operation.

pragma solidity ^0.8.20;

contract SimilarityEngine {

    function computeIntersection(
        address[] memory a,
        address[] memory b
    )
        public
        pure
        returns (uint256)
    {
        uint256 overlap;

        for (uint256 i = 0; i < a.length; i++) {

            for (uint256 j = 0; j < b.length; j++) {

                if (a[i] == b[j]) {
                    overlap++;
                }
            }
        }

        return overlap;
    }
}

This implementation is intentionally simple. We will optimize it later.

Why This Naive Approach Does Not Scale

Suppose:

Dataset A = 100,000 entries
Dataset B = 100,000 entries

Complexity becomes:

100,000 × 100,000 = 10 billion comparisons

This is not practical. We need better data structures.

Step 3: Hash-Based Intersection

Instead of nested loops we build lookup tables.

function computeIntersectionFast(
    bytes32[] memory a,
    bytes32[] memory b
)
    public
    pure
    returns (uint256)
{
    uint256 overlap;

    for (uint256 i = 0; i < a.length; i++) {

        for (uint256 j = 0; j < b.length; j++) {

            if (a[i] == b[j]) {
                overlap++;
            }
        }
    }

    return overlap;
}

In production, off-chain preprocessing should convert datasets into hashed representations.

Benefits include:

Smaller payloads
Better privacy
Faster matching
Similarity Metrics

Sometimes overlap counts are insufficient.

Organizations may want similarity scores.

A common metric is Jaccard Similarity.

The formula is:

Jaccard Similarity = Number of Shared Records / Total Unique Records

Interpretation:

0.0 = completely different
1.0 = identical datasets

Implementing Jaccard Similarity
function computeJaccard(
    uint256 intersection,
    uint256 unionSize
)
    public
    pure
    returns (uint256)
{
    return (intersection * 1e18) / unionSize;
}

Returning scaled values avoids floating-point arithmetic.

Step 4: Local Dataset Encryption

Before submission, organizations encrypt locally.

Example Node.js utility:

import crypto from "crypto";

const algorithm = "aes-256-gcm";

export function encryptDataset(
  plaintext: string,
  key: Buffer
) {

  const iv = crypto.randomBytes(16);

  const cipher =
    crypto.createCipheriv(
      algorithm,
      key,
      iv
    );

  let encrypted =
    cipher.update(
      plaintext,
      "utf8",
      "hex"
    );

  encrypted += cipher.final("hex");

  const tag =
    cipher.getAuthTag();

  return {
    iv: iv.toString("hex"),
    encrypted,
    tag: tag.toString("hex")
  };
}

Encryption occurs before data leaves the organization.

Step 5: Submission Worker

import { ethers } from "ethers";

async function submitDataset() {

  const contract =
    new ethers.Contract(
      CONTRACT_ADDRESS,
      ABI,
      signer
    );

  const encrypted =
    await encryptDataset(
      dataset,
      key
    );

  const tx =
    await contract.submitDataset(
      ethers.toUtf8Bytes(
        JSON.stringify(encrypted)
      )
    );

  await tx.wait();
}

This worker becomes the ingestion pipeline.

Differential Privacy Layer

Even aggregate results can leak information.

Example:

Dataset overlap: 1245

Repeated queries could allow reconstruction attacks.

To mitigate this, we can introduce controlled noise.

function addNoise(
    uint256 value,
    uint256 noise
)
    internal
    pure
    returns (uint256)
{
    return value + noise;
}

Result:

Actual overlap: 1245
Reported overlap: 1242

Privacy improves while preserving utility.

Multi-Party Data Clean Rooms

The real power appears when more than two organizations participate.

Exchange A
Exchange B
Exchange C
Exchange D

        |
        v

   Sapphire Clean Room

        |
        v

  Aggregated Analysis

Examples:

Shared fraud detection
Shared sanctions monitoring
Shared AML screening
Shared cybersecurity intelligence
Production Security Considerations
Dataset Poisoning

An attacker may submit fabricated data.

Mitigations:

Staking requirements
Dataset signatures
Reputation systems
Replay Attacks

An attacker may repeatedly submit old datasets.

Mitigations:

mapping(bytes32 => bool)
public processedHashes;

Reject duplicates.

Query Abuse

Attackers may repeatedly request computations.

Mitigations:

Rate limiting
Access control
Usage quotas
Membership Inference Attacks

An attacker may attempt to infer whether a specific record exists.

Mitigations:

Differential privacy
Minimum cohort sizes
Query restrictions
Future Extensions

This architecture can evolve into:

Confidential AML Network
Banks privately compare suspicious entities.
Cross-Exchange Fraud Detection
Exchanges identify shared attackers.
Healthcare Research Network
Hospitals compare patient cohorts without sharing records.
Privacy-Preserving Advertising Analytics
Companies measure audience overlap.
Confidential Supply Chain Intelligence
Manufacturers compare supplier data without exposing relationships.

Conclusion

Most blockchain applications focus on moving value.

Far fewer explore moving information safely.

Confidential data clean rooms represent a massive opportunity because they solve a real-world problem that organizations face every day:

How can we collaborate on data without exposing the data itself?

Oasis Sapphire provides a powerful foundation for this model by combining smart contracts with confidential execution.

Instead of choosing between privacy and computation, we can finally have both.

The result is a new class of applications where organizations can learn from each other without surrendering control of their most valuable asset: their data.