DEV Community

rayQu
rayQu

Posted on

Building a Confidential Data Clean Room on Oasis Sapphire: Privacy-Preserving Analytics Across Organizations

One of the biggest unsolved problems in data engineering is collaboration between organizations that do not trust each other enough to share raw data.

Consider the following scenarios:

  • Two crypto exchanges want to identify overlapping fraudulent wallets.
  • Multiple banks want to detect coordinated money laundering activity.
  • Healthcare providers want to compare patient cohorts for research.
  • Marketing companies want to measure customer overlap.
  • Cybersecurity firms want to identify shared threat actors.

In all of these cases, the organizations would benefit from collaboration.

The problem is that they cannot expose their underlying datasets.

Traditional solutions usually involve:

  • Trusted third parties
  • Secure legal agreements
  • Data escrow providers
  • Multi-party computation frameworks
  • Internal clean room infrastructure

These solutions are expensive, operationally complex, and often difficult to audit.

This is where Oasis Sapphire, once again (:p), becomes interesting.

Instead of sharing raw data, organizations can:

  1. Encrypt datasets locally
  2. Submit encrypted data to Sapphire
  3. Execute confidential computations
  4. Receive aggregated results
  5. Never expose the underlying records

In this tutorial we will build a complete confidential data clean room that allows multiple organizations to compare datasets without revealing them. I will also answer any comments/dms I'm going to get as soon as possible!

We will cover:

  1. Architecture design
  2. Confidential smart contracts
  3. Dataset encryption
  4. Secure submissions
  5. Set intersection algorithms
  6. Similarity metrics
  7. Differential privacy
  8. Attestation concepts
  9. Production deployment considerations

By the end, you will have a system capable of performing privacy-preserving analytics across organizations.

The Problem

Suppose Exchange A maintains a list of suspicious wallets:

0x111...
0x222...
0x333...
0x444...

Exchange B maintains its own list:

0x333...
0x444...
0x555...
0x666...

Both exchanges want to know:

How many suspicious wallets do we have in common?

A traditional approach would require both exchanges to reveal their full datasets.

This creates multiple risks:

  • Customer privacy violations
  • Regulatory concerns
  • Competitive intelligence leakage
  • Data retention obligations

The clean room model allows the question to be answered without exposing the datasets themselves.

The final output becomes:

{
  "overlap": 2
}

Enter fullscreen mode Exit fullscreen mode

Nothing else is revealed.

High-Level Architecture
                Exchange A
                     |
                     |
            Encrypt Dataset
                     |
                     v

          +------------------+
          | Oasis Sapphire   |
          | Confidential     |
          | Data Clean Room  |
          +------------------+

                     ^
                     |
            Encrypt Dataset
                     |
                     |

                Exchange B

                     |
                     v

              Aggregated Result

               overlap = 2
Enter fullscreen mode Exit fullscreen mode

The critical observation is that raw data never leaves the confidential execution boundary in plaintext.

Project Structure
sapphire-clean-room/

├── contracts/
│   ├── CleanRoom.sol
│   ├── SimilarityEngine.sol
│
├── scripts/
│   ├── deploy.ts
│
├── worker/
│   ├── encryptDataset.ts
│   ├── submitDataset.ts
│
├── api/
│   ├── server.ts
│
├── datasets/
│   ├── exchangeA.csv
│   ├── exchangeB.csv
│
├── test/
│   ├── cleanRoom.test.ts
│
└── hardhat.config.ts
Enter fullscreen mode Exit fullscreen mode

Understanding Data Clean Rooms

A data clean room is a controlled environment where:

  • Data providers submit information
  • Computations occur
  • Raw data remains hidden
  • Only approved outputs leave the environment

Traditionally, cloud providers offer clean room services.

The challenge is trust.

Organizations must trust:

  1. Infrastructure operators
  2. Database administrators
  3. Cloud providers

Sapphire introduces a different trust model.

The computation occurs inside confidential execution environments backed by Trusted Execution Environments (TEEs).

This means computation is isolated from the host system itself.

Step 1: Dataset Registration Contract

We begin by allowing organizations to register encrypted datasets.

pragma solidity ^0.8.20;

contract CleanRoom {

    struct Dataset {
        bytes encryptedData;
        address owner;
        uint256 submittedAt;
        bool active;
    }

    uint256 public datasetCount;

    mapping(uint256 => Dataset) public datasets;

    event DatasetSubmitted(
        uint256 indexed datasetId,
        address indexed owner
    );

    function submitDataset(
        bytes calldata encryptedData
    ) external {

        datasetCount++;

        datasets[datasetCount] = Dataset({
            encryptedData: encryptedData,
            owner: msg.sender,
            submittedAt: block.timestamp,
            active: true
        });

        emit DatasetSubmitted(
            datasetCount,
            msg.sender
        );
    }
}
Enter fullscreen mode Exit fullscreen mode
  • This contract stores encrypted payloads only.
  • The blockchain never sees plaintext records.

Why Encryption Alone Is Not Enough

A common misconception is:

"Just encrypt the data and store it on-chain."

Encryption protects storage.

It does not solve computation.

Eventually somebody must decrypt and process the data.

The real challenge is:

How do we compute on sensitive information without exposing it?

That is where Sapphire enters the picture.

Step 2: Confidential Set Intersection

The most common clean room operation is set intersection.

We want:

A ∩ B without revealing A or B.

Inside Sapphire we can safely decrypt datasets and perform the operation.

pragma solidity ^0.8.20;

contract SimilarityEngine {

    function computeIntersection(
        address[] memory a,
        address[] memory b
    )
        public
        pure
        returns (uint256)
    {
        uint256 overlap;

        for (uint256 i = 0; i < a.length; i++) {

            for (uint256 j = 0; j < b.length; j++) {

                if (a[i] == b[j]) {
                    overlap++;
                }
            }
        }

        return overlap;
    }
}
Enter fullscreen mode Exit fullscreen mode

This implementation is intentionally simple. We will optimize it later.

Why This Naive Approach Does Not Scale

Suppose:

Dataset A = 100,000 entries
Dataset B = 100,000 entries

Complexity becomes:

100,000 × 100,000 = 10 billion comparisons

This is not practical. We need better data structures.

Step 3: Hash-Based Intersection

Instead of nested loops we build lookup tables.

function computeIntersectionFast(
    bytes32[] memory a,
    bytes32[] memory b
)
    public
    pure
    returns (uint256)
{
    uint256 overlap;

    for (uint256 i = 0; i < a.length; i++) {

        for (uint256 j = 0; j < b.length; j++) {

            if (a[i] == b[j]) {
                overlap++;
            }
        }
    }

    return overlap;
}

Enter fullscreen mode Exit fullscreen mode

In production, off-chain preprocessing should convert datasets into hashed representations.

Benefits include:

  • Smaller payloads
  • Better privacy
  • Faster matching
  • Similarity Metrics

Sometimes overlap counts are insufficient.

Organizations may want similarity scores.

A common metric is Jaccard Similarity.

The formula is:

Jaccard Similarity = Number of Shared Records / Total Unique Records​

Interpretation:

0.0 = completely different
1.0 = identical datasets

Implementing Jaccard Similarity
function computeJaccard(
    uint256 intersection,
    uint256 unionSize
)
    public
    pure
    returns (uint256)
{
    return (intersection * 1e18) / unionSize;
}
Enter fullscreen mode Exit fullscreen mode

Returning scaled values avoids floating-point arithmetic.

Step 4: Local Dataset Encryption

Before submission, organizations encrypt locally.

Example Node.js utility:

import crypto from "crypto";

const algorithm = "aes-256-gcm";

export function encryptDataset(
  plaintext: string,
  key: Buffer
) {

  const iv = crypto.randomBytes(16);

  const cipher =
    crypto.createCipheriv(
      algorithm,
      key,
      iv
    );

  let encrypted =
    cipher.update(
      plaintext,
      "utf8",
      "hex"
    );

  encrypted += cipher.final("hex");

  const tag =
    cipher.getAuthTag();

  return {
    iv: iv.toString("hex"),
    encrypted,
    tag: tag.toString("hex")
  };
}
Enter fullscreen mode Exit fullscreen mode

Encryption occurs before data leaves the organization.

Step 5: Submission Worker

import { ethers } from "ethers";

async function submitDataset() {

  const contract =
    new ethers.Contract(
      CONTRACT_ADDRESS,
      ABI,
      signer
    );

  const encrypted =
    await encryptDataset(
      dataset,
      key
    );

  const tx =
    await contract.submitDataset(
      ethers.toUtf8Bytes(
        JSON.stringify(encrypted)
      )
    );

  await tx.wait();
}
Enter fullscreen mode Exit fullscreen mode

This worker becomes the ingestion pipeline.

Differential Privacy Layer

Even aggregate results can leak information.

Example:

Dataset overlap: 1245

Repeated queries could allow reconstruction attacks.

To mitigate this, we can introduce controlled noise.

function addNoise(
    uint256 value,
    uint256 noise
)
    internal
    pure
    returns (uint256)
{
    return value + noise;
}
Enter fullscreen mode Exit fullscreen mode

Result:

Actual overlap: 1245
Reported overlap: 1242

Privacy improves while preserving utility.

Multi-Party Data Clean Rooms

The real power appears when more than two organizations participate.

Exchange A
Exchange B
Exchange C
Exchange D

        |
        v

   Sapphire Clean Room

        |
        v

  Aggregated Analysis

Enter fullscreen mode Exit fullscreen mode

Examples:

  • Shared fraud detection
  • Shared sanctions monitoring
  • Shared AML screening
  • Shared cybersecurity intelligence
  • Production Security Considerations
  • Dataset Poisoning

An attacker may submit fabricated data.

Mitigations:

  • Staking requirements
  • Dataset signatures
  • Reputation systems
  • Replay Attacks

An attacker may repeatedly submit old datasets.

Mitigations:

mapping(bytes32 => bool)
public processedHashes;

Enter fullscreen mode Exit fullscreen mode

Reject duplicates.

Query Abuse

Attackers may repeatedly request computations.

Mitigations:

  • Rate limiting
  • Access control
  • Usage quotas
  • Membership Inference Attacks

An attacker may attempt to infer whether a specific record exists.

Mitigations:

  • Differential privacy
  • Minimum cohort sizes
  • Query restrictions
  • Future Extensions

This architecture can evolve into:

  • Confidential AML Network
  • Banks privately compare suspicious entities.
  • Cross-Exchange Fraud Detection
  • Exchanges identify shared attackers.
  • Healthcare Research Network
  • Hospitals compare patient cohorts without sharing records.
  • Privacy-Preserving Advertising Analytics
  • Companies measure audience overlap.
  • Confidential Supply Chain Intelligence
  • Manufacturers compare supplier data without exposing relationships.

Conclusion

Most blockchain applications focus on moving value.

Far fewer explore moving information safely.

Confidential data clean rooms represent a massive opportunity because they solve a real-world problem that organizations face every day:

How can we collaborate on data without exposing the data itself?

Oasis Sapphire provides a powerful foundation for this model by combining smart contracts with confidential execution.

Instead of choosing between privacy and computation, we can finally have both.

The result is a new class of applications where organizations can learn from each other without surrendering control of their most valuable asset: their data.

Top comments (0)