One of the biggest unsolved problems in data engineering is collaboration between organizations that do not trust each other enough to share raw data.
Consider the following scenarios:
- Two crypto exchanges want to identify overlapping fraudulent wallets.
- Multiple banks want to detect coordinated money laundering activity.
- Healthcare providers want to compare patient cohorts for research.
- Marketing companies want to measure customer overlap.
- Cybersecurity firms want to identify shared threat actors.
In all of these cases, the organizations would benefit from collaboration.
The problem is that they cannot expose their underlying datasets.
Traditional solutions usually involve:
- Trusted third parties
- Secure legal agreements
- Data escrow providers
- Multi-party computation frameworks
- Internal clean room infrastructure
These solutions are expensive, operationally complex, and often difficult to audit.
This is where Oasis Sapphire, once again (:p), becomes interesting.
Instead of sharing raw data, organizations can:
- Encrypt datasets locally
- Submit encrypted data to Sapphire
- Execute confidential computations
- Receive aggregated results
- Never expose the underlying records
In this tutorial we will build a complete confidential data clean room that allows multiple organizations to compare datasets without revealing them. I will also answer any comments/dms I'm going to get as soon as possible!
We will cover:
- Architecture design
- Confidential smart contracts
- Dataset encryption
- Secure submissions
- Set intersection algorithms
- Similarity metrics
- Differential privacy
- Attestation concepts
- Production deployment considerations
By the end, you will have a system capable of performing privacy-preserving analytics across organizations.
The Problem
Suppose Exchange A maintains a list of suspicious wallets:
0x111...
0x222...
0x333...
0x444...
Exchange B maintains its own list:
0x333...
0x444...
0x555...
0x666...
Both exchanges want to know:
How many suspicious wallets do we have in common?
A traditional approach would require both exchanges to reveal their full datasets.
This creates multiple risks:
- Customer privacy violations
- Regulatory concerns
- Competitive intelligence leakage
- Data retention obligations
The clean room model allows the question to be answered without exposing the datasets themselves.
The final output becomes:
{
"overlap": 2
}
Nothing else is revealed.
High-Level Architecture
Exchange A
|
|
Encrypt Dataset
|
v
+------------------+
| Oasis Sapphire |
| Confidential |
| Data Clean Room |
+------------------+
^
|
Encrypt Dataset
|
|
Exchange B
|
v
Aggregated Result
overlap = 2
The critical observation is that raw data never leaves the confidential execution boundary in plaintext.
Project Structure
sapphire-clean-room/
├── contracts/
│ ├── CleanRoom.sol
│ ├── SimilarityEngine.sol
│
├── scripts/
│ ├── deploy.ts
│
├── worker/
│ ├── encryptDataset.ts
│ ├── submitDataset.ts
│
├── api/
│ ├── server.ts
│
├── datasets/
│ ├── exchangeA.csv
│ ├── exchangeB.csv
│
├── test/
│ ├── cleanRoom.test.ts
│
└── hardhat.config.ts
Understanding Data Clean Rooms
A data clean room is a controlled environment where:
- Data providers submit information
- Computations occur
- Raw data remains hidden
- Only approved outputs leave the environment
Traditionally, cloud providers offer clean room services.
The challenge is trust.
Organizations must trust:
- Infrastructure operators
- Database administrators
- Cloud providers
Sapphire introduces a different trust model.
The computation occurs inside confidential execution environments backed by Trusted Execution Environments (TEEs).
This means computation is isolated from the host system itself.
Step 1: Dataset Registration Contract
We begin by allowing organizations to register encrypted datasets.
pragma solidity ^0.8.20;
contract CleanRoom {
struct Dataset {
bytes encryptedData;
address owner;
uint256 submittedAt;
bool active;
}
uint256 public datasetCount;
mapping(uint256 => Dataset) public datasets;
event DatasetSubmitted(
uint256 indexed datasetId,
address indexed owner
);
function submitDataset(
bytes calldata encryptedData
) external {
datasetCount++;
datasets[datasetCount] = Dataset({
encryptedData: encryptedData,
owner: msg.sender,
submittedAt: block.timestamp,
active: true
});
emit DatasetSubmitted(
datasetCount,
msg.sender
);
}
}
- This contract stores encrypted payloads only.
- The blockchain never sees plaintext records.
Why Encryption Alone Is Not Enough
A common misconception is:
"Just encrypt the data and store it on-chain."
Encryption protects storage.
It does not solve computation.
Eventually somebody must decrypt and process the data.
The real challenge is:
How do we compute on sensitive information without exposing it?
That is where Sapphire enters the picture.
Step 2: Confidential Set Intersection
The most common clean room operation is set intersection.
We want:
A ∩ B without revealing A or B.
Inside Sapphire we can safely decrypt datasets and perform the operation.
pragma solidity ^0.8.20;
contract SimilarityEngine {
function computeIntersection(
address[] memory a,
address[] memory b
)
public
pure
returns (uint256)
{
uint256 overlap;
for (uint256 i = 0; i < a.length; i++) {
for (uint256 j = 0; j < b.length; j++) {
if (a[i] == b[j]) {
overlap++;
}
}
}
return overlap;
}
}
This implementation is intentionally simple. We will optimize it later.
Why This Naive Approach Does Not Scale
Suppose:
Dataset A = 100,000 entries
Dataset B = 100,000 entries
Complexity becomes:
100,000 × 100,000 = 10 billion comparisons
This is not practical. We need better data structures.
Step 3: Hash-Based Intersection
Instead of nested loops we build lookup tables.
function computeIntersectionFast(
bytes32[] memory a,
bytes32[] memory b
)
public
pure
returns (uint256)
{
uint256 overlap;
for (uint256 i = 0; i < a.length; i++) {
for (uint256 j = 0; j < b.length; j++) {
if (a[i] == b[j]) {
overlap++;
}
}
}
return overlap;
}
In production, off-chain preprocessing should convert datasets into hashed representations.
Benefits include:
- Smaller payloads
- Better privacy
- Faster matching
- Similarity Metrics
Sometimes overlap counts are insufficient.
Organizations may want similarity scores.
A common metric is Jaccard Similarity.
The formula is:
Jaccard Similarity = Number of Shared Records / Total Unique Records
Interpretation:
0.0 = completely different
1.0 = identical datasets
Implementing Jaccard Similarity
function computeJaccard(
uint256 intersection,
uint256 unionSize
)
public
pure
returns (uint256)
{
return (intersection * 1e18) / unionSize;
}
Returning scaled values avoids floating-point arithmetic.
Step 4: Local Dataset Encryption
Before submission, organizations encrypt locally.
Example Node.js utility:
import crypto from "crypto";
const algorithm = "aes-256-gcm";
export function encryptDataset(
plaintext: string,
key: Buffer
) {
const iv = crypto.randomBytes(16);
const cipher =
crypto.createCipheriv(
algorithm,
key,
iv
);
let encrypted =
cipher.update(
plaintext,
"utf8",
"hex"
);
encrypted += cipher.final("hex");
const tag =
cipher.getAuthTag();
return {
iv: iv.toString("hex"),
encrypted,
tag: tag.toString("hex")
};
}
Encryption occurs before data leaves the organization.
Step 5: Submission Worker
import { ethers } from "ethers";
async function submitDataset() {
const contract =
new ethers.Contract(
CONTRACT_ADDRESS,
ABI,
signer
);
const encrypted =
await encryptDataset(
dataset,
key
);
const tx =
await contract.submitDataset(
ethers.toUtf8Bytes(
JSON.stringify(encrypted)
)
);
await tx.wait();
}
This worker becomes the ingestion pipeline.
Differential Privacy Layer
Even aggregate results can leak information.
Example:
Dataset overlap: 1245
Repeated queries could allow reconstruction attacks.
To mitigate this, we can introduce controlled noise.
function addNoise(
uint256 value,
uint256 noise
)
internal
pure
returns (uint256)
{
return value + noise;
}
Result:
Actual overlap: 1245
Reported overlap: 1242
Privacy improves while preserving utility.
Multi-Party Data Clean Rooms
The real power appears when more than two organizations participate.
Exchange A
Exchange B
Exchange C
Exchange D
|
v
Sapphire Clean Room
|
v
Aggregated Analysis
Examples:
- Shared fraud detection
- Shared sanctions monitoring
- Shared AML screening
- Shared cybersecurity intelligence
- Production Security Considerations
- Dataset Poisoning
An attacker may submit fabricated data.
Mitigations:
- Staking requirements
- Dataset signatures
- Reputation systems
- Replay Attacks
An attacker may repeatedly submit old datasets.
Mitigations:
mapping(bytes32 => bool)
public processedHashes;
Reject duplicates.
Query Abuse
Attackers may repeatedly request computations.
Mitigations:
- Rate limiting
- Access control
- Usage quotas
- Membership Inference Attacks
An attacker may attempt to infer whether a specific record exists.
Mitigations:
- Differential privacy
- Minimum cohort sizes
- Query restrictions
- Future Extensions
This architecture can evolve into:
- Confidential AML Network
- Banks privately compare suspicious entities.
- Cross-Exchange Fraud Detection
- Exchanges identify shared attackers.
- Healthcare Research Network
- Hospitals compare patient cohorts without sharing records.
- Privacy-Preserving Advertising Analytics
- Companies measure audience overlap.
- Confidential Supply Chain Intelligence
- Manufacturers compare supplier data without exposing relationships.
Conclusion
Most blockchain applications focus on moving value.
Far fewer explore moving information safely.
Confidential data clean rooms represent a massive opportunity because they solve a real-world problem that organizations face every day:
How can we collaborate on data without exposing the data itself?
Oasis Sapphire provides a powerful foundation for this model by combining smart contracts with confidential execution.
Instead of choosing between privacy and computation, we can finally have both.
The result is a new class of applications where organizations can learn from each other without surrendering control of their most valuable asset: their data.
Top comments (0)