DEV Community

Cover image for How We Built Multi-Modal Screen Detection for Cryptographic Evidence Capture: VeraSnap

How We Built Multi-Modal Screen Detection for Cryptographic Evidence Capture: VeraSnap

TL;DR: We built a system that uses LiDAR depth analysis, moiré pattern detection, rolling shutter flicker analysis, and IMU-based human presence verification to detect when someone photographs a screen instead of a real scene — and we bind the result cryptographically with RFC 3161 timestamps. This is how we did it, why it matters for digital evidence, and the cross-platform challenges we solved along the way.


The Analog Hole Problem Nobody Talks About

Every content provenance system — C2PA, Content Credentials, you name it — has a fundamental vulnerability that's rarely discussed in technical circles: the analog hole.

Here's the attack: take a manipulated image, display it on a high-resolution monitor, then photograph that monitor with a "trusted" camera app. The resulting photo carries valid provenance credentials — cryptographic signatures, timestamps, the works — despite containing synthetic or manipulated content. The camera faithfully records what it sees, and what it sees is a screen displaying a lie.

┌──────────────────────────────────────────────────┐
│  Deepfake / Manipulated Image                    │
│         ↓                                        │
│  Display on 4K Monitor                           │
│         ↓                                        │
│  Photograph with "Trusted" Camera App            │
│         ↓                                        │
│  ✅ Valid C2PA signature                         │
│  ✅ Valid RFC 3161 timestamp                     │
│  ✅ Valid GPS coordinates                        │
│  ❌ Content is NOT a real-world scene            │
└──────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

This isn't a theoretical concern. In January 2025, researchers demonstrated that Nikon's C2PA implementation could be tricked into signing fake images with valid certificates — forcing Nikon to revoke all C2PA certificates and pause their authentication service. The attack didn't even require screen photography; but the analog hole makes it trivially easy.

We needed a way to detect this at capture time, not after the fact. And we needed it to work on consumer smartphones, not $3,000+ professional cameras.

This article walks through how we built that system for VeraSnap, our open-standard cryptographic evidence capture app.


Why Existing Approaches Fall Short

Before diving into our implementation, let's survey what's already out there — and why none of it solved our problem.

Sony Camera Authenticity Solution (PDAF-based)

Sony deploys 3D depth detection on their Alpha camera lineup (A1 II, A9 III, A7V). Their system uses Phase Detection AutoFocus (PDAF) pixel data from the imaging sensor to infer depth along a single optical axis. It works — but it requires cameras costing $2,500–$7,000, and the verification service is currently limited to select news organizations.

Key limitation: PDAF is a passive sensing technology. It infers depth from how light falls on split photodiodes during autofocus. It doesn't actively measure distance.

Serelay (Autofocus Focal Length Mapping)

Serelay's patented approach (US11012613B2) samples focal lengths at approximately 9 discrete points using the smartphone's standard autofocus mechanism. An SVM classifier distinguishes flat surfaces from 3D scenes based on whether all focus points converge to the same focal length.

Key limitation: 9 data points. That's it. And autofocus degrades significantly in low light. The patent is US-only, which is interesting for freedom-to-operate analysis.

Truepic (Software-based Image Analysis)

Truepic includes "picture of a picture detection" among their 35+ fraud tests, likely using moiré pattern detection, color distortion analysis, and edge artifact classification. No depth sensors involved.

Key limitation: Pure software analysis is inherently a cat-and-mouse game. As displays improve (higher PPI, better color accuracy, wider viewing angles), software-only detection gets harder.

The Gap

Nobody had built a system that:

  1. Uses dedicated depth sensors (not autofocus proxies)
  2. Runs on consumer smartphones (not professional cameras)
  3. Combines multiple detection modalities (not just one signal)
  4. Binds results cryptographically to an open standard
  5. Works cross-platform (iOS and Android)

That's what we set out to build.


Architecture Overview: Defense in Depth

Our screen detection system follows a defense-in-depth philosophy. No single technique is foolproof, so we layer multiple independent detection methods and fuse their results.

┌─────────────────────────────────────────────────────────┐
│                  VeraSnap Capture Pipeline                │
│                                                          │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌─────────┐ │
│  │  LiDAR   │  │  Moiré   │  │  Flicker │  │   IMU   │ │
│  │  Depth   │  │  Pattern │  │  Detect  │  │  Tremor │ │
│  │ Analysis │  │ Analysis │  │ Analysis │  │ Analysis│ │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬────┘ │
│       │              │              │              │      │
│       ▼              ▼              ▼              ▼      │
│  ┌──────────────────────────────────────────────────┐    │
│  │         Weighted Score Fusion Engine              │    │
│  │                                                   │    │
│  │  Score = w1×Depth + w2×Moiré + w3×Flicker        │    │
│  │          + w4×(1 - TremorPresent)                 │    │
│  └──────────────────┬───────────────────────────────┘    │
│                     │                                     │
│                     ▼                                     │
│  ┌──────────────────────────────────────────────────┐    │
│  │  CPP v1.5 Event (SHA-256 → RFC 3161 Timestamp)   │    │
│  └──────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Each tier operates independently and has different strengths:

Tier Method Accuracy Hardware Needed Lighting Dependency
1 LiDAR Depth Analysis ~97% LiDAR sensor None (active IR)
1 Moiré Pattern (CNN) 96–99% Camera only Moderate
1 Rolling Shutter Flicker >95% Camera only Low
2 IMU Tremor Analysis ~85% Accelerometer None
2 Ambient Light PWM ~80% Light sensor N/A

Let's dive deep into each one.


Tier 1: LiDAR Depth Uniformity Analysis

This is our flagship detection method — and the one that made VeraSnap the first consumer smartphone app to use dedicated LiDAR for screen detection in evidence capture.

The Core Insight

A real three-dimensional scene produces variable depth data across the frame. Objects exist at different distances — a person at 1.5m, a wall at 3m, furniture at 2m. A flat display produces uniform depth readings — every pixel on that 27" monitor is at essentially the same distance from the camera.

This is well-established in face biometric anti-spoofing (detecting printed photos held up to a face scanner), but nobody had applied it to general scene verification for evidentiary purposes.

iPhone LiDAR: The Hardware

iPhone Pro models (since iPhone 12 Pro) include a dedicated LiDAR scanner — a dToF (direct Time-of-Flight) sensor that emits infrared laser pulses and measures how long they take to bounce back. Key specs:

  • Resolution: 256 × 192 = 49,152 depth points per frame
  • Refresh rate: Up to 60 Hz
  • Range: 0.2m to 5m
  • Accuracy: ±1cm at close range
  • Lighting: Works in complete darkness (active IR illumination)

Compare this to Serelay's 9 autofocus sample points. We have over 5,000× more data.

The Algorithm

Our screen detection algorithm computes four indicators from the LiDAR depth map, then combines them with weighted scoring:

def is_likely_screen(analysis: DepthAnalysis) -> tuple[bool, float]:
    """
    Reference implementation — CPP v1.4 Depth Analysis Extension.

    Implementations MAY use different algorithms as long as
    the output format conforms to spec.
    """
    stats = analysis.statistics
    plane = analysis.plane_analysis

    # Criterion 1: Low depth variance → flat surface
    flatness_score = 1.0 - min(stats.std_deviation / 0.5, 1.0)

    # Criterion 2: Dominant plane covers most of frame
    plane_dominance = plane.dominant_plane_ratio

    # Criterion 3: Narrow depth range
    depth_uniformity = 1.0 - min(stats.depth_range / 2.0, 1.0)

    # Criterion 4: Sharp rectangular edges in depth discontinuities
    edge_sharpness = detect_rectangular_edges(analysis)

    # Weighted combination
    score = (
        flatness_score   * 0.30 +
        plane_dominance  * 0.25 +
        depth_uniformity * 0.25 +
        edge_sharpness   * 0.20
    )

    is_screen = score > 0.70
    confidence = abs(score - 0.50) * 2  # 0.0 at boundary, 1.0 at extremes

    return is_screen, confidence
Enter fullscreen mode Exit fullscreen mode

Calibration Data

We tested against real-world scenarios to establish thresholds:

Scene Type Typical StdDev Typical PlaneRatio Expected Verdict
Outdoor landscape 5.0+ m <0.20 ✅ NOT screen
Indoor room 1.0–3.0 m 0.20–0.40 ✅ NOT screen
Document on desk 0.3–0.8 m 0.30–0.50 ✅ NOT screen
Person portrait 0.5–1.5 m 0.15–0.30 ✅ NOT screen
Monitor display <0.05 m >0.85 🚩 LIKELY screen
Smartphone screen <0.02 m >0.90 🚩 LIKELY screen
Printed photo (flat) 0.01–0.05 m >0.80 ⚠️ Possible false positive

The printed photo case is the known limitation — we handle this by reporting confidence levels and recommending human review for high-stakes verification.

Reflectivity Analysis: The Secret Weapon

Beyond depth uniformity, LiDAR provides something no other method can: surface reflectivity data. LCD and OLED screens have characteristic infrared reflectivity patterns that differ from natural surfaces:

  • Glass panels produce specular IR reflections at certain angles
  • LCD polarizers interact distinctively with IR light
  • OLED emitters show different IR return characteristics than printed surfaces

We detect ReflectivityAnomaly as a boolean indicator in our screen detection output. This alone doesn't trigger a screen classification, but combined with the other indicators, it significantly reduces false positives.

iOS Implementation (Swift)

import ARKit

class DepthAnalyzer {

    func analyzeDepthFrame(_ depthMap: CVPixelBuffer) -> DepthAnalysis {
        let width = CVPixelBufferGetWidth(depthMap)    // 256
        let height = CVPixelBufferGetHeight(depthMap)  // 192

        CVPixelBufferLockBaseAddress(depthMap, .readOnly)
        defer { CVPixelBufferUnlockBaseAddress(depthMap, .readOnly) }

        let baseAddress = CVPixelBufferGetBaseAddress(depthMap)!
        let floatBuffer = baseAddress.assumingMemoryBound(to: Float32.self)

        // Collect valid depth values
        var depths: [Float] = []
        for i in 0..<(width * height) {
            let value = floatBuffer[i]
            if value.isFinite && value > 0.0 && value < 10.0 {
                depths.append(value)
            }
        }

        guard !depths.isEmpty else {
            return DepthAnalysis(available: false, reason: .captureFailed)
        }

        // Statistics
        let minDepth = depths.min()!
        let maxDepth = depths.max()!
        let mean = depths.reduce(0, +) / Float(depths.count)
        let variance = depths.map { ($0 - mean) * ($0 - mean) }
                             .reduce(0, +) / Float(depths.count)
        let stdDev = sqrt(variance)
        let validRatio = Float(depths.count) / Float(width * height)

        // Plane analysis via RANSAC
        let planeResult = detectDominantPlane(
            depths: depths,
            width: width, 
            height: height
        )

        // Screen detection
        let (isScreen, confidence) = computeScreenScore(
            stdDev: stdDev,
            depthRange: maxDepth - minDepth,
            planeRatio: planeResult.ratio,
            edgeSharpness: planeResult.edgeSharpness
        )

        return DepthAnalysis(
            available: true,
            sensorType: .lidar,
            frameTimestamp: Date(),
            resolution: Resolution(width: width, height: height),
            statistics: Statistics(
                minDepth: minDepth,
                maxDepth: maxDepth,
                meanDepth: mean,
                stdDeviation: stdDev,
                depthRange: maxDepth - minDepth,
                validPixelRatio: validRatio
            ),
            planeAnalysis: planeResult.analysis,
            screenDetection: ScreenDetection(
                isLikelyScreen: isScreen,
                confidence: confidence,
                indicators: /* ... */
            ),
            analysisHash: computeSHA256(depthMap)
        )
    }
}
Enter fullscreen mode Exit fullscreen mode

The Privacy-Preserving Twist

Here's a design decision we're proud of: the raw depth map is never stored. We compute statistics and the screen detection verdict at capture time, then hash the raw depth data (stored as AnalysisHash). The hash proves the computation was performed on real depth data without preserving any 3D reconstruction of the scene.

This matters for GDPR compliance — depth maps could theoretically contain biometric information (face geometry), and storing them would trigger Article 9 special category protections. By hashing and discarding, we avoid this entirely.


Tier 1: Moiré Pattern Detection (CNN-Based)

LiDAR is powerful, but it's only available on iPhone Pro models. We needed a technique that works on every device — including budget Android phones with no depth sensor at all.

The Physics of Moiré

When you photograph a screen, two regular grids interact: the camera's sensor pixel grid and the display's pixel grid. This interference creates characteristic moiré patterns — wavy, rainbow-like artifacts that don't exist in natural scenes.

Camera Sensor Grid (e.g., 12MP)
  |||||||||||||||||||||||||||
  |||||||||||||||||||||||||||    ← Interference
  |||||||||||||||||||||||||||
Display Pixel Grid (e.g., 401 PPI)
  || || || || || || || || ||
  || || || || || || || || ||
                                → Moiré pattern artifacts
Enter fullscreen mode Exit fullscreen mode

The academic literature is rich here. Garcia & de Queiroz (IEEE TIFS 2015) established the fundamental 2D DFT + Difference of Gaussians approach, achieving 92–97% accuracy on standard LCD screens.

Our Three-Tier Approach

We implement three detection methods, ranked by computational cost:

Tier 1 — Frequency Domain FFT (simplest, ~85% accuracy):

// Android (Kotlin) — Simplified
fun detectMoireFFT(image: Bitmap): Float {
    val grayscale = toGrayscale(image)
    val fft2d = computeFFT2D(grayscale)
    val magnitude = computeMagnitudeSpectrum(fft2d)

    // Look for periodic peaks at non-natural frequencies
    // Screens produce regular grid interference patterns
    val peaks = findPeriodicPeaks(magnitude, 
        minFrequency = 0.1,  // Normalized
        maxFrequency = 0.4
    )

    return peaks.sumOf { it.amplitude } / peaks.size
}
Enter fullscreen mode Exit fullscreen mode

Tier 2 — MobileNetV2 Classifier (recommended, 96%+ accuracy):

// iOS (Swift) — Core ML inference
import CoreML
import Vision

func detectMoireCNN(image: CGImage) -> ScreenDetectionResult {
    let model = try! MoireDetectorV2(configuration: .init())
    let request = VNCoreMLRequest(model: try! VNCoreMLModel(for: model.model))

    let handler = VNImageRequestHandler(cgImage: image)
    try! handler.perform([request])

    guard let result = request.results?.first as? VNClassificationObservation else {
        return .unknown
    }

    return ScreenDetectionResult(
        isScreen: result.identifier == "screen",
        confidence: result.confidence,
        modelVersion: "moiredet-v1.2-mobilenetv2"
    )
}
Enter fullscreen mode Exit fullscreen mode
Component iOS (Swift) Android (Kotlin)
ML Runtime Core ML (.mlmodel) TensorFlow Lite (.tflite)
Model MobileNetV2 + classifier head MobileNetV2 + classifier head
Input 224×224 center crop 224×224 center crop
Inference Time ~15ms on A15+ ~20ms on Snapdragon 8 Gen 1+
Model Size ~8–12 MB ~8–12 MB

Tier 3 — Wavelet + CNN Cascade (highest accuracy, 99%+):

Wavelet decomposition into LH, HL, HH sub-bands followed by lightweight CNN analysis of each sub-band. Highest accuracy but ~3× the compute cost.

The High-PPI Challenge

Modern displays are making moiré detection harder. A 4K display at 458 PPI pushes moiré frequencies beyond the camera's Nyquist limit at distances beyond ~30cm. We handle this by:

  1. Using multi-scale analysis (checking at multiple resolution levels)
  2. Detecting sub-pixel rendering patterns (RGB stripe vs. PenTile diamond) via Gabor filter banks at 8 orientations
  3. Falling back to LiDAR depth when moiré is ambiguous

Training Data

Our model was trained on a diverse dataset covering:

  • LCD, OLED, mini-LED, E-Ink displays
  • 720p through 8K resolutions
  • Various viewing angles (0°–60°) and distances (15cm–2m)
  • Multiple ambient lighting conditions
  • GAN-based de-moiréing attack samples for adversarial robustness

Tier 1: Rolling Shutter Flicker Detection

CMOS image sensors expose pixels sequentially — the top row is captured a few microseconds before the bottom row. This "rolling shutter" effect creates visible banding when photographing displays that flicker at their refresh rate.

How It Works

Display flickering at 60Hz:
  ████████████████  ← bright phase
  ░░░░░░░░░░░░░░░░  ← dark phase (backlight PWM)
  ████████████████  ← bright phase

Camera rolling shutter captures:
  Row 0:    ████████  (bright phase)
  Row 100:  ████████  (bright phase)
  Row 200:  ░░░░░░░░  (dark phase — banding!)
  Row 300:  ████████  (bright phase)
  ...
Enter fullscreen mode Exit fullscreen mode

The algorithm is straightforward:

1. Extract a raw frame (before ISP processing if possible)
2. Compute row-wise brightness: B[row] = mean(Y_channel[row])
3. Apply FFT to B[] → frequency spectrum
4. Search for peaks at: 50, 60, 100, 120, 240 Hz
   (adjusted for exposure time and regional power frequency)
5. If peak SNR > threshold → screen detected
Enter fullscreen mode Exit fullscreen mode

Cross-Platform Implementation

Component iOS (Swift) Android (Kotlin)
Frame Access AVCaptureVideoDataOutput Camera2 API + ImageReader (YUV_420_888)
Analysis Row-mean brightness → FFT Row-mean brightness → FFT
Target Frequencies 50Hz (JP East), 60Hz (JP West/US), 100/120/240Hz Same

False Positive Mitigation

Fluorescent lights also flicker at 100/120Hz. PWM LED dimming creates similar artifacts. Our solution: never use flicker detection alone. It's always fused with moiré and/or depth data in the ensemble decision.

Processing cost: ~5ms per frame. Negligible.

The Variable Refresh Rate Problem

Modern displays with ProMotion/LTPO technology dynamically switch between 1–120Hz. This makes flicker detection unreliable because the frequency changes between frames. Our detection accuracy drops from >95% to 70–85% on these displays. We compensate by:

  1. Analyzing multiple consecutive frames (looking for any consistent frequency)
  2. Weighting flicker lower in the fusion score when no clear peak is found
  3. Relying more heavily on moiré and depth for ambiguous cases

Tier 2: IMU-Based Human Presence Verification

This is a subtle but clever technique. When a human holds a phone, the accelerometer and gyroscope register characteristic micro-tremors — involuntary hand movements in the 4–12Hz band that are physiologically unavoidable. A phone mounted on a tripod or mechanical arm (which might be used in a sophisticated screen-capture attack) lacks these tremors.

The Signal

Human hand-holding characteristics:
  - Tremor band:     4–12 Hz (bandpass filtered)
  - PSD ratio:       High energy in tremor band vs. total
  - Zero-crossing:   Characteristic rate
  - Jerk profile:    Follows "minimum-jerk" trajectories
                     (biological optimization principle)

Mechanical/tripod characteristics:
  - Tremor band:     Near-zero energy
  - Motion profile:  Step functions, not smooth curves
  - Jerk profile:    Discontinuous
Enter fullscreen mode Exit fullscreen mode

Implementation

// iOS — Collect accelerometer data during capture
import CoreMotion

let motionManager = CMMotionManager()
motionManager.accelerometerUpdateInterval = 1.0 / 100.0  // 100Hz sampling

var samples: [CMAccelerometerData] = []

motionManager.startAccelerometerUpdates(to: .main) { data, _ in
    guard let data = data else { return }
    samples.append(data)
}

// After capture, analyze the 500ms window around shutter press
func analyzeHumanPresence(_ samples: [CMAccelerometerData]) -> Float {
    let magnitudes = samples.map { 
        sqrt($0.acceleration.x² + $0.acceleration.y² + $0.acceleration.z²) 
    }

    // Bandpass filter: 4–12 Hz
    let filtered = bandpassFilter(magnitudes, low: 4.0, high: 12.0, fs: 100.0)

    // Power Spectral Density in tremor band
    let psd = computePSD(filtered)
    let tremorEnergy = psd.filter { $0.frequency >= 4 && $0.frequency <= 12 }
                         .map { $0.power }
                         .reduce(0, +)
    let totalEnergy = psd.map { $0.power }.reduce(0, +)

    let tremorRatio = tremorEnergy / max(totalEnergy, 1e-10)

    // High ratio → human; low ratio → tripod/mechanical
    return tremorRatio
}
Enter fullscreen mode Exit fullscreen mode

Detection accuracy: ~85%. This won't catch a sophisticated attacker who hand-holds their phone while photographing a screen, but it adds another independent signal to the fusion engine.


The Fusion Engine: Combining Everything

Individual detectors have known weaknesses. The power of our system comes from fusing independent signals:

# Weighted score combination
def compute_screen_score(
    depth_result: Optional[DepthResult],
    moire_score: float,
    flicker_detected: bool,
    tremor_present: bool
) -> float:

    weights = {}
    scores = {}

    # LiDAR depth (if available)
    if depth_result and depth_result.available:
        weights['depth'] = 0.35
        scores['depth'] = depth_result.flatness_score

    # Moiré pattern
    weights['moire'] = 0.30
    scores['moire'] = moire_score

    # Flicker
    weights['flicker'] = 0.15
    scores['flicker'] = 1.0 if flicker_detected else 0.0

    # IMU tremor (inverted — no tremor suggests non-human)
    weights['tremor'] = 0.10
    scores['tremor'] = 0.0 if tremor_present else 1.0

    # Normalize weights
    total_weight = sum(weights.values())

    combined = sum(
        scores[k] * weights[k] / total_weight 
        for k in scores
    )

    return combined
Enter fullscreen mode Exit fullscreen mode

When LiDAR is unavailable (non-Pro iPhones, most Android devices), the weights automatically redistribute across the remaining modalities. The system degrades gracefully rather than failing.

Expected Performance

Configuration Accuracy False Positive Rate
LiDAR + Moiré + Flicker + IMU >97% <2%
Moiré + Flicker + IMU (no LiDAR) >96% <5%
Moiré only (minimum config) ~96% ~8%

The CPP Integration: Cryptographically Binding Results

Screen detection results are meaningless if they can be tampered with after the fact. We bind them into the Content Provenance Protocol (CPP) event chain using the same cryptographic infrastructure as every other capture event.

The JSON Schema

{
  "SensorData": {
    "GPS": { "Latitude": 35.6762, "Longitude": 139.6503, "Accuracy": 5.0 },
    "Accelerometer": [0.012, -0.003, 9.801],
    "Compass": 180.5,
    "DepthAnalysis": {
      "Available": true,
      "SensorType": "LiDAR",
      "FrameTimestamp": "2026-02-14T10:30:00.123Z",
      "Resolution": { "Width": 256, "Height": 192 },
      "Statistics": {
        "MinDepth": 0.45,
        "MaxDepth": 3.82,
        "MeanDepth": 1.23,
        "StdDeviation": 0.87,
        "DepthRange": 3.37,
        "ValidPixelRatio": 0.92
      },
      "PlaneAnalysis": {
        "DominantPlaneRatio": 0.15,
        "DominantPlaneDistance": 1.05,
        "PlaneCount": 3,
        "LargestPlaneArea": 0.12
      },
      "ScreenDetection": {
        "IsLikelyScreen": false,
        "Confidence": 0.95,
        "Indicators": {
          "FlatnessScore": 0.12,
          "DepthUniformity": 0.08,
          "EdgeSharpness": 0.25,
          "ReflectivityAnomaly": false
        }
      },
      "AnalysisHash": "sha256:a1b2c3d4e5f6..."
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The Cryptographic Chain

  1. Depth analysis runs during capture → JSON generated
  2. JSON is canonicalized (RFC 8785 JCS)
  3. SHA-256 hash computed over the entire event (including screen detection)
  4. Hash signed with device key (Secure Enclave on iOS, StrongBox on Android)
  5. Hash submitted to RFC 3161 TSA → timestamp token received
  6. Event inserted into hash chain (previous hash links to this event)
Event N-1 ──hash──→ Event N (with ScreenDetection) ──hash──→ Event N+1
                          │
                          ├── SHA-256 hash
                          ├── ES256 signature (Secure Enclave)
                          └── RFC 3161 timestamp token
Enter fullscreen mode Exit fullscreen mode

The result: modifying the screen detection verdict after the fact would break the hash chain, invalidate the signature, and conflict with the RFC 3161 timestamp. Three independent cryptographic guarantees.


Cross-Platform Compatibility: The Hard Part

VeraSnap runs on both iOS and Android. Making screen detection work identically across platforms was one of the hardest engineering challenges in the project.

The Sensor Landscape

iOS is relatively homogeneous — LiDAR exists on Pro models, TrueDepth on all models (front camera). Android is a zoo:

Sensor Type Platforms CPP SensorType
LiDAR (dToF) iPhone Pro, iPad Pro LiDAR
TrueDepth (structured light) iPhone front camera TrueDepth
ToF Samsung Galaxy S20+ Ultra, Huawei P30 Pro, Sony Xperia ToF
Structured Light Google Pixel 4/4 XL StructuredLight
Stereo (dual camera) Many dual-camera Android devices Stereo
None Budget phones, older devices Unavailable

CPP v1.4's Platform-Agnostic Approach

The CPP specification defines sensor types abstractly, so iOS-generated proofs verify correctly on Android and vice versa:

// iPhone 15 Pro capture
{ "SensorType": "LiDAR", "Statistics": { "StdDeviation": 0.87 } }

// Samsung Galaxy S22 Ultra capture  
{ "SensorType": "ToF", "Statistics": { "StdDeviation": 0.91 } }

// Budget Android capture
{ "SensorType": "Unavailable", "UnavailableReason": "SENSOR_NOT_AVAILABLE" }
Enter fullscreen mode Exit fullscreen mode

The verifier doesn't need to know or care about platform-specific implementation details. It sees a standardized JSON schema and applies the same validation logic.

Android Camera2 API: Depth Data Access

// Android (Kotlin) — Accessing ToF/depth data via Camera2
class DepthCaptureSession(private val cameraManager: CameraManager) {

    fun startDepthCapture(cameraId: String) {
        val characteristics = cameraManager.getCameraCharacteristics(cameraId)

        // Check for depth sensor capability
        val capabilities = characteristics.get(
            CameraCharacteristics.REQUEST_AVAILABLE_CAPABILITIES
        )
        val hasDepth = capabilities?.contains(
            CameraCharacteristics.REQUEST_AVAILABLE_CAPABILITIES_DEPTH_OUTPUT
        ) == true

        if (!hasDepth) {
            // Fall back to software-only detection
            return startSoftwareOnlyDetection()
        }

        // Configure depth ImageReader
        val depthReader = ImageReader.newInstance(
            DEPTH_WIDTH, DEPTH_HEIGHT,
            ImageFormat.DEPTH16,  // 16-bit depth in millimeters
            2  // maxImages
        )

        depthReader.setOnImageAvailableListener({ reader ->
            val image = reader.acquireLatestImage() ?: return@setOnImageAvailableListener
            val depthMap = processDepth16(image)
            val analysis = analyzeDepth(depthMap)
            image.close()
        }, backgroundHandler)

        // Create capture session with both color and depth outputs
        val surfaces = listOf(colorSurface, depthReader.surface)
        cameraDevice.createCaptureSession(surfaces, sessionCallback, null)
    }

    private fun processDepth16(image: Image): FloatArray {
        val plane = image.planes[0]
        val buffer = plane.buffer.asShortBuffer()
        val depths = FloatArray(buffer.remaining())

        for (i in depths.indices) {
            val raw = buffer.get(i)
            // DEPTH16: upper 13 bits = depth in mm, lower 3 bits = confidence
            val depthMm = (raw.toInt() and 0xFFF8) shr 3
            val confidence = raw.toInt() and 0x07

            depths[i] = if (confidence > 0 && depthMm > 0) {
                depthMm.toFloat() / 1000.0f  // Convert to meters
            } else {
                Float.NaN  // Invalid
            }
        }

        return depths
    }
}
Enter fullscreen mode Exit fullscreen mode

JSON Compatibility: The Silent Killer

A subtle but critical issue: JSON floating-point representation must be identical across platforms or hash verification breaks. We use RFC 8785 JSON Canonicalization Scheme (JCS) to ensure:

  • Numbers use shortest representation (1.23 not 1.230000)
  • Keys are sorted lexicographically
  • No trailing commas, no comments
  • UTF-8 encoding normalized
// iOS — JCS canonicalization
func canonicalize(_ json: Any) -> Data {
    // RFC 8785: deterministic JSON serialization
    let options: JSONSerialization.WritingOptions = [.sortedKeys]
    let data = try! JSONSerialization.data(withJSONObject: json, options: options)
    // Additional JCS normalization for numbers...
    return jcsNormalize(data)
}
Enter fullscreen mode Exit fullscreen mode
// Android — Must produce identical output
fun canonicalize(json: JSONObject): ByteArray {
    // Same RFC 8785 implementation
    val sorted = sortKeysRecursively(json)
    return jcsSerialize(sorted).toByteArray(Charsets.UTF_8)
}
Enter fullscreen mode Exit fullscreen mode

If the iOS implementation serializes 0.95 and Android serializes 0.9500000238418579 (due to Float32 vs Float64 differences), the SHA-256 hash won't match and cross-platform verification fails. We test this extensively in CI.


Known Limitations and Honest Assessment

We follow the CPP philosophy of "Provenance ≠ Truth". Screen detection is probabilistic, not deterministic. Here's what we're transparent about:

False Positives

Flat artwork and printed photos can trigger screen detection. A large flat painting or a glossy photograph on a table shows the same depth uniformity as a screen. We mitigate this with reflectivity analysis (screens have different IR characteristics than printed paper), but it's not perfect.

Recommendation: For high-stakes legal evidence, always include the Confidence value and note that human review is recommended when confidence is below 0.80.

Evasion Attacks

A sophisticated attacker could:

  1. Display content on a curved screen (defeating depth uniformity)
  2. Place objects at varying distances in front of the screen
  3. Use a projector onto an irregular surface
  4. Apply anti-moiré filters to the display

We don't claim screen detection is foolproof. It raises the bar significantly — from "trivially easy" to "requires specialized equipment and knowledge."

Device Coverage

LiDAR is only on iPhone Pro and iPad Pro. ToF sensors on Android are becoming rarer (Samsung removed them from the S23 line). Software-only detection (moiré + flicker) remains the realistic option for most devices. We handle this gracefully:

{
  "DepthAnalysis": {
    "Available": false,
    "SensorType": "Unavailable",
    "UnavailableReason": "SENSOR_NOT_AVAILABLE"
  },
  "screen_detection": {
    "moire_analysis": {
      "score": 0.05,
      "model_version": "moiredet-v1.2-mobilenetv2",
      "is_screen_capture": false,
      "confidence": 0.95
    },
    "flicker_analysis": {
      "detected": false
    },
    "combined_screen_score": 0.04
  }
}
Enter fullscreen mode Exit fullscreen mode

Why This Matters: The EU AI Act Connection

This isn't just a technical exercise. EU AI Act Article 50 mandates that AI-generated content be marked in a machine-readable format by August 2, 2026. The European Commission's draft Code of Practice explicitly calls for a multi-layered approach including:

  1. Cryptographic metadata (C2PA-style Content Credentials)
  2. Imperceptible watermarks (frequency-domain embedding)
  3. Fingerprinting/logging (fallback when metadata is stripped)

But none of these layers address the analog hole. You can watermark an AI-generated image all you want — once it's displayed on a screen and re-photographed, the watermark is destroyed and the new photo gets clean provenance credentials.

Screen detection closes this gap. It's the complement to AI content marking — one system says "this was AI-generated," the other says "this was captured from a real 3D scene."

The penalty for Article 50 non-compliance? Up to €15 million or 3% of global annual turnover. That's a powerful incentive for enterprises to adopt capture-time provenance verification.


Open Standard, Not Walled Garden

Everything described in this article is specified in the Content Provenance Protocol (CPP) v1.4–v1.5, published as an IETF Internet-Draft (draft-vso-cpp-core). The screen detection extension is fully documented with:

  • JSON schema definitions
  • Reference algorithm implementations
  • Calibration data
  • Verification procedures

The specification is open, the GitHub repos are public:

We believe content provenance is infrastructure, not a competitive moat. The more implementations adopt CPP's screen detection schema, the more valuable the ecosystem becomes for everyone.


What's Next

We're actively working on:

  1. Android Key Attestation integration — proving the detection ran on genuine hardware, not an emulator
  2. zk-img protocol — zero-knowledge proofs that verify screen detection results without revealing the underlying depth data
  3. Adversarial training — continuously updating our moiré model with GAN-generated de-moiréing attacks
  4. C2PA conformance — mapping CPP screen detection results to C2PA assertion types for interoperability

If you're building content provenance tooling and want to integrate screen detection, the CPP spec is your starting point. PRs welcome.


Try It

VeraSnap is live on both platforms:

  • iOS: App Store (LiDAR on Pro models, software detection on all)
  • Android: Google Play (ToF where available, software detection on all)

Take a photo of your monitor. Then take a photo of your desk. Compare the DepthAnalysis in the proof JSON. The difference is dramatic.


VeraSnap is developed by VeritasChain Co., Ltd. The Content Provenance Protocol is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

Top comments (0)