DEV Community: sentinel-safety

False Positives in Child Safety AI: Architecture Tradeoffs and Why They Matter

sentinel-safety — Sun, 26 Apr 2026 10:56:39 +0000

Every time a child safety system flags the wrong person, trust in the entire system erodes. A teenager falsely banned from a platform they use to talk to friends. A teacher wrongly suspended from an educational tool. An adult gamer kicked out of a community they've been part of for years.

False positives in child safety moderation are not just technical errors. They're injustices that fall disproportionately on specific groups, create legal liability, and undermine the social license that makes any safety system viable long-term.

This post is about the false positive problem in child safety AI — what causes it, how different system architectures handle it, and why we at SENTINEL made specific engineering choices around it.

Two categories of false positives

Child safety AI has two distinct false positive problems that are often conflated:

Statistical false positives — the model is wrong on individual cases. Every classifier has a false positive rate. At scale, even a 0.1% FP rate means thousands of wrongly flagged users per day on a large platform.

Systemic false positives — the model is wrong on specific groups at higher rates than others. This is the demographic bias problem: a model trained on a non-representative dataset may flag Black users, LGBTQ+ users, non-native English speakers, or users with non-standard communication styles at rates significantly higher than their actual risk.

These are related but different problems. A model can have good overall accuracy while still systematically harming specific communities. Statistical accuracy metrics hide demographic disparities.

Why this is harder in behavioral detection

Keyword filters have obvious FP sources: a mention of "grooming" in a dog care context, a word with double meaning. The failure mode is legible.

Behavioral detection is more complex. SENTINEL's four signal types each have their own FP patterns:

Linguistic signals — conversation style shifts that resemble grooming escalation can occur in completely legitimate contexts: a mentor becoming more informal with a mentee over time, a coach developing a closer relationship with an athlete, a tutor's communication adapting to a student's level.

Graph signals — an adult who messages many young users might be a coach, teacher, or community organizer, not a predator. Coordinated contact patterns that look suspicious in isolation might be a team announcement or event notification.

Temporal signals — contact frequency increases that look like escalation might just be a growing friendship or project collaboration. Cross-session dynamics that match grooming velocity might match entirely benign relationship development.

Fairness signals — these are the audit mechanism, not a detection signal. They catch the other three when they have disparate impact.

The pattern that distinguishes grooming from legitimate relationship development is context-dependent and multi-signal. No single signal is sufficient for a flag, let alone an action.

How SENTINEL is architected for this

Human-in-the-loop by design, not by accident. SENTINEL does not auto-ban. Ever. Every flag routes to a human moderator with a plain-language explanation of exactly which behavioral signals triggered the score and why. The moderator sees the reasoning chain, not a number. A false positive that reaches a moderator who reviews context and clears it is vastly preferable to an automated action on a real person's account.

This is a deliberate architectural choice with real costs. Routing to humans is slower and more expensive than auto-action. We think those costs are worth paying.

Explainability as a FP mitigation tool. When a moderator can see that a flag was triggered by Signals A, B, and C, they can make a contextual judgment: "Signals A and B are present, but in context, they're explained by the user's role as a community manager. Signal C is unusual but doesn't fit the grooming pattern when viewed alongside the account history." Opaque scores don't enable this. Explainability does.

The fairness gate. Before any detection model deploys in SENTINEL, it must pass a demographic parity audit. The system tests whether the model produces significantly different false positive rates across demographic groups (where demographic signals are present in the training data). If it does — if one group is flagged at a rate disproportionate to their actual risk — the model cannot ship.

This is a hard gate. Not a soft recommendation. Not a documented exception. The model doesn't deploy.

This solves the systemic FP problem at the deployment level rather than the post-deployment mitigation level.

Risk score range with explicit uncertainty. SENTINEL returns a 0-100 risk score. The score is accompanied by a plain-language explanation that names the specific signals and their contribution to the score. Moderators are trained to treat mid-range scores (roughly 40-70) as "review carefully" rather than "act immediately." High scores (80+) still route to human review — they're just prioritized.

The honest v1 disclosure

We don't have production false positive rates to share. SENTINEL v1 was released this week. The v1 synthetic dataset of ~50 labeled conversations was used for initial model validation, not to generate production-representative accuracy statistics.

Anyone claiming production FP rates for a v1 system without production deployments is making up numbers. We're not doing that.

What we can say honestly:

The system is designed to route to humans, so a statistical FP becomes a human moderator action rather than an automated ban
The fairness gate prevents models with demographic disparities from deploying
The explainability layer enables moderators to identify and clear FPs efficiently
We will publish real-world FP data as production deployments generate it

This is the correct v1 posture. Platforms evaluating SENTINEL should weight the system architecture — how it handles uncertainty and error — more than accuracy numbers that don't exist yet.

The precision-recall tradeoff in child safety contexts

In most classification problems, you tune the precision-recall tradeoff based on the relative cost of FPs versus false negatives (FNs). The tradeoff in child safety is asymmetric and context-dependent:

False negatives (missing a real grooming case) have potentially catastrophic consequences for the child involved. False positives have serious but different consequences: platform trust, legal liability, harm to the incorrectly flagged user.

The right tradeoff point depends on the downstream action. If a flag means auto-ban, the cost of FPs is very high and you want high precision. If a flag means human review, the cost of FPs is much lower and you can afford higher recall — catching more real cases at the cost of more human review cycles.

SENTINEL's human-in-the-loop architecture shifts the optimal operating point. Higher recall at moderate precision is the right operating mode when the cost of a FP is "a human reviews and clears it" rather than "the user is auto-banned."

What platforms actually need to track

When you deploy SENTINEL, the metrics that matter aren't just the model's FP rate. They're:

Moderator override rate — what percentage of SENTINEL flags do moderators clear? High override rates signal FPs the model is generating consistently. Low override rates validate that flags are mostly actionable.
Time-to-clear on FPs — how quickly can moderators identify and clear a wrong flag? Short time means explainability is working.
FP rate by user segment — are any user groups being flagged at rates that don't match their actual risk profile? This is your fairness monitoring loop.
Recall at platform confidence level — of the cases that eventually resulted in moderator action, what percentage did SENTINEL flag first?

These metrics require production data and a functioning moderation queue. They're the instrumentation we're building with early adopters.

Where we're going

For v2, we're planning:

Active learning pipeline to improve model accuracy with production data while preserving privacy
Calibrated confidence intervals on risk scores (not just a point estimate)
Per-platform fairness calibration (different user demographics may require different threshold settings)
Published benchmark comparisons with keyword-filter baselines on the open research dataset

The FP problem in child safety AI won't be solved by any single v1 release. It's a continuous calibration problem that requires production data, iterative improvement, and honest reporting. We're committed to that process.

GitHub: https://github.com/sentinel-safety/SENTINEL

Free for platforms under $100k annual revenue. If you're building in this space and want to be part of the early production feedback loop, reach out at sentinel.childsafety@gmail.com.

COPPA Compliance for Platform Developers: What the Law Actually Requires and How to Build It

sentinel-safety — Sun, 26 Apr 2026 10:51:07 +0000

The enforcement trend is clear. Epic Games: $275 million. Microsoft (Xbox): $20 million. YouTube: $170 million. These are the FTC's signal that COPPA enforcement is no longer theoretical.

If your platform has users (or could have users) under 13 — a game, a forum, an educational tool, a community app — COPPA applies to you. "We don't target kids" is not a defense once you have actual knowledge of under-13 users. In 2026, between state-level laws proliferating and the FTC's increased scrutiny of the gaming sector specifically, the "too small to worry about it" era is over.

This post covers what COPPA actually requires and how to implement it as engineering infrastructure — not legal theory, but the specific systems you need to build.

What COPPA actually covers

COPPA (Children's Online Privacy Protection Act, 15 U.S.C. § 6501 et seq.) applies to two categories of operators:

Online services "directed to children under 13" — determined by the FTC based on subject matter, visual content, use of animated characters, music, and similar factors.
Any operator with "actual knowledge" that they're collecting personal information from children under 13.

The "actual knowledge" standard is the one that catches most platforms. If a user tells you they're 10, you have actual knowledge. If your platform's content obviously attracts children (Minecraft mods, Roblox plugins, educational tools), the FTC may find constructive knowledge even without explicit statements.

Personal information under COPPA is broader than most developers expect. It includes: name, address, email, phone number, screen name that can identify a child, persistent identifiers (device IDs, cookies), geolocation data, photos or video or audio with a child's image or voice, and any information combined with the above that allows individual identification.

The 5 core COPPA obligations

1. Verifiable parental consent before collection

Before collecting, using, or disclosing any personal information from a child under 13, you must obtain verifiable consent from the parent. "Verifiable" means using a method reasonably calculated to ensure the person providing consent is actually the parent.

Critically: you cannot collect personal information from an under-13 user and then seek consent afterward. Consent must precede collection.

2. A COPPA-compliant privacy policy

Your privacy policy must clearly describe what personal information you collect from children, how you use it, whether you disclose it to third parties (and who those parties are), and how parents can review and delete their child's information. Linking to a general privacy policy buried in your footer is not sufficient.

3. Data minimization

You may not condition a child's participation in an activity on collecting more personal information than is reasonably necessary. If a child wants to play a game, you cannot require their birthdate, phone number, and photo as a condition of participation.

4. The parental dashboard: review, delete, and withdraw consent

Parents must be able to review what personal information you've collected about their child, delete it, and withdraw consent (with deletion following withdrawal). This requires a parental identity verification flow and a dashboard with real deletion capability — not soft deletion, but actual erasure from your systems and backups with a documented retention schedule.

5. Data security

You must maintain reasonable procedures to protect the confidentiality, security, and integrity of personal information collected from children. The FTC expects encryption in transit and at rest, access controls, and documented security practices.

The age determination problem

This is the hardest engineering problem in COPPA compliance — and where most platforms fail.

Self-declaration ("enter your age") does not work. Children know to lie about their age. Self-declaration provides no protection against actual knowledge. If your platform attracts children and you're relying on users to truthfully report their age, you're exposed.

Age gates are a speed bump, not a barrier. Asking users to confirm they're 13+ before account creation provides minimal legal cover and no actual protection.

What actually works:

Option A: Default to children-first design. Implement COPPA's protections for all users. No personal information collection, no behavioral tracking, parental consent for account creation. This eliminates the classification problem at the cost of reduced functionality for adult users.

Option B: Age-neutral data collection. Don't collect personal information that would trigger COPPA from anyone. Increasingly common for smaller platforms.

Option C: Age verification gate with parental consent flow. Users who provide a birthdate indicating under-13 are routed to a parental consent flow before any personal information is collected. This works only if you treat any indication of under-13 status as triggering — including users who say they're 11 and then continue using the platform.

Verifiable parental consent: 6 FTC-approved methods

The FTC's COPPA Rule §312.5(b)(1) approves these consent mechanisms:

1. Signed consent form. Email the form, require a signed copy returned by mail, fax, or scan. Low conversion, high friction, legally bulletproof.

2. Credit or debit card verification. Use a card transaction with real-time notification to the cardholder. The assumption: only adults hold credit cards. Commonly implemented via Stripe with a $0.50 hold that's immediately refunded.

3. Toll-free number staffed by trained personnel. Parent calls, speaks with a human who verifies consent. High cost, doesn't scale, but explicitly FTC-approved.

4. Video conference. Live video session with trained staff. Same scaling constraints.

5. Government-issued photo ID with destruction guarantee. Parent submits ID, you verify age and destroy the image. High friction, significant data liability.

6. Knowledge-based authentication. The parent answers questions about their financial history or public records (similar to bank ID verification). Services like LexisNexis Risk Solutions provide these flows.

For internal operations only (no disclosure to third parties, no behavioral profiling), COPPA allows a simplified path: email plus confirmation. This is the lowest-friction compliant option for read-only or minimal-data platforms.

Practical recommendation for most indie platforms: implement credit card verification (Option 2) as your primary path, with signed form as fallback. Defensible consent with reasonable UX friction.

Data minimization in practice

Separate your data pipelines. Under-13 users must have data flows that exclude analytics, behavioral tracking, advertising profiling, and any third-party data sharing. If you're using a third-party analytics SDK, that SDK must not receive under-13 users' data — which means instrumenting your analytics to exclude COPPA-protected users, not just abstractly promising not to share their data.

Persistent identifiers are personal information. Device IDs, advertising IDs, session tokens that persist across sessions — all are personal information under COPPA when associated with a known or suspected child user. Your under-13 user architecture may need different identifier strategies than your adult user architecture.

Session isolation. Conversation data, gameplay data, and behavioral signals from under-13 users have different retention obligations. Build these as separate data classes with separate deletion triggers from the start.

The deletion obligation engineering spec

The parental review and deletion right requires real infrastructure:

Parental account linking — a system that cryptographically links the parent's verified identity to the child's account.

Complete inventory — you must know everything you've collected. If a child's data is in your analytics platform, your moderation logs, your behavioral model training sets, your backups, your CDN cache, and your primary database — your deletion flow must reach all of it.

Deletion vs. anonymization — COPPA requires deletion of personal information, but allows retention of de-identified aggregate data. Build your architecture to support true deletion of PII while preserving anonymized behavioral aggregates that don't re-identify.

Documented retention schedule — some data (moderation records, abuse reports, NCMEC CyberTipline evidence packages) may have legal retention obligations that override COPPA's deletion right. Document these exceptions explicitly.

The safety gap COPPA doesn't close

COPPA is a privacy and data handling law. It tells you what to collect, how to collect it, and how to delete it. It says almost nothing about what you must do to detect harm while that data is active.

This is the gap where children are most at risk. A grooming predator targeting a child on your platform:

Generates data you're permitted to collect under COPPA
Operates through conversations that appear benign in isolation
Takes weeks or months to escalate — well after any individual message has been reviewed and cleared

COPPA compliance is necessary. It is not sufficient for child safety.

The platforms taking this seriously are building behavioral detection infrastructure alongside COPPA compliance — using the data they're permitted to collect to identify escalation patterns, relationship dynamics, and temporal signals that individual message review misses entirely.

State law: the layer on top of COPPA

COPPA is federal minimum. State laws add to it:

California AADC (Age-Appropriate Design Code): Requires design choices that protect minors broadly (up to 18), including privacy by default, no profiling without opt-in, and accessible privacy controls.
Utah, Arkansas, Texas, Florida: All passed laws in 2023-2024 adding parental consent or age verification requirements, some extending to 13-17 users.
UK Children's Code: Applies to platforms accessible to UK users under 18. Requires privacy by default, no profiling, no nudge techniques.

These don't replace COPPA — they layer on top. Build your architecture to support user classification by age and jurisdiction with different data handling rules per class.

Implementation checklist

[ ] Age gate with routing: under-13 routes to parental consent flow; 13+ proceeds normally
[ ] Parental consent mechanism (credit card verification + signed form fallback)
[ ] Parental account linking with cryptographic verification
[ ] COPPA-segregated data pipelines (separate analytics, no third-party SDK data for under-13 users)
[ ] Parental dashboard: review, delete, withdraw consent
[ ] Documented deletion flow that reaches all data stores (primary DB, analytics, moderation logs, backups)
[ ] Data retention schedule with legal exceptions documented
[ ] NCMEC CyberTipline integration for mandatory reporting
[ ] Privacy policy meeting FTC COPPA requirements (not your general privacy policy)
[ ] Annual review cycle (requirements evolve; compliance must too)

SENTINEL provides reference implementations for several items on this list as open-source infrastructure: parental consent state tracking, COPPA-segregated data handling, GDPR/COPPA-compliant erasure that preserves audit log integrity, and NCMEC CyberTipline evidence package generation.

The behavioral detection layer — the part that identifies grooming patterns while COPPA-compliant data is active — is the other half of the picture.

GitHub: https://github.com/sentinel-safety/SENTINEL

Free for platforms under $100k annual revenue. Apache 2.0 in 2046.

Add Child Safety to Your Platform in 30 Minutes: A SENTINEL Integration Guide

sentinel-safety — Sun, 26 Apr 2026 10:27:46 +0000

If you're building a platform where users interact — a game, a community forum, a messaging app, an educational tool — child safety compliance is not optional. The EU DSA, UK Online Safety Act, and NCMEC mandatory reporting obligations all apply based on your user base, not your company size or headcount.

SENTINEL (https://github.com/sentinel-safety/SENTINEL) is an open-source behavioral intelligence platform that handles the full compliance stack: behavioral detection, perceptual hash matching, evidence preservation, and CyberTipline report generation. This guide walks through getting it running on your infrastructure.

Prerequisites: Docker, Docker Compose, 4GB RAM, a Linux or macOS host. Everything runs locally — no data leaves your infrastructure.

Step 1: Clone and Configure (5 minutes)

git clone https://github.com/sentinel-safety/SENTINEL.git
cd SENTINEL
cp .env.example .env

Edit .env with your platform's configuration. The required fields at minimum:

# Platform identity (used in NCMEC reports)
PLATFORM_NAME="Your Platform Name"
PLATFORM_ESP_ID="your-esp-id"

# Database
DATABASE_URL=postgresql://sentinel:sentinel@postgres:5432/sentinel

# Redis (session state and message queuing)
REDIS_URL=redis://redis:6379

# JWT for inter-service auth
JWT_SECRET=$(openssl rand -hex 32)

The ESP ID is your platform's identifier in NCMEC's CyberTipline system. If you don't have one yet, you can register at https://www.missingkids.org/theissues/csam.

Step 2: Start the Services (3 minutes)

SENTINEL runs as 13 microservices. Docker Compose handles the orchestration:

docker-compose up -d

On first run, this pulls images and initializes the database schema. On subsequent starts, it's under 10 seconds.

Verify all services are healthy:

docker-compose ps

You should see all services in Up or Up (healthy) state. The key services:

gateway (port 8000): API entry point for your platform to send events
behavioral-analyzer: Session-level pattern detection
content-scanner: Perceptual hash matching
evidence-store: Forensically sound evidence packaging
report-generator: CyberTipline report construction
federation-hub: Cross-platform threat intelligence (optional)

Step 3: Send Your First Event (5 minutes)

SENTINEL's gateway accepts message events via REST. When a user sends a message on your platform, forward it:

import httpx

async def forward_to_sentinel(message: dict):
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:8000/api/v1/events/message",
            json={
                "platform_user_id": message["sender_id"],
                "session_id": message["conversation_id"],
                "content": message["text"],
                "timestamp": message["created_at"].isoformat(),
                "recipient_ids": [message["recipient_id"]],
                "metadata": {
                    "ip_address": message.get("sender_ip"),
                    "user_agent": message.get("user_agent"),
                }
            },
            headers={"Authorization": f"Bearer {SENTINEL_API_KEY}"}
        )
    return response.json()

For Node.js platforms:

const axios = require('axios');

async function forwardToSentinel(message) {
  const response = await axios.post(
    'http://localhost:8000/api/v1/events/message',
    {
      platform_user_id: message.senderId,
      session_id: message.conversationId,
      content: message.text,
      timestamp: new Date(message.createdAt).toISOString(),
      recipient_ids: [message.recipientId],
      metadata: {
        ip_address: message.senderIp,
        user_agent: message.userAgent,
      }
    },
    { headers: { Authorization: `Bearer ${process.env.SENTINEL_API_KEY}` } }
  );
  return response.data;
}

The gateway returns immediately with an event ID. Analysis happens asynchronously — your message delivery latency is unaffected.

Step 4: Handle Alerts (10 minutes)

When SENTINEL detects a high-risk session, it fires a webhook to your platform. Configure your webhook endpoint:

# In .env
ALERT_WEBHOOK_URL=https://your-platform.com/webhooks/sentinel
ALERT_WEBHOOK_SECRET=your-webhook-secret

Your webhook handler receives:

{
  "alert_id": "alert_01HXYZ...",
  "session_id": "conv_123",
  "platform_user_id": "user_456",
  "risk_score": 0.87,
  "risk_level": "HIGH",
  "behavioral_signals": [
    "age_solicitation_detected",
    "trust_escalation_pattern",
    "platform_exit_pressure"
  ],
  "recommended_action": "REVIEW",
  "evidence_package_id": "evp_01HABC...",
  "created_at": "2026-04-26T14:23:00Z"
}

A minimal handler that queues for human review:

from fastapi import FastAPI, Request, HTTPException
import hmac, hashlib

app = FastAPI()

@app.post("/webhooks/sentinel")
async def handle_sentinel_alert(request: Request):
    # Verify signature
    body = await request.body()
    sig = hmac.new(
        WEBHOOK_SECRET.encode(),
        body,
        hashlib.sha256
    ).hexdigest()
    if not hmac.compare_digest(sig, request.headers.get("X-Sentinel-Signature", "")):
        raise HTTPException(status_code=401)

    alert = await request.json()

    if alert["risk_level"] in ("HIGH", "CRITICAL"):
        # Queue for human review — never auto-ban on algorithm alone
        await queue_for_trust_and_safety_review(alert)

    return {"status": "received"}

Critical: Always route HIGH and CRITICAL alerts to human review before taking account action. SENTINEL's fairness gate applies statistical verification before flagging, but human judgment is the final step. Automated bans without review create wrongful termination liability.

Step 5: Understanding What Gets Flagged (5 minutes)

SENTINEL's behavioral analyzer tracks signals across a session, not individual messages. A single message asking someone's age is not flagged. The pattern that gets flagged looks like this:

Initial contact (age solicitation, shared interest framing)
Trust escalation over multiple sessions (increasing personal disclosure requests)
Platform exit pressure ("let's continue on Discord")
Image solicitation following the above sequence

No single step triggers an alert. The risk score accumulates across the behavioral trajectory. This is why behavioral detection catches grooming that content filters miss — the individual messages are often innocuous; only the pattern is not.

You can inspect any session's behavioral analysis:

curl -H "Authorization: Bearer $SENTINEL_API_KEY" \
  http://localhost:8000/api/v1/sessions/{session_id}/analysis

Step 6: Evidence and Reporting

When a session reaches CRITICAL risk level, or when your trust and safety team confirms a violation after review, you can generate an NCMEC CyberTipline report:

curl -X POST \
  -H "Authorization: Bearer $SENTINEL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"session_id": "conv_123", "confirmed_by": "analyst_id_456"}' \
  http://localhost:8000/api/v1/reports/generate

This returns a report object pre-populated with all required and recommended CyberTipline fields, sourced from the evidence package that was built at the time of flagging. The report is ready for submission or human review before submission.

Evidence is automatically preserved for 90 days (configurable for law enforcement extension requests) in isolated storage with full chain of custody.

What You've Got

After these 30 minutes, your platform has:

Real-time behavioral analysis on all user messages
Perceptual hash scanning for known CSAM
An alert pipeline to your trust and safety workflow
An evidence preservation system that satisfies 18 U.S.C. § 2258A retention requirements
A CyberTipline report generator ready to produce compliant reports within your 24-hour reporting window

This is the same infrastructure stack that compliance teams at large platforms build over 12-18 months. SENTINEL packages it as a deployable system because the platforms that need it most — indie games, small social networks, educational tools — are the ones least likely to have the resources to build it from scratch.

Next Steps

Content scanning: Configure the NCMEC hash database integration for PhotoDNA-compatible matching (requires NCMEC ESP registration)
Federation: Enable the federation hub to share anonymized threat intelligence with other SENTINEL-enabled platforms (cross-platform grooming patterns are common)
Monitoring: The /metrics endpoint exposes Prometheus-compatible metrics for your observability stack
Tuning: Adjust risk score thresholds and behavioral signal weights in config/behavioral_rules.yml to match your platform's demographics

Documentation, architecture details, and the full API reference are at the GitHub repo: https://github.com/sentinel-safety/SENTINEL

SENTINEL is in active v1 development. Production deployments should include a human review layer. The system is designed to surface risk for human judgment, not replace it.

NCMEC Mandatory Reporting for Online Platforms: What Developers Need to Know

sentinel-safety — Sun, 26 Apr 2026 10:24:09 +0000

Every online platform that allows user-generated content faces a legal reality that most engineering teams discover too late: if your users can send each other messages, you may be a mandatory reporter under federal law.

18 U.S.C. § 2258A requires Electronic Service Providers (ESPs) to report apparent child sexual exploitation material (CSEM) to the National Center for Missing & Exploited Children (NCMEC) CyberTipline. The statute covers any service that provides email, instant messaging, chat, cloud storage, or any capability for transmitting content. If you run a platform where users communicate, you are almost certainly covered.

This is not hypothetical risk. Platforms that knowingly fail to report face criminal liability under § 2258A(e). And "knowingly" has been interpreted broadly.

The 24-Hour Clock

When your system encounters apparent CSEM, the reporting window opens immediately. The standard is 24 hours for an initial report, with supplemental information to follow. This sounds manageable until you consider what the clock requires of your infrastructure:

Detection must happen in near real-time, not on a nightly batch job
Your incident response pipeline must be able to generate a compliant CyberTipline report within hours of detection
Evidence must be preserved in a way that survives chain-of-custody scrutiny
The report must contain specific required fields, not just a notification that something happened

Most platforms that think they are compliant are not. They have content moderation workflows that flag content for human review, with reports generated manually days or weeks later. That gap is a legal exposure.

What a CyberTipline Report Contains

A compliant report under § 2258A(b) includes:

Required:

The identity of the reporting ESP
A copy of each visual depiction reported
The electronic address used by the apparent violator (email, username, or IP address)
The geographic location of the apparent violator, if reasonably available
The time and date of the apparent violation

Recommended (strongly advisable):

Full IP address with timestamp (IPv4 and IPv6 where available)
Port number and protocol
User account information (creation date, email, phone)
Prior platform actions taken on the account
Session logs bracketing the incident

The recommended fields matter because NCMEC routes reports to law enforcement. Thin reports with only required fields are significantly less actionable. If your system cannot generate the recommended fields automatically, every report costs an analyst hours of manual data gathering.

The Evidence Preservation Problem

Reporting is only half the obligation. 18 U.S.C. § 2258A(h) requires ESPs to preserve reported material and all associated records for 90 days (extendable to 180 days upon law enforcement request). This creates engineering requirements that most platforms have not addressed:

Chain of custody. Preserved evidence must be demonstrably unaltered from the moment of detection. This requires cryptographic hashing (SHA-256 minimum) at the point of initial encounter, not after any processing. Perceptual hashing for CSAM detection itself is not sufficient for chain-of-custody purposes.

Retention isolation. Preserved evidence must be stored separately from your normal content lifecycle. If your platform auto-deletes content after 30 days, your retention pipeline must intercept flagged content before deletion and move it to isolated, access-controlled storage.

Access controls. Evidence access must be logged. Who accessed what, when, and for what purpose must be auditable. This is not a nice-to-have for law enforcement cooperation.

Metadata completeness. Storage timestamps, access logs, and modification records must be preserved alongside content. A file hash is not enough; the entire forensic package must be intact.

Most platforms' object storage configurations, retention policies, and access control models were not designed with evidence preservation in mind. Retrofitting them is expensive and error-prone.

The Detection Gap Nobody Talks About

Here is where the regulatory framework runs into a practical wall: § 2258A creates reporting obligations, but it does not specify how you are supposed to detect reportable content in the first place. The statute's good faith provision (§ 2258A(f)) protects platforms that take "reasonable steps" — but does not define them.

The industry standard has converged on PhotoDNA and similar perceptual hash matching against NCMEC's hash database. This approach is effective for known material. It fails completely for novel content.

Novel CSAM — content not previously indexed in any hash database — passes through PhotoDNA-based systems undetected. The behavioral context in which it appears (a sequence of conversations showing escalating trust-building, requests for images, and grooming language) often precedes production of novel material by days or weeks.

This means platforms relying solely on hash matching are detecting content only after it has already been widely shared and indexed. The grooming process that produced it went undetected.

Behavioral detection addresses this gap by analyzing communication patterns rather than content. A platform can identify high-risk interactions before content is produced, intervene earlier, and generate richer context for any eventual report.

A Compliance Architecture That Actually Works

A production-grade NCMEC reporting pipeline has seven components:

1. Real-time content scanning — hash matching against NCMEC database on upload or send, before content reaches the recipient. Results must be available within the message delivery latency window (typically under 500ms for an async side-channel check).

2. Behavioral pattern analysis — session-level analysis of communication patterns for grooming indicators: age solicitation, trust escalation sequences, platform-exit pressure, image requests following a grooming pattern. This runs independently of content scanning and flags interactions for review before content violations occur.

3. Evidence packaging — automated generation of a forensically sound evidence package at the moment of flagging: cryptographic hashes, original files, metadata, session context, account history. The package must be generated before any other system action on the content.

4. Retention isolation — automated transfer of flagged content and all associated data to immutable, access-controlled storage with a 90-day minimum retention period and an alert system for law enforcement extension requests.

5. Report generation — automated construction of a CyberTipline-compliant report, pre-populated with all required and recommended fields from the evidence package. This should be reviewable by a human in under 5 minutes for a standard case.

6. Submission and tracking — CyberTipline API integration (or SFTP for high-volume reporters) with submission confirmation tracking, retry logic, and a permanent audit log of all submissions.

7. Fairness gate — before any account action (suspension, ban, content removal), a statistical verification step to confirm the detection confidence is above your defined threshold. False positives that result in wrongful account termination create liability and destroy user trust. An appeal pathway with human review is required.

The Regulatory Overlap Problem

If you operate internationally, NCMEC reporting intersects with GDPR, COPPA, and the UK Online Safety Act in ways that create genuine compliance tension.

GDPR requires a lawful basis for processing personal data. Behavioral monitoring for child safety purposes generally qualifies under legitimate interests or a specific statutory obligation — but you must document this in your processing records.

COPPA applies to platforms directed to children under 13. If you have COPPA obligations, your data minimization requirements interact with your NCMEC evidence preservation requirements. You may be legally required to retain data for 90 days that your COPPA compliance program wants deleted immediately.

UK OSA creates a parallel reporting obligation to the Internet Watch Foundation and, for higher-risk services, mandatory risk assessments that include grooming detection capabilities. An OSA-compliant detection system is not identical to an NCMEC-compliant system, but they share significant infrastructure.

The cleanest approach is a unified evidence layer that satisfies all three frameworks simultaneously, with jurisdiction-aware retention policies that apply the longest applicable retention period.

SENTINEL as Reference Implementation

SENTINEL (https://github.com/sentinel-safety/SENTINEL) is an open-source behavioral intelligence platform built specifically for this compliance stack. It implements all seven components above as independently deployable microservices:

PhotoDNA-compatible perceptual hash matching with NCMEC database integration
Behavioral pattern detection using multi-signal session analysis
Automated evidence packaging with SHA-256 chain of custody
Retention isolation with configurable jurisdiction-aware policies
CyberTipline report generation (NCMEC API v2 format)
Submission tracking with audit log
Statistical fairness gate with configurable thresholds

SENTINEL is designed for platforms that cannot afford a dedicated trust and safety team but need production-grade compliance infrastructure. It runs entirely on your infrastructure, with no third-party data sharing, using Docker Compose for deployment.

The project is in active development (v1), open source under a dual license (free for platforms under $100k ARR), and built to be the reference implementation for the child safety compliance stack that the regulatory frameworks require but do not specify.

Starting Checklist

If you are starting from scratch on NCMEC compliance:

Confirm your ESP status under § 2258A — if users can send content, you are almost certainly covered
Audit your current detection capabilities — hash matching only, or behavioral analysis too?
Map your evidence preservation infrastructure — where does flagged content go, and for how long?
Review your incident response timeline — can you generate a compliant report within 24 hours of detection?
Check your international overlap — GDPR, COPPA, and UK OSA interact with your NCMEC obligations

The liability for non-compliance is criminal, not civil. The 24-hour clock starts the moment your system encounters reportable content — not when a human reviews it.

Inside SENTINEL: How 13 Microservices Detect Child Grooming by Behavior, Not Keywords

sentinel-safety — Sat, 25 Apr 2026 16:02:22 +0000

This is a technical walkthrough of SENTINEL's architecture. If you want to understand how a behavioral child safety detection system actually works at the service level, this is for you.

SENTINEL is a 13-microservice platform. Each service is independently deployable. You can start with just the event ingestion and risk scoring services, add the compliance layer when needed, and opt into federation later. Here's what each service does and why it exists as a separate service.

Why microservices?

Content moderation systems get bolted into platform infrastructure and then never changed. A monolithic design locks you into the same detection logic, the same compliance reporting format, and the same infrastructure footprint — even as your platform scales and your regulatory obligations evolve.

SENTINEL's services are small, replaceable, and independently testable. A platform that wants to swap SENTINEL's linguistic model for their own detection model can do that without touching the audit log service or the NCMEC reporting pipeline. A platform that doesn't need the federation service doesn't deploy it.

The 13 services group into four layers: ingestion, analysis, infrastructure, and output.

Ingestion layer

Event API Service is the single entry point. Platforms send behavioral events over REST: message sent, session started, relationship formed, contact frequency change. The service validates the schema, assigns a platform-specific event ID, and queues the event for the analysis layer. Webhook callbacks are supported for real-time risk score delivery.

SDK layer is not a service itself, but the Python and Node.js SDKs abstract the API call. Most platforms integrate at the SDK level, not the raw API. The SDKs handle batching, retry logic, and async callback handling.

Analysis layer

These four services are the core of SENTINEL's behavioral detection. Each is independently scalable.

Linguistic Analysis Service builds a session-by-session profile of how a user's communication style changes over time. It is not a keyword scanner. It watches register shifts — vocabulary level, formality, pronoun use, topic focus — and compares them against session history to detect the style changes associated with manufactured intimacy. The model runs on behavioral metadata about language; it does not read or store message content in the traditional sense.

Graph Analysis Service maintains a social graph for each platform: who communicates with whom, at what frequency, and through which channel types. It detects coordinated targeting (multiple accounts approaching the same minor), asymmetric relationship formation (high contact frequency on one side), and escalation from group channels to private channels. Graph signals are some of the most reliable indicators of grooming intent — they are hard to game because they reflect structural behavior, not surface-level content choices.

Temporal Analysis Service watches time-domain signals: contact frequency acceleration, unusual-hours patterns, cross-session escalation velocity. A user who contacts a minor three times in week one, eight times in week two, and daily by week three is exhibiting a velocity pattern. The temporal service tracks this trajectory across sessions and integrates with the risk scoring aggregator to weight recent escalation more heavily.

Fairness Evaluation Service does not produce risk scores. It runs before any detection model deploys and computes demographic parity metrics across the user population. If the linguistic, graph, or temporal models produce false positive rates that differ significantly across demographic groups, this service blocks deployment. Once deployed, it runs periodic re-evaluation to catch drift.

Risk scoring layer

Risk Score Aggregator takes outputs from the linguistic, graph, and temporal services and combines them into a unified risk score between 0 and 100. The combination is not a simple average: each signal layer is independently weighted, and the aggregation logic is configurable per platform. The aggregator also produces the plain-language explanation that accompanies each score — synthesizing the specific signals that contributed, in a format that a human moderator can read and a court can understand.

The risk score aggregator assigns a tier label: trusted (0–29), watch (30–59), restrict (60–84), and critical (85–100). These thresholds are configurable.

Infrastructure layer

Audit Log Service maintains SENTINEL's tamper-evident audit chain. Every risk score, every model deployment decision, every fairness evaluation, and every compliance export is written to a cryptographically chained log. Records cannot be altered without detection. Retention is seven years by default, configurable for jurisdictions requiring longer retention. This is the primary documentation artifact for regulatory audit requests.

Federation Service manages the opt-in cross-platform threat intelligence network. When a platform confirms a grooming case (human-reviewed), the federation service generates a behavioral signature — a non-reversible vector representation of the behavioral pattern — and submits it to the federation pool. When analyzing new users, the service queries whether their behavioral profile matches any known signature. No user PII or message content crosses platform boundaries.

Data Retention and Erasure Service handles GDPR Article 17 erasure requests, COPPA deletion requirements, and jurisdiction-aware data retention policies. When a user deletion request arrives, this service coordinates with the other services to remove personal data while preserving the audit log integrity required for compliance. The audit log entries are pseudonymized rather than deleted, maintaining the evidentiary chain while honoring erasure obligations.

Output layer

NCMEC Reporting Service assembles CyberTipline evidence packages when behavioral indicators meet mandatory reporting thresholds. The package includes the structured event timeline, risk score history, platform context, and whatever user metadata is required for the report. Platform operators review and file; SENTINEL prepares the documentation. This service integrates with the audit log to ensure the evidence package and the audit record are consistent.

Moderation Dashboard Service presents the moderation queue to platform trust and safety teams. Flagged users appear with their risk scores, tier labels, and plain-language explanations. Moderators can review the behavioral signal history, take action, and record the outcome. The service feeds outcomes back into the audit log.

Compliance Export Service generates structured documentation for regulatory submissions: risk assessment records for DSA Article 28 compliance, transparency reports, and audit extracts. These are exportable in machine-readable formats compatible with the EU's Digital Services Act transparency database requirements.

How services communicate

Within a SENTINEL deployment, services communicate over a message queue (Redis by default) for asynchronous analysis jobs and over REST for synchronous queries. The event API places analysis jobs on the queue; each analysis service processes them and writes results to PostgreSQL. The risk score aggregator subscribes to completed analysis outputs and triggers score generation.

Federation queries are synchronous REST calls to the federation service (with caching for high-frequency platforms). Audit log writes are append-only over a dedicated internal API.

Starting small

You do not need to deploy all 13 services. The minimum viable deployment is the Event API, the three analysis services (linguistic, graph, temporal), and the risk score aggregator. This gives you behavioral risk scoring with plain-language explanations.

Add the audit log service for compliance infrastructure. Add the NCMEC reporting service when mandatory reporting becomes relevant. Add the federation service when your platform is large enough to benefit from cross-platform threat intelligence.

The Docker Compose configuration in the repository defines the full stack. Individual services can be commented out for minimal deployments.

SENTINEL is open source. Every service's code, model training scripts, and data handling policy is in the repository.

GitHub: https://github.com/sentinel-safety/SENTINEL

Free for platforms under $100k annual revenue and all non-commercial and research use.

Fairness in Child Safety AI: Why Demographic Parity Audits Are Not Optional

sentinel-safety — Sat, 25 Apr 2026 16:00:50 +0000

Most machine learning systems for content moderation are built, evaluated on accuracy metrics, and deployed. Fairness evaluation is treated as a nice-to-have, or skipped entirely.

In child safety specifically, this is a serious problem — and not just for ethical reasons. Systems that flag one demographic group disproportionately cause real harm to the falsely flagged users, create legal exposure for the platform, and undermine public trust in automated moderation. They also tend to miss threats in underrepresented groups.

SENTINEL treats fairness differently: demographic parity is a hard deployment constraint. No model ships if it fails. This post explains why, and how it works.

The specific failure mode

Content moderation datasets are biased. This is almost universally true, for several converging reasons:

Historical reports are not uniformly distributed. Platforms receive more reports from users who are most engaged with reporting tools, which skews toward certain demographics. Communities that distrust platforms report less. Communities that have historically been moderated more heavily are more represented in training labels.

Language patterns differ by demographics. Models trained to detect linguistic patterns associated with grooming may learn correlates that happen to be more common in speech patterns associated with certain ethnic, regional, or age groups — completely independent of actual risk.

Sampling bias in synthetic datasets. When real data is unavailable and researchers generate synthetic grooming datasets for training, the synthetic data reflects the assumptions of whoever wrote it.

The result: a model trained on historical moderation data may produce substantially different false positive rates across demographic groups. Applied to a production platform, this means some user populations are flagged at rates 2x, 3x, or higher than others — with no actual difference in risk.

Why this matters specifically for child safety

In most content moderation contexts, a false positive means an innocuous post is removed or a legitimate user is temporarily suspended. That's bad, but recoverable.

In child safety moderation, the stakes are higher on both sides. A false positive doesn't just inconvenience a user — it potentially exposes a minor to a flagged interaction, can result in account termination, and may even trigger law enforcement contact. The reputational, legal, and personal consequences of being incorrectly flagged as a potential predator are severe.

This creates a specific obligation: child safety AI needs to be demonstrably fair across demographic groups, not just accurate overall.

Regulators are arriving at the same conclusion. The EU DSA's algorithmic accountability provisions (Articles 34-35) include requirements to assess systemic risks that arise from the design of automated systems, including risks related to fundamental rights. A system that disproportionately flags users from minority groups creates exactly this kind of systemic risk.

Demographic parity as a deployment gate

Most AI fairness work happens after deployment: models are built, deployed, and then audited to see if they've produced disparate impact. By then, the harm is already in production.

SENTINEL takes a different approach: the fairness audit runs before deployment, and passing it is required.

Specifically, before any detection model is deployed on a tenant platform, SENTINEL runs a demographic parity evaluation across the platform's user population. The evaluation measures the false positive rate across demographic groups (age, gender, and any additional demographic signals available from the platform's user data).

If the false positive rate differs across groups by more than a configurable threshold (default: 10 percentage points), deployment is blocked. The model is not gradual-rollout'd, not deployed with a warning, not deployed with a note in the audit log. It cannot ship.

The platform receives a fairness report explaining which demographic segment has the elevated false positive rate, the magnitude of the disparity, and recommendations for retraining or re-weighting the model.

Why a gate, not a dashboard

A common question: why not just show a fairness dashboard and let the platform decide?

Three reasons:

First, the decision should not be delegated to individual platform operators. A platform under regulatory scrutiny may face strong pressure to deploy quickly. A compliance gate removes the pressure. The system enforces the standard regardless of business timelines.

Second, fairness metrics are not intuitive, and disparate impact is easy to rationalize. "Our overall accuracy is 94% and the disparity is only 8 percentage points" sounds reasonable until you recognize that an 8-point disparity in false positive rate means one user group is being incorrectly flagged at roughly double the rate of another. A gate makes the threshold explicit and enforceable.

Third, regulator expectations are moving toward architectural enforcement. The EU DSA and UK Online Safety Act both require risk mitigation measures, not just risk assessment. A deployment gate provides a documentable, auditable enforcement mechanism that a risk assessment dashboard does not.

Technical implementation

The fairness gate in SENTINEL works in three stages:

Pre-deployment evaluation: When a tenant installs a new detection model (or updates an existing one), SENTINEL runs the model against a balanced evaluation set drawn from the platform's historical behavioral data. The evaluation set is stratified by demographic group to ensure sufficient representation of each group for meaningful statistical comparison.

Disparity measurement: The gate computes false positive rate for each demographic group and computes the maximum pairwise disparity. It also computes the false negative rate (missed true positives) across groups, since fairness cuts both ways: a model that misses threats in one demographic group while detecting them in others fails fairness criteria as well.

Pass/fail determination: If the maximum pairwise disparity in false positive rate or false negative rate exceeds the configured threshold, the model is marked as failed and cannot be deployed. The gate produces a detailed report: which groups were compared, what the measured rates were, and how far the model fell outside the threshold.

What happens when a model fails

When a model fails the fairness gate, the platform receives a report and works with the model to bring it into compliance. The most common interventions are:

Reweighting the training data to correct for underrepresentation of particular groups.

Calibration adjustments to reduce systematic score inflation for specific groups.

Feature engineering: if specific features are driving disparate impact, those features may need to be removed or replaced.

In some cases, the training dataset is simply inadequate for producing a fair model, and the model needs to be retrained with better data. The fairness gate catches this before it becomes a production problem.

The fairness-accuracy tradeoff

A frequent objection: doesn't imposing fairness constraints reduce overall accuracy?

In practice, for behavioral detection specifically: models that produce disparate impact are usually not more accurate overall — they're reflecting bias in the training data. Correcting for that bias tends to improve calibration across the board.

There is a theoretical tradeoff: in some scenarios, constrained optimization for fairness does reduce the optimized accuracy metric. SENTINEL's position is that this tradeoff is acceptable and, in the child safety context, required. A system with 93% accuracy and equitable false positive rates is better than a system with 95% accuracy that disproportionately flags one demographic group.

The regulatory and ethical case for accepting this tradeoff is strong. The legal case is becoming clearer as enforcement under DSA and OSA develops.

Connecting to audit infrastructure

The fairness gate doesn't operate in isolation. Every fairness evaluation run is logged in SENTINEL's tamper-evident audit log, including the model version, the evaluation dataset, the demographic groups evaluated, the measured disparity rates, and the pass/fail outcome.

This creates an auditable record that the platform took fairness evaluation seriously. When a regulator asks how the platform ensured its automated systems did not produce disparate impact, this log is the answer.

The fairness gate is part of SENTINEL's core platform. It applies to all detection models on all tenant platforms, with no opt-out.

SENTINEL is an open-source behavioral intelligence platform for child safety compliance. Free for platforms under $100k annual revenue.

GitHub: https://github.com/sentinel-safety/SENTINEL

What EU DSA and UK Online Safety Act require from your platform's child safety infrastructure

sentinel-safety — Sat, 25 Apr 2026 15:56:57 +0000

Building a platform where kids might be present? The regulatory landscape changed substantially in 2024 and 2025, and the compliance obligations are more specific than many developers realize.

This is a practical breakdown of what the EU Digital Services Act and UK Online Safety Act actually require at the technical level, and what compliant infrastructure looks like.

Are you in scope?

The EU Digital Services Act's child safety obligations (Article 28) apply to any online platform accessible to minors in the EU. "Accessible to minors" is the operative phrase: if children can access your service, you are in scope. You do not have to specifically market to children. The DSA came into full application in February 2024.

The UK Online Safety Act takes a similar approach: services "likely to be accessed by children" in the UK fall under child safety duties. Ofcom is publishing a categorization register in mid-2026 that will explicitly list which services are in scope.

The practical implication: any platform with social features, chat, or user-generated content that children might encounter is likely subject to at least some of these obligations. The "we're too small to worry about it" era is over.

What "proactive safety" actually means

Both the DSA and UK OSA require proactive rather than reactive child safety measures. This is a meaningful distinction.

Reactive safety means: a child reports something harmful, the platform reviews it and takes action. This is the baseline that most platforms operate at today.

Proactive safety means: the platform has systems in place to identify and intervene before harm occurs, based on risk assessment and systematic monitoring.

Specifically, Article 28 of the DSA requires platforms to assess systemic risks to minors and implement mitigation measures. The UK OSA requires services to be "safe by design," with proactive systems rather than purely response-based moderation.

Keyword filters, even sophisticated ones, are primarily reactive. Predators have adapted to them. They avoid flagged terms, use coded language, and spend weeks or months establishing trust before anything overtly harmful appears in message content. By the time a keyword filter triggers, the grooming process has often already advanced significantly.

What satisfies proactive requirements is behavioral monitoring: watching how interactions evolve over time, identifying escalation patterns early, and surfacing risk before explicit content appears.

The audit trail requirement

Both regulations require platforms to demonstrate compliance, which means documentation and audit trails are mandatory, not optional.

DSA Article 28 requires platforms to produce documentation of their risk assessments and mitigation measures. Regulators can demand this evidence. The record-keeping obligation extends across multiple years.

The UK Online Safety Act requires similar audit readiness. Ofcom has enforcement powers including substantial fines, and audit evidence demonstrating proactive safety measures is central to establishing compliance.

For legal proceedings involving child exploitation or grooming, courts and law enforcement also require documentation: who was flagged, what behavioral evidence supported the flag, what action was taken, and when. This documentation needs to be tamper-evident, meaning the platform cannot alter records after the fact without detection.

Cryptographically chained audit logs, retained for at least seven years, satisfy both the regulatory audit requirements and the legal evidence standards.

The mandatory reporting infrastructure

Platforms operating in the US have mandatory reporting obligations under 18 U.S.C. § 2258A: if a platform becomes aware of apparent child sexual exploitation material, it must report to the National Center for Missing and Exploited Children (NCMEC) CyberTipline. Failure to report is a criminal offense.

The NCMEC reporting process requires specific documentation: user information, timestamps, platform context, and the flagged content. Generating these evidence packages manually is error-prone and slow. Compliance infrastructure should automate this documentation so that when a platform files a report, the evidence package is ready.

The GDPR and COPPA intersection

Platforms serving users across jurisdictions face an intersection problem. COPPA (US) applies to platforms collecting personal information from children under 13. GDPR (EU) applies to personal data of EU residents, with heightened protections for children's data. The UK post-Brexit equivalent maintains similar protections.

These frameworks have different requirements around data retention, parental consent, and erasure. A platform operating internationally needs to satisfy all of them simultaneously. The infrastructure for this includes jurisdiction-aware data retention policies, automated erasure workflows for deletion requests, parental consent mechanisms and records, and separation of data handling for minors versus adult users.

What compliant infrastructure actually needs

Pulling this together, a platform taking its child safety compliance obligations seriously needs:

Proactive behavioral detection that identifies escalation patterns before explicit harm occurs
Tamper-evident audit logs retained for at least seven years, cryptographically chained so records cannot be altered
Risk assessment documentation recording what the platform assessed and what mitigations were implemented
NCMEC CyberTipline evidence packages generated automatically when reportable content is identified
Jurisdiction-aware data handling covering GDPR, COPPA, and UK data protection requirements
Explainable moderation decisions so human moderators and regulators can understand why a user was flagged

The compliance gap

This infrastructure has historically been expensive to build and only accessible to large platforms. GDPR compliance consultants, behavioral detection systems, and audit infrastructure are not cheap.

This creates a genuine problem: the largest platforms have dedicated trust and safety teams and reasonable compliance budgets. Smaller platforms, often the ones children encounter in gaming, social, and creative communities, have almost nothing.

The DSA and UK OSA apply to smaller platforms too. The July 2026 Ofcom categorization register and continued DSA enforcement will make this increasingly difficult to ignore.

SENTINEL

We built SENTINEL as an open-source reference implementation for exactly this compliance stack. It ships with behavioral detection across four signal types (linguistic, graph, temporal, fairness), a demographic parity enforcement gate that blocks deployment if the detection model disproportionately flags any group, tamper-evident cryptographically chained audit logs with seven-year default retention, automated NCMEC CyberTipline evidence package generation, and jurisdiction-aware GDPR and COPPA data handling.

Every risk score comes with a plain-language explanation of the specific behavioral signals that triggered it, so moderators and regulators can understand and document the decision.

SENTINEL is free for platforms under $100k annual revenue and all non-commercial and research use. Fully open source.

GitHub: https://github.com/sentinel-safety/SENTINEL

Grooming operates over time. Here's how behavioral detection tracks it.

sentinel-safety — Sat, 25 Apr 2026 15:51:41 +0000

Every system designed to detect child grooming has the same problem: it's looking at the wrong unit of analysis.

Grooming doesn't happen in a message. It happens across weeks of messages — a slow accumulation of trust, a gradual shift in conversational register, an escalation in contact frequency that would look unremarkable if you sampled any individual session but reads clearly as a pattern when you step back and look at the whole trajectory.

When you build a detection system around message-level classification, you're designing for a problem that doesn't exist. Predators don't send a message that contains the whole grooming attempt. They send a hundred messages across a month, each one just slightly further than the last.

This post is about how temporal signal analysis changes the problem — and specifically, how SENTINEL's temporal layer works.

What keyword filters see

A keyword filter has a view like this:

[message] → [classifier] → flag / no flag

Each message is independent. The system has no memory. What happened in last Tuesday's session doesn't affect how it evaluates today's message.

This maps cleanly onto spam detection, where the signals that make a message spam are usually present in the message itself. It maps badly onto grooming, where the signal is the shape of behavior over time, not the content of individual messages.

A systematic review of the grooming detection literature by An et al. (arXiv:2503.05727, 2025) found that behavioral and temporal features are "consistently underexplored relative to linguistic features across the published literature" despite showing strong discriminative power in the studies that do use them. The architecture of most detection systems — trained on datasets of individual message excerpts — has driven the field toward a unit of analysis that the problem doesn't support.

What the behavioral evidence actually shows

Research on documented grooming cases consistently identifies a set of behavioral patterns that operate across sessions rather than within them:

Escalation velocity. Grooming tends to follow a measurable escalation trajectory: initial low-stakes contact, relationship development, increasing intimacy and exclusivity, then requests for personal information, image sharing, or off-platform contact. The rate at which this escalation moves is a signal. Fast escalation from a new contact is a very different pattern from a years-long friendship.

Contact frequency evolution. Early in grooming, contact is typically sporadic and positioned as casual. As trust develops, contact frequency increases and becomes more purposeful. The shift from irregular to regular to daily to multiple-times-daily contact, across sessions rather than within a single session, is a behavioral signature.

Session-bridging behavior. Predators often end sessions in ways that create continuity with the next one — leaving threads open, referencing the next time they'll talk, creating a sense of ongoing relationship rather than discrete conversations. This cross-session threading is observable as a temporal pattern.

Off-platform migration attempts. Requests to move a conversation from a platform to a private channel (WhatsApp, Signal, Snapchat) tend to cluster at a specific point in the grooming trajectory, after sufficient trust has been established but before the predator feels confident enough to escalate overtly on the monitored platform. The timing of this request, relative to the arc of the relationship, is a signal.

None of these patterns are visible in a message. They're only visible as trajectories.

How SENTINEL's temporal layer works

SENTINEL analyzes user behavior across four signal layers: linguistic, graph, temporal, and fairness. The temporal layer is specifically designed to capture the escalation patterns that cross-session behavioral analysis makes visible.

The core object is what we call the behavioral profile: a rolling window of signals accumulated across sessions for a given user-to-user relationship or a given user's behavior on the platform. This profile is updated with each new event and used to compute temporal features.

The key temporal signals SENTINEL tracks:

Escalation velocity. The rate at which the composite behavioral risk score is increasing over time. A user whose score has risen from 15 to 60 over three weeks looks very different from a user whose score reached 60 in a single session. The trajectory itself carries information.

Contact frequency gradient. How the rate of contact between two users has changed over time. The first week of contact looked casual; by week four, there are multiple sessions per day. The gradient of this change is computed as a temporal signal.

Session boundary behavior. How sessions end and begin. Does the conversation pick up immediately where it left off? Are there explicit continuity markers? Does the session-ending message create an open loop that the next session closes?

Time-of-day pattern shifts. Contact shifting to unusual hours — late night, early morning — is a known escalation marker. SENTINEL tracks whether the distribution of contact times has changed over the observation window.

These signals are composited into a temporal risk contribution that's added to the overall behavioral risk score alongside the linguistic and graph signal contributions.

The practical implication: why trajectory matters more than threshold

The classic approach to classification systems is to set a threshold: if the confidence score exceeds X, flag the content. For message-level classifiers, this makes sense — the score reflects confidence in the single message being malicious.

For temporal systems, the threshold intuition breaks down. The point is not whether today's message exceeds a threshold, but whether the shape of behavior over time matches known grooming trajectories.

SENTINEL scores users rather than messages. The risk score for a user reflects the accumulated weight of behavioral evidence across their entire history on the platform, with decay applied to older signals so that low-risk periods can recover a user's standing. A single suspicious message raises the score modestly. A sustained pattern of escalating contact, register shifts, and frequency increases over three weeks raises it substantially.

This means a moderator review queue populated by SENTINEL's scores looks different from a queue populated by per-message classification scores. The cases at the top of the queue are there because of a behavioral trajectory — because something has been building, not because one message happened to cross a threshold.

What explainability looks like for temporal signals

One of SENTINEL's design requirements is that every risk score comes with a structured plain-language explanation of the signals that contributed to it. For temporal signals, this looks like:

"Contact frequency between this user and [target] has increased 4.2x over the past 21 days. Time-of-day distribution has shifted toward late evening hours. Risk score increased 18 points this week driven primarily by contact frequency escalation."

This explanation structure matters for two reasons.

For human moderators: reviewing a case with this context is fundamentally different from reviewing a number. The moderator understands why the system flagged this user, can evaluate whether the behavioral trajectory matches their knowledge of the specific situation, and can make a better decision about what action, if any, is warranted.

For legal defensibility: if a moderation action is challenged — or if the platform needs to document its proactive detection methodology for DSA or UK Online Safety Act audit purposes — a structured explanation of the behavioral trajectory is far more useful than a classifier confidence score.

The data problem

The honest limitation of temporal detection is that it requires time-series data, which creates challenges that message-level systems don't face.

Most academic grooming detection datasets are collections of chat logs — often the PAN12 benchmark dataset — without full temporal context. Training and evaluating temporal detection systems requires longitudinal data: the full arc of a relationship over time, with session boundaries preserved. This data is scarce in research settings, because it requires either real platform data (which raises obvious consent and ethical issues) or synthetic data generation with careful attention to temporal realism.

SENTINEL ships with a synthetic research dataset of 50 annotated grooming conversations, designed for temporal analysis. It's a starting point; extending it is an explicit project goal. Academic researchers who want to build on this or contribute temporal datasets are specifically invited to engage — reach out at sentinel.childsafety@gmail.com.

Where this leaves detection systems

The practical implication is that building effective grooming detection requires choosing a unit of analysis that matches the phenomenon: not the message, not even the session, but the behavioral trajectory across sessions over time.

Systems that operate at the message level will always face the fundamental evasion problem: a predator who knows not to send any individual message that crosses a threshold can groom successfully while generating only normal-looking messages at each individual checkpoint. Systems that track behavioral trajectories can detect the escalation pattern even when no individual message is above threshold.

This is why SENTINEL's architecture is built around behavioral profiling rather than per-message classification — and why the temporal layer is central to the detection model rather than an add-on.

SENTINEL is open source and free for platforms under $100k annual revenue: https://github.com/sentinel-safety/SENTINEL

For questions, dataset contributions, or research collaboration: sentinel.childsafety@gmail.com

Inside SENTINEL: How 13 Microservices Detect Child Grooming by Behavior, Not Keywords

sentinel-safety — Sat, 25 Apr 2026 15:06:05 +0000

Keyword filters are a solved problem — solved by predators. They learned years ago to spell things differently, avoid flagged words, and simply groom slowly enough that no single message triggers a filter. The result: every major platform relying solely on keyword detection is running safety infrastructure that the most dangerous users have already mapped and bypassed.

SENTINEL takes a different approach. Instead of asking "does this message contain a bad word?", it asks "does this person's behavior, over time, resemble the trajectory of a predator approaching a minor?"

This post covers how that works at an engineering level.

The Four Signal Layers

SENTINEL's risk scoring is built on four independent signal layers feeding into a weighted ensemble:

1. Linguistic Analysis

NLP signals beyond keyword matching: sentiment trajectory across a conversation, escalation in intimacy markers, attempts to isolate the target from other users, and lexical similarity to known grooming conversation patterns. Models are trained on synthetic and research-derived datasets — never real user data.

2. Graph Analysis

Who is talking to whom, at what frequency, and with what structural characteristics. A 40-year-old account with zero peer-age connections making rapid friend requests to accounts flagged as likely minors looks very different from an 18-year-old talking to their gaming friends. Graph signals detect coordinated targeting, unusual relationship formation rates, and network centrality anomalies.

3. Temporal Analysis

Grooming has a temporal signature. Conversation escalation follows recognizable progressions. Contact frequency patterns — how often someone messages a specific user, at what times, with what regularity — are informative signals independent of content. SENTINEL builds time-series models of behavioral escalation across sessions.

4. Fairness Audit Layer

Before any composite score is emitted, it passes through demographic parity checks. If the system would flag members of one demographic group at a materially different rate than another for identical behavior, the score is held until the discrepancy is resolved. This is enforced at runtime, not just during training.

The four layers produce a composite score from 0–100 with four tiers: trusted, watch, restrict, critical.

The 13 Microservices

SENTINEL ships as a Docker Compose stack of 13 independent services. Each can be deployed incrementally — you do not need the full stack to get value.

Core Pipeline

1. event-ingestor — The entry point. Accepts raw events (messages, relationship changes, login events) via REST API or webhook. Normalizes, validates, and routes to the internal queue. Handles 10k+ events/second per instance.

2. nlp-scorer — Consumes events from the queue. Runs the linguistic analysis pipeline: tokenization, entity extraction, sentiment analysis, escalation detection. Emits linguistic signal scores to the aggregator.

3. graph-builder — Maintains the relationship graph in a vector database. On each new relationship event, updates edge weights, recalculates centrality, and flags anomalous graph formation. Uses incremental graph algorithms to avoid full recomputation.

4. temporal-tracker — Maintains per-user time-series of behavioral events. Computes rate-of-change signals, session frequency patterns, and contact escalation curves.

5. risk-aggregator — The ensemble. Pulls scores from the three signal services, applies the weighted ensemble model, runs the fairness gate, and writes the final risk score to the score store.

6. score-store — PostgreSQL-backed store for all risk scores with full history. Every score change is recorded with the contributing signals and their weights. The record contains not just "the score is 74" but which six signals contributed how much at what timestamp.

Compliance and Audit

7. audit-chain — Every moderator action, every automated action, every score change produces a cryptographically signed audit event. Events are chained (each includes the hash of the previous), making retroactive tampering detectable. Retained for 7 years, designed to serve as legal evidence.

8. compliance-engine — Per-tenant regulatory configuration. Handles GDPR right-to-erasure (soft-deletes with zero-knowledge proof of deletion), COPPA data retention limits, DSA reporting endpoint generation, and OSA audit export formatting.

9. alert-dispatcher — Watches the score store for threshold crossings. On critical tier transitions, fires webhook callbacks, generates moderator queue entries, and (if configured) prepares NCMEC CyberTipline-formatted evidence packages.

Federation Layer

10. federation-gateway — The privacy-preserving threat intelligence layer. When a user reaches critical tier, a cryptographic signal (not identifying data, not message content) is shared with opted-in peer platforms. Peers receive a risk signal for a pseudonymous identifier and can check for a matching user in their own system.

11. identity-resolver — Maps between external platform identifiers and SENTINEL's internal pseudonymous IDs. Raw platform user IDs never appear in logs, federation signals, or audit exports.

Developer Interface

12. api-gateway — The external-facing REST API. Handles authentication, rate limiting, per-tenant routing, and SDK compatibility. The Python and Node.js SDKs talk exclusively to this service.

13. dashboard-service — The moderator web UI. Displays risk score queues, behavioral timelines, graph visualizations, and the human review workflow. Every score comes with a plain-language explanation of why, specifically to reduce moderator burnout from opaque black-box outputs.

How the Fairness Gate Works

Before any risk score leaves the risk-aggregator, it runs through the fairness gate:

def fairness_gate(score, signals, demographic_proxy):
    baseline_rate = get_population_flag_rate(demographic_proxy)
    predicted_rate = estimate_flag_rate(score, signals, demographic_proxy)

    disparity = abs(predicted_rate - baseline_rate) / baseline_rate

    if disparity > PARITY_THRESHOLD:
        raise FairnessViolation(
            f"Demographic parity violation: {disparity:.2%} disparity detected"
        )

    return score

The threshold is configurable per deployment. When a FairnessViolation is raised, the score is quarantined and flagged for human review rather than propagated downstream. This is not a soft warning — it is a hard stop.

The default threshold (5% disparity) is derived from NIST's AI Risk Management Framework recommendations.

The Federation Protocol

The federation protocol is the most architecturally interesting piece. The goal: share threat intelligence across platforms without sharing any of the data that makes that intelligence sensitive.

The flow:

Platform A detects a critical-tier user. The federation-gateway generates a hashed, salted pseudonymous token from the user's behavioral signals.
The token is broadcast to opted-in peers via a gossip protocol over mutual TLS.
Platform B receives the token. Its identity-resolver checks whether any of its users produce a matching token under the shared salt.
If a match is found, Platform B's risk-aggregator applies a federation risk boost to that user's score.

No messages are shared. No usernames. No IPs. Platform A never learns which users on Platform B were matched. A predator banned on one platform gets flagged on another within minutes, with zero raw data crossing platform boundaries.

This is v1 of the federation protocol. The roadmap includes k-anonymity enhancements and a formal differential privacy layer.

Integration

The entire integration surface is the event ingestor API:

from sentinel_safety import SentinelClient
import hashlib

client = SentinelClient(api_key="your_key", tenant_id="your_tenant")

# Send a message event
client.ingest_event({
    "event_type": "message",
    "sender_id": "user_abc",
    "recipient_id": "user_xyz",
    "platform_room_id": "room_123",
    "timestamp": "2026-04-25T12:00:00Z",
    # Content hash only — raw messages never leave your platform
    "content_hash": hashlib.sha256(message_content.encode()).hexdigest(),
})

# Get current risk score
score = client.get_risk_score("user_abc")
print(score.tier)       # "watch"
print(score.score)      # 47
print(score.reasoning)  # Plain-language explanation of contributing signals

Content is never sent to SENTINEL — only a hash, alongside behavioral metadata. NLP analysis runs client-side via the SDK; only extracted signal scores reach the ingestor. Raw messages never leave your platform.

Time to first integration: under an hour.

Tech Stack

Python 3.12, FastAPI for all internal services
PostgreSQL (score store, audit chain)
Redis (event queue, session state)
Qdrant (vector database for graph embeddings)
Docker Compose for local and self-hosted deployment
OpenTelemetry throughout for observability

No proprietary cloud services required. Deployable on any provider.

What Is Next

SENTINEL v1.0 is live: github.com/sentinel-safety/SENTINEL

The roadmap: federated learning enhancements (on-device model updates without data sharing), k-anonymity improvements to the federation protocol, expansion of the research dataset beyond the current v1 baseline, and formal academic publication of the behavioral detection methodology.

If you are building a platform where minors are present and have not yet implemented proactive safety measures, SENTINEL is designed so there is no excuse not to. Setup is a Docker Compose file and an API key. Compliance infrastructure is included. The audit trail is automatic.

Commercial licensing for platforms over $100k annual revenue: sentinel.childsafety@gmail.com

SENTINEL is built and maintained by the Sentinel Foundation. v1.0 released April 2026.

Fairness in Child Safety AI: Why Demographic Parity Audits Are Not Optional

sentinel-safety — Sat, 25 Apr 2026 12:36:48 +0000

There's a particular failure mode in content moderation AI that the industry doesn't talk about enough: the system works, on average, but it works badly for specific groups.

Keyword filters disproportionately flag African-American Vernacular English. Toxicity classifiers flag LGBTQ+ content at higher rates than equivalent heteronormative content. Spam detection penalizes non-native English speakers. These failures are documented, reproducible, and — when they happen in a child safety context — cause serious harm.

If your child safety detection system disproportionately flags minors from certain demographic groups as high-risk, you're not just making mistakes. You're making systematic mistakes that will expose specific communities to greater scrutiny, greater false suspicion, and potentially greater harm from over-moderation. At the same time, you may be under-flagging true positives in other demographic groups — leaving some children less protected.

This is why fairness enforcement in child safety AI is not optional. And it's why we built demographic parity audits as an architectural enforcement mechanism in SENTINEL — not a metric to monitor, but a gate that blocks deployment.

What Fairness Actually Means in Detection Systems

"Fairness" in ML has multiple mathematical definitions that are often in tension with each other. For a detection system, the most relevant concepts are:

Demographic parity (statistical parity): The system flags roughly equal proportions of each demographic group. If 5% of adult users overall are flagged as high-risk, demographic parity requires that roughly 5% of adult users from any given demographic group are also flagged.

Equal opportunity: The true positive rate is equal across groups. If the system correctly identifies 80% of genuine threats in one group, it should identify roughly 80% in all groups.

Equalized odds: Both true positive rate and false positive rate are equal across groups.

These three definitions often conflict. A system that achieves demographic parity may fail equal opportunity (if the base rate of actual threats differs across groups). A system optimized for equal opportunity may produce different false positive rates across groups.

For SENTINEL, we selected demographic parity as the primary fairness gate, with supplementary monitoring of false positive parity. Here's the reasoning:

The false positive risk is the most immediately harmful. A false positive in a child safety context means a user who posed no threat is flagged, their account possibly restricted, and their behavior scrutinized. If false positive rates are higher for, say, Latino users than white users on the same platform, you've built a system that disproportionately harms a specific community. This is a direct civil rights issue.

The base rate problem is real but doesn't justify disparate impact. Some argue that demographic parity is too strict because different groups may have different base rates of predatory behavior. This argument is theoretically interesting and practically dangerous. Predatory behavior is a property of individuals, not groups. Any model that produces group-level predictions is producing biased predictions. Demographic parity is the correct standard.

What Fairness Failures Look Like in Practice

The research on algorithmic fairness in related domains gives us a detailed picture of how these failures happen:

Training data skew. If your training dataset of known grooming patterns was compiled primarily from English-language, North American platform data, your model has seen many examples of how grooming looks in that cultural-linguistic context. It has seen fewer examples of how it looks in other contexts. The result: lower true positive rates (worse recall) for grooming patterns from underrepresented communities, and potentially higher false positive rates as the model over-indexes on surface-level features that happen to correlate with certain communities.

Feature selection bias. If your linguistic signal layer uses n-gram or word embedding features trained on general-purpose English text, those features will not generalize equally across dialects, languages, and communication styles. A detection system trained to flag certain vocabulary patterns will flag non-standard English usage as anomalous — even when it's not anomalous for the users in question.

Label bias. If your training labels (confirmed grooming cases) were generated by a moderation team that itself had biased moderation practices, that bias propagates into the model. Garbage in, garbage out — but specifically, biased garbage in, systematically biased model out.

Feedback loops. A deployed model that produces disparate false positive rates creates its own future training data. More false positive labels from community X mean community X is more represented in the "flagged" training data, which reinforces the bias in the next model version.

How SENTINEL's Fairness Gate Works

SENTINEL implements fairness enforcement as a pre-deployment gate. Before any detection model — or update to an existing model — can be deployed, it must pass a demographic parity audit.

The audit process:

Step 1: Generate a fairness evaluation dataset.

This is a dataset of simulated or synthetic behavioral profiles representing a range of demographic groups, with ground-truth labels (threat / non-threat). The evaluation dataset is separate from the training data. It's designed to represent the demographic diversity of the platform's user base.

SENTINEL ships with a synthetic evaluation dataset. Platforms are encouraged to extend it with platform-specific data that represents their actual user demographics.

Step 2: Run the model against the evaluation dataset.

The model generates risk scores for all profiles in the evaluation set. Scores are recorded along with demographic labels.

Step 3: Compute parity metrics.

For each demographic group represented in the evaluation set, SENTINEL computes:

Flag rate (what percentage of profiles from this group are scored above the threshold)
False positive rate (among profiles labeled non-threat, what percentage are scored above threshold)
True positive rate (among profiles labeled threat, what percentage are scored above threshold)

Step 4: Apply parity thresholds.

SENTINEL's default thresholds: flag rate must be within ±20% of the overall flag rate for any group with sufficient representation. False positive rate must be within ±15% of the overall false positive rate.

These thresholds are configurable by platform. A platform may want stricter thresholds, or may have a different trade-off profile. The defaults are conservative.

Step 5: Gate or pass.

If any demographic group fails the parity threshold, the model cannot be deployed. This is enforced in the platform's model deployment pipeline — not a warning, not a recommendation, a hard block.

A fairness failure produces a detailed report: which group failed, what the actual vs. threshold disparity was, and what the model's overall performance metrics are. This report is included in the audit log.

Why It's Enforced, Not Monitored

An earlier iteration of SENTINEL had fairness metrics as a monitoring dashboard — visible, reported, but not blocking. This turned out to be insufficient.

The problem with monitoring-only approaches is that fairness failures in production are hard to detect and slow to surface. A 15% disparity in false positive rates between demographic groups might not be visible in aggregate moderation metrics. It won't be visible at all if the platform's reporting doesn't disaggregate by demographic group. And even if it's visible, the feedback loop from "we detected a fairness problem" to "we retrained and deployed a fixed model" is measured in weeks or months.

During that time, the biased model is flagging users at disparate rates. Real users are experiencing real harm.

Pre-deployment enforcement changes the dynamic entirely. A model that fails the fairness audit never reaches users. The harm never happens. The feedback loop is closed before deployment, not after.

This is the same logic as testing in software development. You can find bugs in production through monitoring, or you can find bugs before production through testing. Testing is better.

The Contribution Fairness Requirement

SENTINEL's fairness gate applies not just to the core platform, but to any behavioral detection model contributed to the project.

The CONTRIBUTING.md is explicit: any pull request that modifies detection logic must include a fairness analysis. This means contributors need to run the fairness evaluation suite on their modifications and include the results in their PR. PRs that improve detection performance at the cost of fairness parity will not be merged.

This creates a useful forcing function for contributors: if your modification to the linguistic signal layer improves detection accuracy overall but creates a 25% disparity in false positive rates for non-English speakers, you know before you submit the PR. You can iterate on the modification before it gets to review.

The Harder Questions

Demographic parity as a gate answers one question: is the model systematically unfair? But it doesn't answer harder questions that any mature child safety system will eventually confront:

What demographic categories should be measured? Race, ethnicity, gender, age, language, nationality? The choice of demographic categories is itself a value judgment, and not all categories are measurable from platform data. SENTINEL's default evaluation framework includes age (adult/minor), detected language, and account age as proxies. Platform-specific deployments can extend this with additional categories.

What if higher-risk groups produce legitimate base rate differences? This question is often raised as a challenge to demographic parity. Our answer: base rate differences in predatory behavior are not established empirically at the population level. They may be artifacts of over-policing — certain communities are more surveilled, so more of their bad actors are caught, so training data is skewed. Demographic parity is the correct standard precisely because we cannot trust historical label data to accurately represent true base rates.

What about intersectionality? A model might be fair when analyzed by race and fair when analyzed by gender, but systematically unfair for users who are both a particular race and a particular gender. Intersectional fairness analysis is computationally expensive but increasingly recognized as necessary. SENTINEL's roadmap includes intersectional parity analysis as a future enhancement.

Why This Matters for Regulatory Compliance

Both EU DSA and UK Online Safety Act contain non-discrimination provisions. Under the DSA, algorithmic decision systems must be non-discriminatory. Under the Online Safety Act, Ofcom can require platforms to demonstrate that their proactive safety systems do not produce disparate impact.

These provisions are currently underspecified — regulators haven't yet issued detailed technical guidance on what fairness compliance looks like in practice. But the direction of travel is clear.

A platform that can show pre-deployment fairness audits, documented parity metrics, and a hard gate preventing deployment of biased models is in a significantly stronger compliance position than one that monitors disparate impact in production and responds reactively.

The best time to build fairness enforcement is before your platform is large enough to attract regulatory scrutiny. By then, you've already accumulated deployment history, training data, and potentially liability.

Building It Right From the Start

If you're building a new moderation system, or evaluating whether to integrate SENTINEL, the key takeaway is this: fairness enforcement is architecturally much easier when it's built in from the beginning.

Retrofitting demographic parity audits onto an existing system requires:

Auditing training data for demographic representation
Building fairness evaluation datasets you probably don't have
Modifying deployment pipelines to include fairness gates
Retraining models that may have been in production for years

If you start with a fairness-gate-enforced framework, you never accumulate this technical debt. Every model trained on your platform, from day one, has been evaluated for demographic parity. Every deployment decision has been documented.

For child safety specifically, this matters more than in almost any other domain. The population you're protecting — children — is exactly the population least able to advocate for themselves when they're being harmed by algorithmic bias. Building fair systems is an architectural decision, not an aspiration.

SENTINEL's fairness gate and demographic parity audit are open source and fully documented. GitHub: https://github.com/sentinel-safety/SENTINEL. The fairness evaluation framework is documented in CONTRIBUTING.md.

Privacy-Preserving Threat Federation: How Platforms Can Share Intelligence Without Sharing Data

sentinel-safety — Sat, 25 Apr 2026 10:26:56 +0000

Here's a problem that every trust and safety team eventually runs into: predators don't stay on one platform.

A person who is caught grooming children on Platform A, banned, and deleted — simply opens an account on Platform B. Platform B has no way of knowing. Platform B's moderation team starts from zero. The predator has a clean slate.

This isn't a hypothetical. It's documented behavior. In the child safety space, researchers have consistently found that serial offenders operate across multiple platforms simultaneously, maintaining different personas for different targets. When one platform bans an account, another platform absorbs the risk.

The obvious solution is for platforms to share information. But sharing information between platforms creates serious privacy problems. How do you tell Platform B about a threat without giving Platform B access to Platform A's users' private communications?

This is the federation problem, and solving it correctly is genuinely difficult.

What Platforms Have Tried

The main existing approach to cross-platform threat sharing in the child safety space is perceptual hash matching — most famously implemented by PhotoDNA and maintained by NCMEC, IWF, and others.

The idea is elegant: take a known piece of CSAM, compute a hash that captures its visual "fingerprint," share that fingerprint without sharing the image. When another platform encounters a matching image, they can detect it without ever seeing the original.

This works extremely well for CSAM detection. It's been responsible for tens of millions of reports globally.

But hash matching has a hard limitation: it only works for content that has already been identified. It cannot detect new offenders. It cannot detect behavioral patterns. And it cannot detect grooming, which typically involves no CSAM at all in its early stages — just ordinary conversation.

For behavioral threat intelligence, no equivalent infrastructure exists. When Platform A bans a groomer after a three-month escalation pattern, Platform B learns nothing.

What "Federation" Could Mean

In the security world, threat intelligence federation is well-established. MISP, STIX/TAXII, and other standards allow organizations to share indicators of compromise (IoCs), attack signatures, and threat actor TTPs. The question is whether this can be adapted to behavioral threat intelligence for child safety in a privacy-preserving way.

The challenge is that behavioral threat intelligence is inherently more sensitive than, say, a malicious IP address. A behavioral threat record might contain:

Temporal patterns (when this account was active)
Linguistic pattern features (style of communication)
Relationship graph structure (how many connections, what frequency)
Account metadata (creation date, device fingerprints)

Any of this, if transmitted in the raw, is personally identifiable data subject to GDPR, CCPA, and other privacy regulations. Platform A cannot simply export these records and transmit them to Platform B.

The Cryptographic Signature Approach

SENTINEL's federation layer uses a different model: instead of sharing behavioral data, it shares cryptographic signatures derived from behavioral patterns.

Here's the key insight: you don't need to share the data to share the threat signal. You need to share a representation of the data that is:

Specific enough that Platform B can detect the same threat
Generic enough that it doesn't reveal personal information about the individual who generated it
Mathematically bound to the actual behavioral evidence, so it can't be fabricated

In practice, this looks like this:

Platform A detects a confirmed grooming pattern. Their system generates a behavioral signature: a vector representation derived from the multi-dimensional behavioral profile of this pattern — the combination of linguistic drift, temporal escalation, graph structure, and contact dynamics that together characterized this threat. This vector is not reversible back to the original behavioral data. It's a representation, not a copy.

Platform A submits this signature (with no user PII attached) to the SENTINEL federation service. The federation service stores signatures only — it never receives raw behavioral data.

Platform B, when analyzing a new user's behavior, computes behavioral vectors from their own platform's data and queries the federation service: does this user's behavioral profile match any known threat signature?

If there's a match above a confidence threshold, Platform B gets an alert: "This account's behavioral pattern matches a confirmed threat signature from a federated platform." The alert contains no information about which platform generated the signature, who the original account was, or what specifically they did.

Platform B's moderators review the alert, examine the current platform's own behavioral data, and make an independent determination.

What the Federation Service Knows

The federation service in this architecture is privacy-minimal:

It stores behavioral signature vectors (not reversible to personal data)
It knows which platform submitted each signature (for federation governance)
It knows when each signature was submitted
It does not know: any platform's users, any user's identity, the content of any conversation, or any PII

This means a compromise of the federation service does not expose user data from any platform. An attacker who gains access to the signature database gets a set of high-dimensional vectors with no direct link to individuals.

The Trust Problem in Federation

Cryptographic privacy is necessary but not sufficient. There's also a trust problem: for federation to work, Platform B needs to be able to trust that signatures submitted by Platform A represent real, confirmed threats — not false positives, not fabricated data.

This is where federation governance matters.

SENTINEL's federation model is opt-in, and participation requires a signed federation agreement. Platforms that submit signatures to the federation service attest that the signature represents a confirmed grooming pattern — not just a flagged behavior, not an unreviewed algorithmic output, but a human-reviewed, confirmed case.

This creates accountability. A platform that submits low-quality signatures (high false positive rate, or — worse — deliberately weaponized signals targeting legitimate users) can be suspended from the federation.

The governance model draws from the existing ISAC (Information Sharing and Analysis Center) model used in cybersecurity. The key adaptations for child safety context are:

Stricter confirmation requirements before signature submission (human review required, not just algorithmic flag)
Lower confidence threshold for alerts (a match is treated as a reason to investigate, not a reason to ban)
Right to appeal — users flagged via federated signatures have a clear process to challenge the match

Privacy Preservation in Practice

Three specific privacy risks need to be addressed in any federation system:

Risk 1: Signature linkage across platforms.

If behavioral signatures are deterministic — the same behavioral data always produces the same signature — then Platform A and Platform B could cross-reference their signature databases to identify users who have accounts on both platforms. This is a privacy violation even if neither platform has the other's data.

SENTINEL's signatures are non-deterministic: they incorporate platform-specific entropy, so the same underlying behavioral pattern produces different signatures on different platforms. Cross-platform account linkage is not possible from signatures alone.

Risk 2: Inference of sensitive attributes from behavioral patterns.

Behavioral patterns can inadvertently encode demographic information. A behavioral detection system trained on datasets that skew toward certain demographic groups might produce signatures that are statistically correlated with age, ethnicity, or gender. This is both a fairness problem and a privacy problem.

SENTINEL addresses this through the fairness gate: before any behavioral model is used to generate federation signatures, it must pass a demographic parity audit. If signature generation is found to be correlated with protected attributes, the model cannot be deployed.

Risk 3: Abuse of the federation alert mechanism.

If Platform B receives an alert saying "this user matches a known threat signature," that alert itself is sensitive. It needs to be treated as confidential information — not disclosed to the flagged user (which would tip off the threat) and not retained longer than necessary for the moderation review.

SENTINEL's federation alerts are ephemeral: they're generated on query, delivered to Platform B's moderation queue, and not stored by the federation service. Platform B's own retention policies apply to the alert records, subject to the same erasure handling as other behavioral data.

What This Looks Like for a Small Platform

If you're a small platform considering federation participation, the operational picture is:

You submit signatures only for confirmed grooming cases that have been reviewed by a human moderator.
Your federation queries run asynchronously in the background as part of normal behavioral analysis — you don't need to integrate a separate federation lookup step.
When you receive a federation match alert, it appears in your moderation queue alongside SENTINEL's own behavioral risk score for that user. The alert is one data point; your moderator reviews it alongside your platform's own evidence.
Nothing from your platform — no user data, no conversation content, no PII — is ever transmitted to the federation service or to any other platform.

The federation participation agreement is included in SENTINEL's repository. It covers the confirmation requirements, dispute resolution, and grounds for suspension.

The Larger Picture

The fundamental problem — predators migrating between platforms — won't be solved by any single platform improving its own detection. It requires coordination.

The CSAM hash matching infrastructure (PhotoDNA / NCMEC / IWF) shows that privacy-preserving cross-platform coordination is achievable at scale. The same principle — share signatures, not content — can be extended to behavioral threat intelligence.

The infrastructure to do it exists. The open question is whether the industry will adopt it. That adoption requires trust, governance, and tooling that makes participation low-friction for small platforms that don't have dedicated T&S engineering teams.

That's what SENTINEL's federation layer is designed to be: production-grade behavioral threat federation that a small platform can deploy in an afternoon, participate in responsibly, and benefit from within days.

SENTINEL is an open-source behavioral intelligence platform for child safety compliance. The federation module is part of the core platform. Free for platforms under $100k revenue. GitHub: https://github.com/sentinel-safety/SENTINEL

Building Compliance-Native Child Safety: What DSA and UKOSA Actually Require

sentinel-safety — Sat, 25 Apr 2026 10:10:00 +0000

If you operate a platform where users under 18 might be present — a game, a community forum, a tutoring app, a messaging tool — there's a good chance you've heard that child safety regulations are getting stricter.

You may have heard "DSA" and "UK Online Safety Act" mentioned. You might have a vague sense that you're probably in scope for something. But the actual requirements are surprisingly opaque, especially for smaller teams who can't afford a compliance consultant.

This post walks through what DSA and UKOSA actually require, what counts as "reasonable" compliance for a small platform, and what you'd need to build (or deploy) to demonstrate it.

Two Laws. One Problem.

EU Digital Services Act (DSA) came into force for all platforms in February 2024. It applies to any online intermediary operating in the EU — regardless of where the platform is headquartered.

UK Online Safety Act (UKOSA) completed its phased implementation in January 2025, with additional categorization duties taking effect in July 2026. It applies to platforms with UK users — again, regardless of where you're based.

Both laws operate on a tiered system. The obligations on a gaming indie studio with 10,000 users are dramatically different from those on a Very Large Online Platform (VLOP) like Meta. But here's the thing smaller teams often miss: the baseline obligations apply to everyone, including platforms that have never thought of themselves as being "in scope."

What Both Laws Actually Require (Baseline)

1. You must have a process for content moderation

Both laws require platforms to have documented, functioning processes for dealing with illegal content and harmful content involving minors. "We don't really have chat" is not a defense if your platform has any user-to-user communication feature.

What this means practically:

A written moderation policy that users can read
A mechanism for users to report content
A process for reviewing and acting on reports
Documentation that you actually follow the process

2. You must have a way to report CSAM to authorities

If child sexual abuse material appears on your platform (or is generated/distributed through it), you are required to report it. In the US, this means NCMEC CyberTipline reporting. The DSA establishes that illegal CSAM reports must be made to national authorities (and to a soon-to-be-established EU center).

What this means practically:

You need tooling that can generate evidence packages in the NCMEC reporting format (hash, timestamp, account information, content)
You need a documented retention policy for evidence that might be needed in legal proceedings
You need to know what your reporting obligations are in the jurisdictions you operate in

3. You must implement child safety measures if minors are in your user base

This is where both laws get more specific. If you have users under 18 (or if you have any reason to believe you might), you're required to implement proportionate measures to prevent harmful contact with those users.

The key word is "proportionate." A platform with 500 users has different obligations than TikTok. But "proportionate" does not mean "none."

The July 2026 Ofcom Categorization Register

In July 2026, Ofcom will publish the UK's first Platform Categorization Register under UKOSA. This register will categorize platforms into tiers — and different tiers have different mandatory obligations.

Here's what this means for smaller platforms: many platforms that currently believe they're below the threshold will discover they're not.

The categorization criteria include:

Number of UK users
Whether the platform allows user-to-user communication
Whether users under 18 are present (or "likely to be present")
Whether the platform has content that is "regulated content" under UKOSA

If you run a gaming platform with voice or text chat, and you have any UK users, you should be planning now for what category you might fall into.

What "Proactive" Child Safety Looks Like

Both laws nudge platforms toward proactive (not just reactive) safety measures. Reactive safety is: someone reports abuse, you respond. Proactive safety is: you detect patterns of potential abuse before a report is filed.

For most platforms, proactive safety has historically meant one thing: keyword filtering. Block certain words and phrases, flag messages that contain them.

There are two problems with this approach, and regulators are increasingly aware of both:

Problem 1: Keyword filters don't catch grooming.

Grooming is a process that unfolds over weeks or months. It typically begins with entirely normal, benign conversation — building trust, establishing a relationship, escalating gradually. The vocabulary of early-stage grooming looks nothing like the vocabulary regulators put on keyword lists. By the time a keyword triggers, significant harm has often already begun.

Problem 2: Keyword filters create legal liability, not just safety.

A keyword filter that misses a grooming pattern, when documented, looks like a system that was designed to fail. When a regulator or plaintiff examines your moderation logs, "we had a keyword filter" is not a strong defense. "We monitored behavioral patterns and escalated to human moderators when patterns suggested risk" is a much stronger one.

What Behavioral Detection Actually Requires

If you want to implement behavioral detection — the approach that actually works against grooming — here's what you need:

1. Multi-session context

A single-message classifier cannot detect grooming. You need a system that tracks how conversations evolve over time — across multiple sessions, over days or weeks. The risk signal comes from the trajectory, not any individual message.

2. Relationship graph tracking

Grooming often involves one adult establishing a relationship with one minor. Coordinated grooming (multiple accounts approaching the same minor) is also documented. You need to track who is talking to whom, with what frequency, and how those relationships develop.

3. Explainability for human moderators

Regulators in both the EU and UK have begun asking: when your system flags a user, what does your human moderator actually see? An opaque score from 0 to 100 is not sufficient. Moderators need to understand why a flag was triggered — both for accuracy (to make good decisions) and for accountability (to document that human review occurred).

4. Audit logs with forensic integrity

Both DSA and UKOSA require that you be able to demonstrate your compliance process to regulators. This means tamper-evident audit logs — records that cannot be altered after the fact — that show when a risk was detected, what action was taken, and by whom.

For legal proceedings (criminal cases, civil suits), chain-of-custody matters. Your audit log is evidence. It needs to be treated like evidence from the start.

5. Data handling compliance

You can't build a behavioral detection system without collecting and processing behavioral data. That data collection must be GDPR-compliant (for EU and UK users), COPPA-compliant (if you have US users under 13), and consistent with your privacy policy.

This means:

A documented lawful basis for processing behavioral data for safety purposes
Erasure handling — when a user exercises their right to deletion, the audit log must be preserved for legal compliance but personal data must be removed
Data minimization — you should process the minimum necessary behavioral signals, not archive raw message content

The Compliance Burden on Small Platforms

Here's the frustrating reality: the compliance requirements above are legitimate and proportionate. They exist to protect children. But implementing all of them from scratch is expensive — easily $500K+ in engineering cost for a full custom implementation.

This is where the market has a gap. Large platforms (Meta, Discord, Roblox, TikTok) have entire trust and safety engineering teams. Small platforms — indie game studios, EdTech startups, community forums — have maybe one person who is also doing three other jobs.

The UKOSA's Ofcom has explicitly acknowledged this gap. Their guidance mentions that smaller platforms can use third-party tooling to meet their obligations, provided that tooling is well-documented and auditable. The regulation doesn't require you to build from scratch; it requires you to have a functioning, defensible compliance posture.

What This Looks Like in Practice

We built SENTINEL as an open-source answer to this gap. Here's what it covers:

Behavioral risk scoring: Four signal layers (linguistic, graph, temporal, and fairness) that monitor conversation patterns across sessions — not just individual messages. Each score comes with a plain-language explanation so moderators understand what triggered it.

Fairness gates: Before any detection model can be deployed, it must pass a demographic parity audit. If it disproportionately flags any demographic group, it cannot ship. This prevents the disparate-impact problems that have plagued algorithmic moderation systems.

Tamper-evident audit logs: 7-year retention with cryptographic chaining — every entry is a chain link that can be verified. Designed for legal proceedings, not just internal monitoring.

NCMEC CyberTipline reporting: Generates evidence packages in the required format. If you have a mandatory reporting obligation, the tooling to meet it is built in.

GDPR/COPPA erasure handling: When a deletion request comes in, personal data can be removed from behavioral records without destroying the audit log's forensic integrity.

Federation (opt-in): Platforms can share threat signatures without sharing raw messages. A predator banned on one platform gets flagged on federated platforms — without any platform ever seeing another platform's user data.

It's free for platforms under $100k annual revenue. Most indie studios, most EdTech startups, most community forums qualify.

Where to Start

If you're a small platform trying to figure out your compliance posture:

Establish whether you're in scope. If you have users in the EU or UK and any user-to-user communication feature, you probably are. If you have users under 18 (or can't rule it out), the child safety provisions apply.
Document what you have. Even if it's just a keyword filter and a report-abuse button, document it. A documented process is a defense. An undocumented one is not.
Understand the July 2026 UKOSA deadline. If you operate a UK-facing platform, start tracking Ofcom's categorization register announcements now. The obligations for higher-tier platforms take effect in Q3 2026.
Look at open-source tooling. You don't need to build a moderation platform from scratch. SENTINEL (and other tools in the ROOST ecosystem) are specifically designed to give smaller platforms access to the same caliber of safety infrastructure that large platforms have built internally.

One More Thing

The regulatory environment is not going to get simpler. The EU's AI Act introduces additional requirements for AI-based content moderation systems. The UK is actively expanding UKOSA. US state laws are proliferating.

But the fundamental requirement is not that complex: you need to demonstrate that you took child safety seriously, that you had proportionate processes, and that you documented what you did. That's achievable for a small platform with the right tools.

SENTINEL is an open-source behavioral intelligence platform for child safety compliance. Free for platforms under $100k revenue. GitHub: https://github.com/sentinel-safety/SENTINEL