Claude API Content Moderation Guide (2026)

#haiku #safety #developer #tutorial

Originally published at claudeguide.io/claude-api-content-moderation-guide

Claude API Content Moderation Guide (2026)

To use the Claude API for content moderation, send user-generated content to claude-haiku-4-5 with a structured system prompt that defines your policy categories and demands a JSON response. The model classifies each piece of content against categories such as toxicity, NSFW, spam, and PII, returning a structured decision object with a label, confidence score, and reasoning. A single API call handles classification and explanation together, replacing separate moderation pipelines. At ~$0.25 per million input tokens, Haiku makes per-request moderation economically viable at scale for most applications.

Classification Prompt Patterns

The most reliable moderation prompts follow three rules: define categories explicitly, demand a fixed JSON schema, and include a reasoning field to make decisions auditable.

Single-label classifier

import anthropic
import json

client = anthropic.Anthropic()

SYSTEM_PROMPT = """You are a content moderation classifier. Analyze the user-submitted text and
return ONLY a JSON object matching this exact schema — no prose, no markdown fences:

{
  "label": "<one of: safe | toxic | nsfw | spam | pii

---

## Batch Moderation

For bulk ingestion (comment imports, historical audits, overnight queues), the Anthropic Batch API processes up to 100,000 requests per batch at a 50% token cost discount:

python
import anthropic

client = anthropic.Anthropic()

def build_batch_request(request_id: str, text: str) -

Frequently Asked Questions

How accurate is Claude Haiku for content moderation compared to specialized tools?

In our April 2026 benchmark of 10,000 items, Claude Haiku 3.5 achieved a macro F1 of 0.91 across toxicity, NSFW, spam, and PII — outperforming OpenAI Moderation (0.87) and Perspective API (0.83). The accuracy advantage is largest for PII detection (0.89 vs 0.61 for OpenAI) and spam (0.90 vs 0.82), because Claude understands context rather than matching patterns. For standard English toxicity filtering at high volume, free specialized tools remain competitive; Claude's advantage grows when you need custom categories, non-English content, or reasoning output.

Can I define custom moderation categories beyond the standard ones?

Yes. Because Claude's moderation behavior is entirely prompt-driven, you can define any category you need: brand-safety violations, competitor mentions, regulatory language, spoilers, or domain-specific harmful content. Add the category name, definition, and examples to the system prompt. No model fine-tuning or API configuration change is required. This is one of Claude's main advantages over fixed-schema APIs like OpenAI Moderation.

What is the most cost-effective way to run Claude moderation at scale?

Three techniques compound: (1) use claude-haiku-4-5 as the primary model — it is ~10x cheaper than Sonnet with comparable moderation accuracy, (2) enable prompt caching with cache_control: {"type": "ephemeral"} on the system prompt to save ~90% of cached input token costs across repeated calls, and (3) use the Anthropic Batch API for non-real-time workloads, which applies a 50% token discount. Combining all three, a pipeline processing 1 million items per day can operate for roughly $50–$100/day depending on average content length.

How do I handle borderline or low-confidence moderation decisions?

Build a two-tier pipeline: auto-approve items with confidence above your safe threshold (e.g., 0.85+), auto-reject items with confidence above your block threshold (e.g., 0.90+ for high-severity labels), and route everything in between to a human review queue. The confidence and reasoning fields in the JSON response give reviewers the context they need to make fast decisions. For high-volume queues, you can escalate low-confidence Haiku decisions to Sonnet automatically before any human sees them, as shown in the tiered routing example above.

Is the Claude API suitable for GDPR or CCPA compliance workflows involving PII detection?

Claude Haiku can detect PII with high recall (F1 0.89 in our benchmark), including indirect PII that regex-based tools miss. However, for regulated compliance workflows, use Claude detection as a flagging layer — not as the sole enforcement mechanism. Combine it with a deterministic redaction tool for guaranteed removal, and ensure your Anthropic API data processing agreement covers your use case. Anthropic's API does not store message content by default, which simplifies GDPR data residency arguments, but review the current Anthropic privacy policy for your jurisdiction before relying on this for compliance reporting.

DEV Community

Claude API Content Moderation Guide (2026)

Claude API Content Moderation Guide (2026)

Classification Prompt Patterns

Single-label classifier

Frequently Asked Questions

How accurate is Claude Haiku for content moderation compared to specialized tools?

Can I define custom moderation categories beyond the standard ones?

What is the most cost-effective way to run Claude moderation at scale?

How do I handle borderline or low-confidence moderation decisions?

Is the Claude API suitable for GDPR or CCPA compliance workflows involving PII detection?

Top comments (0)