system designcontent moderationmachine learningtrust and safetydistributed systems

Content Moderation System Design: Scaling Trust and Safety

March 28, 2026 7 min readBy Codelit Team Discussion

Introduction#

Every platform that accepts user-generated content needs a content moderation system. At scale, this means processing billions of posts, images, and videos daily — combining ML classifiers, hash-based matching, human review, and policy enforcement into a coherent pipeline.

This guide covers the end-to-end content moderation system design, from ingestion to appeals.

Functional Requirements#

Moderate text, images, and video content in real time
Flag or remove content that violates platform policies
Route uncertain content to human reviewers
Support an appeals process for incorrect decisions
Maintain an audit trail of all moderation actions
Allow policy updates without redeploying the system

Non-Functional Requirements#

Latency: Real-time moderation for text (under 200ms), near-real-time for images (under 2 seconds)
Throughput: Handle hundreds of thousands of content items per second
Accuracy: Minimize false positives (wrongly removed content) and false negatives (missed violations)
Scalability: Support growth from millions to billions of daily items
Consistency: Same content should receive the same moderation decision globally

High-Level Architecture#

Content Upload → Pre-filter → ML Pipeline → Decision Engine → Action
                                   ↓                ↓
                            Human Review Queue    Audit Log
                                   ↓
                            Appeals Process

Content Ingestion and Pre-Filtering#

Before running expensive ML models, apply cheap pre-filters:

Hash-based matching: Compare content hashes against known-bad databases
Blocklist matching: Check text against keyword and regex blocklists
Duplicate detection: Identify previously moderated content via perceptual hashing
Rate limiting: Flag accounts uploading at abnormal rates

These filters catch a significant percentage of violations at minimal compute cost.

Hash-Based Matching#

PhotoDNA#

Microsoft's PhotoDNA generates a hash of an image that is robust to resizing, cropping, and color changes. Platforms compare uploaded images against databases of known illegal content (e.g., NCMEC database for CSAM).

Video Hashing#

Video is decomposed into keyframes, and each frame is hashed independently. Some systems also hash audio tracks to catch violations in spoken content.

Perceptual Hashing#

Unlike cryptographic hashes, perceptual hashes produce similar outputs for visually similar images. This catches minor modifications designed to evade exact-match detection.

ML Classification Pipeline#

Text Moderation#

Text classifiers analyze content for:

Hate speech and slurs
Harassment and bullying
Spam and scam content
Self-harm and violence
Misinformation (more nuanced, often requires specialized models)

Modern systems use transformer-based models fine-tuned on platform-specific labeled data. Multilingual support requires separate models or multilingual architectures.

Image Moderation#

Image classifiers detect:

Nudity and sexual content
Violence and gore
Drugs and weapons
Text embedded in images (requires OCR followed by text classification)

CNNs and vision transformers are common choices. Models output a confidence score per violation category.

Video Moderation#

Video moderation combines:

Frame sampling: Extract frames at regular intervals and run image classifiers
Audio transcription: Convert speech to text and run text classifiers
Temporal analysis: Some violations only become apparent across multiple frames

Video moderation is the most compute-intensive — large platforms process millions of hours of video daily.

Confidence Thresholds and the Decision Engine#

Each ML model outputs a confidence score between 0 and 1. The decision engine maps these scores to actions using configurable thresholds:

Confidence Range	Action
0.95 - 1.00	Auto-remove, notify user
0.70 - 0.95	Send to human review queue
0.30 - 0.70	Reduce distribution (shadow restrict)
0.00 - 0.30	Allow

These thresholds are tuned per category and per market. A platform may be more aggressive on CSAM (auto-remove at 0.80) and more conservative on satire (only auto-remove at 0.99).

The Policy Engine#

Moderation rules change frequently. A policy engine decouples rules from code:

Policies are defined as configuration (JSON/YAML rules or a DSL)
Rules reference model output labels and confidence scores
Different policies apply to different regions, content types, or user tiers
Policy changes take effect immediately without deployment

Example policy rule:

rule: block_hate_speech
condition: hate_speech_score > 0.90 AND region IN [US, EU]
action: remove
notify: true
appeal_eligible: true

Human Review Queue#

When ML confidence is uncertain, content enters the human review queue:

Priority Ranking#

Severity: Potential CSAM or imminent violence is reviewed first
Reach: Content from accounts with large followings is prioritized
Recency: Newer content is reviewed before older content
Model confidence: Items closer to the decision boundary get reviewed sooner

Reviewer Workflow#

Reviewer sees the content with ML predictions and relevant context
Reviewer selects a violation category or marks as "no violation"
Decision is recorded and fed back to the ML training pipeline
Reviewer labels become ground truth for model improvement

Reviewer Wellbeing#

Content reviewers are exposed to disturbing material. Systems must:

Blur graphic content by default, requiring explicit click to reveal
Limit exposure time per session
Provide mental health support
Rotate reviewers across content categories

Appeals Process#

Users whose content is removed can appeal:

User submits appeal with optional explanation
Appeal is routed to a different reviewer (never the original)
Reviewer re-evaluates with full context including the user's explanation
Decision is final (or escalated to a senior review panel)
Appeal outcomes feed back into model training

Tracking appeal overturn rates per category is a key quality metric. A high overturn rate signals that the ML model or thresholds need adjustment.

False Positive Handling#

False positives — legitimate content incorrectly flagged — erode user trust. Strategies to minimize them:

Multi-model ensemble: Require agreement from multiple models before auto-removing
Context awareness: A medical education video showing anatomy should not be flagged as nudity
User reputation scoring: Established accounts with clean history get higher thresholds
Gradual enforcement: Reduce distribution before outright removal

Real-Time vs Batch Moderation#

Real-Time#

Applied at upload time. Essential for:

Live streams
Chat messages
Content that could go viral within minutes

Batch#

Applied retroactively. Useful for:

Re-scanning existing content when new policies are introduced
Running improved models against historical content
Detecting coordinated campaigns that only become visible in aggregate

Most platforms use both. Real-time catches obvious violations; batch catches the rest.

Scaling Considerations#

Compute#

Text classification is cheap — thousands of items per GPU per second
Image classification is moderate — hundreds per GPU per second
Video is expensive — may require dedicated GPU clusters

Storage#

Store moderation decisions and audit logs indefinitely for compliance
Cache hash databases in memory for fast lookup
Use CDN-level integration to block removed content globally

Geographic Distribution#

Deploy moderation services in multiple regions for latency
Ensure compliance with local regulations (different rules per jurisdiction)
Route human review to reviewers who speak the content's language

Metrics and Monitoring#

Key metrics to track:

Precision and recall per violation category
Human review queue depth and average review time
Appeal overturn rate per category and per model version
Time to action from upload to moderation decision
False positive rate for high-confidence auto-removals

Summary#

Component	Role
Pre-filter	Hash matching, blocklists, deduplication
ML Pipeline	Text, image, and video classification
Decision Engine	Maps confidence scores to actions via thresholds
Policy Engine	Configurable rules decoupled from code
Human Review	Handles uncertain cases, feeds training data
Appeals	Allows users to contest decisions

Content moderation at scale is a continuous balancing act between user safety, free expression, and operational cost. The best systems combine fast automated detection with thoughtful human oversight.

Learn how to design content moderation pipelines, trust and safety systems, and 200+ other architectures at codelit.io.

Article #204 · Codelit System Design Series

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Try these templates

Uber Real-Time Location System

Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.

6 components

E-Commerce Checkout System

Production checkout flow with Stripe payments, inventory management, and fraud detection.

11 components

Notification System

Multi-channel notification platform with preferences, templating, and delivery tracking.

9 components

Build this architecture

Generate an interactive architecture for Content Moderation System Design in seconds.

Try it in Codelit →

system designcontent moderationmachine learningtrust and safetydistributed systems

Content Moderation System Design: Scaling Trust and Safety

March 28, 2026 7 min readBy Codelit Team Discussion

Introduction#

This guide covers the end-to-end content moderation system design, from ingestion to appeals.

Functional Requirements#

Moderate text, images, and video content in real time
Flag or remove content that violates platform policies
Route uncertain content to human reviewers
Support an appeals process for incorrect decisions
Maintain an audit trail of all moderation actions
Allow policy updates without redeploying the system

Non-Functional Requirements#

Latency: Real-time moderation for text (under 200ms), near-real-time for images (under 2 seconds)
Throughput: Handle hundreds of thousands of content items per second
Accuracy: Minimize false positives (wrongly removed content) and false negatives (missed violations)
Scalability: Support growth from millions to billions of daily items
Consistency: Same content should receive the same moderation decision globally

High-Level Architecture#

Content Upload → Pre-filter → ML Pipeline → Decision Engine → Action
                                   ↓                ↓
                            Human Review Queue    Audit Log
                                   ↓
                            Appeals Process

Content Ingestion and Pre-Filtering#

Before running expensive ML models, apply cheap pre-filters:

Hash-based matching: Compare content hashes against known-bad databases
Blocklist matching: Check text against keyword and regex blocklists
Duplicate detection: Identify previously moderated content via perceptual hashing
Rate limiting: Flag accounts uploading at abnormal rates

These filters catch a significant percentage of violations at minimal compute cost.

Hash-Based Matching#

PhotoDNA#

Video Hashing#

Video is decomposed into keyframes, and each frame is hashed independently. Some systems also hash audio tracks to catch violations in spoken content.

Perceptual Hashing#

Unlike cryptographic hashes, perceptual hashes produce similar outputs for visually similar images. This catches minor modifications designed to evade exact-match detection.

ML Classification Pipeline#

Text Moderation#

Text classifiers analyze content for:

Hate speech and slurs
Harassment and bullying
Spam and scam content
Self-harm and violence
Misinformation (more nuanced, often requires specialized models)

Modern systems use transformer-based models fine-tuned on platform-specific labeled data. Multilingual support requires separate models or multilingual architectures.

Image Moderation#

Image classifiers detect:

Nudity and sexual content
Violence and gore
Drugs and weapons
Text embedded in images (requires OCR followed by text classification)

CNNs and vision transformers are common choices. Models output a confidence score per violation category.

Video Moderation#

Video moderation combines:

Frame sampling: Extract frames at regular intervals and run image classifiers
Audio transcription: Convert speech to text and run text classifiers
Temporal analysis: Some violations only become apparent across multiple frames

Video moderation is the most compute-intensive — large platforms process millions of hours of video daily.

Confidence Thresholds and the Decision Engine#

Each ML model outputs a confidence score between 0 and 1. The decision engine maps these scores to actions using configurable thresholds:

Confidence Range	Action
0.95 - 1.00	Auto-remove, notify user
0.70 - 0.95	Send to human review queue
0.30 - 0.70	Reduce distribution (shadow restrict)
0.00 - 0.30	Allow

These thresholds are tuned per category and per market. A platform may be more aggressive on CSAM (auto-remove at 0.80) and more conservative on satire (only auto-remove at 0.99).

The Policy Engine#

Moderation rules change frequently. A policy engine decouples rules from code:

Policies are defined as configuration (JSON/YAML rules or a DSL)
Rules reference model output labels and confidence scores
Different policies apply to different regions, content types, or user tiers
Policy changes take effect immediately without deployment

Example policy rule:

rule: block_hate_speech
condition: hate_speech_score > 0.90 AND region IN [US, EU]
action: remove
notify: true
appeal_eligible: true

Human Review Queue#

When ML confidence is uncertain, content enters the human review queue:

Priority Ranking#

Severity: Potential CSAM or imminent violence is reviewed first
Reach: Content from accounts with large followings is prioritized
Recency: Newer content is reviewed before older content
Model confidence: Items closer to the decision boundary get reviewed sooner

Reviewer Workflow#

Reviewer sees the content with ML predictions and relevant context
Reviewer selects a violation category or marks as "no violation"
Decision is recorded and fed back to the ML training pipeline
Reviewer labels become ground truth for model improvement

Reviewer Wellbeing#

Content reviewers are exposed to disturbing material. Systems must:

Blur graphic content by default, requiring explicit click to reveal
Limit exposure time per session
Provide mental health support
Rotate reviewers across content categories

Appeals Process#

Users whose content is removed can appeal:

User submits appeal with optional explanation
Appeal is routed to a different reviewer (never the original)
Reviewer re-evaluates with full context including the user's explanation
Decision is final (or escalated to a senior review panel)
Appeal outcomes feed back into model training

Tracking appeal overturn rates per category is a key quality metric. A high overturn rate signals that the ML model or thresholds need adjustment.

False Positive Handling#

False positives — legitimate content incorrectly flagged — erode user trust. Strategies to minimize them:

Multi-model ensemble: Require agreement from multiple models before auto-removing
Context awareness: A medical education video showing anatomy should not be flagged as nudity
User reputation scoring: Established accounts with clean history get higher thresholds
Gradual enforcement: Reduce distribution before outright removal

Real-Time vs Batch Moderation#

Real-Time#

Applied at upload time. Essential for:

Live streams
Chat messages
Content that could go viral within minutes

Batch#

Applied retroactively. Useful for:

Re-scanning existing content when new policies are introduced
Running improved models against historical content
Detecting coordinated campaigns that only become visible in aggregate

Most platforms use both. Real-time catches obvious violations; batch catches the rest.

Scaling Considerations#

Compute#

Text classification is cheap — thousands of items per GPU per second
Image classification is moderate — hundreds per GPU per second
Video is expensive — may require dedicated GPU clusters

Storage#

Store moderation decisions and audit logs indefinitely for compliance
Cache hash databases in memory for fast lookup
Use CDN-level integration to block removed content globally

Geographic Distribution#

Deploy moderation services in multiple regions for latency
Ensure compliance with local regulations (different rules per jurisdiction)
Route human review to reviewers who speak the content's language

Metrics and Monitoring#

Key metrics to track:

Precision and recall per violation category
Human review queue depth and average review time
Appeal overturn rate per category and per model version
Time to action from upload to moderation decision
False positive rate for high-confidence auto-removals

Summary#

Component	Role
Pre-filter	Hash matching, blocklists, deduplication
ML Pipeline	Text, image, and video classification
Decision Engine	Maps confidence scores to actions via thresholds
Policy Engine	Configurable rules decoupled from code
Human Review	Handles uncertain cases, feeds training data
Appeals	Allows users to contest decisions

Learn how to design content moderation pipelines, trust and safety systems, and 200+ other architectures at codelit.io.

Article #204 · Codelit System Design Series

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

Build this architecture →

Comments

AI search

Build this architecture

Generate an interactive architecture for Content Moderation System Design in seconds.

Try it in Codelit →

Content Moderation System Design: Scaling Trust and Safety

Introduction#

Functional Requirements#

Non-Functional Requirements#

High-Level Architecture#

Content Ingestion and Pre-Filtering#

Hash-Based Matching#

PhotoDNA#

Video Hashing#

Perceptual Hashing#

ML Classification Pipeline#

Text Moderation#

Image Moderation#

Video Moderation#

Confidence Thresholds and the Decision Engine#

The Policy Engine#

Human Review Queue#

Priority Ranking#

Reviewer Workflow#

Reviewer Wellbeing#

Appeals Process#

False Positive Handling#

Real-Time vs Batch Moderation#

Real-Time#

Batch#

Scaling Considerations#

Compute#

Storage#

Geographic Distribution#

Metrics and Monitoring#

Summary#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Uber Real-Time Location System

E-Commerce Checkout System

Notification System

Build this architecture

Content Moderation System Design: Scaling Trust and Safety

Introduction#

Functional Requirements#

Non-Functional Requirements#

High-Level Architecture#

Content Ingestion and Pre-Filtering#

Hash-Based Matching#

PhotoDNA#

Video Hashing#

Perceptual Hashing#

ML Classification Pipeline#

Text Moderation#

Image Moderation#

Video Moderation#

Confidence Thresholds and the Decision Engine#

The Policy Engine#

Human Review Queue#

Priority Ranking#

Reviewer Workflow#

Reviewer Wellbeing#

Appeals Process#

False Positive Handling#

Real-Time vs Batch Moderation#

Real-Time#

Batch#

Scaling Considerations#

Compute#

Storage#

Geographic Distribution#

Metrics and Monitoring#

Summary#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Uber Real-Time Location System

E-Commerce Checkout System