Content Moderation System Design: Scaling Trust and Safety
Introduction#
Every platform that accepts user-generated content needs a content moderation system. At scale, this means processing billions of posts, images, and videos daily — combining ML classifiers, hash-based matching, human review, and policy enforcement into a coherent pipeline.
This guide covers the end-to-end content moderation system design, from ingestion to appeals.
Functional Requirements#
- Moderate text, images, and video content in real time
- Flag or remove content that violates platform policies
- Route uncertain content to human reviewers
- Support an appeals process for incorrect decisions
- Maintain an audit trail of all moderation actions
- Allow policy updates without redeploying the system
Non-Functional Requirements#
- Latency: Real-time moderation for text (under 200ms), near-real-time for images (under 2 seconds)
- Throughput: Handle hundreds of thousands of content items per second
- Accuracy: Minimize false positives (wrongly removed content) and false negatives (missed violations)
- Scalability: Support growth from millions to billions of daily items
- Consistency: Same content should receive the same moderation decision globally
High-Level Architecture#
Content Upload → Pre-filter → ML Pipeline → Decision Engine → Action
↓ ↓
Human Review Queue Audit Log
↓
Appeals Process
Content Ingestion and Pre-Filtering#
Before running expensive ML models, apply cheap pre-filters:
- Hash-based matching: Compare content hashes against known-bad databases
- Blocklist matching: Check text against keyword and regex blocklists
- Duplicate detection: Identify previously moderated content via perceptual hashing
- Rate limiting: Flag accounts uploading at abnormal rates
These filters catch a significant percentage of violations at minimal compute cost.
Hash-Based Matching#
PhotoDNA#
Microsoft's PhotoDNA generates a hash of an image that is robust to resizing, cropping, and color changes. Platforms compare uploaded images against databases of known illegal content (e.g., NCMEC database for CSAM).
Video Hashing#
Video is decomposed into keyframes, and each frame is hashed independently. Some systems also hash audio tracks to catch violations in spoken content.
Perceptual Hashing#
Unlike cryptographic hashes, perceptual hashes produce similar outputs for visually similar images. This catches minor modifications designed to evade exact-match detection.
ML Classification Pipeline#
Text Moderation#
Text classifiers analyze content for:
- Hate speech and slurs
- Harassment and bullying
- Spam and scam content
- Self-harm and violence
- Misinformation (more nuanced, often requires specialized models)
Modern systems use transformer-based models fine-tuned on platform-specific labeled data. Multilingual support requires separate models or multilingual architectures.
Image Moderation#
Image classifiers detect:
- Nudity and sexual content
- Violence and gore
- Drugs and weapons
- Text embedded in images (requires OCR followed by text classification)
CNNs and vision transformers are common choices. Models output a confidence score per violation category.
Video Moderation#
Video moderation combines:
- Frame sampling: Extract frames at regular intervals and run image classifiers
- Audio transcription: Convert speech to text and run text classifiers
- Temporal analysis: Some violations only become apparent across multiple frames
Video moderation is the most compute-intensive — large platforms process millions of hours of video daily.
Confidence Thresholds and the Decision Engine#
Each ML model outputs a confidence score between 0 and 1. The decision engine maps these scores to actions using configurable thresholds:
| Confidence Range | Action |
|---|---|
| 0.95 - 1.00 | Auto-remove, notify user |
| 0.70 - 0.95 | Send to human review queue |
| 0.30 - 0.70 | Reduce distribution (shadow restrict) |
| 0.00 - 0.30 | Allow |
These thresholds are tuned per category and per market. A platform may be more aggressive on CSAM (auto-remove at 0.80) and more conservative on satire (only auto-remove at 0.99).
The Policy Engine#
Moderation rules change frequently. A policy engine decouples rules from code:
- Policies are defined as configuration (JSON/YAML rules or a DSL)
- Rules reference model output labels and confidence scores
- Different policies apply to different regions, content types, or user tiers
- Policy changes take effect immediately without deployment
Example policy rule:
rule: block_hate_speech
condition: hate_speech_score > 0.90 AND region IN [US, EU]
action: remove
notify: true
appeal_eligible: true
Human Review Queue#
When ML confidence is uncertain, content enters the human review queue:
Priority Ranking#
- Severity: Potential CSAM or imminent violence is reviewed first
- Reach: Content from accounts with large followings is prioritized
- Recency: Newer content is reviewed before older content
- Model confidence: Items closer to the decision boundary get reviewed sooner
Reviewer Workflow#
- Reviewer sees the content with ML predictions and relevant context
- Reviewer selects a violation category or marks as "no violation"
- Decision is recorded and fed back to the ML training pipeline
- Reviewer labels become ground truth for model improvement
Reviewer Wellbeing#
Content reviewers are exposed to disturbing material. Systems must:
- Blur graphic content by default, requiring explicit click to reveal
- Limit exposure time per session
- Provide mental health support
- Rotate reviewers across content categories
Appeals Process#
Users whose content is removed can appeal:
- User submits appeal with optional explanation
- Appeal is routed to a different reviewer (never the original)
- Reviewer re-evaluates with full context including the user's explanation
- Decision is final (or escalated to a senior review panel)
- Appeal outcomes feed back into model training
Tracking appeal overturn rates per category is a key quality metric. A high overturn rate signals that the ML model or thresholds need adjustment.
False Positive Handling#
False positives — legitimate content incorrectly flagged — erode user trust. Strategies to minimize them:
- Multi-model ensemble: Require agreement from multiple models before auto-removing
- Context awareness: A medical education video showing anatomy should not be flagged as nudity
- User reputation scoring: Established accounts with clean history get higher thresholds
- Gradual enforcement: Reduce distribution before outright removal
Real-Time vs Batch Moderation#
Real-Time#
Applied at upload time. Essential for:
- Live streams
- Chat messages
- Content that could go viral within minutes
Batch#
Applied retroactively. Useful for:
- Re-scanning existing content when new policies are introduced
- Running improved models against historical content
- Detecting coordinated campaigns that only become visible in aggregate
Most platforms use both. Real-time catches obvious violations; batch catches the rest.
Scaling Considerations#
Compute#
- Text classification is cheap — thousands of items per GPU per second
- Image classification is moderate — hundreds per GPU per second
- Video is expensive — may require dedicated GPU clusters
Storage#
- Store moderation decisions and audit logs indefinitely for compliance
- Cache hash databases in memory for fast lookup
- Use CDN-level integration to block removed content globally
Geographic Distribution#
- Deploy moderation services in multiple regions for latency
- Ensure compliance with local regulations (different rules per jurisdiction)
- Route human review to reviewers who speak the content's language
Metrics and Monitoring#
Key metrics to track:
- Precision and recall per violation category
- Human review queue depth and average review time
- Appeal overturn rate per category and per model version
- Time to action from upload to moderation decision
- False positive rate for high-confidence auto-removals
Summary#
| Component | Role |
|---|---|
| Pre-filter | Hash matching, blocklists, deduplication |
| ML Pipeline | Text, image, and video classification |
| Decision Engine | Maps confidence scores to actions via thresholds |
| Policy Engine | Configurable rules decoupled from code |
| Human Review | Handles uncertain cases, feeds training data |
| Appeals | Allows users to contest decisions |
Content moderation at scale is a continuous balancing act between user safety, free expression, and operational cost. The best systems combine fast automated detection with thoughtful human oversight.
Learn how to design content moderation pipelines, trust and safety systems, and 200+ other architectures at codelit.io.
Article #204 · Codelit System Design Series
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
Related articles
Try these templates
Uber Real-Time Location System
Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.
6 componentsE-Commerce Checkout System
Production checkout flow with Stripe payments, inventory management, and fraud detection.
11 componentsNotification System
Multi-channel notification platform with preferences, templating, and delivery tracking.
9 componentsBuild this architecture
Generate an interactive architecture for Content Moderation System Design in seconds.
Try it in Codelit →
Comments