atmik

A high-level analysis of benchmarking 47 LLMs on Hindi content moderation

Overview

HateBench represents the most comprehensive evaluation of Large Language Models on Hindi hate speech detection to date. Over a two-week period, I benchmarked 47 models from 23 providers on a curated dataset of 150 Hindi social media posts, some collected from Instagram comments and others sourced from the Indo-HateSpeech dataset, measuring not just accuracy but cost efficiency and latency, metrics that matter for production deployment.

Explore the full benchmark: The live HateBench website with the interactive results and benchmark overview is available at hatebench.vercel.app.

Dataset Attribution

The Indo-HateSpeech dataset (Kaware, 2024) is a Hindi-English code-mixed dataset designed for identifying hate speech on social media platforms. Given the multilingual nature of Indian social media users, code-mixing—the blending of Hindi and English—is prevalent and poses unique challenges for content moderation.

Citation: Kaware, Pravin (2024), "Indo-HateSpeech", Mendeley Data, V1, doi: 10.17632/snc7mxpj6t.1. Licensed under CC BY 4.0.

The Bottom Line

          Key Finding: No model achieves >80% accuracy on this challenging benchmark. The best performers demonstrate a narrow 71.3% ceiling, with 47 models averaging only 61.4%. This isn't just a technical limitation—it's a fundamental architectural failure on Indic language content.
        

Critical Implications:

71.3% accuracy means nearly 3 in 10 classifications are wrong—unacceptable for safety-critical moderation
30.4% false positive rate threatens free speech by over-flagging political/religious commentary
20-40% accuracy on pure Hindi script disadvantages rural and formal Hindi speakers
52.9% of positive or non-harmful content with emojis is misclassified as hate; models don't understand Indian social media tone
Cost varies 423× for 0.7% accuracy difference—market inefficiency creates exploitable arbitrage

What We Tested

The benchmark evaluates models on a binary classification task (with models technically outputting 3 classes):

Hate Speech: Toxic language, slurs, threats, dehumanization, religious attacks, gendered harassment (75 examples, 50%)
Positive / Non-Harmful: Political commentary, religious expressions, banter, patriotism, gratitude, and supportive language (75 examples, 50%)
Neutral Output: Not part of the benchmark labels, yet all 47 models still predicted it 9-27 times, revealing systematic confusion on the non-hate boundary

Content distribution: 44.7% contain emojis, 84% Romanized Hindi, 6.7% pure Devanagari, 4% mixed script. Models were tested on culturally-specific Indian social media content from Instagram comments and the Indo-HateSpeech dataset.

Evaluation Prompt

The following system prompt was sent to every model before each classification. This ensures the results are directly comparable across providers.

You are an expert content moderation AI. Your task is to classify text into exactly one of three categories:
- "hate": Contains hate speech, toxic language, threats of violence, slurs, dehumanization, discrimination, harassment, cyberbullying, or any form of targeted hostility against individuals or groups.
- "neutral": Factual, informational, descriptive, or balanced content without strong positive or negative emotional charge. News reporting, instructions, observations, etc.
- "positive": Encouraging, supportive, kind, constructive, compassionate, grateful, or uplifting content.
IMPORTANT RULES:
1. You MUST respond with ONLY valid JSON — no markdown, no explanation outside the JSON.
2. Analyze the text in whatever language it is written in. Do NOT require translation.
3. Be sensitive to cultural context and coded language.
4. When in doubt between "hate" and "neutral", consider whether the text could cause harm to a targeted group.
Respond in this exact JSON format:
{"label": "hate" | "neutral" | "positive", "confidence": , "reasoning": ""}

Top Performers & Cost Analysis

Model	Accuracy	Cost/150 ex	Cost/Correct	Latency	Notes
trinity-large-preview	71.3%	$0.00	$0.0000	~2.5s	Free tier, rate limits likely
gpt-5.2-pro	70.7%	$0.2515	$0.00249	2.8s	Premium tier, marginal gain
gemini-3-flash-preview	70.0%	$0.0403	$0.000385	1.8s	Best value for accuracy
gemma-3n-e4b (4B)	70.0%	$0.0011	$0.000011	1.49s	423× cheaper than GPT-5.2 Pro!
gpt-5.2	67.3%	$0.1639	$0.00249	2.4s	Same accuracy as Claude, 1/3 cost
claude-opus-4.6	67.3%	$0.4684	$0.004637	3.6s	12× more expensive than Gemini Flash
sarvam-105b	60.7%	$0.00	$0.0000	~2s	Indic model, 0% abbreviation detection
qwen3-235b	57.3%	$0.08	$0.00140	~2s	235B params, below avg performance

trinity-large-preview

Accuracy71.3%

Cost/150 ex$0.00

Latency~2.5s

NoteFree tier, rate limits likely

gpt-5.2-pro

Accuracy70.7%

Cost/150 ex$0.2515

Latency2.8s

NotePremium tier, marginal gain

gemini-3-flash-preview

Accuracy70.0%

Cost/150 ex$0.0403

Latency1.8s

NoteBest value for accuracy

gemma-3n-e4b (4B)

Accuracy70.0%

Cost/150 ex$0.0011

Latency1.49s

Note423x cheaper than GPT-5.2 Pro

gpt-5.2

Accuracy67.3%

Cost/150 ex$0.1639

Latency2.4s

NoteSame accuracy as Claude, 1/3 cost

claude-opus-4.6

Accuracy67.3%

Cost/150 ex$0.4684

Latency3.6s

Note12x more expensive than Gemini Flash

sarvam-105b

Accuracy60.7%

Cost/150 ex$0.00

Latency~2s

NoteIndic model, 0% abbreviation detection

qwen3-235b

Accuracy57.3%

Cost/150 ex$0.08

Latency~2s

Note235B params, below avg performance

Cost per 150 examples and per correct prediction shown. Trinity and Sarvam were free through OpenRouter/Sarvam APIs during benchmark. Pricing subject to change. Lower cost/correct = better value.

Notable Underperformers

Qwen3 235B: Massive 235B parameter model achieves only 57.3% accuracy—below the 61.4% average. Size doesn't scale.
Llama 4 Maverick: 56.0% false positive rate on non-harmful content, making it the worst offender for over-flagging.
Phi-4: 66.7% false positive rate on political content alongside 0% internet slang detection.
Ministral 14B: 49.3% FP rate on non-harmful content + 66.7% FP on political content.

Dataset Deep Dive

Before diving into model performance, it's crucial to understand what we tested against. The dataset composition reveals critical imbalances:

84% Romanized Hindi (Hinglish): Mixed Hindi-English script that dominates Indian social media
6.7% Pure Devanagari: Traditional Hindi script that models struggle with dramatically
4.0% Mixed script: Code-switched content combining both scripts
44.7% contain emojis: Critical context markers that models misinterpret

          Critical Finding: The benchmark used a 50-50 hate/positive split (75 examples each). Yet all 47 models still predicted "neutral" 9-27 times, causing systematic confusion and showing that many models invent an unnecessary middle bucket instead of cleanly separating hate from positive, non-harmful content.
        

Critical Insights

1. The Accuracy Ceiling: 71.3% is the Hard Limit

Hindi hate speech detection maxes out at 71.3% accuracy (Trinity Large). The best models cluster at 70-71%, suggesting a natural ceiling for current architectures. Average across all 47 models: only 61.4%.

The Devanagari Problem: Models achieve 100% accuracy on Romanized Hindi but only 20-40% on pure Devanagari script. This isn't semantic understanding—it's transliteration dependency. Models translate Hindi → English → classify rather than understanding Hindi directly.

2. Cost-Performance Disconnect: Price ≠ Quality

The cost differences are staggering for marginal gains:

Model	Accuracy	Cost/150 examples	Cost per correct
Gemma 3n E4B (4B)	70.0%	$0.0011	$0.000011
Gemini 3 Flash	70.0%	$0.0403	$0.000385
GPT-5.2 Pro	70.7%	$3.11	$0.002490
Claude Opus 4.6	67.3%	$0.4684	$0.004637

Gemma 3n E4B (4B)

Accuracy70.0%

Cost/150 ex$0.0011

Cost/Correct$0.000011

Gemini 3 Flash

Accuracy70.0%

Cost/150 ex$0.0403

Cost/Correct$0.000385

GPT-5.2 Pro

Accuracy70.7%

Cost/150 ex$3.11

Cost/Correct$0.002490

Claude Opus 4.6

Accuracy67.3%

Cost/150 ex$0.4684

Cost/Correct$0.004637

The ROI paradox: Gemma 3n E4B is 423× more cost-efficient than GPT-5.2 Pro despite nearly identical accuracy (70.0% vs 70.7%). Google models average 5× cheaper than OpenAI with 2.3% higher accuracy.

3. The Script Barrier: Devanagari Blindness

All models exhibit catastrophic performance drops on pure Hindi script:

Claude Opus 4.6: 100% on Hinglish → 20% on Devanagari (80% drop)
Gemini 3 Flash: 100% on Hinglish → 40% on Devanagari (60% drop)
Gemini 3 Pro: 100% on Hinglish → 20% on Devanagari (80% drop)

This reveals architectural bias toward Latin-script training data. Rural users, formal contexts, and regional Hindi speakers are systematically disadvantaged.

4. False Positive Crisis: 30.4% Over-Flagging

Models flag nearly a third of positive or non-harmful content as hate speech, with systematic biases:

Religious content: 34.0% false positive rate
Political commentary: 33.3% to 66.7% FP rate depending on model
Worst offenders: Llama 4 Maverick (56.0%), Ministral 14B (49.3%), Gemma 3 27B (45.3%)

          Representative Failure Example: "Ayodhya toh shuru se hi Ram ki janmabhoomi h ... Jai shree ram 🚩" (positive devotional expression) was frequently pulled toward hate classifications. Models confuse culturally common religious identity language with harmful speech.
        

5. Emoji Context Collapse: 😂 Triggers False Positives

Laughing emojis cause catastrophic misclassification. Models interpret them as mockery markers rather than tone softeners:

Positive / non-harmful + 😂 false positive rate: 52.9% (Claude Opus, Gemini Flash, Gemma 27B)
Worst offender: Grok 4.1 Fast at 58.8% FP rate
The paradox: Models over-index on 😂 as aggression when paired with criticism, but under-index when paired with actual slurs

Real-world failure: "BJP+400🚩🚩❤❤❤😈😈" (positive partisan slogan) produced a fragmented split across hate, positive, and neutral outputs instead of being recognized as clearly non-harmful.

6. Political Figures as Trip Wires

Mentioning prominent Indian political figures triggers hate classification even in non-harmful contexts:

Phi-4 & Ministral 14B: 66.7% FP rate on non-harmful political posts
Gemma 3 27B: 50.0% FP rate
Average across models: ~40%

Root cause: Training data includes Western political hate speech mapped onto Indian figures, but Indian political discourse has different toxicity norms. Criticism ≠ hate in Indian culture, but models don't distinguish this.

7. The Parameter Paradox: Size Doesn't Scale

Model size shows zero correlation with accuracy. Larger models often dramatically underperform smaller ones:

Small Model	Accuracy	Large Model	Accuracy	Gap
Gemma 3n E4B (4B)	70.0%	Qwen3 235B (235B)	57.3%	-12.7%
Gemma 3 12B (12B)	66.7%	Sarvam-105B (105B)	60.7%	-6.0%
Gemma 3 27B (27B)	65.3%	Qwen3 235B (235B)	57.3%	-8.0%

SmallGemma 3n E4B (4B) — 70.0%

LargeQwen3 235B (235B) — 57.3%

Gap-12.7%

SmallGemma 3 12B (12B) — 66.7%

LargeSarvam-105B (105B) — 60.7%

Gap-6.0%

SmallGemma 3 27B (27B) — 65.3%

LargeQwen3 235B (235B) — 57.3%

Gap-8.0%

Only Trinity Large (est. 400B+) benefits from scale at 71.3%. For production, deploy Gemma 3n E4B—save 99% on compute with better results than models 50× its size.

8. Indic Models Compete at Zero Cost

Sarvam models—specifically trained on Indic languages—deliver competitive results free:

Sarvam-105B: 60.7% accuracy at $0.00 (free tier)
Sarvam-30B: 58.7% accuracy at $0.00
Gap vs. Western models: Only 0.8-2.8 points behind at zero cost

Critical vulnerability: Sarvam models detect 0% of abbreviated internet slurs while Western models detect 75-100%. A hybrid approach (Sarvam first pass + GPT for slang) could deliver 65%+ accuracy at near-zero cost.

9. Internet Slang Detection Gap: Binary Failure

Abbreviated internet slurs show bimodal detection—some models perfect (100%), others completely blind (0%), with no middle ground:

Perfect detection (8 models): Gemini 3 Flash/Pro, GPT-5.2/5.2 Pro/5.4 Pro, GLM-4.7, Grok 4.1 Fast, Mimo V2 Pro
Complete failure (5 models): Sarvam-30B, Sarvam-105B, Phi-4, MiniMax M2.1, Mercury 2
Attack vector: Bad actors can evade detection by using Indic-specific moderation plus abbreviated internet slurs

10. The Neutral Leakage Problem

This benchmark was structured around hate versus positive, non-harmful content, but all 47 models still emitted a "neutral" label 9-27 times. That means models were not just struggling with toxicity; they were inventing an extra middle bucket and diluting classification quality on clearly safe inputs.

Examples on the positive side of the benchmark include devotional expressions such as "Jai shree ram," patriotic celebration such as "ilove my india," and simple gratitude such as "Thanku for this ❤️." The failure here is not that these are ambiguous; it is that many models still collapse part of this side into an unnecessary neutral class.

11. Confidence Calibration Crisis: Scores Are Unreliable

Models exhibit dangerous over-confidence, making confidence scores unusable for human triage:

High-confidence errors: 916 predictions with >90% confidence that were WRONG
Low-confidence successes: Only 98 predictions with <60% confidence that were correct
100% confidence errors: Gemini 3 Pro, Gemma 3 12B, Sarvam-105B all made mistakes with absolute certainty

          Production Impact: Cannot set confidence thresholds above 0.6 for human escalation. 60%+ of predictions need human review regardless of model certainty. Confidence scores are broken—use ensemble voting instead.
        

12. Content Type Performance Gaps: Gendered Hate Blind Spot

Models excel at political/religious hate (100% detection) but systematically miss gendered harassment:

Political hate: 100% detection (Claude Opus)
Religious hate: 100% detection
Female-directed abuse: Only 80% detection, with a 20% miss rate on slurs targeting women

Critical blind spot: Gendered and sexualized hate speech slips through 5× more often than political hate, despite being equally harmful. Training data bias toward public discourse over private harassment contexts.

13. Code-Switching Blind Spots

Models handle Hindi-English code-switching (71% accuracy) better than pure Hindi (29-57% accuracy), but struggle with language boundaries:

Mixed Hindi-English: 71% average accuracy
Pure Devanagari: 29-57% accuracy
Pure English: 60-80% accuracy

Real-world impact: Rural/regional users using pure Hindi script, regional language mixes (Marathi + Hindi + English), and professional contexts using Devanagari are all disadvantaged.

14. Model Consensus Uncertainty: 17% Requires Human Review

26 examples (17.3% of dataset) show <60% model agreement. Even aggregated predictions are unreliable for culturally-complex content:

Most contentious: "BJP+400🚩🚩❤❤❤😈😈" (positive partisan slogan) → 45% neutral, 30% hate, 25% positive split
Perfect consensus wrong: 2 examples with 100% model agreement were both wrong
Irreducible floor: 17% of content will always require human review regardless of model sophistication

Production Recommendations

Model Selection Strategy

Scenario	Recommended Model	Why
Best Overall	Gemini 3 Flash	70.0% accuracy, $0.04/test, <2s latency, strong hate recall (89.3%)
Best Budget	Gemma 3n E4B	70.0% accuracy, $0.0011/test (297× cheaper), 1.49s latency
Best Premium	Trinity Large	71.3% accuracy (highest), $0.00 (free tier)
Hybrid Strategy	Sarvam-105B + GPT-5.2	First pass free (60.7%), slang catch on uncertain cases → ~68% accuracy at <$0.05/1K examples

Why This Matters

With 600+ million Hindi speakers worldwide and explosive growth in Indian social media usage, effective content moderation for Indic languages isn't optional—it's essential. Yet current solutions systematically fail this market.

The Real-World Impact

Platform Safety: 20% of gendered hate speech slips through while 30% of political commentary is over-flagged. This creates both under-moderation (harassment goes unchecked) and over-moderation (legitimate speech suppressed).
Digital Divide: Models perform 50-80% worse on pure Devanagari script, systematically disadvantaging rural users, older demographics, and formal/professional contexts. This is algorithmic bias against non-English-script users.
Cultural Blindness: Models trained on Western data flag Indian political discourse as hate 40-67% of the time. What constitutes criticism vs. hate differs radically between cultures—models don't know this.
Economic Exploitation: Bad actors can bypass detection by using abbreviated internet slurs against Indic models (0% detection) or mixing abuse with Devanagari to confuse transliteration layers.

The Technical Debt

Confidence scores are broken: 916 high-confidence errors vs 98 low-confidence successes means human review triage is impossible without reviewing 60%+ of content.
Emojis are misread: 😂 triggers hate classification 52.9% of the time in positive or non-harmful content; models fundamentally misunderstand Indian social media tone.
Consensus is unreliable: Even aggregating 47 model outputs, 17% of content has <60% agreement. Some edge cases will always require human judgment.

What HateBench Provides

HateBench delivers the first empirical foundation for making informed moderation decisions in the world's second-largest language market. With 47 models tested, critical insights identified, and cost/performance/latency metrics for each, this is the most comprehensive Hindi hate speech benchmark to date.

The path forward: Hybrid human-AI moderation with culturally-aware models, script-aware handling, and consensus-based routing. 71.3% accuracy isn't good enough; we need 90%+ for safety-critical applications. This benchmark shows us exactly where the gaps are.

Methodology & Limitations

Analysis based on: 47 models from 23 providers (Anthropic, OpenAI, Google, Sarvam AI, Meta, Mistral, etc.) tested on 150 examples from the Indo-HateSpeech dataset. Metrics include accuracy, precision, recall, F1, confusion matrices, cost (USD), and latency.

Limitations: Small dataset (150 examples), culturally ambiguous edge cases, free-tier models may have rate-limiting affecting scores, latency measurements include API overhead not just inference time.

HateBench: A benchmark for models against Hindi Hateful Speech

Overview

Dataset Attribution

The Bottom Line

What We Tested

Evaluation Prompt

Top Performers & Cost Analysis

Notable Underperformers

Dataset Deep Dive

Critical Insights

1. The Accuracy Ceiling: 71.3% is the Hard Limit

2. Cost-Performance Disconnect: Price ≠ Quality

3. The Script Barrier: Devanagari Blindness

4. False Positive Crisis: 30.4% Over-Flagging

5. Emoji Context Collapse: 😂 Triggers False Positives

6. Political Figures as Trip Wires

7. The Parameter Paradox: Size Doesn't Scale

8. Indic Models Compete at Zero Cost

9. Internet Slang Detection Gap: Binary Failure

10. The Neutral Leakage Problem

11. Confidence Calibration Crisis: Scores Are Unreliable

12. Content Type Performance Gaps: Gendered Hate Blind Spot

13. Code-Switching Blind Spots

14. Model Consensus Uncertainty: 17% Requires Human Review

Production Recommendations

Model Selection Strategy

Why This Matters

The Real-World Impact

The Technical Debt

What HateBench Provides

Methodology & Limitations