Swiss-Bench
Which AI model works best for Switzerland?
11 models. 6 dimensions. 3 languages. Updated quarterly.
Last updated: Q1 2026
Leaderboard
Overall Model Rankings
| Rank | Model | Type | Overall | DE | FR | IT | Updated |
|---|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | Closed Source | 65.9 | 82.4 | 82.6 | 81.0 | Q1 2026 |
| 2 | Kimi K2.5 | Open Source | 64.7 | 82.0 | 86.0 | 87.0 | Q1 2026 |
| 3 | Gemini 2.5 Pro | Closed Source | 63.8 | 85.0 | 87.0 | 85.0 | Q1 2026 |
| 4 | MiniMax M2.5 | Open Source | 62.9 | 72.0 | 82.0 | 78.0 | Q1 2026 |
| 5 | GPT-4o | Closed Source | 61.7 | 70.0 | 73.2 | 69.8 | Q1 2026 |
| 6 | Gemini 2.0 Flash | Closed Source | 59.3 | 75.2 | 74.2 | 77.0 | Q1 2026 |
| 7 | DeepSeek V3 | Open Source | 58.2 | 81.2 | 81.4 | 80.2 | Q1 2026 |
| 8 | Mistral Large 2 | Open Source | 58.2 | 70.6 | 63.8 | 68.0 | Q1 2026 |
| 9 | Llama 3.3 70B | Open Source | 56.7 | 64.6 | 68.2 | 63.6 | Q1 2026 |
| 10 | GPT-4o Mini | Closed Source | 55.3 | 57.2 | 62.2 | 57.2 | Q1 2026 |
| 11 | Qwen 2.5 72B | Open Source | 54.0 | 66.4 | 67.8 | 74.0 | Q1 2026 |
Swiss-Bench v1.0 — HAAS composite score across 6 dimensions: Performance, Robustness, Safety, Compliance, Swiss Language, Documentation. Overall = weighted average of all dimensions. DE/FR/IT = MMLU-ProX multilingual accuracy (10% weight in HAAS). 11 models, Q1 2026. Methodology →
Key Findings
Q1 2026 Highlights
Best Overall
Claude Opus 4.6
Highest HAAS score (65.9) across all 6 dimensions. Strong Swiss legal knowledge and EU AI Act compliance.
Tightest Race
Top 4 within 3 points
Kimi K2.5 (64.7), Gemini 2.5 Pro (63.8), and MiniMax M2.5 (62.9) trail the leader by just 1–3 points. The frontier is crowded.
Best Multilingual
Gemini 2.5 Pro
Highest language scores across all three Swiss languages (DE 85%, FR 87%, IT 85%). Kimi K2.5 strongest in Italian (87%).
Benchmark results: Swiss-Bench v1.0 (March 2026). Updated quarterly.
Detailed Results
Domain-specific performance
Domain Breakdown
| Domain | Best Model | Score | Runner-up | Gap |
|---|---|---|---|---|
| Financial Services | Claude Opus 4.6 | 91.2 | GPT-4o | +2.4 |
| Legal (Federal) | GPT-4o | 89.7 | Claude Opus 4.6 | +1.1 |
| Legal (Cantonal) | Claude Opus 4.6 | 86.3 | Gemini 2.0 Flash | +3.8 |
| Healthcare | Gemini 2.0 Flash | 84.9 | Claude Opus 4.6 | +0.7 |
| Public Administration | Claude Opus 4.6 | 88.1 | GPT-4o | +1.9 |
| Insurance | GPT-4o | 87.4 | Claude Opus 4.6 | +2.2 |
Failure Mode Analysis
| Failure Mode | Claude Opus 4.6 | GPT-4o | Gemini 2.0 Flash | Llama 3.3 70B |
|---|---|---|---|---|
| Hallucination Rate | 2.1% | 3.4% | 2.8% | 6.7% |
| Jurisdiction Confusion | 1.3% | 1.8% | 2.4% | 5.1% |
| Temporal Decay | 4.2% | 3.9% | 5.1% | 7.3% |
| Language Mixing | 0.8% | 1.2% | 0.6% | 3.4% |
Cross-Lingual Consistency
| Model | DE↔FR | DE↔IT | FR↔IT | Avg. Consistency |
|---|---|---|---|---|
| Claude Opus 4.6 | 96.8% | 94.2% | 95.1% | 95.4% |
| GPT-4o | 95.3% | 92.7% | 93.4% | 93.8% |
| Gemini 2.0 Flash | 96.1% | 95.8% | 94.9% | 95.6% |
| Mistral Large 2 | 97.2% | 91.3% | 92.7% | 93.7% |
Swiss-Bench methodology and scoring criteria are documented on our Methodology page →
A peer-reviewed scientific article describing our methodology, expert-verified ground truth, and statistical framework is currently in preparation for publication.
Need scores for YOUR domain? Our AI Model Evaluation (from CHF 8,000) runs Swiss-Bench against your specific use case. 5-model comparison, domain-specific scenarios, actionable recommendation.
Contact
contact@ai-helvetic.ch
Ready for an independent evaluation?
Start with an AI Model Evaluation or a full SOTA Model Sweep. Within two weeks you'll know which model works best for your Swiss use case.