Swiss-Bench

Which AI model works best for Switzerland?

11 models. 6 dimensions. 3 languages. Updated quarterly.

Last updated: Q1 2026

Leaderboard

Overall Model Rankings

Swiss-Bench Overall AI Model Rankings, Q1 2026 (11 models)
Rank	Model	Type	Overall	DE	FR	IT	Updated
1	Claude Opus 4.6	Closed Source	65.9	82.4	82.6	81.0	Q1 2026
2	Kimi K2.5	Open Source	64.7	82.0	86.0	87.0	Q1 2026
3	Gemini 2.5 Pro	Closed Source	63.8	85.0	87.0	85.0	Q1 2026
4	MiniMax M2.5	Open Source	62.9	72.0	82.0	78.0	Q1 2026
5	GPT-4o	Closed Source	61.7	70.0	73.2	69.8	Q1 2026
6	Gemini 2.0 Flash	Closed Source	59.3	75.2	74.2	77.0	Q1 2026
7	DeepSeek V3	Open Source	58.2	81.2	81.4	80.2	Q1 2026
8	Mistral Large 2	Open Source	58.2	70.6	63.8	68.0	Q1 2026
9	Llama 3.3 70B	Open Source	56.7	64.6	68.2	63.6	Q1 2026
10	GPT-4o Mini	Closed Source	55.3	57.2	62.2	57.2	Q1 2026
11	Qwen 2.5 72B	Open Source	54.0	66.4	67.8	74.0	Q1 2026

Swiss-Bench v1.0 — HAAS composite score across 6 dimensions: Performance, Robustness, Safety, Compliance, Swiss Language, Documentation. Overall = weighted average of all dimensions. DE/FR/IT = MMLU-ProX multilingual accuracy (10% weight in HAAS). 11 models, Q1 2026. Methodology →

Key Findings

Q1 2026 Highlights

Best Overall

Claude Opus 4.6

Highest HAAS score (65.9) across all 6 dimensions. Strong Swiss legal knowledge and EU AI Act compliance.

Tightest Race

Top 4 within 3 points

Kimi K2.5 (64.7), Gemini 2.5 Pro (63.8), and MiniMax M2.5 (62.9) trail the leader by just 1–3 points. The frontier is crowded.

Best Multilingual

Gemini 2.5 Pro

Highest language scores across all three Swiss languages (DE 85%, FR 87%, IT 85%). Kimi K2.5 strongest in Italian (87%).

Benchmark results: Swiss-Bench v1.0 (March 2026). Updated quarterly.

Detailed Results

Domain-specific performance

Domain Breakdown

Domain	Best Model	Score	Runner-up	Gap
Financial Services	Claude Opus 4.6	91.2	GPT-4o	+2.4
Legal (Federal)	GPT-4o	89.7	Claude Opus 4.6	+1.1
Legal (Cantonal)	Claude Opus 4.6	86.3	Gemini 2.0 Flash	+3.8
Healthcare	Gemini 2.0 Flash	84.9	Claude Opus 4.6	+0.7
Public Administration	Claude Opus 4.6	88.1	GPT-4o	+1.9
Insurance	GPT-4o	87.4	Claude Opus 4.6	+2.2

Failure Mode Analysis

Failure Mode	Claude Opus 4.6	GPT-4o	Gemini 2.0 Flash	Llama 3.3 70B
Hallucination Rate	2.1%	3.4%	2.8%	6.7%
Jurisdiction Confusion	1.3%	1.8%	2.4%	5.1%
Temporal Decay	4.2%	3.9%	5.1%	7.3%
Language Mixing	0.8%	1.2%	0.6%	3.4%

Cross-Lingual Consistency

Model	DE↔FR	DE↔IT	FR↔IT	Avg. Consistency
Claude Opus 4.6	96.8%	94.2%	95.1%	95.4%
GPT-4o	95.3%	92.7%	93.4%	93.8%
Gemini 2.0 Flash	96.1%	95.8%	94.9%	95.6%
Mistral Large 2	97.2%	91.3%	92.7%	93.7%

Get the full Swiss-Bench report

Quarterly deep-dive with domain scores, failure analysis, and model recommendations for Swiss enterprises.

No spam. Quarterly report only. Unsubscribe anytime.

Swiss-Bench methodology and scoring criteria are documented on our Methodology page →

A peer-reviewed scientific article describing our methodology, expert-verified ground truth, and statistical framework is currently in preparation for publication.

Need scores for YOUR domain? Our AI Model Evaluation (from CHF 8,000) runs Swiss-Bench against your specific use case. 5-model comparison, domain-specific scenarios, actionable recommendation.

Contact

Ready for an independent evaluation?

Start with an AI Model Evaluation or a full SOTA Model Sweep. Within two weeks you'll know which model works best for your Swiss use case.

Evaluation from CHF 8,000 · SOTA Sweep from CHF 20,000

contact@ai-helvetic.ch