The HAAS Score: 6 dimensions, fully reproducible
Every AI system we evaluate receives a Helvetic AI Assurance Score (HAAS) across 6 dimensions. Each dimension is scored 0–100 with confidence intervals. Our methodology is fully documented and every result is independently verifiable.
6 evaluation dimensions
Performance (incl. Hallucination Rate)
Task completion, factual correctness, hallucination detection. Domain-specific scenarios from Swiss-Bench test real-world performance, not generic benchmarks.
Robustness
Adversarial inputs, prompt injection resistance, stress testing. How does the model perform under edge cases and adversarial conditions?
Safety
Hallucination detection, fabricated citation identification, harmful output avoidance. Tests whether models invent Swiss legal references or produce misleading regulatory guidance.
Compliance
EU AI Act technical compliance via 29 Compl-AI benchmarks (ETH Zurich). Automated scoring across applicable articles and technical requirements for AI system governance.
Swiss Language
Multilingual competence across German, French, and Italian. MMLU-ProX language-specific accuracy and Swiss translation quality. How well does the model handle Switzerland’s three official languages?
Documentation
Regulatory gap analysis quality — how well models identify differences between EU-wide regulations and Swiss-specific requirements (FINMA, nDPA). Tests structured reasoning about regulatory frameworks.
Three layers of evaluation technology
Inspect AI
The evaluation framework adopted by Anthropic and Google DeepMind. Provides the infrastructure for running reproducible model evaluations at scale. Over 100 built-in evaluation tasks with a proven architecture for systematic AI testing.
Compl-AI
EU AI Act compliance scoring framework mapping regulatory principles to technical requirements. Published, peer-reviewed methodology from leading European AI safety researchers.
Swiss-Bench
436 Swiss-specific evaluation scenarios across 11 tasks. Tests German, French, and Italian comprehension on domain-specific tasks. Detects jurisdiction confusion, Verwaltungsdeutsch comprehension failures, temporal decay, and cross-lingual inconsistencies.
Peer-reviewed methodology
Every methodological choice in our evaluation system is grounded in peer-reviewed research. Our citation accuracy evaluation follows Asai et al. (Nature, 2026) — the same study that found GPT-4o hallucinates citations 78–90% of the time. This is why we evaluate legal citation correctness as a dedicated scoring dimension.
Our regulatory compliance mapping adapts the Compl-AI framework (ETH Zurich, ArXiv: 2410.07959), recognized by the OECD. Our holistic evaluation philosophy follows HELM (Stanford CRFM, peer-reviewed in TMLR). Swiss legal translation evaluation builds on methodology validated by Niklaus et al. (EMNLP 2023, ACL 2025) covering 180,000+ Swiss legal translation pairs.
We are currently preparing a scientific article for peer-reviewed publication that details our complete evaluation methodology, expert verification process, and statistical framework. Our ground truth verification follows MMLU-Redux (Gema et al., NAACL 2025), which found a 9% error rate in widely-used benchmarks. Our expert annotation protocol is modelled on CUAD (Hendrycks et al., NeurIPS 2021) and LegalBench (Guha et al., NeurIPS 2023). In total, our methodology draws on 40+ peer-reviewed publications.
Every result is reproducible
Every evaluation follows a documented, reproducible methodology. You receive detailed benchmark results, scoring breakdowns, and methodology documentation with every engagement — sufficient to verify and understand every finding.
This is not an opinion. It’s evidence.
No conflicts of interest
Helvetic AI has no commercial relationships with any AI model provider. No referral fees, no vendor partnerships, no pay-for-score agreements. Every model is evaluated with the same system, the same benchmarks, and the same scoring methodology.
Key publications
- Asai, A. et al. “Citation correctness in large language models.” Nature, 2026.
- Dobreva, R. et al. “Compl-AI: Compliance assessment of LLMs against EU AI Act requirements.” ArXiv: 2410.07959, 2024. (ETH Zürich / INSAIT)
- Liang, P. et al. “Holistic Evaluation of Language Models (HELM).” TMLR, 2023. (Stanford CRFM)
- UK AI Safety Institute. “Inspect AI: evaluation framework for AI systems.” MIT License, 2024.
- Niklaus, J. et al. “MultiLegalPile: a 689GB multilingual legal corpus.” EMNLP, 2023.
- Niklaus, J. et al. “Swiss legal translation evaluation: 180,000+ translation pairs.” ACL, 2025.
- Gema, A.P. et al. “MMLU-Redux: Fixing expert-written evaluation sets.” NAACL, 2025.
- Hendrycks, D. et al. “CUAD: An expert-annotated NLP dataset for legal contract review.” NeurIPS, 2021.
- Guha, N. et al. “LegalBench: A collaboratively built benchmark for measuring legal reasoning.” NeurIPS, 2023.
- OECD. “AI risk management and governance frameworks.” OECD AI Policy Observatory, 2024.
Questions about our methodology?
We're happy to discuss our evaluation approach in detail.