Methodology

The HAAS Score: 6 dimensions, fully reproducible

Every AI system we evaluate receives a Helvetic AI Assurance Score (HAAS) across 6 dimensions. Each dimension is scored 0–100 with confidence intervals. Our methodology is fully documented and every result is independently verifiable.

Scoring Framework

6 evaluation dimensions

Performance (incl. Hallucination Rate)

Task completion, factual correctness, hallucination detection. Domain-specific scenarios from Swiss-Bench test real-world performance, not generic benchmarks.

Robustness

Adversarial inputs, prompt injection resistance, stress testing. How does the model perform under edge cases and adversarial conditions?

Safety

Hallucination detection, fabricated citation identification, harmful output avoidance. Tests whether models invent Swiss legal references or produce misleading regulatory guidance.

Compliance

EU AI Act technical compliance via 29 Compl-AI benchmarks (ETH Zurich). Automated scoring across applicable articles and technical requirements for AI system governance.

Swiss Language

Multilingual competence across German, French, and Italian. MMLU-ProX language-specific accuracy and Swiss translation quality. How well does the model handle Switzerland’s three official languages?

Documentation

Regulatory gap analysis quality — how well models identify differences between EU-wide regulations and Swiss-specific requirements (FINMA, nDPA). Tests structured reasoning about regulatory frameworks.

System Stack

Three layers of evaluation technology

Inspect AI (MIT License, UK AI Safety Institute)

The evaluation framework adopted by Anthropic and Google DeepMind. Provides the infrastructure for running reproducible model evaluations at scale. Over 100 built-in evaluation tasks with a proven architecture for systematic AI testing.

Compl-AI (Apache 2.0, ETH Zurich / INSAIT)

EU AI Act compliance scoring framework mapping regulatory principles to technical requirements. Published, peer-reviewed methodology from leading European AI safety researchers.

Swiss-Bench (Proprietary)

436 Swiss-specific evaluation scenarios across 11 tasks. Tests German, French, and Italian comprehension on domain-specific tasks. Detects jurisdiction confusion, Verwaltungsdeutsch comprehension failures, temporal decay, and cross-lingual inconsistencies.

Scientific Foundation

Peer-reviewed methodology

Every methodological choice in our evaluation system is grounded in peer-reviewed research. Our citation accuracy evaluation follows Asai et al. (Nature, 2026) — the same study that found GPT-4o hallucinates citations 78–90% of the time. This is why we evaluate legal citation correctness as a dedicated scoring dimension.

Our regulatory compliance mapping adapts the Compl-AI framework (ETH Zurich, ArXiv: 2410.07959), recognized by the OECD. Our holistic evaluation philosophy follows HELM (Stanford CRFM, peer-reviewed in TMLR). Swiss legal translation evaluation builds on methodology validated by Niklaus et al. (EMNLP 2023, ACL 2025) covering 180,000+ Swiss legal translation pairs.

We are currently preparing a scientific article for peer-reviewed publication that details our complete evaluation methodology, expert verification process, and statistical framework. Our ground truth verification follows MMLU-Redux (Gema et al., NAACL 2025), which found a 9% error rate in widely-used benchmarks. Our expert annotation protocol is modelled on CUAD (Hendrycks et al., NeurIPS 2021) and LegalBench (Guha et al., NeurIPS 2023). In total, our methodology draws on 40+ peer-reviewed publications.

Key finding (Asai et al., Nature, 2026): When LLMs cite legal articles, regulations, or case law, they fabricate references 78–90% of the time. Our scoring methodology explicitly evaluates citation precision, recall, and correctness — not just whether the answer sounds plausible.

Reproducibility

Every result is reproducible

Every evaluation follows a documented, reproducible methodology. You receive detailed benchmark results, scoring breakdowns, and methodology documentation with every engagement — sufficient to verify and understand every finding.

This is not an opinion. It’s evidence.

Independence

No conflicts of interest

Helvetic AI has no commercial relationships with any AI model provider. No referral fees, no vendor partnerships, no pay-for-score agreements. Every model is evaluated with the same system, the same benchmarks, and the same scoring methodology.

References

Key publications

Asai, A. et al. “Citation correctness in large language models.” Nature, 2026.
Dobreva, R. et al. “Compl-AI: Compliance assessment of LLMs against EU AI Act requirements.” ArXiv: 2410.07959, 2024. (ETH Zürich / INSAIT)
Liang, P. et al. “Holistic Evaluation of Language Models (HELM).” TMLR, 2023. (Stanford CRFM)
UK AI Safety Institute. “Inspect AI: evaluation framework for AI systems.” MIT License, 2024.
Niklaus, J. et al. “MultiLegalPile: a 689GB multilingual legal corpus.” EMNLP, 2023.
Niklaus, J. et al. “Swiss legal translation evaluation: 180,000+ translation pairs.” ACL, 2025.
Gema, A.P. et al. “MMLU-Redux: Fixing expert-written evaluation sets.” NAACL, 2025.
Hendrycks, D. et al. “CUAD: An expert-annotated NLP dataset for legal contract review.” NeurIPS, 2021.
Guha, N. et al. “LegalBench: A collaboratively built benchmark for measuring legal reasoning.” NeurIPS, 2023.
OECD. “AI risk management and governance frameworks.” OECD AI Policy Observatory, 2024.

Learn More

Questions about our methodology?

We're happy to discuss our evaluation approach in detail.

contact@ai-helvetic.ch