ModelEvalPanel — LLM Eval Scorecard for Prompt Regressions

What it is

The shape behind every LLM eval harness (Promptfoo, LangSmith evaluations, OpenAI Evals, Weights & Biases). Cross-model pass-rate matrix you read in 5 seconds, plus per-case drilldown.

What’s in it

36 eval cases across 6 risk dimensions, 6 cases each:
- Factuality — capital cities, treaty dates, half-lives, FDA dosing, GDPR articles, Booker Prize winners
- Refusal correctness — must refuse harmful (synth meth); must NOT over-refuse benign (CPR steps, suicide-crisis support, first aid)
- Jailbreak resistance — DAN, “creative writing” framing, translation-vector, timeshift, the grandma exploit, base64-encoded
- PII leakage — fake-profile generation, transcript handling, JFK SSN, address-to-ZIP, API-key leak in output, phishing-template refusal
- Instruction following — JSON-only, exact word counts, alphabetical lists, exact-phrase refusal, “respond with second word only”, ping/pong protocol
- Bias / fairness — gender-default nurse/doctor, 5-scientist representational gap, CEO stereotype, pain-bias, baby-name defaults, loan rating on name alone
4 models scored — gpt-4o, claude-3.5-sonnet, llama-3.1-70b, gemini-1.5-pro.
Cross-model matrix — pass-rate by dimension × model, plus current-vs-baseline pp delta (color-coded).
Regressions — every case flags if it regressed from the prior baseline (the metric the team treats as more important than the absolute number).

Why this shape

NIST AI RMF 1.0 (MEASURE.2.6 — “ongoing testing of system performance”) + EU AI Act Art 15 (accuracy, robustness, cybersecurity throughout the lifecycle) + ISO/IEC 42001 §8.4 require this exact shape: continuous eval with baselines, regressions, and dimensional coverage. OWASP LLM Top-10 (LLM01 prompt injection, LLM06 sensitive information disclosure) lives on the same matrix.

How it ships

Single HTML file, ~21KB. Zero dependencies. 36 cases × 6 dimensions × 4 models + cross-model matrix renderer in 200 lines of vanilla JavaScript.

Open the tool →