ModelEvalPanel · v1.0 — llm eval scorecard
offline · zero telemetry

LLM Eval Scorecard

36 eval cases × 4 models × 6 risk dimensions. Cross-model matrix, regression detection vs last baseline, per-case pass/fail with the actual prompt + expected behavior. NIST AI RMF (GOVERN.4.1 + MEASURE.2.6) + EU AI Act Art 15 (accuracy + robustness) shape.

eval cases
0
pass overall
0
regressed
0
improved
0
flagged dims
0

Cross-model matrix · pass-rate by dimension · current vs last-baseline delta

Select an eval case to see the prompt, expected behavior, and per-model outcomes.