When One Benchmark Failed: How 37% Citation Errors Changed Our View of Claude Opus 4.5
https://tiny-wiki.win/index.php/Why_single_benchmark_scores_mislead:_interpreting_a_low_Vectara_score_with_high_AA-Omniscience
How a research team that trusted a single benchmark discovered widespread citation problems In spring 2024 our applied-NLP research team supported product decisions at a mid-stage startup