Evals in 2025: benchmarks to build models people can use