Hi
@wlabchoi
!
Thanks for trying out the benchmarks and for the detailed question!
We've migrated everything to our new Github repository
For TeleMath specifically, the evaluation methodology is documented in the paper: TeleMath: A Benchmark Dataset for Assessing Large Language Models Capability on Telecom Math
To answer your questions directly:
- Evaluation metrics: We use pass@1 (single attempt accuracy) and cons@16 (majority voting over 16 samples) with temperature 0.6 and top_p 0.90
- Answer validation: Numerical exact-match; answers are strictly numerical values with units either stated in the question or implied by context
- No post-processing: Answers are compared directly against ground-truth numerical values
You can also find more context on the overall benchmark methodology in our blog post.
For running evaluations locally, check the repo documentation:
- Getting Started
- Running Evaluations
- List of Evals
The framework uses Inspect AI, which should help with reproducibility.
Thanks for using our benchmarks!