Benchmarks/math/GSM8K Chat

GSM8K Chat

Grade School Math 8K adapted for chat format evaluation, featuring 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on GSM8K Chat

State-of-the-art frontier
Open
Proprietary

GSM8K Chat Leaderboard

1 models
ContextCostLicense
170B
Notice missing or incorrect data?

FAQ

Common questions about GSM8K Chat

Grade School Math 8K adapted for chat format evaluation, featuring 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.
The GSM8K Chat paper is available at https://arxiv.org/abs/2110.14168. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The GSM8K Chat leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Llama 3.1 Nemotron 70B Instruct by NVIDIA leads with a score of 0.819. The average score across all models is 0.819.
The highest GSM8K Chat score is 0.819, achieved by Llama 3.1 Nemotron 70B Instruct from NVIDIA.
1 models have been evaluated on the GSM8K Chat benchmark, with 0 verified results and 1 self-reported results.
GSM8K Chat is categorized under math and reasoning. The benchmark evaluates text models.