GSM8K Chat
Grade School Math 8K adapted for chat format evaluation, featuring 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.
Progress Over Time
Interactive timeline showing model performance evolution on GSM8K Chat
State-of-the-art frontier
Open
Proprietary
GSM8K Chat Leaderboard
1 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | 70B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about GSM8K Chat
Grade School Math 8K adapted for chat format evaluation, featuring 8.5K high-quality linguistically diverse grade school math word problems requiring multi-step reasoning and elementary arithmetic operations.
The GSM8K Chat paper is available at https://arxiv.org/abs/2110.14168. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The GSM8K Chat leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Llama 3.1 Nemotron 70B Instruct by NVIDIA leads with a score of 0.819. The average score across all models is 0.819.
The highest GSM8K Chat score is 0.819, achieved by Llama 3.1 Nemotron 70B Instruct from NVIDIA.
1 models have been evaluated on the GSM8K Chat benchmark, with 0 verified results and 1 self-reported results.
GSM8K Chat is categorized under math and reasoning. The benchmark evaluates text models.