HorizonMath

Progress Over Time

Interactive timeline showing model performance evolution on HorizonMath

State-of-the-art frontier
Open
Proprietary

HorizonMath Leaderboard

2 models
ContextCostLicense
1
ByteDance
ByteDance
1
ByteDance
ByteDance
Notice missing or incorrect data?
About this benchmark

What is HorizonMath?

HorizonMath is an extremely difficult frontier mathematics benchmark designed to test the limits of mathematical reasoning on research-level and competition-beyond problems.

HorizonMath is a text benchmark evaluating models on math and reasoning tasks. LLM Stats tracks 2 models on this benchmark, scored on a 0–1 scale. The current average is 0.0, with the leader at 0.0.

Compare leaders on the best AI for math and best AI for reasoning leaderboards.

Current leaders

Seed 2.1 Pro from ByteDance currently leads the HorizonMath leaderboard with a score of 0.020 across 2 evaluated AI models.

1Seed 2.1 ProByteDance2.0%
1Seed 2.1 TurboByteDance2.0%

FAQ

Common questions about the HorizonMath benchmark and leaderboard.

What is the HorizonMath benchmark?

HorizonMath is an extremely difficult frontier mathematics benchmark designed to test the limits of mathematical reasoning on research-level and competition-beyond problems.

What is the HorizonMath leaderboard?

The HorizonMath leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, Seed 2.1 Pro by ByteDance leads with a score of 0.020. The average score across all models is 0.020.

What is the highest HorizonMath score?

The highest HorizonMath score is 0.020, achieved by Seed 2.1 Pro from ByteDance.

How many models are evaluated on HorizonMath?

2 models have been evaluated on the HorizonMath benchmark, with 0 verified results and 2 self-reported results.

What categories does HorizonMath cover?

HorizonMath is categorized under math and reasoning. The benchmark evaluates text models.

How recent are the HorizonMath leaderboard results?

The HorizonMath leaderboard was last updated in June 2026 and currently includes 2 evaluated models.