moonshotai
/

Kimi-K2-Thinking

@@ -66,7 +66,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
 ## 3. Evaluation Results
 **Reasoning Tasks**
-| Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 | Grok-4 |
 |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|:-------:|
 | **HLE (Text-only)** | no tools | 23.9 | 26.3 | 19.8* | 7.9 | 19.8 | 25.4 |
 | | w/ tools | 44.9 | 41.7* | 32.0* | 21.7 | 20.3* | 41.0 |
@@ -81,7 +81,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
 | **GPQA** | no tools | 84.5 | 85.7 | 83.4 | 74.2 | 79.9 | 87.5 |
 **General Tasks**
-| Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
 |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
 | **MMLU-Pro** | no tools | 84.6 | 87.1 | 87.5 | 81.9 | 85.0 |
 | **MMLU-Redux** | no tools | 94.4 | 95.3 | 95.6 | 92.7 | 93.7 |
@@ -89,7 +89,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
 | **HealthBench** | no tools | 58.0 | 67.2 | 44.2 | 43.8 | 46.9 |
 **Agentic Search Tasks**
-| Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
 |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
 | **BrowseComp** | w/ tools | 60.2 | 54.9 | 24.1 | 7.4 | 40.1 |
 | **BrowseComp-ZH** | w/ tools | 62.3 | 63.0* | 42.4* | 22.2 | 47.9 |
@@ -98,7 +98,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
 | **Frames** | w/ tools | 87.0 | 86.0* | 85.0* | 58.1 | 80.2* |
 **Coding Tasks**
-| Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
 |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
 | **SWE-bench Verified** | w/ tools | 71.3 | 74.9 | 77.2 | 69.2 | 67.8 |
 | **SWE-bench Multilingual** | w/ tools | 61.1 | 55.3* | 68.0 | 55.9 | 57.9 |
@@ -118,7 +118,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
  2.3. For AIME and HMMT (no tools), we report the average of 32 runs (avg@32). For AIME and HMMT (with Python), we report the average of 16 runs (avg@16). For IMO-AnswerBench, we report the average of 8 runs (avg@8).
 3. **Baselines**:
- 3.1 GPT-5, Claude-4.5-sonnet, Grok-4 results and DeepSeek-V3.2 results are quoted from the [GPT-5 post](https://openai.com/index/introducing-gpt-5/), [GPT-5 for Developers post](https://openai.com/index/introducing-gpt-5-for-developers/), [GPT-5 system card](https://openai.com/index/gpt-5-system-card/), [claude-sonnet-4-5 post](https://www.anthropic.com/news/claude-sonnet-4-5), [grok-4 post](https://x.ai/news/grok-4), [deepseek-v3.2 post](https://api-docs.deepseek.com/news/news250929), the [public Terminal-Bench leaderboard](https://www.tbench.ai/leaderboard) (Terminus-2), the [public Vals AI leaderboard](https://vals.ai/) and [artificialanalysis](https://artificialanalysis.ai/). Benchmarks for which no available public scores were re-tested under the same conditions used for k2 thinking and are marked with an asterisk(*).
  3.2 The GPT-5 and Grok-4 on the HLE full set with tools are 35.2 and 38.6 from the official posts. In our internal evaluation on the HLE text-only subset, GPT-5 scores 41.7 and Grok-4 scores 38.6 (Grok-4’s launch cited 41.0 on the text-only subset). For GPT-5's HLE text-only w/o tool, we use score from <a href="https://scale.com/leaderboard/humanitys_last_exam_text_only" target="_blank">Scale.ai</a>. The official GPT5 HLE full set w/o tool is 24.8.
  3.3 For <a href="https://aclanthology.org/2025.emnlp-main.1794.pdf" target="_blank">IMO-AnswerBench</a>: GPT-5 scored 65.6 in the benchmark paper. We re-evaluated GPT-5 with official API and obtained a score of 76.

 ## 3. Evaluation Results
 **Reasoning Tasks**
+| Benchmark | Setting | K2 Thinking | GPT-5<br> (High) | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 | Grok-4 |
 |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|:-------:|
 | **HLE (Text-only)** | no tools | 23.9 | 26.3 | 19.8* | 7.9 | 19.8 | 25.4 |
 | | w/ tools | 44.9 | 41.7* | 32.0* | 21.7 | 20.3* | 41.0 |
 | **GPQA** | no tools | 84.5 | 85.7 | 83.4 | 74.2 | 79.9 | 87.5 |
 **General Tasks**
+| Benchmark | Setting | K2 Thinking | GPT-5<br> (High) | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
 |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
 | **MMLU-Pro** | no tools | 84.6 | 87.1 | 87.5 | 81.9 | 85.0 |
 | **MMLU-Redux** | no tools | 94.4 | 95.3 | 95.6 | 92.7 | 93.7 |
 | **HealthBench** | no tools | 58.0 | 67.2 | 44.2 | 43.8 | 46.9 |
 **Agentic Search Tasks**
+| Benchmark | Setting | K2 Thinking | GPT-5<br> (High) | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
 |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
 | **BrowseComp** | w/ tools | 60.2 | 54.9 | 24.1 | 7.4 | 40.1 |
 | **BrowseComp-ZH** | w/ tools | 62.3 | 63.0* | 42.4* | 22.2 | 47.9 |
 | **Frames** | w/ tools | 87.0 | 86.0* | 85.0* | 58.1 | 80.2* |
 **Coding Tasks**
+| Benchmark | Setting | K2 Thinking | GPT-5<br> (High) | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
 |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
 | **SWE-bench Verified** | w/ tools | 71.3 | 74.9 | 77.2 | 69.2 | 67.8 |
 | **SWE-bench Multilingual** | w/ tools | 61.1 | 55.3* | 68.0 | 55.9 | 57.9 |
  2.3. For AIME and HMMT (no tools), we report the average of 32 runs (avg@32). For AIME and HMMT (with Python), we report the average of 16 runs (avg@16). For IMO-AnswerBench, we report the average of 8 runs (avg@8).
 3. **Baselines**:
+ 3.1 GPT-5, Claude-4.5-sonnet, Grok-4 results and DeepSeek-V3.2 results are quoted from the [GPT-5 post](https://openai.com/index/introducing-gpt-5/), [GPT-5 for Developers post](https://openai.com/index/introducing-gpt-5-for-developers/), [GPT-5 system card](https://openai.com/index/gpt-5-system-card/), [claude-sonnet-4-5 post](https://www.anthropic.com/news/claude-sonnet-4-5), [grok-4 post](https://x.ai/news/grok-4), [deepseek-v3.2 post](https://api-docs.deepseek.com/news/news250929), the [public Terminal-Bench leaderboard](https://www.tbench.ai/leaderboard) (Terminus-2), the [public Vals AI leaderboard](https://vals.ai/) and [artificialanalysis](https://artificialanalysis.ai/). Benchmarks for which no available public scores were re-tested under the same conditions used for k2 thinking and are marked with an asterisk(*). For the GPT-5 test, we set the reasoning effort to high.
  3.2 The GPT-5 and Grok-4 on the HLE full set with tools are 35.2 and 38.6 from the official posts. In our internal evaluation on the HLE text-only subset, GPT-5 scores 41.7 and Grok-4 scores 38.6 (Grok-4’s launch cited 41.0 on the text-only subset). For GPT-5's HLE text-only w/o tool, we use score from <a href="https://scale.com/leaderboard/humanitys_last_exam_text_only" target="_blank">Scale.ai</a>. The official GPT5 HLE full set w/o tool is 24.8.
  3.3 For <a href="https://aclanthology.org/2025.emnlp-main.1794.pdf" target="_blank">IMO-AnswerBench</a>: GPT-5 scored 65.6 in the benchmark paper. We re-evaluated GPT-5 with official API and obtained a score of 76.