Update README.md
Browse files
README.md
CHANGED
|
@@ -66,7 +66,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
|
|
| 66 |
## 3. Evaluation Results
|
| 67 |
|
| 68 |
**Reasoning Tasks**
|
| 69 |
-
| Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 | Grok-4 |
|
| 70 |
|:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|:-------:|
|
| 71 |
| **HLE (Text-only)** | no tools | 23.9 | 26.3 | 19.8* | 7.9 | 19.8 | 25.4 |
|
| 72 |
| | w/ tools | 44.9 | 41.7* | 32.0* | 21.7 | 20.3* | 41.0 |
|
|
@@ -81,7 +81,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
|
|
| 81 |
| **GPQA** | no tools | 84.5 | 85.7 | 83.4 | 74.2 | 79.9 | 87.5 |
|
| 82 |
|
| 83 |
**General Tasks**
|
| 84 |
-
| Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
|
| 85 |
|:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
|
| 86 |
| **MMLU-Pro** | no tools | 84.6 | 87.1 | 87.5 | 81.9 | 85.0 |
|
| 87 |
| **MMLU-Redux** | no tools | 94.4 | 95.3 | 95.6 | 92.7 | 93.7 |
|
|
@@ -89,7 +89,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
|
|
| 89 |
| **HealthBench** | no tools | 58.0 | 67.2 | 44.2 | 43.8 | 46.9 |
|
| 90 |
|
| 91 |
**Agentic Search Tasks**
|
| 92 |
-
| Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
|
| 93 |
|:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
|
| 94 |
| **BrowseComp** | w/ tools | 60.2 | 54.9 | 24.1 | 7.4 | 40.1 |
|
| 95 |
| **BrowseComp-ZH** | w/ tools | 62.3 | 63.0* | 42.4* | 22.2 | 47.9 |
|
|
@@ -98,7 +98,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
|
|
| 98 |
| **Frames** | w/ tools | 87.0 | 86.0* | 85.0* | 58.1 | 80.2* |
|
| 99 |
|
| 100 |
**Coding Tasks**
|
| 101 |
-
| Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
|
| 102 |
|:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
|
| 103 |
| **SWE-bench Verified** | w/ tools | 71.3 | 74.9 | 77.2 | 69.2 | 67.8 |
|
| 104 |
| **SWE-bench Multilingual** | w/ tools | 61.1 | 55.3* | 68.0 | 55.9 | 57.9 |
|
|
@@ -118,7 +118,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
|
|
| 118 |
β2.3. For AIME and HMMT (no tools), we report the average of 32 runs (avg@32). For AIME and HMMT (with Python), we report the average of 16 runs (avg@16). For IMO-AnswerBench, we report the average of 8 runs (avg@8).
|
| 119 |
|
| 120 |
3. **Baselines**:
|
| 121 |
-
β3.1 GPT-5, Claude-4.5-sonnet, Grok-4 results and DeepSeek-V3.2 results are quoted from the [GPT-5 post](https://openai.com/index/introducing-gpt-5/), [GPT-5 for Developers post](https://openai.com/index/introducing-gpt-5-for-developers/), [GPT-5 system card](https://openai.com/index/gpt-5-system-card/), [claude-sonnet-4-5 post](https://www.anthropic.com/news/claude-sonnet-4-5), [grok-4 post](https://x.ai/news/grok-4), [deepseek-v3.2 post](https://api-docs.deepseek.com/news/news250929), the [public Terminal-Bench leaderboard](https://www.tbench.ai/leaderboard) (Terminus-2), the [public Vals AI leaderboard](https://vals.ai/) and [artificialanalysis](https://artificialanalysis.ai/). Benchmarks for which no available public scores were re-tested under the same conditions used for k2 thinking and are marked with an asterisk(*).
|
| 122 |
β3.2 The GPT-5 and Grok-4 on the HLE full set with tools are 35.2 and 38.6 from the official posts. In our internal evaluation on the HLE text-only subset, GPT-5 scores 41.7 and Grok-4 scores 38.6 (Grok-4βs launch cited 41.0 on the text-only subset). For GPT-5's HLE text-only w/o tool, we use score from <a href="https://scale.com/leaderboard/humanitys_last_exam_text_only" target="_blank">Scale.ai</a>. The official GPT5 HLE full set w/o tool is 24.8.
|
| 123 |
β3.3 For <a href="https://aclanthology.org/2025.emnlp-main.1794.pdf" target="_blank">IMO-AnswerBench</a>: GPT-5 scored 65.6 in the benchmark paper. We re-evaluated GPT-5 with official API and obtained a score of 76.
|
| 124 |
|
|
|
|
| 66 |
## 3. Evaluation Results
|
| 67 |
|
| 68 |
**Reasoning Tasks**
|
| 69 |
+
| Benchmark | Setting | K2 Thinking | GPT-5<br> (High) | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 | Grok-4 |
|
| 70 |
|:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|:-------:|
|
| 71 |
| **HLE (Text-only)** | no tools | 23.9 | 26.3 | 19.8* | 7.9 | 19.8 | 25.4 |
|
| 72 |
| | w/ tools | 44.9 | 41.7* | 32.0* | 21.7 | 20.3* | 41.0 |
|
|
|
|
| 81 |
| **GPQA** | no tools | 84.5 | 85.7 | 83.4 | 74.2 | 79.9 | 87.5 |
|
| 82 |
|
| 83 |
**General Tasks**
|
| 84 |
+
| Benchmark | Setting | K2 Thinking | GPT-5<br> (High) | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
|
| 85 |
|:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
|
| 86 |
| **MMLU-Pro** | no tools | 84.6 | 87.1 | 87.5 | 81.9 | 85.0 |
|
| 87 |
| **MMLU-Redux** | no tools | 94.4 | 95.3 | 95.6 | 92.7 | 93.7 |
|
|
|
|
| 89 |
| **HealthBench** | no tools | 58.0 | 67.2 | 44.2 | 43.8 | 46.9 |
|
| 90 |
|
| 91 |
**Agentic Search Tasks**
|
| 92 |
+
| Benchmark | Setting | K2 Thinking | GPT-5<br> (High) | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
|
| 93 |
|:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
|
| 94 |
| **BrowseComp** | w/ tools | 60.2 | 54.9 | 24.1 | 7.4 | 40.1 |
|
| 95 |
| **BrowseComp-ZH** | w/ tools | 62.3 | 63.0* | 42.4* | 22.2 | 47.9 |
|
|
|
|
| 98 |
| **Frames** | w/ tools | 87.0 | 86.0* | 85.0* | 58.1 | 80.2* |
|
| 99 |
|
| 100 |
**Coding Tasks**
|
| 101 |
+
| Benchmark | Setting | K2 Thinking | GPT-5<br> (High) | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
|
| 102 |
|:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
|
| 103 |
| **SWE-bench Verified** | w/ tools | 71.3 | 74.9 | 77.2 | 69.2 | 67.8 |
|
| 104 |
| **SWE-bench Multilingual** | w/ tools | 61.1 | 55.3* | 68.0 | 55.9 | 57.9 |
|
|
|
|
| 118 |
β2.3. For AIME and HMMT (no tools), we report the average of 32 runs (avg@32). For AIME and HMMT (with Python), we report the average of 16 runs (avg@16). For IMO-AnswerBench, we report the average of 8 runs (avg@8).
|
| 119 |
|
| 120 |
3. **Baselines**:
|
| 121 |
+
β3.1 GPT-5, Claude-4.5-sonnet, Grok-4 results and DeepSeek-V3.2 results are quoted from the [GPT-5 post](https://openai.com/index/introducing-gpt-5/), [GPT-5 for Developers post](https://openai.com/index/introducing-gpt-5-for-developers/), [GPT-5 system card](https://openai.com/index/gpt-5-system-card/), [claude-sonnet-4-5 post](https://www.anthropic.com/news/claude-sonnet-4-5), [grok-4 post](https://x.ai/news/grok-4), [deepseek-v3.2 post](https://api-docs.deepseek.com/news/news250929), the [public Terminal-Bench leaderboard](https://www.tbench.ai/leaderboard) (Terminus-2), the [public Vals AI leaderboard](https://vals.ai/) and [artificialanalysis](https://artificialanalysis.ai/). Benchmarks for which no available public scores were re-tested under the same conditions used for k2 thinking and are marked with an asterisk(*). For the GPT-5 test, we set the reasoning effort to high.
|
| 122 |
β3.2 The GPT-5 and Grok-4 on the HLE full set with tools are 35.2 and 38.6 from the official posts. In our internal evaluation on the HLE text-only subset, GPT-5 scores 41.7 and Grok-4 scores 38.6 (Grok-4βs launch cited 41.0 on the text-only subset). For GPT-5's HLE text-only w/o tool, we use score from <a href="https://scale.com/leaderboard/humanitys_last_exam_text_only" target="_blank">Scale.ai</a>. The official GPT5 HLE full set w/o tool is 24.8.
|
| 123 |
β3.3 For <a href="https://aclanthology.org/2025.emnlp-main.1794.pdf" target="_blank">IMO-AnswerBench</a>: GPT-5 scored 65.6 in the benchmark paper. We re-evaluated GPT-5 with official API and obtained a score of 76.
|
| 124 |
|