dawnmsg commited on
Commit
1b9dbb7
Β·
verified Β·
1 Parent(s): 6e3cdad

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -66,7 +66,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
66
  ## 3. Evaluation Results
67
 
68
  **Reasoning Tasks**
69
- | Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 | Grok-4 |
70
  |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|:-------:|
71
  | **HLE (Text-only)** | no tools | 23.9 | 26.3 | 19.8* | 7.9 | 19.8 | 25.4 |
72
  | | w/ tools | 44.9 | 41.7* | 32.0* | 21.7 | 20.3* | 41.0 |
@@ -81,7 +81,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
81
  | **GPQA** | no tools | 84.5 | 85.7 | 83.4 | 74.2 | 79.9 | 87.5 |
82
 
83
  **General Tasks**
84
- | Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
85
  |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
86
  | **MMLU-Pro** | no tools | 84.6 | 87.1 | 87.5 | 81.9 | 85.0 |
87
  | **MMLU-Redux** | no tools | 94.4 | 95.3 | 95.6 | 92.7 | 93.7 |
@@ -89,7 +89,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
89
  | **HealthBench** | no tools | 58.0 | 67.2 | 44.2 | 43.8 | 46.9 |
90
 
91
  **Agentic Search Tasks**
92
- | Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
93
  |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
94
  | **BrowseComp** | w/ tools | 60.2 | 54.9 | 24.1 | 7.4 | 40.1 |
95
  | **BrowseComp-ZH** | w/ tools | 62.3 | 63.0* | 42.4* | 22.2 | 47.9 |
@@ -98,7 +98,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
98
  | **Frames** | w/ tools | 87.0 | 86.0* | 85.0* | 58.1 | 80.2* |
99
 
100
  **Coding Tasks**
101
- | Benchmark | Setting | K2 Thinking | GPT-5 | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
102
  |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
103
  | **SWE-bench Verified** | w/ tools | 71.3 | 74.9 | 77.2 | 69.2 | 67.8 |
104
  | **SWE-bench Multilingual** | w/ tools | 61.1 | 55.3* | 68.0 | 55.9 | 57.9 |
@@ -118,7 +118,7 @@ Kimi K2 Thinking is the latest, most capable version of open-source thinking mod
118
   2.3. For AIME and HMMT (no tools), we report the average of 32 runs (avg@32). For AIME and HMMT (with Python), we report the average of 16 runs (avg@16). For IMO-AnswerBench, we report the average of 8 runs (avg@8).
119
 
120
  3. **Baselines**:
121
-  3.1 GPT-5, Claude-4.5-sonnet, Grok-4 results and DeepSeek-V3.2 results are quoted from the [GPT-5 post](https://openai.com/index/introducing-gpt-5/), [GPT-5 for Developers post](https://openai.com/index/introducing-gpt-5-for-developers/), [GPT-5 system card](https://openai.com/index/gpt-5-system-card/), [claude-sonnet-4-5 post](https://www.anthropic.com/news/claude-sonnet-4-5), [grok-4 post](https://x.ai/news/grok-4), [deepseek-v3.2 post](https://api-docs.deepseek.com/news/news250929), the [public Terminal-Bench leaderboard](https://www.tbench.ai/leaderboard) (Terminus-2), the [public Vals AI leaderboard](https://vals.ai/) and [artificialanalysis](https://artificialanalysis.ai/). Benchmarks for which no available public scores were re-tested under the same conditions used for k2 thinking and are marked with an asterisk(*).
122
   3.2 The GPT-5 and Grok-4 on the HLE full set with tools are 35.2 and 38.6 from the official posts. In our internal evaluation on the HLE text-only subset, GPT-5 scores 41.7 and Grok-4 scores 38.6 (Grok-4’s launch cited 41.0 on the text-only subset). For GPT-5's HLE text-only w/o tool, we use score from <a href="https://scale.com/leaderboard/humanitys_last_exam_text_only" target="_blank">Scale.ai</a>. The official GPT5 HLE full set w/o tool is 24.8.
123
   3.3 For <a href="https://aclanthology.org/2025.emnlp-main.1794.pdf" target="_blank">IMO-AnswerBench</a>: GPT-5 scored 65.6 in the benchmark paper. We re-evaluated GPT-5 with official API and obtained a score of 76.
124
 
 
66
  ## 3. Evaluation Results
67
 
68
  **Reasoning Tasks**
69
+ | Benchmark | Setting | K2 Thinking | GPT-5<br> (High) | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 | Grok-4 |
70
  |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|:-------:|
71
  | **HLE (Text-only)** | no tools | 23.9 | 26.3 | 19.8* | 7.9 | 19.8 | 25.4 |
72
  | | w/ tools | 44.9 | 41.7* | 32.0* | 21.7 | 20.3* | 41.0 |
 
81
  | **GPQA** | no tools | 84.5 | 85.7 | 83.4 | 74.2 | 79.9 | 87.5 |
82
 
83
  **General Tasks**
84
+ | Benchmark | Setting | K2 Thinking | GPT-5<br> (High) | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
85
  |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
86
  | **MMLU-Pro** | no tools | 84.6 | 87.1 | 87.5 | 81.9 | 85.0 |
87
  | **MMLU-Redux** | no tools | 94.4 | 95.3 | 95.6 | 92.7 | 93.7 |
 
89
  | **HealthBench** | no tools | 58.0 | 67.2 | 44.2 | 43.8 | 46.9 |
90
 
91
  **Agentic Search Tasks**
92
+ | Benchmark | Setting | K2 Thinking | GPT-5<br> (High) | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
93
  |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
94
  | **BrowseComp** | w/ tools | 60.2 | 54.9 | 24.1 | 7.4 | 40.1 |
95
  | **BrowseComp-ZH** | w/ tools | 62.3 | 63.0* | 42.4* | 22.2 | 47.9 |
 
98
  | **Frames** | w/ tools | 87.0 | 86.0* | 85.0* | 58.1 | 80.2* |
99
 
100
  **Coding Tasks**
101
+ | Benchmark | Setting | K2 Thinking | GPT-5<br> (High) | Claude Sonnet 4.5<br> (Thinking) | K2 0905 | DeepSeek-V3.2 |
102
  |:----------:|:--------:|:------------:|:------:|:----------------------------:|:--------:|:--------------:|
103
  | **SWE-bench Verified** | w/ tools | 71.3 | 74.9 | 77.2 | 69.2 | 67.8 |
104
  | **SWE-bench Multilingual** | w/ tools | 61.1 | 55.3* | 68.0 | 55.9 | 57.9 |
 
118
   2.3. For AIME and HMMT (no tools), we report the average of 32 runs (avg@32). For AIME and HMMT (with Python), we report the average of 16 runs (avg@16). For IMO-AnswerBench, we report the average of 8 runs (avg@8).
119
 
120
  3. **Baselines**:
121
+  3.1 GPT-5, Claude-4.5-sonnet, Grok-4 results and DeepSeek-V3.2 results are quoted from the [GPT-5 post](https://openai.com/index/introducing-gpt-5/), [GPT-5 for Developers post](https://openai.com/index/introducing-gpt-5-for-developers/), [GPT-5 system card](https://openai.com/index/gpt-5-system-card/), [claude-sonnet-4-5 post](https://www.anthropic.com/news/claude-sonnet-4-5), [grok-4 post](https://x.ai/news/grok-4), [deepseek-v3.2 post](https://api-docs.deepseek.com/news/news250929), the [public Terminal-Bench leaderboard](https://www.tbench.ai/leaderboard) (Terminus-2), the [public Vals AI leaderboard](https://vals.ai/) and [artificialanalysis](https://artificialanalysis.ai/). Benchmarks for which no available public scores were re-tested under the same conditions used for k2 thinking and are marked with an asterisk(*). For the GPT-5 test, we set the reasoning effort to high.
122
   3.2 The GPT-5 and Grok-4 on the HLE full set with tools are 35.2 and 38.6 from the official posts. In our internal evaluation on the HLE text-only subset, GPT-5 scores 41.7 and Grok-4 scores 38.6 (Grok-4’s launch cited 41.0 on the text-only subset). For GPT-5's HLE text-only w/o tool, we use score from <a href="https://scale.com/leaderboard/humanitys_last_exam_text_only" target="_blank">Scale.ai</a>. The official GPT5 HLE full set w/o tool is 24.8.
123
   3.3 For <a href="https://aclanthology.org/2025.emnlp-main.1794.pdf" target="_blank">IMO-AnswerBench</a>: GPT-5 scored 65.6 in the benchmark paper. We re-evaluated GPT-5 with official API and obtained a score of 76.
124