UNO-Scorer: A Unified General Scoring Model for UNO-Bench
๐ Introduction
UNO-Scorer is a lightweight yet high-precision LLM-based evaluation model designed to efficiently automate the evaluation of Large Multimodal Models (LMMs) with minimal computational overhead.
Core Functionality:
- Input: Question + Reference Answer + Model Response
- Processing: Analyzes correctness by comparing each sub-question against the reference answer
- Output: Numerical score + Detailed evaluation reasoning for each sub-question
Built upon the powerful Qwen3-14B backbone, UNO-Scorer is fine-tuned on 13K high-quality in-house data. It overcomes the limitations of traditional Overall Reward Models (ORMs) by supporting 6 distinct question types, with particular excellence in Multi-Step Open-Ended Questions (MO).
๐ Performance
UNO-Scorer demonstrates superior performance in automated evaluation, particularly in handling complex Multi-Step Open-Ended Questions. We compared the accuracy of our scorer against other advanced evaluators on our test set:
| Model | Accuracy |
|---|---|
| Seed-1.5-VL | 0.9118 |
| GPT-4.1 | 0.9457 |
| UNO-Scorer (Ours) | 0.9505 |
Experiments show that UNO-Scorer surpasses even proprietary frontier models like GPT-4.1 in this specific evaluation domain with lower cost.
๐ป Usage
โก Quick Start (HuggingFace Transformers)
Get started with UNO-Scorer in just a few lines of code:
pip install -U transformers
python3 test_scorer_hf.py --model-name /path/to/UNO-Scorer
Minimal Example:
โ ๏ธ Critical: The prompt template below is simplified for illustration. Only the complete prompt template in
test_scorer_hf.pywill properly activate the model's fine-tuned scoring capabilities. Custom or simplified prompts will not achieve optimal results.
Click to expand complete prompt template
def process_score_prompt(question, reference, response):
promt_template = """่ฏทๅ
้่ฏป้ฎ้ขไฟกๆฏ๏ผ็ถๅๅบไบๅ่็ญๆกๅฏนๆจกๅๅๅค็็ปๆ่ฟ่กๆญฃ็กฎๆงๆๅใๆฏ้้ขๅฏ่ฝๅ
ๅซๅคไธชๅฐ้ฎ๏ผๆฏไธชๅฐ้ฎ้ฝๅทฒ็ปๅบไบ็ธๅบ็ๅ่็ญๆกๅๅๅผ๏ผ่ฏท้ๅฐ้ฎๆ ก้ชๆจกๅๅๅคๆฏๅฆๆญฃ็กฎ๏ผๆญฃ็กฎๅพๅฏนๅบๅๅผ๏ผ้่ฏฏๆๆผ็ญๅพ0ๅ๏ผ็ดฏ่ฎก่ฎกๅ๏ผๆๅฆไธ่ฆๆฑใ
---
### ่ฆๆฑ1๏ผไฟกๆฏๆขณ็
- ๆขณ็ๅบๅฆไธไฟกๆฏ
- ้ฎ้ขๅ
ๅฎน
- ๅ่็ญๆก๏ผๅฏ้ๅบฆๅฎๅ่กจ่พพ๏ผไฝไธๆนๅๆ ธๅฟๅ
ๅฎน๏ผ
- ๆจกๅๅๅค๏ผ้่ฆๅฐๆจกๅๅๅคไธญ็ๆไปฃๅ
ณ็ณปไธๅ่็ญๆกๅฏน้ฝ๏ผ
- ๅๅผ
### ่ฆๆฑ2๏ผๅคๆญ้ขๅ
- ๆ็กฎ่ฏฅๅฐ้ฎๅฑไบไปฅไธๅช็ง้ขๅไนไธ๏ผๅนถๅบไบ่ฏฅ็ฑปๅ็ๆๅๆ ๅ่ฟ่กๆๅ๏ผ้่ฆ็ปๅบ่ฏฆ็ป็ๆฏๅฏน่ฟ็จใ
- **ๆฐๅผๅ**๏ผ่ฆๆฑๆจกๅๅๅคไธๆ ๅ็ญๆก็ๆฐๅผๅฎๅ
จ็ธๅ๏ผไธๅ
่ฎธๆ่ฏฏๅทฎใไพ๏ผ`้ฎ้ข๏ผๅไบฌๅฅฅ่ฟไผๆฏๅชไธๅนด๏ผๅ่็ญๆก๏ผ2008๏ผๆจกๅๅๅค๏ผ2004๏ผๆๅ็ปๆ๏ผ้่ฏฏใ`
- **ๆไธพๅ**๏ผ่ฆๆฑๆจกๅๅๅคๅไธพๅบๅ่็ญๆก็ๅ
จ้จๅฏน่ฑก๏ผ็ผบไธไธๅฏใ้ไธไธๅฏ๏ผๅ
่ฎธๅไน่ฏ็ญ่ฏญไน็ธ่ฟ็่กจ่พพ๏ผ้ขไธญๆ้กบๅบ่ฆๆฑๅๅฟ
้กปๆ้กบๅบๆไธพใไพ๏ผ`ๅพไธญๅบ็ฐไบๅชไบๅจ็ฉ๏ผๅ่็ญๆก๏ผๅคง็็ซใๆฒณ้ฉฌใ้ฟ้ข้นฟ๏ผๆจกๅๅๅค๏ผๆฒณ้ฉฌใๅฐ็็ซใ้ฟ้ข้นฟ๏ผๆๅ็ปๆ๏ผ้่ฏฏใ `ๆณจ๏ผโ/โ่กจ็คบโๆโ๏ผๅฆ๏ผXXA/XXB๏ผ่กจ็คบๅ็ญๅบไปปๆไธ้กนๅณๅฏใ
- **้ๆฉ้ข**๏ผ่ฆๆฑๆจกๅๅๅคไธๅ่็ญๆก็ธๅ็้้กนๆ้้กนๅ
ๅฎนใไพ๏ผ`้ฎ้ข๏ผๆ็ฝๆฏๅชไธชๆไปฃ็่ฏไบบ๏ผA. ๅๆ B. ๅฎๆ C. ๅ
ๆ๏ผๆจกๅๅๅค๏ผๆ็ฝๆฏๅๆ่ฏไบบ๏ผๆๅ็ปๆ๏ผๆญฃ็กฎใ`
- **ๅคๆญ้ข**๏ผ่ฆๆฑๆจกๅๅๅคไธๅ่็ญๆก็ๅคๆญไธ่ดใไพ๏ผ`้ฎ้ข๏ผๅพไธญ้ผ ๆ ๆฏๅฆๆพๅจไบ็ฌ่ฎฐๆฌ็ต่ๅทฆไพง๏ผๅ่็ญๆก๏ผๆฏ๏ผๆจกๅๅๅค๏ผๅพไธญ้ผ ๆ ๅจ็ฌ่ฎฐๆฌ็ต่็ๅทฆไพงใๆๅ็ปๆ๏ผๆญฃ็กฎใ`
- **็ฎ็ญ้ข**๏ผ่ฆๆฑๆจกๅๅๅคๅ
ๆฌไธๅ่็ญๆก่ฏญไนไธ่ด็็ญ่ฏญๆ่กจ่พพ๏ผๅ
่ฎธ่กจ่พพๆนๅผไธๅใไพ๏ผ`้ฎ้ข๏ผ่ง้ขไธญๆๅๆพๅ
ฅ้
ไธญ็้ฃๆๆฏไปไน๏ผๅ่็ญๆก๏ผๆด่ฑ๏ผๆจกๅๅๅค๏ผ่ก่ๅใๆๅ็ปๆ๏ผ้่ฏฏใ`
- **่ฎบ่ฟฐ้ข**๏ผ่ฆๆฑๆจกๅๅๅคๅ
ๅซๅ่็ญๆก็ๆ ธๅฟ่ง็นใไพ๏ผ`้ฎ้ข๏ผ่ฏท็ฎ่ฆ่ฎบ่ฟฐไธบไปไน่ฆไฟๆค็็ฉๅคๆ ทๆงใๅ่็ญๆก๏ผ็ปดๆ็ๆๅนณ่กก๏ผๆจกๅๅๅค๏ผไฟๆค็็ฉๅคๆ ทๆง่ฝๅค่ฎฉ็ๆ็ณป็ปไฟๆ็จณๅฎ๏ผไฟ่ฟไบบ็ฑป็คพไผ็ๅฏๆ็ปญๅๅฑใๆๅ็ปๆ๏ผๆญฃ็กฎใ`
### ่ฆๆฑ3๏ผๆๅๆ ๅ
- **ๅฎๅ
จๆญฃ็กฎ**๏ผๅพๆปกๅใ
- **้่ฏฏๆๆผ็ญ**๏ผๅพ0ๅใ
- ๅฆๆจกๅๅๅคไธๅ่็ญๆกๅคงๆ็ธๅไฝ็ป่็ฅๆๅทฎๅซ๏ผไธ้ๆ ธๅฟๅ
ๅฎน๏ผ่งไธบๆญฃ็กฎ๏ผๅ
ทไฝๅ่ๅ่็ญๆก็่ฏฆ็ป่ฆๆฑใ
- ่ฅๆจกๅๅๅคๆช็ดๆฅ็ปๅบ็ญๆก๏ผ้ไธปๅจๅฝ็บณๆป็ป็ป่ฎบ๏ผๅชๅ
ณๆณจ็ป่ฎบๆฏๅฆไธ่ดใ
- ๆฏๅฐ้ฎ็ฌ็ซๆๅ๏ผๅๅบ้่ฏฏไธๅฝฑๅๅ็ปญๅฐ้ฎ็็ปๆใ
### ่ฆๆฑ4๏ผ่พๅบๆ ผๅผ
- ้ๅฐ้ฎๅๅบๅพๅ่ฏดๆใ
- ๆๆๅฐ้ฎๅพๅ็ธๅ ๏ผๅจ<score></score>ไธญ็ปๅบๆปๅ๏ผไพๅฆ๏ผ<score>5</score>
---
## ้ฎ้ขไฟกๆฏ
{{question}}
## ๅ่็ญๆก
{{reference}}
## ๆจกๅๅๅค
{{response}}
## ้ๅฐ้ฎๆๅ"""
prompt = promt_template.replace("{{question}}", remove_thought_block(question.strip()))
prompt = prompt.replace("{{reference}}", reference)
prompt = prompt.replace("{{response}}", response)
return prompt
from transformers import AutoModelForCausalLM, AutoTokenizer
import re
def extract_score(text):
matches = re.findall(r'<score>([\d.]+)</score>', text)
return float(matches[-1]) if matches else 0.0
tokenizer = AutoTokenizer.from_pretrained("meituan-longcat/UNO-Scorer-Qwen3-14B")
model = AutoModelForCausalLM.from_pretrained(
"meituan-longcat/UNO-Scorer-Qwen3-14B",
torch_dtype="auto",
device_map="auto"
)
# Prepare scoring prompt
question = "Which animal appears in the image?"
reference = "Sub-question 1: Elephant, total score 10 points"
response = "I see an elephant in the image."
# This prompt template is simplified for illustration.
prompt = f"""Please score the model's response based on the reference answer.
Question: {question}
Reference Answer: {reference}
Model Response: {response}
Provide a step-by-step analysis and output the total score in <score></score> tags."""
# Generate score
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=4096,
do_sample=False
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
result = tokenizer.decode(output_ids, skip_special_tokens=True)
print("Score response:\n", result)
score = extract_score(result)
print(f"Score: {score}")
๐ How It Works
UNO-Scorer evaluates model responses through a structured process:
- Information Organization: Extracts question content, reference answer, model response, and scoring criteria
- Question Type Classification: Identifies the question type (multiple-choice, numerical, enumeration, yes/no, short-answer, or essay)
- Detailed Comparison: Compares model response against reference answer using type-specific criteria
- Score Extraction: Outputs final score in
<score>X</score>format (where X is 0-10)
๐ฅ Input Format Requirements
The model expects three key inputs:
| Component | Description | Example |
|---|---|---|
| Question | The original question posed to the model | "Which animals appear in the image?" |
| Reference Answer | Ground truth answer with point allocation (sum to 10) | Sub-question 1: Elephant, total score 10 points |
| Model Response | The response from the model being evaluated | "I see an elephant in the image." |
Reference Answer Formatting (Critical!)
Since the model is trained primarily on Chinese corpora, formatting reference answers in Chinese yields significantly better results. However, English formatting is also supported.
For Single-Answer Questions:
1. {Answer}, total score 10 points, focus only on final answer correctness
1. {็ญๆก}๏ผๆปๅ10ๅ๏ผๆ ้ๅ
ณๆณจๆจ็่ฟ็จ๏ผๆ็ป็ญๆกๆญฃ็กฎๅณๅฏ
For Multi-Part Questions:
1. {Sub-Answer A} ({X} points); 2. {Sub-Answer B} ({Y} points)
1. {ๅญ็ญๆกA}๏ผ{X}ๅ๏ผ; 2. {ๅญ็ญๆกB}๏ผ{Y}ๅ๏ผ
With Custom Scoring Criteria:
1. {Answer}, total score 10 points, scoring criteria: {detailed criteria}
1. {็ญๆก}๏ผๆปๅ10ๅ๏ผ่ฏๅๆ ๅ๏ผ{่ฏฆ็ปๆ ๅ}
๐ค Output Format
The model returns:
- Detailed Evaluation: Step-by-step analysis for each sub-question
- Score Tag:
<score>X</score>where X ranges from 0 to 10
Example output:
Sub-question 1:
Question Content: How many apples are in the image?
Reference Answer: 2
Model Response: There are two appels.
Points: 10 point
Question Type: Numerical
Comparison Process: The reference answer is "2" and the model response is "two". The numerical values are completely identical, with only the expression format differing. This meets the scoring standard for numerical questions.
Scoring Explanation: Completely correct, awarded 10 point.
<score>10</score>
๐ Complete Evaluation Example
See test_scorer_hf.py for a full working example with multiple question types:
- Multiple-choice questions
- Yes/No questions
- Open-ended questions
- Multi-part questions
Run the example:
python3 test_scorer_hf.py --model-name /path/to/UNO-Scorer
๐ Optimized Inference with vLLM (Recommended for Production)
For large-scale evaluation tasks, we strongly recommend using vLLM for significant performance improvements:
# 1. Clone the repository
git clone https://github.com/meituan-longcat/UNO-Bench.git
cd UNO-Bench/uno-eval
# 2. Install dependencies
pip install -r requirements.txt
# 3. Run vLLM-based inference
bash examples/test_scorer_vllm.sh
Why vLLM?
- 10-20x faster inference compared to standard HuggingFace
- Better batching support for multiple evaluation tasks
- Lower memory footprint
- Optimized for production deployments
โ ๏ธ Important Notes
- Language: Chinese formatting in reference answers produces significantly better results due to the model's training data composition
- Point Allocation: Reference answers must have total points equal to 10
- Score Extraction: Always look for
<score>X</score>in the output - Batch Processing: Use vLLM for evaluating multiple responses efficiently
- Question Type Awareness: Ensure reference answers clearly specify the question type for optimal scoring
๐ฏ Supported Question Types
UNO-Scorer supports evaluation across 6 distinct question types:
| Question Type | Description | Scoring Rule |
|---|---|---|
| Multiple-Choice | Select correct option from given choices | Response must match the correct option exactly |
| Numerical | Provide specific numerical values | No tolerance for numerical errors |
| Enumeration | List all required items | Must include all items, no omissions or errors |
| Yes/No | Binary judgment questions | Response judgment must match reference answer |
| Short-Answer | Brief factual answers | Semantic equivalence acceptable, expression flexibility allowed |
| Essay | Longer analytical responses | Must contain core viewpoints from reference answer |
๐ Citation
If you find this model or the UNO-Bench useful for your research, please cite our paper:
@misc{chen2025unobench,
title={UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in Omni Models},
author={Chen Chen and ZeYang Hu and Fengjiao Chen and Liya Ma and Jiaxing Liu and Xiaoyu Li and Ziwen Wang and Xuezhi Cao and Xunliang Cai},
year={2025},
eprint={2510.18915},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.18915},
}
โ๏ธ License & Disclaimer
This model is released under the Apache 2.0 License. It is based on Qwen3-14B. Please strictly follow the license and usage policy of the original Qwen model series.
Disclaimer: This model is designed for research and evaluation purposes. Users are responsible for ensuring their use complies with applicable laws and regulations.
๐ค Contributing
We welcome contributions and feedback! Please feel free to:
- Report issues or bugs
- Suggest improvements
- Share your evaluation results
- Contribute enhancements
For more information, visit our GitHub repository.
- Downloads last month
- 273