Instructions to use zijinghuafen/GM-PRM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zijinghuafen/GM-PRM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="zijinghuafen/GM-PRM") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("zijinghuafen/GM-PRM") model = AutoModelForMultimodalLM.from_pretrained("zijinghuafen/GM-PRM") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use zijinghuafen/GM-PRM with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "zijinghuafen/GM-PRM" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zijinghuafen/GM-PRM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/zijinghuafen/GM-PRM
- SGLang
How to use zijinghuafen/GM-PRM with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "zijinghuafen/GM-PRM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zijinghuafen/GM-PRM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "zijinghuafen/GM-PRM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "zijinghuafen/GM-PRM", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use zijinghuafen/GM-PRM with Docker Model Runner:
docker model run hf.co/zijinghuafen/GM-PRM
GM-PRM: A Generative Multimodal Process Reward Model
Model weights for GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning (arXiv:2508.04088).
Accepted at the 4th Workshop on Advances in Language and Vision Research (ALVR), in conjunction with ACL 2026 (San Diego, California, July 2026).
Overview
GM-PRM transforms a multimodal Process Reward Model from a passive binary verifier into an active reasoning collaborator. Instead of emitting a scalar correct/incorrect score per step, it produces a fine-grained, interpretable analysis of each reasoning step along three dimensions:
- Step intent — what the step is trying to do
- Image alignment — whether the step is consistent with the image
- Reasoning logic — whether the logic and calculations are sound
Crucially, GM-PRM is trained to generate a corrected version of the first erroneous step it identifies. This corrective ability powers our test-time inference strategy, Refined Best-of-N (Refined-BoN), which feeds the corrected step back to the policy model to steer it toward a better reasoning trajectory — improving both the diversity and correctness of the solution pool.
Model details
- Base model: Qwen/Qwen2.5-VL-7B-Instruct
- Training: full-parameter SFT (ViT encoder frozen), 2 epochs, lr 1e-5, bf16, DeepSpeed ZeRO-3
- Training data: zijinghuafen/GM-PRM-20K — 19,614 samples (plane geometry + functions), filtered by joint agreement of GPT-4o (LLM-as-a-judge) and Monte-Carlo estimation
Results
Used as the critic in Refined-BoN (N=8), GM-PRM consistently improves policy-model accuracy across five multimodal math benchmarks (MathVista, MathVision, MathVerse, DynaMath, WeMath). Average gains: +5.9 (MiniCPM-V-2.6-8B), +4.5 (Llama-3.2-11B-Vision), +4.5 (Qwen2.5-VL-7B), +5.6 (InternVL3-8B). See the paper for full results.
Usage
GM-PRM is a Qwen2.5-VL model; load it with transformers. Give it the problem image and the policy model's step-by-step solution, and it returns per-step analysis, judgements, and a corrected first-incorrect step.
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch, re
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"zijinghuafen/GM-PRM", torch_dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained("zijinghuafen/GM-PRM")
image_path = "problem.png"
response = "<the policy model's step-by-step solution>"
steps = re.split(r"\n\s*\n", response)
prompt = (
"You are an expert in solving multimodal mathematical problems. You will be given:\n"
"1. An image of a multimodal mathematical problem.\n2. A multi-step solution.\n\n"
"**Task**:\nThe tasks you need to do are:\n"
"1. Analyze the purpose of each step and what specific actions were taken.\n"
"2. Analyze each step's correctness in terms of image alignment and reasoning logic.\n"
"- Image alignment: Whether the information and reasoning used in the step are consistent with the content of the provided image.\n"
"- Reasoning logic: Whether the reasoning is logically sound, calculations are correct, and information used matches that from previous steps and question.\n"
"When outputting judgements, you must choose one output from \"Correct\" or \"Incorrect\".\n"
"3. For the first incorrect step, correct it based on your analysis of its error and intent, and output the corrected step at the end of your output.\n\n"
"**Output Format**:\nYou must output your content in the following format:\n"
"### Step 1 ###\nStep intent analysis:[...]\nImage alignment analysis:[...]\n"
"Judgement of image alignment:[Correct/Incorrect]\nReasoning logic analysis:[...]\n"
"Judgement of reasoning logic:[Correct/Incorrect]\nFinal judgement of the current step:[Correct/Incorrect]\n\n"
"### Step 2 ###\n...\n\n"
"Corrected step of the first incorrect step:[If there are incorrect steps, the corrected step of the first incorrect step goes here. Otherwise, omit this line]\n\n"
"**Problem**:\nThe image of problem is as follows:\n<image>\n\n"
"**Solution Steps**:\nSteps you need to analyze and judge are as follows:\n"
)
for j, s in enumerate(steps):
prompt += f"Step {j+1}: {s}\n\n"
messages = [{"role": "user", "content": [
{"type": "image", "image": image_path},
{"type": "text", "text": prompt},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, _ = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, padding=True, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=2048, do_sample=False)
print(processor.batch_decode([out[0][inputs.input_ids.shape[1]:]], skip_special_tokens=True)[0])
The probabilities of the generated Correct / Incorrect tokens can be used as step-level scores for Best-of-N selection.
Citation
@inproceedings{zhang2026gmprm,
title={GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning},
author={Zhang, Jianghangfan and Yan, Yibo and Zheng, Kening and Zou, Xin and Dai, Song and Hu, Xuming},
booktitle={Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR), in conjunction with ACL 2026},
month={July},
year={2026},
address={San Diego, California, USA},
publisher={Association for Computational Linguistics}
}
- Downloads last month
- -
Model tree for zijinghuafen/GM-PRM
Base model
Qwen/Qwen2.5-VL-7B-Instruct