Model Card for Qwen2.5-Math-7B-Instruct-SHARP-Math-PRM

Introduction

Qwen2.5-Math-7B-Instruct-SHARP-Math-PRM is a Process Reward Model (PRM) fine-tuned from Qwen2.5-Math-7B-Instruct. This model is specifically designed to evaluate the correctness of intermediate reasoning steps in mathematical problem-solving processes, enabling more reliable and interpretable mathematical reasoning.

The model has been trained on the SHARP-Math dataset using the Process Reward Model methodology, which provides step-by-step feedback on mathematical reasoning chains.

This model is part of the SHARP-PRM series, trained using advanced Process Reward Model techniques.

Model Information

Base Model

Base Model: Qwen/Qwen2.5-Math-7B-Instruct
Architecture: Qwen2ForTokenClassification
Parameters: 7B

Training Details

Training Dataset: SHARP-Math (Process Reward Model dataset)
Training Method: Process Reward Model (PRM) as introduced in Uesato et al., 2022
Training Framework: TRL (Transformer Reinforcement Learning) v0.24.0
Task Type: Token Classification (binary classification: error/correct for each reasoning step)

PRM Evaluation

This model is designed to evaluate mathematical reasoning processes by:

Step-level Evaluation: Classifying each step in a reasoning chain as either "correct" or "error"
Process Feedback: Providing feedback on the reasoning process, not just the final answer
Error Detection: Identifying where mistakes occur in multi-step mathematical solutions

Evaluation Metrics

The model is evaluated on the ProcessBench benchmark.

Key metrics include:

Error Accuracy: Ability to correctly identify incorrect steps
Correct Accuracy: Ability to correctly identify correct steps
F1 Score: Balanced measure of error and correct step classification

Quick Start

Installation

pip install transformers torch

Basic Usage

Using the Model for Step Classification

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
import torch.nn.functional as F

model_name = "path/to/Qwen2.5-Math-7B-Instruct-SHARP-Math-PRM"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()

# Example: Evaluate a mathematical reasoning chain
# Problem with steps (one correct, one incorrect)
problem = "Solve: 2x + 5 = 13"
steps = [
    "Subtract 5 from both sides: 2x = 8",  # Correct step
    "Divide by 2: x = 5"  # Incorrect step (should be x = 4)
]

# Format input with step separator
input_text = problem + "\n\n" + "\n\n".join(steps)
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=8192)

# Get model predictions
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits  # Shape: [batch_size, sequence_length, num_labels]
    probabilities = F.softmax(logits, dim=-1)  # Convert to probabilities
    predictions = torch.argmax(logits, dim=-1)  # Get predicted class indices

# Aggregate predictions per step
# In practice, you would map tokens to steps based on your step separator
labels = ["error", "correct"]
for i, step in enumerate(steps):
    # Get average probability for step tokens (simplified)
    # In real usage, you'd need to map token positions to step boundaries
    step_start = len(tokenizer(problem + "\n\n", return_tensors="pt")["input_ids"][0])
    step_tokens = predictions[0, step_start:step_start+len(tokenizer(step)["input_ids"])]
    step_label = labels[step_tokens.mode().values.item()] if len(step_tokens) > 0 else "unknown"
    print(f"\nStep {i+1}: {step}")
    print(f"  Prediction: {step_label}")
    print(f"  Confidence: {probabilities[0, step_start, 1].item():.2%}")

# Expected output:
# Step 1: Subtract 5 from both sides: 2x = 8
#   Prediction: correct
#   Confidence: 0.95
#
# Step 2: Divide by 2: x = 5
#   Prediction: error
#   Confidence: 0.87

Output Interpretation:

Logits: Raw scores from the model (before softmax). Higher values indicate stronger confidence.
Probabilities: Softmax-normalized scores between 0 and 1. Sum to 1 for each token.
Predictions: Class indices (0 = "error", 1 = "correct") for each token.

Using with Pipeline

from transformers import pipeline

classifier = pipeline(
    "token-classification",
    model="path/to/Qwen2.5-Math-7B-Instruct-SHARP-Math-PRM",
    tokenizer="path/to/Qwen2.5-Math-7B-Instruct-SHARP-Math-PRM",
    device=0 if torch.cuda.is_available() else -1
)

# Classify reasoning steps
result = classifier(problem + "\n\n" + "\n\n".join(steps))

Integration with Mathematical Reasoning

This PRM model can be used to:

Filter incorrect reasoning paths in tree-of-thought or chain-of-thought generation
Provide feedback during step-by-step problem solving
Evaluate solution quality before final answer generation
Improve training by identifying problematic reasoning patterns

Training Procedure

Training Configuration

Learning Rate: 2e-5
Batch Size: Per-device batch size (with gradient accumulation)
Epochs: Multiple epochs with early stopping
Optimizer: AdamW with cosine learning rate schedule
Warmup Ratio: 3%
Gradient Clipping: 5.0
Precision: bfloat16
Gradient Checkpointing: Enabled for memory efficiency

Training Framework Versions

TRL: 0.24.0
Transformers: 4.56.2
PyTorch: 2.9.1
Datasets: 4.4.1
Tokenizers: 0.22.1

Training Data

The model was trained on the SHARP-Math dataset, which contains:

Mathematical problems with step-by-step solutions
Labeled reasoning steps (correct/error)
Diverse mathematical domains and difficulty levels

Use Cases

1. Mathematical Reasoning Evaluation

Evaluate intermediate steps in mathematical problem-solving
Identify errors in multi-step calculations
Provide feedback on reasoning quality

2. Educational Applications

Automated grading of mathematical solutions
Step-by-step feedback for students
Identification of common error patterns

3. Research Applications

Training better mathematical reasoning models
Analyzing reasoning patterns
Improving chain-of-thought generation

Limitations and Considerations

Domain Specificity: This model is specifically trained for mathematical reasoning and may not generalize well to other domains
Step Length: The model is optimized for step-level evaluation with a 256-token context per step
Language: The model is primarily trained on English mathematical content
False Positives/Negatives: Like all classification models, it may misclassify some steps

Citation

If you use this model in your research, please cite:

@misc{qwen2.5-math-7b-instruct-sharp-math-prm,
  title={Qwen2.5-Math-7B-Instruct-SHARP-Math-PRM: A Process Reward Model for Mathematical Reasoning},
  author={Your Name/Organization},
  year={2025},
  howpublished={\url{https://huggingface.co/path/to/Qwen2.5-Math-7B-Instruct-SHARP-Math-PRM}}
}

Model Card Version: 1.0
Last Updated: 2025-12-30

Downloads last month: -

Safetensors

Model size

7B params

Tensor type

BF16

Model tree for ZaandaTeika/Qwen2.5-Math-7B-Instruct-SHARP-Math-PRM

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-Math-7B

Finetuned

Qwen/Qwen2.5-Math-7B-Instruct

Finetuned

(136)

this model