Model Card for Qwen2.5-Math-7B-Instruct-SHARP-Math-PRM

Introduction

Qwen2.5-Math-7B-Instruct-SHARP-Math-PRM is a Process Reward Model (PRM) fine-tuned from Qwen2.5-Math-7B-Instruct. This model is specifically designed to evaluate the correctness of intermediate reasoning steps in mathematical problem-solving processes, enabling more reliable and interpretable mathematical reasoning.

The model has been trained on the SHARP-Math dataset using the Process Reward Model methodology, which provides step-by-step feedback on mathematical reasoning chains.

This model is part of the SHARP-PRM series, trained using advanced Process Reward Model techniques.

Model Information

Base Model

Training Details

  • Training Dataset: SHARP-Math (Process Reward Model dataset)
  • Training Method: Process Reward Model (PRM) as introduced in Uesato et al., 2022
  • Training Framework: TRL (Transformer Reinforcement Learning) v0.24.0
  • Task Type: Token Classification (binary classification: error/correct for each reasoning step)

PRM Evaluation

This model is designed to evaluate mathematical reasoning processes by:

  1. Step-level Evaluation: Classifying each step in a reasoning chain as either "correct" or "error"
  2. Process Feedback: Providing feedback on the reasoning process, not just the final answer
  3. Error Detection: Identifying where mistakes occur in multi-step mathematical solutions

Evaluation Metrics

The model is evaluated on the ProcessBench benchmark.

Key metrics include:

  • Error Accuracy: Ability to correctly identify incorrect steps
  • Correct Accuracy: Ability to correctly identify correct steps
  • F1 Score: Balanced measure of error and correct step classification

Quick Start

Installation

pip install transformers torch

Basic Usage

Using the Model for Step Classification

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
import torch.nn.functional as F

model_name = "path/to/Qwen2.5-Math-7B-Instruct-SHARP-Math-PRM"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()

# Example: Evaluate a mathematical reasoning chain
# Problem with steps (one correct, one incorrect)
problem = "Solve: 2x + 5 = 13"
steps = [
    "Subtract 5 from both sides: 2x = 8",  # Correct step
    "Divide by 2: x = 5"  # Incorrect step (should be x = 4)
]

# Format input with step separator
input_text = problem + "\n\n" + "\n\n".join(steps)
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=8192)

# Get model predictions
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits  # Shape: [batch_size, sequence_length, num_labels]
    probabilities = F.softmax(logits, dim=-1)  # Convert to probabilities
    predictions = torch.argmax(logits, dim=-1)  # Get predicted class indices

# Aggregate predictions per step
# In practice, you would map tokens to steps based on your step separator
labels = ["error", "correct"]
for i, step in enumerate(steps):
    # Get average probability for step tokens (simplified)
    # In real usage, you'd need to map token positions to step boundaries
    step_start = len(tokenizer(problem + "\n\n", return_tensors="pt")["input_ids"][0])
    step_tokens = predictions[0, step_start:step_start+len(tokenizer(step)["input_ids"])]
    step_label = labels[step_tokens.mode().values.item()] if len(step_tokens) > 0 else "unknown"
    print(f"\nStep {i+1}: {step}")
    print(f"  Prediction: {step_label}")
    print(f"  Confidence: {probabilities[0, step_start, 1].item():.2%}")

# Expected output:
# Step 1: Subtract 5 from both sides: 2x = 8
#   Prediction: correct
#   Confidence: 0.95
#
# Step 2: Divide by 2: x = 5
#   Prediction: error
#   Confidence: 0.87

Output Interpretation:

  • Logits: Raw scores from the model (before softmax). Higher values indicate stronger confidence.
  • Probabilities: Softmax-normalized scores between 0 and 1. Sum to 1 for each token.
  • Predictions: Class indices (0 = "error", 1 = "correct") for each token.

Using with Pipeline

from transformers import pipeline

classifier = pipeline(
    "token-classification",
    model="path/to/Qwen2.5-Math-7B-Instruct-SHARP-Math-PRM",
    tokenizer="path/to/Qwen2.5-Math-7B-Instruct-SHARP-Math-PRM",
    device=0 if torch.cuda.is_available() else -1
)

# Classify reasoning steps
result = classifier(problem + "\n\n" + "\n\n".join(steps))

Integration with Mathematical Reasoning

This PRM model can be used to:

  1. Filter incorrect reasoning paths in tree-of-thought or chain-of-thought generation
  2. Provide feedback during step-by-step problem solving
  3. Evaluate solution quality before final answer generation
  4. Improve training by identifying problematic reasoning patterns

Training Procedure

Training Configuration

  • Learning Rate: 2e-5
  • Batch Size: Per-device batch size (with gradient accumulation)
  • Epochs: Multiple epochs with early stopping
  • Optimizer: AdamW with cosine learning rate schedule
  • Warmup Ratio: 3%
  • Gradient Clipping: 5.0
  • Precision: bfloat16
  • Gradient Checkpointing: Enabled for memory efficiency

Training Framework Versions

  • TRL: 0.24.0
  • Transformers: 4.56.2
  • PyTorch: 2.9.1
  • Datasets: 4.4.1
  • Tokenizers: 0.22.1

Training Data

The model was trained on the SHARP-Math dataset, which contains:

  • Mathematical problems with step-by-step solutions
  • Labeled reasoning steps (correct/error)
  • Diverse mathematical domains and difficulty levels

Use Cases

1. Mathematical Reasoning Evaluation

  • Evaluate intermediate steps in mathematical problem-solving
  • Identify errors in multi-step calculations
  • Provide feedback on reasoning quality

2. Educational Applications

  • Automated grading of mathematical solutions
  • Step-by-step feedback for students
  • Identification of common error patterns

3. Research Applications

  • Training better mathematical reasoning models
  • Analyzing reasoning patterns
  • Improving chain-of-thought generation

Limitations and Considerations

  1. Domain Specificity: This model is specifically trained for mathematical reasoning and may not generalize well to other domains
  2. Step Length: The model is optimized for step-level evaluation with a 256-token context per step
  3. Language: The model is primarily trained on English mathematical content
  4. False Positives/Negatives: Like all classification models, it may misclassify some steps

Citation

If you use this model in your research, please cite:

@misc{qwen2.5-math-7b-instruct-sharp-math-prm,
  title={Qwen2.5-Math-7B-Instruct-SHARP-Math-PRM: A Process Reward Model for Mathematical Reasoning},
  author={Your Name/Organization},
  year={2025},
  howpublished={\url{https://huggingface.co/path/to/Qwen2.5-Math-7B-Instruct-SHARP-Math-PRM}}
}

Model Card Version: 1.0
Last Updated: 2025-12-30

Downloads last month
-
Safetensors
Model size
7B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ZaandaTeika/Qwen2.5-Math-7B-Instruct-SHARP-Math-PRM

Base model

Qwen/Qwen2.5-7B
Finetuned
(136)
this model