--- base_model: Qwen/Qwen2.5-Math-1.5B-Instruct library_name: transformers model_name: Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM tags: - generated_from_trainer - prm - trl - math - process-reward-model - qwen2.5 - sharp --- # Model Card for Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM ## Introduction **Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM** is a Process Reward Model (PRM) fine-tuned from [Qwen2.5-Math-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-1.5B-Instruct). This model is specifically designed to evaluate the correctness of intermediate reasoning steps in mathematical problem-solving processes, enabling more reliable and interpretable mathematical reasoning. The model has been trained on the **SHARP-Math** dataset using the Process Reward Model methodology, which provides step-by-step feedback on mathematical reasoning chains. This model is part of the SHARP-PRM series, trained using advanced Process Reward Model techniques. ## Model Information ### Base Model - **Base Model**: [Qwen/Qwen2.5-Math-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Math-1.5B-Instruct) - **Architecture**: Qwen2ForTokenClassification - **Parameters**: 1.5B ### Training Details - **Training Dataset**: SHARP-Math (Process Reward Model dataset) - **Training Method**: Process Reward Model (PRM) as introduced in [Uesato et al., 2022](https://huggingface.co/papers/2211.14275) - **Training Framework**: [TRL (Transformer Reinforcement Learning)](https://github.com/huggingface/trl) v0.24.0 - **Task Type**: Token Classification (binary classification: error/correct for each reasoning step) ## PRM Evaluation This model is designed to evaluate mathematical reasoning processes by: 1. **Step-level Evaluation**: Classifying each step in a reasoning chain as either "correct" or "error" 2. **Process Feedback**: Providing feedback on the reasoning process, not just the final answer 3. **Error Detection**: Identifying where mistakes occur in multi-step mathematical solutions ### Evaluation Metrics The model is evaluated on the [ProcessBench](https://huggingface.co/datasets/Qwen/ProcessBench) benchmark. Key metrics include: - **Error Accuracy**: Ability to correctly identify incorrect steps - **Correct Accuracy**: Ability to correctly identify correct steps - **F1 Score**: Balanced measure of error and correct step classification ## Quick Start ### Installation ```bash pip install transformers torch ``` ### Basic Usage #### Using the Model for Step Classification ```python from transformers import AutoModelForTokenClassification, AutoTokenizer import torch import torch.nn.functional as F model_name = "path/to/Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) model.eval() # Example: Evaluate a mathematical reasoning chain # Problem with steps (one correct, one incorrect) problem = "Solve: 2x + 5 = 13" steps = [ "Subtract 5 from both sides: 2x = 8", # Correct step "Divide by 2: x = 5" # Incorrect step (should be x = 4) ] # Format input with step separator input_text = problem + "\n\n" + "\n\n".join(steps) inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=8192) # Get model predictions with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits # Shape: [batch_size, sequence_length, num_labels] probabilities = F.softmax(logits, dim=-1) # Convert to probabilities predictions = torch.argmax(logits, dim=-1) # Get predicted class indices # Aggregate predictions per step # In practice, you would map tokens to steps based on your step separator labels = ["error", "correct"] for i, step in enumerate(steps): # Get average probability for step tokens (simplified) # In real usage, you'd need to map token positions to step boundaries step_start = len(tokenizer(problem + "\n\n", return_tensors="pt")["input_ids"][0]) step_tokens = predictions[0, step_start:step_start+len(tokenizer(step)["input_ids"])] step_label = labels[step_tokens.mode().values.item()] if len(step_tokens) > 0 else "unknown" print(f"\nStep {i+1}: {step}") print(f" Prediction: {step_label}") print(f" Confidence: {probabilities[0, step_start, 1].item():.2%}") # Expected output: # Step 1: Subtract 5 from both sides: 2x = 8 # Prediction: correct # Confidence: 0.95 # # Step 2: Divide by 2: x = 5 # Prediction: error # Confidence: 0.87 ``` **Output Interpretation:** - **Logits**: Raw scores from the model (before softmax). Higher values indicate stronger confidence. - **Probabilities**: Softmax-normalized scores between 0 and 1. Sum to 1 for each token. - **Predictions**: Class indices (0 = "error", 1 = "correct") for each token. #### Using with Pipeline ```python from transformers import pipeline classifier = pipeline( "token-classification", model="path/to/Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM", tokenizer="path/to/Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM", device=0 if torch.cuda.is_available() else -1 ) # Classify reasoning steps result = classifier(problem + "\n\n" + "\n\n".join(steps)) ``` ### Integration with Mathematical Reasoning This PRM model can be used to: 1. **Filter incorrect reasoning paths** in tree-of-thought or chain-of-thought generation 2. **Provide feedback** during step-by-step problem solving 3. **Evaluate solution quality** before final answer generation 4. **Improve training** by identifying problematic reasoning patterns ## Training Procedure ### Training Configuration - **Learning Rate**: 2e-5 - **Batch Size**: Per-device batch size (with gradient accumulation) - **Epochs**: Multiple epochs with early stopping - **Optimizer**: AdamW with cosine learning rate schedule - **Warmup Ratio**: 3% - **Gradient Clipping**: 5.0 - **Precision**: bfloat16 - **Gradient Checkpointing**: Enabled for memory efficiency ### Training Framework Versions - **TRL**: 0.24.0 - **Transformers**: 4.56.2 - **PyTorch**: 2.9.1 - **Datasets**: 4.4.1 - **Tokenizers**: 0.22.1 ### Training Data The model was trained on the **SHARP-Math** dataset, which contains: - Mathematical problems with step-by-step solutions - Labeled reasoning steps (correct/error) - Diverse mathematical domains and difficulty levels ## Use Cases ### 1. Mathematical Reasoning Evaluation - Evaluate intermediate steps in mathematical problem-solving - Identify errors in multi-step calculations - Provide feedback on reasoning quality ### 2. Educational Applications - Automated grading of mathematical solutions - Step-by-step feedback for students - Identification of common error patterns ### 3. Research Applications - Training better mathematical reasoning models - Analyzing reasoning patterns - Improving chain-of-thought generation ## Limitations and Considerations 1. **Domain Specificity**: This model is specifically trained for mathematical reasoning and may not generalize well to other domains 2. **Step Length**: The model is optimized for step-level evaluation with a 256-token context per step 3. **Language**: The model is primarily trained on English mathematical content 4. **False Positives/Negatives**: Like all classification models, it may misclassify some steps ## Citation If you use this model in your research, please cite: ```bibtex @misc{qwen2.5-math-1.5b-instruct-sharp-math-prm, title={Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM: A Process Reward Model for Mathematical Reasoning}, author={Your Name/Organization}, year={2025}, howpublished={\url{https://huggingface.co/path/to/Qwen2.5-Math-1.5B-Instruct-SHARP-Math-PRM}} } ``` **Model Card Version**: 1.0 **Last Updated**: 2025-12-30