Dependency Conflict Prediction Models
Model Description
This repository contains machine learning models for Python dependency conflict detection and package name validation. The models are part of the PyHarmony project, an environment-aware dependency compatibility tool.
Models Included
Conflict Prediction Model (
conflict_predictor.pkl)- Random Forest Classifier for predicting dependency conflicts
- Trained on synthetic dependency datasets
- Provides early warning of potential conflicts before detailed analysis
Package Embeddings (
package_embeddings.json)- Pre-computed semantic embeddings for 77+ common Python packages
- Uses sentence-transformers (all-MiniLM-L6-v2)
- Enables intelligent spell-checking and package name suggestions
Embedding Metadata (
embedding_info.json)- Model configuration and package information
Intended Use
Primary Use Cases
- Dependency Conflict Prediction: Predict whether a set of Python dependencies will have conflicts
- Package Name Validation: Correct spelling mistakes in package names using semantic similarity
- Requirements.txt Analysis: Analyze and validate Python requirements files
Out-of-Scope Use Cases
- Security vulnerability detection
- Multi-language package management (Node.js, Java, etc.)
- Automatic dependency updates/fixes
Training Details
Training Data
- Dataset: Synthetic Requirements Dataset
- Size: 120 samples (60 valid, 60 invalid)
- Generation Method: Programmatically generated using rule-based conflict injection
- Conflict Patterns:
- PyTorch/PyTorch Lightning version mismatches
- FastAPI/Pydantic incompatibilities
- TensorFlow/Keras conflicts
- Duplicate package specifications
Training Procedure
Conflict Prediction Model:
- Algorithm: Random Forest Classifier (scikit-learn)
- Features:
- Package presence (binary features for 30 common packages)
- Number of packages (normalized)
- Version specificity (pinned vs unpinned)
- Duplicate detection
- Known conflict pattern indicators
- Hyperparameters:
- n_estimators: 100
- max_depth: 10
- min_samples_split: 5
- Test Accuracy: 85-95% (depending on dataset split)
Package Embeddings:
- Base Model: sentence-transformers/all-MiniLM-L6-v2
- Embedding Dimension: 384
- Number of Packages: 77
- Method: Pre-computed embeddings for common Python packages
Training Scripts
Models can be retrained using:
train_conflict_model.py- Trains the conflict prediction modelgenerate_embeddings.py- Generates package embeddings
Evaluation
Metrics
- Accuracy: 85-95% on test set
- Precision: High (exact values depend on dataset)
- Recall: High (exact values depend on dataset)
- F1 Score: High (exact values depend on dataset)
Evaluation Results
The models were evaluated on:
- Synthetic test set (20% of training data)
- 20 real-world requirements.txt files
- Achieved 95%+ accuracy in package identification and correction
Limitations and Bias
Known Limitations
- Synthetic Training Data: Model trained on synthetic data may not capture all real-world edge cases
- Limited Package Coverage: Embeddings cover 77 common packages; may not handle rare/private packages well
- Version Constraint Parsing: Complex version constraints may not be fully captured
- Conflict Patterns: Focuses on known compatibility patterns; may miss novel conflicts
Bias Considerations
- Training data focuses on common Python packages (data science, web frameworks, ML libraries)
- May perform better on packages similar to those in training set
- Synthetic data generation may introduce biases toward specific conflict patterns
How to Use
Loading the Models
from ml_models import ConflictPredictor, PackageEmbeddings
Load conflict prediction model
predictor = ConflictPredictor(repo_id="ysakhale/dependency-conflict-models") has_conflict, confidence = predictor.predict(requirements_text)
Load package embeddings
embeddings = PackageEmbeddings(repo_id="ysakhale/dependency-conflict-models") best_match = embeddings.get_best_match("numpyy") # Returns: 'numpy'
Example Usage
thon
Predict conflicts
requirements = "torch==1.8.0\npytorch-lightning==2.2.0" has_conflict, confidence = predictor.predict(requirements) if has_conflict: print(f"Conflict detected with {confidence:.1%} confidence")
Find similar packages
similar = embeddings.find_similar("pandaz", top_k=3)
Returns: [('pandas', 0.95), ('numpy', 0.72), ...]## Model Files
conflict_predictor.pkl(~2-5 MB): Trained Random Forest modelpackage_embeddings.json(~5-10 MB): Pre-computed package embeddingsembedding_info.json(~1 KB): Embedding model metadata
Citation
If you use these models in your research, please cite:
@software{dependency_conflict_models, title={Dependency Conflict Prediction Models}, author={Azam, Faiyaz and Sakhale, Yash and Lin, Yosen and Huang, Anyu}, year={2025}, url={https://huggingface.co/ysakhale/dependency-conflict-models} }## License
MIT License - see LICENSE file for details
Contact
For questions or issues, please open an issue in the main repository or contact the maintainers.
Acknowledgments
- Built as part of the PyHarmony project
- Uses sentence-transformers for embeddings
- Trained with scikit-learn
Space using ysakhale/dependency-conflict-models 1
Evaluation results
- Test Accuracyself-reported0.85-0.95