Dependency Conflict Prediction Models

Model Description

This repository contains machine learning models for Python dependency conflict detection and package name validation. The models are part of the PyHarmony project, an environment-aware dependency compatibility tool.

Models Included

  1. Conflict Prediction Model (conflict_predictor.pkl)

    • Random Forest Classifier for predicting dependency conflicts
    • Trained on synthetic dependency datasets
    • Provides early warning of potential conflicts before detailed analysis
  2. Package Embeddings (package_embeddings.json)

    • Pre-computed semantic embeddings for 77+ common Python packages
    • Uses sentence-transformers (all-MiniLM-L6-v2)
    • Enables intelligent spell-checking and package name suggestions
  3. Embedding Metadata (embedding_info.json)

    • Model configuration and package information

Intended Use

Primary Use Cases

  • Dependency Conflict Prediction: Predict whether a set of Python dependencies will have conflicts
  • Package Name Validation: Correct spelling mistakes in package names using semantic similarity
  • Requirements.txt Analysis: Analyze and validate Python requirements files

Out-of-Scope Use Cases

  • Security vulnerability detection
  • Multi-language package management (Node.js, Java, etc.)
  • Automatic dependency updates/fixes

Training Details

Training Data

  • Dataset: Synthetic Requirements Dataset
  • Size: 120 samples (60 valid, 60 invalid)
  • Generation Method: Programmatically generated using rule-based conflict injection
  • Conflict Patterns:
    • PyTorch/PyTorch Lightning version mismatches
    • FastAPI/Pydantic incompatibilities
    • TensorFlow/Keras conflicts
    • Duplicate package specifications

Training Procedure

Conflict Prediction Model:

  • Algorithm: Random Forest Classifier (scikit-learn)
  • Features:
    • Package presence (binary features for 30 common packages)
    • Number of packages (normalized)
    • Version specificity (pinned vs unpinned)
    • Duplicate detection
    • Known conflict pattern indicators
  • Hyperparameters:
    • n_estimators: 100
    • max_depth: 10
    • min_samples_split: 5
  • Test Accuracy: 85-95% (depending on dataset split)

Package Embeddings:

  • Base Model: sentence-transformers/all-MiniLM-L6-v2
  • Embedding Dimension: 384
  • Number of Packages: 77
  • Method: Pre-computed embeddings for common Python packages

Training Scripts

Models can be retrained using:

  • train_conflict_model.py - Trains the conflict prediction model
  • generate_embeddings.py - Generates package embeddings

Evaluation

Metrics

  • Accuracy: 85-95% on test set
  • Precision: High (exact values depend on dataset)
  • Recall: High (exact values depend on dataset)
  • F1 Score: High (exact values depend on dataset)

Evaluation Results

The models were evaluated on:

  • Synthetic test set (20% of training data)
  • 20 real-world requirements.txt files
  • Achieved 95%+ accuracy in package identification and correction

Limitations and Bias

Known Limitations

  1. Synthetic Training Data: Model trained on synthetic data may not capture all real-world edge cases
  2. Limited Package Coverage: Embeddings cover 77 common packages; may not handle rare/private packages well
  3. Version Constraint Parsing: Complex version constraints may not be fully captured
  4. Conflict Patterns: Focuses on known compatibility patterns; may miss novel conflicts

Bias Considerations

  • Training data focuses on common Python packages (data science, web frameworks, ML libraries)
  • May perform better on packages similar to those in training set
  • Synthetic data generation may introduce biases toward specific conflict patterns

How to Use

Loading the Models

from ml_models import ConflictPredictor, PackageEmbeddings

Load conflict prediction model

predictor = ConflictPredictor(repo_id="ysakhale/dependency-conflict-models") has_conflict, confidence = predictor.predict(requirements_text)

Load package embeddings

embeddings = PackageEmbeddings(repo_id="ysakhale/dependency-conflict-models") best_match = embeddings.get_best_match("numpyy") # Returns: 'numpy'

Example Usage

thon

Predict conflicts

requirements = "torch==1.8.0\npytorch-lightning==2.2.0" has_conflict, confidence = predictor.predict(requirements) if has_conflict: print(f"Conflict detected with {confidence:.1%} confidence")

Find similar packages

similar = embeddings.find_similar("pandaz", top_k=3)

Returns: [('pandas', 0.95), ('numpy', 0.72), ...]## Model Files

  • conflict_predictor.pkl (~2-5 MB): Trained Random Forest model
  • package_embeddings.json (~5-10 MB): Pre-computed package embeddings
  • embedding_info.json (~1 KB): Embedding model metadata

Citation

If you use these models in your research, please cite:

@software{dependency_conflict_models, title={Dependency Conflict Prediction Models}, author={Azam, Faiyaz and Sakhale, Yash and Lin, Yosen and Huang, Anyu}, year={2025}, url={https://huggingface.co/ysakhale/dependency-conflict-models} }## License

MIT License - see LICENSE file for details

Contact

For questions or issues, please open an issue in the main repository or contact the maintainers.

Acknowledgments

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using ysakhale/dependency-conflict-models 1