🎗️ Breast Cancer Detection Ensemble Pipeline

An optimized, production-ready machine learning pipeline featuring a Soft-Voting Ensemble Classifier. This model is trained on clinical data to distinguish between malignant and benign tumors with high sensitivity (recall), minimizing false negatives in diagnostic screening.

This repository structure is modeled after the methodology discussed in "Comparison of ML Algorithms for Breast Cancer Prediction" (CTEMS 2018), expanding the baseline framework to a robust 5-model ensemble architecture with automated pipeline scaling.

📊 Model Description

The model utilizes a Soft-Voting architecture that aggregates predicted class probabilities across five diverse individual base estimators. Every individual classifier is encapsulated within a leakage-free preprocessing pipeline featuring automated standardization using StandardScaler.

Component Estimators

Random Forest Classifier
- 72 estimators
- Balanced class weights
k-Nearest Neighbors (kNN)
- Euclidean distance metric
- k = 5
Gaussian Naive Bayes
- Probabilistic baseline classifier
Support Vector Classifier (SVC)
- rbf kernel
- Probability estimation enabled
Logistic Regression
- Regularized linear classifier
- Balanced class distributions

📈 Dataset & Training Architecture

Dataset Source: Wisconsin Diagnosis Breast Cancer (WDBC) — UCI Machine Learning Repository
Instances: 569 samples
- 357 Benign
- 212 Malignant
Features: 30 real-valued clinical features extracted from digitized FNA images
Split Strategy: Stratified train-test split
- Training: 398 samples
- Testing: 171 samples

The pipeline uses:

StratifiedKFold cross-validation
Leakage-free preprocessing
Automated scaling pipelines

⚡ Performance Metrics

Evaluation prioritizes Recall (Sensitivity) to reduce false negatives while maintaining strong overall classification accuracy.

Model	Accuracy	Precision	Recall	F1-Score	ROC-AUC
Ensemble (Soft Voting)	0.9766	0.9725	0.9907	0.9815	0.9972
Random Forest	0.9649	0.9633	0.9813	0.9722	0.9936
kNN	0.9591	0.9545	0.9813	0.9677	0.9877
Support Vector Machine	0.9766	0.9725	0.9907	0.9815	0.9974
Logistic Regression	0.9766	0.9725	0.9907	0.9815	0.9969
Naive Bayes	0.9591	0.9545	0.9813	0.9677	0.9892

Note: Results may vary slightly depending on package versions and random seeds.

💻 Installation

Dependencies

scikit-learn>=1.0
numpy
pandas
joblib
huggingface_hub

Install dependencies:

pip install scikit-learn numpy pandas joblib huggingface_hub

🚀 Dynamic Inference Example

You can directly download and run the trained pipeline from Hugging Face Hub.

import joblib
import pandas as pd
from huggingface_hub import hf_hub_download

# Download model pipeline
model_path = hf_hub_download(
    repo_id="NethranjaliSE/Breast-Cancer-detection-using-ML-Algorithm",
    filename="ensemble_soft_voting.pkl"
)

# Load pipeline
pipeline = joblib.load(model_path)

# Example sample input (30 WDBC features)
sample_data = [[
    14.12, 19.28, 91.96, 654.8, 0.096, 0.11, 0.08, 0.04, 0.18, 0.06,
    0.25, 0.89, 1.82, 24.3, 0.006, 0.02, 0.02, 0.01, 0.01, 0.003,
    16.26, 25.67, 107.26, 880.5, 0.132, 0.21, 0.19, 0.09, 0.28, 0.08
]]

feature_names = (
    pipeline.feature_names_in_
    if hasattr(pipeline, "feature_names_in_")
    else None
)

input_df = pd.DataFrame(sample_data, columns=feature_names)

# Predict
prediction = pipeline.predict(input_df)
probabilities = pipeline.predict_proba(input_df)[0]

diagnosis = (
    "Benign (Low Risk)"
    if prediction[0] == 1
    else "Malignant (High Risk)"
)

print(f"Diagnostic Assessment: {diagnosis}")

print(
    f"Confidence Matrix -> "
    f"Malignant: {probabilities[0]:.4f} | "
    f"Benign: {probabilities[1]:.4f}"
)

📂 Repository Structure

.
├── ensemble_soft_voting.pkl
├── training_pipeline.ipynb
├── requirements.txt
└── README.md

⚠️ Limitations & Intended Use

This model is developed strictly for:

Academic research
Educational purposes
Machine learning experimentation
Pipeline prototyping

It is NOT approved for:

Clinical deployment
Medical diagnosis
Real-world healthcare decision-making

All diagnostic decisions must be performed by qualified medical professionals using certified medical systems.

📜 Citations

Research Reference

@article{street1993nuclear,
  title={Nuclear feature extraction for breast tumor diagnosis},
  author={Street, W.N. and Wolberg, W.H. and Mangasarian, O.L.},
  journal={IS&T/SPIE Biomedical Imaging},
  year={1993}
}

Dataset Reference

UCI Machine Learning Repository
Breast Cancer Wisconsin (Diagnostic) Dataset

🤝 Acknowledgements

Special thanks to:

UCI Machine Learning Repository
Scikit-learn contributors
Hugging Face Hub
Open-source ML research community

🧠 Model Author

Sachini Praboda Nethranjali
Electronic and Computer Science Undergraduate
University of Kelaniya, Sri Lanka

Downloads last month: -; Downloads are not tracked for this model. How to track

NethranjaliSE
/

Breast-Cancer-detection-using-ML-Algorithm