ESI Triage BERT v70b
Decision-support model for Emergency Severity Index (ESI 1-5) triage classification, trained on a blend of clinical ED notes spanning compact CC and full narrative dialects.
Not a substitute for clinician judgment. Intended as decision-support input, paired with a deterministic ESI handbook engine and per-dialect calibration.
Model overview
- Encoder: BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
- Architecture: Multi-head (esi + symptom + flag + vitals + safety + bucket heads)
- Primary head: 5-class ESI classifier
- Input: Up to 128 BERT tokens of triage note text + 12-dim engineered features
- Training: 3 epochs, bf16, batch 64, LR 3e-5, label smoothing 0.05
Ship metrics
Validation set (training holdout)
- val_esi_exact: 78.1% (+1.2pp vs v67)
- val_esi_adjacent: 98.0%
Production stack ESI 1 recall (safety floor 75%)
| Corpus | BERT only | Full stack |
|---|---|---|
| MCMED (Stanford compact CC) | 46.0% | 84.0% |
| MIETIC (UCSF narrative) | 80.0% | 96.7% |
| Lukina (Russian narrative) | 97.1% | 100.0% |
| MIMIC (Boston compact CC) | 71.8% | 87.6% |
| 3-corpus average | 69.5% | 88.1% |
All ship gates pass. Stack ESI 1 R 88.1% = +1.4pp vs prior champion v67.
Robustness
- Counterfactual fairness: 100% paired invariance across demographic variants (with ensemble)
- DOA discrimination (active code vs pronounced): 0% โ 100% (v66 e3 baseline)
- Adversarial robustness: BERT 96-100% adjacent on perturbations (case shuffle, abbrev shuffle, dialect mix, typos)
Production stack
The model is part of a 7-stage production pipeline:
- BERT esi_head argmax
- Per-dialect temperature scaling (T โฅ 1.0 constrained for safety)
- Per-dialect safety thresholds (Bayes-optimal from validation)
- Safety head OR rule (
airway_p > 0.5ORresus_p > 0.5โ ESI 1) - Dialect-aware engine override (Step A/B1 only, plus Step B3 for Lukina)
- v70 compound rules (DOA / ICH-on-anticoag)
- Lukina soft ESI 2 threshold (p_ESI2 > 0.18)
Per-dialect calibration (T = 1.0 across all dialects in v70b):
- MCMED thresholds: [0.11, 0.40, 0.30, 0.40, 0.40]
- MIETIC thresholds: [0.20, 0.35, 0.20, 0.30, 0.20]
- Lukina thresholds: [0.30, 0.25, 0.40, 0.40, 0.20]
- MIMIC thresholds: [0.15, 0.50, 0.20, 0.30, 0.40]
Known limitations
- Single-rater labels with kappa ~0.5-0.7 on most training data. Inter-rater variance is the noise floor on minority classes.
- MIMIC pyxis-truth labels carry ~12-15% label noise on the ESI 1 class (walk-in mild presentations labeled ESI 1 because pyxis recorded a resuscitation event later). Audit-229b downweight applies during training.
- Pediatric coverage: pediatric records are ~0.2% of training. The model has been validated on 200 pediatric eval records (95% ESI 1 recall, 85% exact) but production deployment should pair with pediatric-specific oversight.
- Dialect bias: MIETIC narrative format ("Chief Complaint: + History:") produces polarized predictions โ severe scenarios over-triage toward ESI 1, mild scenarios under-triage toward ESI 4-5. Middle ESI 3 is under-represented in the training distribution and a v73+ corpus rebalance is planned.
- Geographic skew: training corpus is US-Boston (MIMIC) + Stanford (MC-MED) + UCSF (MIETIC + ER-REASON) + Russian (Lukina). Other geographies, languages, and EHR styles may behave differently.
- Excluded scope: not validated on psychiatric crisis assessment beyond imminent SI flag, not validated on obstetric emergencies beyond 30 ESI 2-3 cases, and not designed for autonomous triage decisions.
Recommended use
- As decision-support displayed alongside an ED nurse's own triage assessment
- Always paired with the production stack (temperature scaling + per-dialect thresholds + engine override)
- ESI 1 recall floor enforced at 75% via per-dialect threshold calibration
Not recommended
- Standalone autonomous triage
- Use outside the training-corpus geographic/dialect distribution without re-validation
- Replacing existing institutional triage protocols
Loading
This model is saved in PyTorch state_dict format (model.pt) and requires the multi-head architecture definition from the training repo to load correctly. The architecture is defined in scripts/train/train_bert_v44_bf16.py:V44MultiHeadBERT. Use eng_features_dim=12 when constructing the model.
import torch
from transformers import AutoTokenizer
# V44MultiHeadBERT defined in the training repo
model = V44MultiHeadBERT("microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext", dropout=0.1, eng_features_dim=12)
state = torch.load("model.pt", map_location="cpu", weights_only=False)
if isinstance(state, dict) and "model_state_dict" in state:
state = state["model_state_dict"]
model.load_state_dict(state, strict=False)
model.eval()
tokenizer = AutoTokenizer.from_pretrained("vadimbelsky/esi-triage-bert-v70b")
Citation
Internal model, v70b, Round 250 ship (June 2026).
License
MIT. Use responsibly; clinical applications require validation in your institutional setting.
- Downloads last month
- 52