ESI Triage BERT v70b

Decision-support model for Emergency Severity Index (ESI 1-5) triage classification, trained on a blend of clinical ED notes spanning compact CC and full narrative dialects.

Not a substitute for clinician judgment. Intended as decision-support input, paired with a deterministic ESI handbook engine and per-dialect calibration.

Model overview

  • Encoder: BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
  • Architecture: Multi-head (esi + symptom + flag + vitals + safety + bucket heads)
  • Primary head: 5-class ESI classifier
  • Input: Up to 128 BERT tokens of triage note text + 12-dim engineered features
  • Training: 3 epochs, bf16, batch 64, LR 3e-5, label smoothing 0.05

Ship metrics

Validation set (training holdout)

  • val_esi_exact: 78.1% (+1.2pp vs v67)
  • val_esi_adjacent: 98.0%

Production stack ESI 1 recall (safety floor 75%)

Corpus BERT only Full stack
MCMED (Stanford compact CC) 46.0% 84.0%
MIETIC (UCSF narrative) 80.0% 96.7%
Lukina (Russian narrative) 97.1% 100.0%
MIMIC (Boston compact CC) 71.8% 87.6%
3-corpus average 69.5% 88.1%

All ship gates pass. Stack ESI 1 R 88.1% = +1.4pp vs prior champion v67.

Robustness

  • Counterfactual fairness: 100% paired invariance across demographic variants (with ensemble)
  • DOA discrimination (active code vs pronounced): 0% โ†’ 100% (v66 e3 baseline)
  • Adversarial robustness: BERT 96-100% adjacent on perturbations (case shuffle, abbrev shuffle, dialect mix, typos)

Production stack

The model is part of a 7-stage production pipeline:

  1. BERT esi_head argmax
  2. Per-dialect temperature scaling (T โ‰ฅ 1.0 constrained for safety)
  3. Per-dialect safety thresholds (Bayes-optimal from validation)
  4. Safety head OR rule (airway_p > 0.5 OR resus_p > 0.5 โ†’ ESI 1)
  5. Dialect-aware engine override (Step A/B1 only, plus Step B3 for Lukina)
  6. v70 compound rules (DOA / ICH-on-anticoag)
  7. Lukina soft ESI 2 threshold (p_ESI2 > 0.18)

Per-dialect calibration (T = 1.0 across all dialects in v70b):

  • MCMED thresholds: [0.11, 0.40, 0.30, 0.40, 0.40]
  • MIETIC thresholds: [0.20, 0.35, 0.20, 0.30, 0.20]
  • Lukina thresholds: [0.30, 0.25, 0.40, 0.40, 0.20]
  • MIMIC thresholds: [0.15, 0.50, 0.20, 0.30, 0.40]

Known limitations

  1. Single-rater labels with kappa ~0.5-0.7 on most training data. Inter-rater variance is the noise floor on minority classes.
  2. MIMIC pyxis-truth labels carry ~12-15% label noise on the ESI 1 class (walk-in mild presentations labeled ESI 1 because pyxis recorded a resuscitation event later). Audit-229b downweight applies during training.
  3. Pediatric coverage: pediatric records are ~0.2% of training. The model has been validated on 200 pediatric eval records (95% ESI 1 recall, 85% exact) but production deployment should pair with pediatric-specific oversight.
  4. Dialect bias: MIETIC narrative format ("Chief Complaint: + History:") produces polarized predictions โ€” severe scenarios over-triage toward ESI 1, mild scenarios under-triage toward ESI 4-5. Middle ESI 3 is under-represented in the training distribution and a v73+ corpus rebalance is planned.
  5. Geographic skew: training corpus is US-Boston (MIMIC) + Stanford (MC-MED) + UCSF (MIETIC + ER-REASON) + Russian (Lukina). Other geographies, languages, and EHR styles may behave differently.
  6. Excluded scope: not validated on psychiatric crisis assessment beyond imminent SI flag, not validated on obstetric emergencies beyond 30 ESI 2-3 cases, and not designed for autonomous triage decisions.

Recommended use

  • As decision-support displayed alongside an ED nurse's own triage assessment
  • Always paired with the production stack (temperature scaling + per-dialect thresholds + engine override)
  • ESI 1 recall floor enforced at 75% via per-dialect threshold calibration

Not recommended

  • Standalone autonomous triage
  • Use outside the training-corpus geographic/dialect distribution without re-validation
  • Replacing existing institutional triage protocols

Loading

This model is saved in PyTorch state_dict format (model.pt) and requires the multi-head architecture definition from the training repo to load correctly. The architecture is defined in scripts/train/train_bert_v44_bf16.py:V44MultiHeadBERT. Use eng_features_dim=12 when constructing the model.

import torch
from transformers import AutoTokenizer
# V44MultiHeadBERT defined in the training repo
model = V44MultiHeadBERT("microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext", dropout=0.1, eng_features_dim=12)
state = torch.load("model.pt", map_location="cpu", weights_only=False)
if isinstance(state, dict) and "model_state_dict" in state:
    state = state["model_state_dict"]
model.load_state_dict(state, strict=False)
model.eval()

tokenizer = AutoTokenizer.from_pretrained("vadimbelsky/esi-triage-bert-v70b")

Citation

Internal model, v70b, Round 250 ship (June 2026).

License

MIT. Use responsibly; clinical applications require validation in your institutional setting.

Downloads last month
52
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for vadimbelsky/esi-triage-bert-v70b