ESI Triage BERT v70b

Decision-support model for Emergency Severity Index (ESI 1-5) triage classification, trained on a blend of clinical ED notes spanning compact CC and full narrative dialects.

Not a substitute for clinician judgment. Intended as decision-support input, paired with a deterministic ESI handbook engine and per-dialect calibration.

Model overview

Encoder: BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
Architecture: Multi-head (esi + symptom + flag + vitals + safety + bucket heads)
Primary head: 5-class ESI classifier
Input: Up to 128 BERT tokens of triage note text + 12-dim engineered features
Training: 3 epochs, bf16, batch 64, LR 3e-5, label smoothing 0.05

Ship metrics

Validation set (training holdout)

val_esi_exact: 78.1% (+1.2pp vs v67)
val_esi_adjacent: 98.0%

Production stack ESI 1 recall (safety floor 75%)

Corpus	BERT only	Full stack
MCMED (Stanford compact CC)	46.0%	84.0%
MIETIC (UCSF narrative)	80.0%	96.7%
Lukina (Russian narrative)	97.1%	100.0%
MIMIC (Boston compact CC)	71.8%	87.6%
3-corpus average	69.5%	88.1%

All ship gates pass. Stack ESI 1 R 88.1% = +1.4pp vs prior champion v67.

Robustness

Counterfactual fairness: 100% paired invariance across demographic variants (with ensemble)
DOA discrimination (active code vs pronounced): 0% → 100% (v66 e3 baseline)
Adversarial robustness: BERT 96-100% adjacent on perturbations (case shuffle, abbrev shuffle, dialect mix, typos)

Production stack

The model is part of a 7-stage production pipeline:

BERT esi_head argmax
Per-dialect temperature scaling (T ≥ 1.0 constrained for safety)
Per-dialect safety thresholds (Bayes-optimal from validation)
Safety head OR rule (airway_p > 0.5 OR resus_p > 0.5 → ESI 1)
Dialect-aware engine override (Step A/B1 only, plus Step B3 for Lukina)
v70 compound rules (DOA / ICH-on-anticoag)
Lukina soft ESI 2 threshold (p_ESI2 > 0.18)

Per-dialect calibration (T = 1.0 across all dialects in v70b):

MCMED thresholds: [0.11, 0.40, 0.30, 0.40, 0.40]
MIETIC thresholds: [0.20, 0.35, 0.20, 0.30, 0.20]
Lukina thresholds: [0.30, 0.25, 0.40, 0.40, 0.20]
MIMIC thresholds: [0.15, 0.50, 0.20, 0.30, 0.40]

Known limitations

Single-rater labels with kappa ~0.5-0.7 on most training data. Inter-rater variance is the noise floor on minority classes.
MIMIC pyxis-truth labels carry ~12-15% label noise on the ESI 1 class (walk-in mild presentations labeled ESI 1 because pyxis recorded a resuscitation event later). Audit-229b downweight applies during training.
Pediatric coverage: pediatric records are ~0.2% of training. The model has been validated on 200 pediatric eval records (95% ESI 1 recall, 85% exact) but production deployment should pair with pediatric-specific oversight.
Dialect bias: MIETIC narrative format ("Chief Complaint: + History:") produces polarized predictions — severe scenarios over-triage toward ESI 1, mild scenarios under-triage toward ESI 4-5. Middle ESI 3 is under-represented in the training distribution and a v73+ corpus rebalance is planned.
Geographic skew: training corpus is US-Boston (MIMIC) + Stanford (MC-MED) + UCSF (MIETIC + ER-REASON) + Russian (Lukina). Other geographies, languages, and EHR styles may behave differently.
Excluded scope: not validated on psychiatric crisis assessment beyond imminent SI flag, not validated on obstetric emergencies beyond 30 ESI 2-3 cases, and not designed for autonomous triage decisions.

Recommended use

As decision-support displayed alongside an ED nurse's own triage assessment
Always paired with the production stack (temperature scaling + per-dialect thresholds + engine override)
ESI 1 recall floor enforced at 75% via per-dialect threshold calibration

Not recommended

Standalone autonomous triage
Use outside the training-corpus geographic/dialect distribution without re-validation
Replacing existing institutional triage protocols

Loading

This model is saved in PyTorch state_dict format (model.pt) and requires the multi-head architecture definition from the training repo to load correctly. The architecture is defined in scripts/train/train_bert_v44_bf16.py:V44MultiHeadBERT. Use eng_features_dim=12 when constructing the model.

import torch
from transformers import AutoTokenizer
# V44MultiHeadBERT defined in the training repo
model = V44MultiHeadBERT("microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext", dropout=0.1, eng_features_dim=12)
state = torch.load("model.pt", map_location="cpu", weights_only=False)
if isinstance(state, dict) and "model_state_dict" in state:
    state = state["model_state_dict"]
model.load_state_dict(state, strict=False)
model.eval()

tokenizer = AutoTokenizer.from_pretrained("vadimbelsky/esi-triage-bert-v70b")

Citation

Internal model, v70b, Round 250 ship (June 2026).

License

MIT. Use responsibly; clinical applications require validation in your institutional setting.

Downloads last month: 52

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vadimbelsky/esi-triage-bert-v70b

Base model

microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext

Finetuned

(159)

this model