- BERT ESI Triage β v50 (RESEARCH CHECKPOINT)
BERT ESI Triage β v50 (RESEARCH CHECKPOINT)
β οΈ RESEARCH USE ONLY β NOT FOR CLINICAL DECISIONS
NOT A MEDICAL DEVICE. This model is an INTERMEDIATE training checkpoint released for research transparency and version-iteration history. It is NOT approved for clinical use and contains DOCUMENTED REGRESSIONS vs the production model (v49). Use v49 if you need the higher-quality release.
Do not enter real patient data.
What this is
bert-esi-triage-v50 is a 110M-parameter BiomedBERT classifier fine-tuned
to assign Emergency Severity Index (ESI 1β5) acuity from triage-time chief
complaint, vitals, and demographics. It is an INTERMEDIATE checkpoint in
our v49 β v50 β v52 iteration history. v50 was trained on a corpus
addressing several v49 gaps (PMH dilution, ESI 3 boundary, gender atypical
MI) but introduced new regressions documented below.
For production use, please use v49: vadimbelsky/bert-esi-triage-v49.
Why this exists
The medical-triage research community benefits from seeing intermediate training checkpoints with HONEST scorecards. v50 represents:
- Validated improvements on narrative dialects (ER-REASON +10.5pp, PMH dilution +15.4pp, gender atypical MI +29pp)
- Validated regressions on compact-CC dialects (MC-MED ESI 1 R -24pp, adversarial ESI 1 R -27pp)
- A data point for understanding the trade-off between narrative- generalization and compact-CC-specialization
This release is intended for researchers studying:
- Training corpus impact on model generalization
- Fairness regression mechanisms (race-driven prediction flips doubled vs v49)
- Engine-ensemble mitigation strategies (engine cuts v50's 26% fairness swings back to 0%)
v50 vs v49 Honest Scorecard
Improvements
| Slice | v49 | v50 | Ξ |
|---|---|---|---|
| MIMIC exact | 54.7% | 56.7% | +2.0 |
| MC-MED exact | 58.9% | 61.7% | +2.8 |
| OB emergency exact | 65.3% | 68.4% | +3.1 |
| OB emergency ESI 1 R | 85.0% | 91.7% | +6.7 |
| Anaphylaxis ESI 1 R | 94.1% | 97.1% | +3.0 |
| ER-REASON narrative | 44.5% | 55.0% | +10.5 |
| Gender atypical MI synth | 52.0% | 81.0% | +29.0 |
| PMH dilution synth | 23.1% | 38.5% | +15.4 |
| ESI 3 boundary synth | 44.7% | 61.7% | +17.0 |
Regressions
| Slice | v49 | v50 | Ξ |
|---|---|---|---|
| MC-MED ESI 1 recall | 89.0% | 65.0% | -24.0 β οΈ |
| Mimic ESI 1 R | 84.6% | 75.6% | -9.1 |
| Adversarial ESI 1 R | 93.3% | 66.7% | -26.7 β οΈ |
| Stroke exact | 58.8% | 50.5% | -8.3 |
| Sepsis exact | 62.4% | 55.9% | -6.5 |
| Sepsis ESI 1 R | 97.5% | 95.0% | -2.5 |
| Stroke ESI 1 R | 85.0% | 80.0% | -5.0 |
| Lukina exact | 49.3% | 46.3% | -3.0 |
| MIETIC exact | 83.5% | 80.0% | -3.5 |
| SES fairness slice | 55.0% | 26.0% | -29.0 |
| SCD fairness slice | 94.0% | 80.0% | -14.0 |
| Pediatric carved ESI 1 R | 100.0% | 87.5% | -12.5 |
| Gender MI fairness | 45.0% | 32.0% | -13.0 |
Counterfactual Fairness (Race-Demographic Invariance)
200 paired records (50 pair_ids Γ 4 demographic variants):
| Metric | v49 | v50 | Ξ |
|---|---|---|---|
| Pair-mean max ESI Ξ | 0.42 | 0.92 | +0.50 β οΈ |
| Pairs perfectly invariant | 72% | 54% | -18 |
| Pairs with ESI swing β₯ 2 | 12% | 26% | +14pp β οΈ |
| Race-driven swing pairs | 8 | 15 | +7 |
| Race mean swing | 1.50 | 1.67 | +0.17 |
The engine ensemble (min(BERT, engine_v4)) reduces v50's fairness
regression to 0% β₯2 swing β engine is REQUIRED if v50 is used at all.
Aggregated head-to-head
- 11 improvements (>+1pp)
- 18 regressions (<-1pp)
- 5 stable (Β±1pp)
- NET: -2.5pp exact, -7.2pp ESI 1 recall vs v49
What this model is good for
- Research only: studying corpus-balance trade-offs
- Ablation reference: documenting what gains came from v50 corpus additions (gender atypical MI synth, PMH dilution synth, ESI 3 boundary)
- Engineering validation: testing that engine ensemble can rescue a fairness-regressed BERT
What this model is NOT good for
- Clinical decisions β UNSAFE due to MC-MED ESI 1 R regression
- Production triage β v49 outperforms it on safety metrics
- Stanford-style telegraphic input β 24pp ESI 1 catch loss
- Adversarial inputs β 27pp ESI 1 catch loss on adversarial probe
Architecture
Identical to v49: BiomedBERT-base + 21-head V44MultiHeadBERT.
import torch
from transformers import AutoTokenizer
# Load weights
state = torch.load("model.pt", map_location="cpu")
# Model architecture is V44MultiHeadBERT from train_bert_v44_bf16.py
See the v49 repository for full inference code:
vadimbelsky/bert-esi-triage-v49.
Training details
- Base: BiomedBERT-base (~110M params)
- Heads: 21 (esi_head + symptom_head + 19 perception/safety heads)
- Training: 3 epochs, bf16, batch 64, lr 3e-5
- Corpus: ~382K records (v50 corpus, predecessor of v52 tiered Pillar 4 corpus)
- Resume strategy: continued from v50 epoch 2 best (epoch 3 finalized symptom_f1_micro = 0.366)
- val esi_exact at end of training: 76.9%
- val esi_adjacent at end of training: 97.9%
What v50 taught us (used to design v52)
- Tiered Pillar 4 weighting > uniform (proves 17% of "label_noise" were ambiguous, not noisy)
- Symptom-head extraction must be re-run on all synth seeds
- ESI 1 archetype synth must NOT be added without engine extraction
- MC-MED ESI 1 corpus balance is critical β Pillar 1 fix in v52
- Fairness regression must be measured via paired counterfactual records
These lessons are baked into v52 (training now, ETA 2.5h from v50 launch).
Maintainer
Citation
@misc{belsky2026esi_v50,
title={BERT ESI Triage v50: Intermediate Research Checkpoint},
author={Belsky, Vadim},
year={2026},
publisher={HuggingFace}
note={Research checkpoint with documented regressions vs v49 production}
}
Files in this repository
model.ptβ full model state dict (438 MB)README.mdβ this file