BERT ESI Triage β€” v50 (RESEARCH CHECKPOINT)

⚠️ RESEARCH USE ONLY β€” NOT FOR CLINICAL DECISIONS

NOT A MEDICAL DEVICE. This model is an INTERMEDIATE training checkpoint released for research transparency and version-iteration history. It is NOT approved for clinical use and contains DOCUMENTED REGRESSIONS vs the production model (v49). Use v49 if you need the higher-quality release.

Do not enter real patient data.

What this is

bert-esi-triage-v50 is a 110M-parameter BiomedBERT classifier fine-tuned to assign Emergency Severity Index (ESI 1–5) acuity from triage-time chief complaint, vitals, and demographics. It is an INTERMEDIATE checkpoint in our v49 β†’ v50 β†’ v52 iteration history. v50 was trained on a corpus addressing several v49 gaps (PMH dilution, ESI 3 boundary, gender atypical MI) but introduced new regressions documented below.

For production use, please use v49: vadimbelsky/bert-esi-triage-v49.

Why this exists

The medical-triage research community benefits from seeing intermediate training checkpoints with HONEST scorecards. v50 represents:

  • Validated improvements on narrative dialects (ER-REASON +10.5pp, PMH dilution +15.4pp, gender atypical MI +29pp)
  • Validated regressions on compact-CC dialects (MC-MED ESI 1 R -24pp, adversarial ESI 1 R -27pp)
  • A data point for understanding the trade-off between narrative- generalization and compact-CC-specialization

This release is intended for researchers studying:

  • Training corpus impact on model generalization
  • Fairness regression mechanisms (race-driven prediction flips doubled vs v49)
  • Engine-ensemble mitigation strategies (engine cuts v50's 26% fairness swings back to 0%)

v50 vs v49 Honest Scorecard

Improvements

Slice v49 v50 Ξ”
MIMIC exact 54.7% 56.7% +2.0
MC-MED exact 58.9% 61.7% +2.8
OB emergency exact 65.3% 68.4% +3.1
OB emergency ESI 1 R 85.0% 91.7% +6.7
Anaphylaxis ESI 1 R 94.1% 97.1% +3.0
ER-REASON narrative 44.5% 55.0% +10.5
Gender atypical MI synth 52.0% 81.0% +29.0
PMH dilution synth 23.1% 38.5% +15.4
ESI 3 boundary synth 44.7% 61.7% +17.0

Regressions

Slice v49 v50 Ξ”
MC-MED ESI 1 recall 89.0% 65.0% -24.0 ⚠️
Mimic ESI 1 R 84.6% 75.6% -9.1
Adversarial ESI 1 R 93.3% 66.7% -26.7 ⚠️
Stroke exact 58.8% 50.5% -8.3
Sepsis exact 62.4% 55.9% -6.5
Sepsis ESI 1 R 97.5% 95.0% -2.5
Stroke ESI 1 R 85.0% 80.0% -5.0
Lukina exact 49.3% 46.3% -3.0
MIETIC exact 83.5% 80.0% -3.5
SES fairness slice 55.0% 26.0% -29.0
SCD fairness slice 94.0% 80.0% -14.0
Pediatric carved ESI 1 R 100.0% 87.5% -12.5
Gender MI fairness 45.0% 32.0% -13.0

Counterfactual Fairness (Race-Demographic Invariance)

200 paired records (50 pair_ids Γ— 4 demographic variants):

Metric v49 v50 Ξ”
Pair-mean max ESI Ξ” 0.42 0.92 +0.50 ⚠️
Pairs perfectly invariant 72% 54% -18
Pairs with ESI swing β‰₯ 2 12% 26% +14pp ⚠️
Race-driven swing pairs 8 15 +7
Race mean swing 1.50 1.67 +0.17

The engine ensemble (min(BERT, engine_v4)) reduces v50's fairness regression to 0% β‰₯2 swing β€” engine is REQUIRED if v50 is used at all.

Aggregated head-to-head

  • 11 improvements (>+1pp)
  • 18 regressions (<-1pp)
  • 5 stable (Β±1pp)
  • NET: -2.5pp exact, -7.2pp ESI 1 recall vs v49

What this model is good for

  • Research only: studying corpus-balance trade-offs
  • Ablation reference: documenting what gains came from v50 corpus additions (gender atypical MI synth, PMH dilution synth, ESI 3 boundary)
  • Engineering validation: testing that engine ensemble can rescue a fairness-regressed BERT

What this model is NOT good for

  • Clinical decisions β€” UNSAFE due to MC-MED ESI 1 R regression
  • Production triage β€” v49 outperforms it on safety metrics
  • Stanford-style telegraphic input β€” 24pp ESI 1 catch loss
  • Adversarial inputs β€” 27pp ESI 1 catch loss on adversarial probe

Architecture

Identical to v49: BiomedBERT-base + 21-head V44MultiHeadBERT.

import torch
from transformers import AutoTokenizer

# Load weights
state = torch.load("model.pt", map_location="cpu")
# Model architecture is V44MultiHeadBERT from train_bert_v44_bf16.py

See the v49 repository for full inference code: vadimbelsky/bert-esi-triage-v49.

Training details

  • Base: BiomedBERT-base (~110M params)
  • Heads: 21 (esi_head + symptom_head + 19 perception/safety heads)
  • Training: 3 epochs, bf16, batch 64, lr 3e-5
  • Corpus: ~382K records (v50 corpus, predecessor of v52 tiered Pillar 4 corpus)
  • Resume strategy: continued from v50 epoch 2 best (epoch 3 finalized symptom_f1_micro = 0.366)
  • val esi_exact at end of training: 76.9%
  • val esi_adjacent at end of training: 97.9%

What v50 taught us (used to design v52)

  1. Tiered Pillar 4 weighting > uniform (proves 17% of "label_noise" were ambiguous, not noisy)
  2. Symptom-head extraction must be re-run on all synth seeds
  3. ESI 1 archetype synth must NOT be added without engine extraction
  4. MC-MED ESI 1 corpus balance is critical β€” Pillar 1 fix in v52
  5. Fairness regression must be measured via paired counterfactual records

These lessons are baked into v52 (training now, ETA 2.5h from v50 launch).

Maintainer

vadim.belsky@gmail.com

Citation

@misc{belsky2026esi_v50,
  title={BERT ESI Triage v50: Intermediate Research Checkpoint},
  author={Belsky, Vadim},
  year={2026},
  publisher={HuggingFace}
  note={Research checkpoint with documented regressions vs v49 production}
}

Files in this repository

  • model.pt β€” full model state dict (438 MB)
  • README.md β€” this file
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for vadimbelsky/bert-esi-triage-v50