BERT ESI Triage — v50 (RESEARCH CHECKPOINT)

⚠️ RESEARCH USE ONLY — NOT FOR CLINICAL DECISIONS

NOT A MEDICAL DEVICE. This model is an INTERMEDIATE training checkpoint released for research transparency and version-iteration history. It is NOT approved for clinical use and contains DOCUMENTED REGRESSIONS vs the production model (v49). Use v49 if you need the higher-quality release.

Do not enter real patient data.

What this is

bert-esi-triage-v50 is a 110M-parameter BiomedBERT classifier fine-tuned to assign Emergency Severity Index (ESI 1–5) acuity from triage-time chief complaint, vitals, and demographics. It is an INTERMEDIATE checkpoint in our v49 → v50 → v52 iteration history. v50 was trained on a corpus addressing several v49 gaps (PMH dilution, ESI 3 boundary, gender atypical MI) but introduced new regressions documented below.

For production use, please use v49: vadimbelsky/bert-esi-triage-v49.

Why this exists

The medical-triage research community benefits from seeing intermediate training checkpoints with HONEST scorecards. v50 represents:

Validated improvements on narrative dialects (ER-REASON +10.5pp, PMH dilution +15.4pp, gender atypical MI +29pp)
Validated regressions on compact-CC dialects (MC-MED ESI 1 R -24pp, adversarial ESI 1 R -27pp)
A data point for understanding the trade-off between narrative- generalization and compact-CC-specialization

This release is intended for researchers studying:

Training corpus impact on model generalization
Fairness regression mechanisms (race-driven prediction flips doubled vs v49)
Engine-ensemble mitigation strategies (engine cuts v50's 26% fairness swings back to 0%)

v50 vs v49 Honest Scorecard

Improvements

Slice	v49	v50	Δ
MIMIC exact	54.7%	56.7%	+2.0
MC-MED exact	58.9%	61.7%	+2.8
OB emergency exact	65.3%	68.4%	+3.1
OB emergency ESI 1 R	85.0%	91.7%	+6.7
Anaphylaxis ESI 1 R	94.1%	97.1%	+3.0
ER-REASON narrative	44.5%	55.0%	+10.5
Gender atypical MI synth	52.0%	81.0%	+29.0
PMH dilution synth	23.1%	38.5%	+15.4
ESI 3 boundary synth	44.7%	61.7%	+17.0

Regressions

Slice	v49	v50	Δ
MC-MED ESI 1 recall	89.0%	65.0%	-24.0 ⚠️
Mimic ESI 1 R	84.6%	75.6%	-9.1
Adversarial ESI 1 R	93.3%	66.7%	-26.7 ⚠️
Stroke exact	58.8%	50.5%	-8.3
Sepsis exact	62.4%	55.9%	-6.5
Sepsis ESI 1 R	97.5%	95.0%	-2.5
Stroke ESI 1 R	85.0%	80.0%	-5.0
Lukina exact	49.3%	46.3%	-3.0
MIETIC exact	83.5%	80.0%	-3.5
SES fairness slice	55.0%	26.0%	-29.0
SCD fairness slice	94.0%	80.0%	-14.0
Pediatric carved ESI 1 R	100.0%	87.5%	-12.5
Gender MI fairness	45.0%	32.0%	-13.0

Counterfactual Fairness (Race-Demographic Invariance)

200 paired records (50 pair_ids × 4 demographic variants):

Metric	v49	v50	Δ
Pair-mean max ESI Δ	0.42	0.92	+0.50 ⚠️
Pairs perfectly invariant	72%	54%	-18
Pairs with ESI swing ≥ 2	12%	26%	+14pp ⚠️
Race-driven swing pairs	8	15	+7
Race mean swing	1.50	1.67	+0.17

The engine ensemble (min(BERT, engine_v4)) reduces v50's fairness regression to 0% ≥2 swing — engine is REQUIRED if v50 is used at all.

Aggregated head-to-head

11 improvements (>+1pp)
18 regressions (<-1pp)
5 stable (±1pp)
NET: -2.5pp exact, -7.2pp ESI 1 recall vs v49

What this model is good for

Research only: studying corpus-balance trade-offs
Ablation reference: documenting what gains came from v50 corpus additions (gender atypical MI synth, PMH dilution synth, ESI 3 boundary)
Engineering validation: testing that engine ensemble can rescue a fairness-regressed BERT

What this model is NOT good for

Clinical decisions — UNSAFE due to MC-MED ESI 1 R regression
Production triage — v49 outperforms it on safety metrics
Stanford-style telegraphic input — 24pp ESI 1 catch loss
Adversarial inputs — 27pp ESI 1 catch loss on adversarial probe

Architecture

Identical to v49: BiomedBERT-base + 21-head V44MultiHeadBERT.

import torch
from transformers import AutoTokenizer

# Load weights
state = torch.load("model.pt", map_location="cpu")
# Model architecture is V44MultiHeadBERT from train_bert_v44_bf16.py

See the v49 repository for full inference code: vadimbelsky/bert-esi-triage-v49.

Training details

Base: BiomedBERT-base (~110M params)
Heads: 21 (esi_head + symptom_head + 19 perception/safety heads)
Training: 3 epochs, bf16, batch 64, lr 3e-5
Corpus: ~382K records (v50 corpus, predecessor of v52 tiered Pillar 4 corpus)
Resume strategy: continued from v50 epoch 2 best (epoch 3 finalized symptom_f1_micro = 0.366)
val esi_exact at end of training: 76.9%
val esi_adjacent at end of training: 97.9%

What v50 taught us (used to design v52)

Tiered Pillar 4 weighting > uniform (proves 17% of "label_noise" were ambiguous, not noisy)
Symptom-head extraction must be re-run on all synth seeds
ESI 1 archetype synth must NOT be added without engine extraction
MC-MED ESI 1 corpus balance is critical — Pillar 1 fix in v52
Fairness regression must be measured via paired counterfactual records

These lessons are baked into v52 (training now, ETA 2.5h from v50 launch).

Maintainer

vadim.belsky@gmail.com

Citation

@misc{belsky2026esi_v50,
  title={BERT ESI Triage v50: Intermediate Research Checkpoint},
  author={Belsky, Vadim},
  year={2026},
  publisher={HuggingFace}
  note={Research checkpoint with documented regressions vs v49 production}
}

Files in this repository

model.pt — full model state dict (438 MB)
README.md — this file

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vadimbelsky/bert-esi-triage-v50

Base model

microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext

Finetuned

(159)

this model