SMB-v1-1.7B-Structure
Documentation & Quickstart
For a comprehensive guide on getting started, architecture details, and advanced usage, please visit our official documentation: 📖 SMB-v1 Quickstart Guide
Model Details
- Model Name: SMB-v1-1.7B-Structure
- Organization: Standard Model Biomedicine
- Model Family: SMB-v1 (Biomedical Foundational Models)
- LLM Backbone: Qwen3-1.7B
- Training Method: SFT + JEPA Multi-objective
- License: Apache 2.0
Model Description
SMB-v1-1.7B-Structure is the initial release of the SMB-v1 family, specifically engineered to model the complex, time-varying dynamics of cancer biology through structured clinical signals. It treats structured clinical data as a multimodal environment, fusing heterogeneous data streams into a unified patient state representation.
Unlike general-purpose models, SMB-v1 is designed to ingest and synthesize diverse structured modalities across the patient journey, including:
- Temporal Physiological Signals: Modeling continuous longitudinal trajectories of laboratory values, vital signs, and functional status markers to capture disease progression and physiological drift over time.
- Clinical Events & Phenotypes: Encoding discrete, high-cardinality sequences of diagnosis codes (ICD), procedure events (CPT), and adverse events to reconstruct the semantic history of the patient's care.
- Therapeutic Interventions: Integrating complex treatment histories—including systemic therapies (chemotherapy, immunotherapy), radiation dosing schedules, and surgical interventions—to understand causal treatment-response dynamics.
- Molecular & Genomic Profiles: Embedding high-dimensional static and dynamic biomarker panels (somatic mutations, gene expression signatures, proteomic markers) directly alongside clinical phenotypes.
- Oncologic Staging & Outcomes: Processing structured tumor staging (TNM), histology classifications, and survival endpoints to anchor representations in ground-truth biological states.
Note: While the full
SMB-v1family will introduce unstructured modalities, this -Structure variant establishes the foundation using the highest-fidelity structured signals available in modern oncology data warehouses.
Intended Use Cases
This model is optimized for downstream tasks requiring a deep understanding of longitudinal patient history:
- Predictive Risk Stratification: Forecasting adverse events, toxicity, or rapid progression based on historical trajectories.
- Treatment Response Modeling: Simulating potential patient outcomes under different therapeutic regimens.
- Patient Similarity Search: Identifying cohorts with similar biological and clinical progressions for real-world evidence generation.
- Clinical Trial Matching: Aligning complex patient states with structured eligibility criteria.
Usage
To use this model effectively, your input data must be in the MEDS (Medical Event Data Standard) format and processed using the smb_biopan_utils package. This ensures that patient event timelines are correctly serialized into the structured text format the model expects.
1. Installation
Ensure you have the model package and the data utility package installed:
pip install transformers pandas
pip install git+https://github.com/standardmodelbio/smb-biopan-utils.git
2. Inference Example
The following example demonstrates how to load the model, process raw MEDS data using process_ehr_info, and generate a patient representation.
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer
from smb_biopan_utils import process_ehr_info
# 1. Load Model and Tokenizer
model_id = "standardmodelbio/SMB-v1-1.7B-Structure"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
device_map="auto"
)
# 2. Load Patient Data (MEDS Format)
# Ensure your dataframe contains columns for 'time', 'code', 'table', etc.
df_meds = pd.read_parquet("path/to/patient_data.parquet")
# 3. Format Data for Inference
# This utility converts the DataFrame into the structured text format
# (e.g., <conditions>...</conditions>) expected by SMB-v1.
input_text = process_ehr_info(
df=df_meds,
subject_id="patient_123", # Specify the subject to process
end_time=pd.Timestamp("2024-01-01") # Prediction timepoint
)
# 4. Generate Representation
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model(
input_ids=inputs.input_ids,
output_hidden_states=True,
return_dict=True
)
# Extract the last hidden state as the patient representation
patient_embedding = outputs.hidden_states[-1]
print(f"Patient Representation Shape: {patient_embedding.shape}")
Citation
If you use this model in your research or application, please cite:
@misc{biopan_omni,
author = {standardmodelbio},
title = {SMB-v1-1.7B-Structure},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/standardmodelbio/SMB-v1-1.7B-Structure}
}
- Downloads last month
- 15