You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

SMB-v1-1.7B-Structure

Documentation & Quickstart

For a comprehensive guide on getting started, architecture details, and advanced usage, please visit our official documentation: 📖 SMB-v1 Quickstart Guide

Model Details

Model Name: SMB-v1-1.7B-Structure
Organization: Standard Model Biomedicine
Model Family: SMB-v1 (Biomedical Foundational Models)
LLM Backbone: Qwen3-1.7B
Training Method: SFT + JEPA Multi-objective
License: Apache 2.0

Model Description

SMB-v1-1.7B-Structure is the initial release of the SMB-v1 family, specifically engineered to model the complex, time-varying dynamics of cancer biology through structured clinical signals. It treats structured clinical data as a multimodal environment, fusing heterogeneous data streams into a unified patient state representation.

Unlike general-purpose models, SMB-v1 is designed to ingest and synthesize diverse structured modalities across the patient journey, including:

Temporal Physiological Signals: Modeling continuous longitudinal trajectories of laboratory values, vital signs, and functional status markers to capture disease progression and physiological drift over time.
Clinical Events & Phenotypes: Encoding discrete, high-cardinality sequences of diagnosis codes (ICD), procedure events (CPT), and adverse events to reconstruct the semantic history of the patient's care.
Therapeutic Interventions: Integrating complex treatment histories—including systemic therapies (chemotherapy, immunotherapy), radiation dosing schedules, and surgical interventions—to understand causal treatment-response dynamics.
Molecular & Genomic Profiles: Embedding high-dimensional static and dynamic biomarker panels (somatic mutations, gene expression signatures, proteomic markers) directly alongside clinical phenotypes.
Oncologic Staging & Outcomes: Processing structured tumor staging (TNM), histology classifications, and survival endpoints to anchor representations in ground-truth biological states.

Note: While the full SMB-v1 family will introduce unstructured modalities, this -Structure variant establishes the foundation using the highest-fidelity structured signals available in modern oncology data warehouses.

Intended Use Cases

This model is optimized for downstream tasks requiring a deep understanding of longitudinal patient history:

Predictive Risk Stratification: Forecasting adverse events, toxicity, or rapid progression based on historical trajectories.
Treatment Response Modeling: Simulating potential patient outcomes under different therapeutic regimens.
Patient Similarity Search: Identifying cohorts with similar biological and clinical progressions for real-world evidence generation.
Clinical Trial Matching: Aligning complex patient states with structured eligibility criteria.

Usage

To use this model effectively, your input data must be in the MEDS (Medical Event Data Standard) format and processed using the smb_biopan_utils package. This ensures that patient event timelines are correctly serialized into the structured text format the model expects.

1. Installation

Ensure you have the model package and the data utility package installed:

pip install transformers pandas
pip install git+https://github.com/standardmodelbio/smb-biopan-utils.git

2. Inference Example

The following example demonstrates how to load the model, process raw MEDS data using process_ehr_info, and generate a patient representation.

import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer
from smb_biopan_utils import process_ehr_info

# 1. Load Model and Tokenizer
model_id = "standardmodelbio/SMB-v1-1.7B-Structure"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    device_map="auto"
)

# 2. Load Patient Data (MEDS Format)
# Ensure your dataframe contains columns for 'time', 'code', 'table', etc.
df_meds = pd.read_parquet("path/to/patient_data.parquet")

# 3. Format Data for Inference
# This utility converts the DataFrame into the structured text format 
# (e.g., <conditions>...</conditions>) expected by SMB-v1.
input_text = process_ehr_info(
    df=df_meds,
    subject_id="patient_123",  # Specify the subject to process
    end_time=pd.Timestamp("2024-01-01") # Prediction timepoint
)

# 4. Generate Representation
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model(
    input_ids=inputs.input_ids,
    output_hidden_states=True,
    return_dict=True
)

# Extract the last hidden state as the patient representation
patient_embedding = outputs.hidden_states[-1]
print(f"Patient Representation Shape: {patient_embedding.shape}")

Citation

If you use this model in your research or application, please cite:

@misc{biopan_omni,
  author = {standardmodelbio},
  title = {SMB-v1-1.7B-Structure},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/standardmodelbio/SMB-v1-1.7B-Structure}
}

Downloads last month: 15