AI & ML interests
Clinical Reasoning, Medical Diagnosis, Bayesian Networks, Healthcare AI, LLM Fine-tuning
Recent Activity
š„ Clinical Reasoning Hub
Clinical Reasoning Labs for Medical Diagnostic Accuracy
Advancing clinical reasoning in compact language models through structured diagnostic methodology and evidence-based training
Mission
Clinical Reasoning Hub develops specialized medical AI models that demonstrate how structured training methodology can dramatically improve diagnostic reasoning in parameter-efficient architectures. Our research focuses on a fundamental question:
Can an 8B-parameter model, trained with the right clinical reasoning framework, approach the diagnostic accuracy of models 10ā80Ć its size?
Our results suggest yes ā with the right approach, compact models can achieve clinically meaningful performance.
Research Approach
Our training methodology is built on three core pillars:
Structured Clinical Reasoning Chains ā Models are trained on multi-phase diagnostic reasoning that mirrors real clinical decision-making: gathering evidence, generating differentials, weighing likelihood ratios, arriving at diagnoses, and self-correcting through verification. This is not simple question-answer memorization ā it is structured thinking.
Evidence-Grounded Training Curricula ā Training data is curated from diverse medical knowledge sources including USMLE-style reasoning, knowledge-graph-grounded clinical pathways, peer-reviewed evidence synthesis, clinical case discussions, and examination content spanning multiple international medical education systems.
Base Model Quality as a Multiplier ā We validate our methodology across different base architectures to confirm that improvements are driven by training methodology, not base model artifacts. The same pipeline applied to stronger base models produces compounding gains, confirming genuine capability transfer.
Models
| Model | Base Architecture | Params | Medical Avg | Key Strength |
|---|---|---|---|---|
| Diagnostic-Reasoning-Q3X1 | Qwen3-8B | 8B | 76.4% | Strongest overall ā 89.7% Professional Medicine |
| Diagnostic-Medicine-R1 | DeepSeek-R1-Distill-Llama-8B | 8B | 64.5% | Validated methodology on reasoning-distilled architecture |
Both models are fine-tuned using QLoRA (rank 128, alpha 256) on approximately 92K curated medical training examples, with a stabilization phase to prevent catastrophic forgetting.
Benchmark Results
All evaluations performed using lm-evaluation-harness v0.4.11 with zero-shot log-likelihood scoring on official test splits. No benchmark contamination ā training data contains no benchmark test questions.
Diagnostic-Reasoning-Q3X1 (Flagship)
| Benchmark | Base Model | Q3X1 (Ours) | Ī Improvement |
|---|---|---|---|
| Professional Medicine | 58.9% | 89.7% | +30.8% |
| Medical Genetics | 58.6% | 88.0% | +29.4% |
| Clinical Knowledge | 61.0% | 86.4% | +25.4% |
| Anatomy | 57.6% | 79.3% | +21.7% |
| MedQA (USMLE-style) | 43.9% | 66.3% | +22.4% |
| PubMedQA | 48.3% | 66.6% | +18.3% |
| MedMCQA | 37.3% | 58.6% | +21.3% |
| Overall Average | 52.2% | 76.4% | +24.2% |
Diagnostic-Medicine-R1
| Benchmark | Base Model | R1 (Ours) | Ī Improvement |
|---|---|---|---|
| Professional Medicine | 58.9% | 73.5% | +14.6% |
| Anatomy | 57.6% | 73.3% | +15.7% |
| Clinical Knowledge | 61.0% | 71.3% | +10.3% |
| PubMedQA | 48.3% | 68.0% | +19.7% |
| Medical Genetics | 58.6% | 62.0% | +3.4% |
| MedQA (USMLE-style) | 43.9% | 56.8% | +12.9% |
| MedMCQA | 37.3% | 46.4% | +9.1% |
| Overall Average | 52.2% | 64.5% | +12.3% |
Cross-Version Progression
Our iterative development shows consistent, compounding improvements:
| Version | Base Architecture | Avg Accuracy | Ī vs Base |
|---|---|---|---|
| Base (untuned) | DeepSeek-R1-Distill-Llama-8B | 52.2% | ā |
| V7 | DeepSeek-R1-Distill-Llama-8B | 57.6% | +5.4% |
| V8 (Diagnostic-Medicine-R1) | DeepSeek-R1-Distill-Llama-8B | 64.5% | +12.3% |
| V9 (Diagnostic-Reasoning-Q3X1) | Qwen3-8B | 76.4% | +24.2% |
Context: 8B-Class Comparison
| Model | MedQA | Prof. Medicine | Clinical Knowledge |
|---|---|---|---|
| Diagnostic-Reasoning-Q3X1 (Ours) | 66.3% | 89.7% | 86.4% |
| MedReason-8B (published) | 61.7% | ā | ā |
| Qwen3-8B (base, untuned) | 43.9% | 58.9% | 61.0% |
| DeepSeek-R1-Distill-Llama-8B (base) | ~43.9% | ~58.9% | ~61.0% |
Our Q3X1 model's MedQA score of 66.3% exceeds published 8B medical models and approaches the performance of several 14B-class models.
Key Findings
Methodology transfers across architectures. The same training pipeline applied to DeepSeek-R1-Distill-Llama-8B (V8) and Qwen3-8B (V9) both produce substantial gains, confirming our approach is not architecture-dependent.
Base model quality acts as a multiplier. Qwen3-8B's stronger pretraining foundation (36T tokens vs 15T) amplifies the effect of our clinical reasoning training, producing nearly double the improvement (+24.2% vs +12.3%).
Structured reasoning training disproportionately helps clinical domains. The largest gains are in Professional Medicine (+30.8%) and Medical Genetics (+29.4%) ā domains where step-by-step clinical reasoning is most valuable.
Compact models can punch far above their weight. An 8B model scoring 89.7% on Professional Medicine demonstrates that parameter count is not the limiting factor for clinical reasoning ā training methodology is.
Training Overview
| Detail | Value |
|---|---|
| Method | QLoRA (rank 128, alpha 256) with cosine decay |
| Hardware | Single NVIDIA H100 80GB |
| Training Data | ~92K curated medical reasoning examples |
| Training Phases | 3 epochs main + 1 epoch stabilization |
| Format | Multi-phase structured clinical reasoning chains |
| Q3X1 Training Time | ~16.3 hours total |
| R1 Training Time | ~10.2 hours total |
| Precision | BF16 |
The training data spans USMLE-style reasoning, knowledge-graph-grounded clinical pathways, evidence synthesis from biomedical literature, clinical textbook cases, and international medical examination content with real explanations. The stabilization phase consolidates gains and prevents catastrophic forgetting of base model capabilities.
Intended Use
These models are released for research purposes in medical AI, clinical NLP, and healthcare reasoning evaluation. They are intended to advance the study of clinical reasoning in language models.
ā ļø These models are NOT intended for clinical decision-making, medical diagnosis, or patient care. They are research artifacts that demonstrate training methodology effectiveness. Any clinical application would require extensive additional validation, regulatory review, and institutional oversight.
Limitations
These models have been evaluated primarily on English-language multiple-choice medical benchmarks and are not validated on open-ended clinical scenarios, multi-turn dialogue, or real patient data. Performance on rare diseases and edge cases has not been characterized. The models may exhibit hallucination or confident incorrect reasoning typical of language models. They are not a substitute for qualified medical professionals.
License
The models in this organization are released under the Apache 2.0 License for the Qwen3-8B-based model (Diagnostic-Reasoning-Q3X1), subject to the base model license terms. The DeepSeek-R1-Distill-Llama-8B-based model (Diagnostic-Medicine-R1) is subject to the Meta Llama 3.1 Community License.
Both models permit commercial use under their respective base model license terms. Users are responsible for ensuring compliance with all applicable license conditions.
Citation
@misc{clinical-reasoning-hub-2026,
title={Structured Clinical Reasoning Training for Compact Medical Language Models},
author={Clinical Reasoning Hub},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/Clinical-Reasoning-Hub}
}