Clinical Reasoning Labs for Medical Diagnostic Accuracy

company

https://huggingface.co/Clinical-Reasoning-Hub

Activity Feed

AI & ML interests

Clinical Reasoning, Medical Diagnosis, Bayesian Networks, Healthcare AI, LLM Fine-tuning

Recent Activity

naturally-intuitive published a model 3 days ago

Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X14B1

naturally-intuitive updated a model 3 days ago

Clinical-Reasoning-Hub/Diagnostic-Reasoning-Q3X14B1

naturally-intuitive published a model 5 days ago

Clinical-Reasoning-Hub/Diagnostic-Reasoning-QW3X2

View all activity

Organization Card

Community About org cards

🏥 Clinical Reasoning Hub

Clinical Reasoning Labs for Medical Diagnostic Accuracy

Advancing clinical reasoning in compact language models through structured diagnostic methodology and evidence-based training

Mission

Clinical Reasoning Hub develops specialized medical AI models that demonstrate how structured training methodology can dramatically improve diagnostic reasoning in parameter-efficient architectures. Our research focuses on a fundamental question:

Can an 8B-parameter model, trained with the right clinical reasoning framework, approach the diagnostic accuracy of models 10–80× its size?

Our results suggest yes — with the right approach, compact models can achieve clinically meaningful performance.

Research Approach

Our training methodology is built on three core pillars:

Structured Clinical Reasoning Chains — Models are trained on multi-phase diagnostic reasoning that mirrors real clinical decision-making: gathering evidence, generating differentials, weighing likelihood ratios, arriving at diagnoses, and self-correcting through verification. This is not simple question-answer memorization — it is structured thinking.

Evidence-Grounded Training Curricula — Training data is curated from diverse medical knowledge sources including USMLE-style reasoning, knowledge-graph-grounded clinical pathways, peer-reviewed evidence synthesis, clinical case discussions, and examination content spanning multiple international medical education systems.

Base Model Quality as a Multiplier — We validate our methodology across different base architectures to confirm that improvements are driven by training methodology, not base model artifacts. The same pipeline applied to stronger base models produces compounding gains, confirming genuine capability transfer.

Models

Model	Base Architecture	Params	Medical Avg	Key Strength
Diagnostic-Reasoning-Q3X1	Qwen3-8B	8B	76.4%	Strongest overall — 89.7% Professional Medicine
Diagnostic-Medicine-R1	DeepSeek-R1-Distill-Llama-8B	8B	64.5%	Validated methodology on reasoning-distilled architecture

Both models are fine-tuned using QLoRA (rank 128, alpha 256) on approximately 92K curated medical training examples, with a stabilization phase to prevent catastrophic forgetting.

Benchmark Results

All evaluations performed using lm-evaluation-harness v0.4.11 with zero-shot log-likelihood scoring on official test splits. No benchmark contamination — training data contains no benchmark test questions.

Diagnostic-Reasoning-Q3X1 (Flagship)

Benchmark	Base Model	Q3X1 (Ours)	Δ Improvement
Professional Medicine	58.9%	89.7%	+30.8%
Medical Genetics	58.6%	88.0%	+29.4%
Clinical Knowledge	61.0%	86.4%	+25.4%
Anatomy	57.6%	79.3%	+21.7%
MedQA (USMLE-style)	43.9%	66.3%	+22.4%
PubMedQA	48.3%	66.6%	+18.3%
MedMCQA	37.3%	58.6%	+21.3%
Overall Average	52.2%	76.4%	+24.2%

Diagnostic-Medicine-R1

Benchmark	Base Model	R1 (Ours)	Δ Improvement
Professional Medicine	58.9%	73.5%	+14.6%
Anatomy	57.6%	73.3%	+15.7%
Clinical Knowledge	61.0%	71.3%	+10.3%
PubMedQA	48.3%	68.0%	+19.7%
Medical Genetics	58.6%	62.0%	+3.4%
MedQA (USMLE-style)	43.9%	56.8%	+12.9%
MedMCQA	37.3%	46.4%	+9.1%
Overall Average	52.2%	64.5%	+12.3%

Cross-Version Progression

Our iterative development shows consistent, compounding improvements:

Version	Base Architecture	Avg Accuracy	Δ vs Base
Base (untuned)	DeepSeek-R1-Distill-Llama-8B	52.2%	—
V7	DeepSeek-R1-Distill-Llama-8B	57.6%	+5.4%
V8 (Diagnostic-Medicine-R1)	DeepSeek-R1-Distill-Llama-8B	64.5%	+12.3%
V9 (Diagnostic-Reasoning-Q3X1)	Qwen3-8B	76.4%	+24.2%

Context: 8B-Class Comparison

Model	MedQA	Prof. Medicine	Clinical Knowledge
Diagnostic-Reasoning-Q3X1 (Ours)	66.3%	89.7%	86.4%
MedReason-8B (published)	61.7%	—	—
Qwen3-8B (base, untuned)	43.9%	58.9%	61.0%
DeepSeek-R1-Distill-Llama-8B (base)	~43.9%	~58.9%	~61.0%

Our Q3X1 model's MedQA score of 66.3% exceeds published 8B medical models and approaches the performance of several 14B-class models.

Key Findings

Methodology transfers across architectures. The same training pipeline applied to DeepSeek-R1-Distill-Llama-8B (V8) and Qwen3-8B (V9) both produce substantial gains, confirming our approach is not architecture-dependent.

Base model quality acts as a multiplier. Qwen3-8B's stronger pretraining foundation (36T tokens vs 15T) amplifies the effect of our clinical reasoning training, producing nearly double the improvement (+24.2% vs +12.3%).

Structured reasoning training disproportionately helps clinical domains. The largest gains are in Professional Medicine (+30.8%) and Medical Genetics (+29.4%) — domains where step-by-step clinical reasoning is most valuable.

Compact models can punch far above their weight. An 8B model scoring 89.7% on Professional Medicine demonstrates that parameter count is not the limiting factor for clinical reasoning — training methodology is.

Training Overview

Detail	Value
Method	QLoRA (rank 128, alpha 256) with cosine decay
Hardware	Single NVIDIA H100 80GB
Training Data	~92K curated medical reasoning examples
Training Phases	3 epochs main + 1 epoch stabilization
Format	Multi-phase structured clinical reasoning chains
Q3X1 Training Time	~16.3 hours total
R1 Training Time	~10.2 hours total
Precision	BF16

The training data spans USMLE-style reasoning, knowledge-graph-grounded clinical pathways, evidence synthesis from biomedical literature, clinical textbook cases, and international medical examination content with real explanations. The stabilization phase consolidates gains and prevents catastrophic forgetting of base model capabilities.

Intended Use

These models are released for research purposes in medical AI, clinical NLP, and healthcare reasoning evaluation. They are intended to advance the study of clinical reasoning in language models.

⚠️ These models are NOT intended for clinical decision-making, medical diagnosis, or patient care. They are research artifacts that demonstrate training methodology effectiveness. Any clinical application would require extensive additional validation, regulatory review, and institutional oversight.

Limitations

These models have been evaluated primarily on English-language multiple-choice medical benchmarks and are not validated on open-ended clinical scenarios, multi-turn dialogue, or real patient data. Performance on rare diseases and edge cases has not been characterized. The models may exhibit hallucination or confident incorrect reasoning typical of language models. They are not a substitute for qualified medical professionals.

License

The models in this organization are released under the Apache 2.0 License for the Qwen3-8B-based model (Diagnostic-Reasoning-Q3X1), subject to the base model license terms. The DeepSeek-R1-Distill-Llama-8B-based model (Diagnostic-Medicine-R1) is subject to the Meta Llama 3.1 Community License.

Both models permit commercial use under their respective base model license terms. Users are responsible for ensuring compliance with all applicable license conditions.

Citation

@misc{clinical-reasoning-hub-2026,
  title={Structured Clinical Reasoning Training for Compact Medical Language Models},
  author={Clinical Reasoning Hub},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/Clinical-Reasoning-Hub}
}

models 4

datasets 0

None public yet

AI & ML interests

Recent Activity

Team members 1

🏥 Clinical Reasoning Hub

Clinical Reasoning Labs for Medical Diagnostic Accuracy

Mission

Research Approach

Models

Benchmark Results

Diagnostic-Reasoning-Q3X1 (Flagship)

Diagnostic-Medicine-R1

Cross-Version Progression

Context: 8B-Class Comparison

Key Findings

Training Overview

Intended Use

Limitations

License

Citation

models 4 Sort: Recently updated

datasets 0

models 4