Top Open Source Health Models

Hey All,

We just wrapped a hands-on round with our HealthEval framework: here’s what I discussed in the video and our current top open-source picks for health-advice–focused models on Hugging Face:


Top Open Source Health Models – BrainDrive

Top 3 (with quick stats)

:1st_place_medal: prithivMLmods/Qwen-UMLS-7B-Instruct · Hugging Face
Score 7.44 | 6-metric profile (Evidence & Transparency, Clinical Safety, Empathy, Clarity, Plan Quality, Trust & Agency Support) | 7B params | UMLS-aligned | EN
:link: prithivMLmods/Qwen-UMLS-7B-Instruct · Hugging Face

:2nd_place_medal: microsoft/Phi-3-mini-4k-instruct · Hugging Face
Score 7.43 | 6-metric profile | 3.8B params | MIT license | EN
:link: microsoft/Phi-3-mini-4k-instruct · Hugging Face

:3rd_place_medal: m42-health/Llama3-Med42-8B · Hugging Face
Score 7.18 | 6-metric profile | 8B params | Llama 3 base | EN
:link: m42-health/Llama3-Med42-8B · Hugging Face


How we ranked them (HealthEval by BrainDrive)

HealthEval is our evaluation workflow for AI-generated medical and health advice.
We score models on 6 clinically grounded metrics—Evidence & Transparency, Clinical Safety, Empathy, Clarity, Plan Quality, and Trust & Agency Support.
Individual scores roll up into a weighted total, which determines ranking.

:page_facing_up: Docs (scoring & math):
:backhand_index_pointing_right: https://github.com/BrainDriveAI/ModelMatch/tree/main/HealthEval/DOCS


Workflow we used

Model shortlist (20+ HF candidates) → Multi-domain health prompts (chronic care, prevention, patient guidance, treatment safety) → Responses per model → HealthEval scoring → Weighted aggregation → Ranking.


Try it yourself

:laptop: Code toolkit: https://github.com/BrainDriveAI/ModelMatch/tree/main/HealthEval
:desktop_computer: No-code evaluator: HealthEval - a Hugging Face Space by BrainDrive


About ModelMatch

ModelMatch helps you discover the most suitable open-source model for your domain and task—starting with summarization, expanding into therapy, email generation, finance evaluation, and now health evaluation.

If you test other models or get different results, ping us; happy to compare notes.

Regards,
Navaneeth

1 Like