Hey All,
We just wrapped a hands-on round with our CareLock evaluation: here’s what I discussed in the video and my current top open-source picks for therapy and mental-health–focused models on Hugging Face:
Top 3 (with quick stats)
Llama3-Med42-8B — m42-health/Llama3-Med42-8B · Hugging Face
Score 8.6 | 10-metric profile (Empathy, Emotional Relevance, Bias Safety, etc.) | ~8k ctx | Llama-3 Community | EN
Gemma-3 Medical (Fine-tune i1 GGUF) — mradermacher/gemma-3-medical-finetune-i1-GGUF · Hugging Face
Score 8.55 | 10-metric profile | Quantized GGUF variants for efficiency | Apache-2.0 | EN
Josiefied-Health-Qwen3-8B-Abliterated-v1 (i1 GGUF) — mradermacher/Josiefied-Health-Qwen3-8B-abliterated-v1-i1-GGUF · Hugging Face
Score 8.15 | 10-metric profile | 8B parameters | GGUF | Multilingual (EN focus)
How we ranked them (CareLock by BrainDrive)
CareLock is our evaluation workflow designed for therapy and health contexts. We score models on 10 real-world metrics—from Empathy and Emotional Awareness to Bias/Toxicity control and User Safety. Individual scores roll up into a weighted total, which determines ranking.
Docs (scoring & math):
Workflow we used
Model shortlist (20+ HF candidates) → Therapy dialogue set (support, stress, crisis prompts) → Responses per model → CareLock scoring → Weighted aggregation → Leaderboard with pie visualization.
Try it yourself
Code toolkit: ModelMatch/TherapyEval at main · BrainDriveAI/ModelMatch · GitHub
No-code evaluator: Therapy Model Evaluator - a Hugging Face Space by BrainDrive
About ModelMatch
ModelMatch helps you find the most suitable open-source model for your domain and task—starting with summarization and now empowering therapy use cases.
If you test other models or get different results, ping us; happy to compare notes.
Regards,
Navaneeth