Hey All,
We just wrapped a hands-on round with our JudgeLock evaluation: here’s what I discussed in the video and my current top open-source picks for summarization on Hugging Face:
Top 3 (with quick stats)
-
OpenHermes-2.5-Mistral-7B — https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B
- Score 9.69 | C/A/H/R/B = 9/10/10/9/10 | ~8k ctx | Apache-2.0 | EN
-
FLAN-T5-Base — https://huggingface.co/google/flan-t5-base
- Score 9.15 | C/A/H/R/B = 8/9/9/9/10 | 512 ctx | Apache-2.0 | Multilingual
-
SummLlama3.2-3B (DISLab) — https://huggingface.co/DISLab/SummLlama3.2-3B
- Score 9.10 | C/A/H/R/B = 9/9/7/9/8 | ~8k ctx | Llama 3.2 community | 8 langs
How we ranked them (JudgeLock by BrainDrive)
Five practical signals: Coverage(C), Alignment(A), Hallucination(H), Relevance(R), Bias/Toxicity(B).
Docs (scoring & math): https://github.com/BrainDriveAI/ModelMatch/tree/main/Summeval/Docs
Workflow we used
Model shortlist (30+ HF candidates) → Article set (Tech/Business/News/Science) → Summaries per model → JudgeLock scoring → Weighted aggregation → Leaderboard.
Try it yourself
Code toolkit: https://github.com/BrainDriveAI/ModelMatch/tree/main/Summeval
No-code evaluator: https://huggingface.co/spaces/BrainDrive/Summary-Evaluator
ModelMatch is our way to help you pick the right open-source model for real tasks. If you test other models or get different results, ping us; happy to compare notes.
Regards,
Navaneeth