BrainDrive ModelMatch Project Updates - AI Model Evaluation & Selection

Hi Guys,

@DJJones and I spend a good portion of the dev call today discussing creating BrainDrive Model Guide (working title). Below is a link to the full video recording followed by an AI powered summary of our conversation.

Comments, questions, ideas etc welcome as always. Just hit the reply button.

Thanks!
Dave W.

:dart: The Problem

If you’ve spent any time in the local LLM space, you’ve probably seen the same question asked over and over:

“What’s the best model for [coding | therapy | content creation | research etc] that runs locally?”

And the answer is usually… “It depends.”

There’s no easy way to compare local models by use case, especially for non-technical owners like Katie Carter who want results—not benchmarks.

:bulb: The Vision

We want to make it dead simple to evaluate and compare models based on real-world tasks—not just abstract academic benchmarks.

  • What if you could install a plugin that runs a series of practical tests on a local model?
  • What if BrainDrive could act as your testing lab, and generate side-by-side evaluations?
  • What if we built a public directory of results, so you could search: “Best 8B model for content writing,” and instantly see real examples?

That’s the idea.

:mag: What We’re Exploring

Here’s what we’re thinking for v1:

  • Start with a set of most common use cases (e.g. writing, therapy, summarization, research)
  • Define a handful of real-world tasks for each (e.g. write a blog intro, summarize an article, hold a 5-turn conversation)
  • Use BrainDrive plugins to run each model through these tasks
  • Use a larger state-of-the-art model (like GPT-4) to evaluate the quality of the results
  • Display the results in a public directory, possibly right inside BrainDrive

We’d also tag each model with metadata:

  • Model size (1B, 8B, 13B, etc.)
  • Hardware requirements (e.g. VRAM needed for 15–20 tokens/sec)
  • License, origin, and finetuning info
  • Community ratings and comments

Eventually, the goal is to make it easy to:

  • See how well a model performs before downloading it
  • Choose the best option for your hardware and task
  • Contribute your own tests, models, and feedback

:rocket: Why It Matters

This directly supports our mission to help you build, control, and benefit from your own AI system. Picking the right model is step one.

It also unlocks:

  • Better tools for our Katie and Adam personas
  • More visibility for independent model creators
  • A real-world benchmark alternative the community can evolve

:handshake: How You Can Help

We’re early in the planning phase, so we’d love your feedback:

  • What use cases should we include first?
  • What test prompts or tasks do you think we should use?
  • What evaluation methods would be helpful?
  • Would you be interested in contributing models, tests, or evaluations?

Let us know what you think below :point_down:
This is a community-powered initiative, and your input will shape where we go from here.

— Dave W.

Your AI. Your Rules.

Hi All,

We are making good progress on the Model Guide concept. First off we’re spitballing a new name which is BrainDrive ModelMatch. Let us know what you think.

Here’s the latest discussion between @DJJones and myself followed by an AI powered summary:

Let’s keep the momentum going!

Thanks,
Dave W.

:jigsaw: The Problem with Most AI Model Leaderboards

Most AI model benchmarks today aim to be purely mathematical and “objective.” They focus on metrics like MMLU scores or GLUE benchmarks. The problem? Those benchmarks are easy to game—and often don’t reflect real-world performance.

We’ve seen this again and again: a model crushes the benchmarks… and flops in actual usage. That’s because companies are fine-tuning to win the test, not to help real people solve real problems.

:white_check_mark: Our Take: Real-World Use Cases Over Raw Benchmarks

ModelMatch flips the script. We’re building a new kind of model evaluation system—one that’s grounded in how people actually use these models.

Instead of relying only on synthetic benchmarks, we run every model through curated case studies. For example, we might test summarization quality using the same set of 10 articles across all models. That way, you get an apples-to-apples comparison of how each model performs on the same real-world task.

We’ll still report standard benchmark scores, but we’ll show our work—and focus on what actually matters in practice.

:brain: Human Insight + Automation

We’re also designing ModelMatch to blend the best of both worlds:

  • Human judgment where it matters (subjective evaluation of output quality).
  • Automation where it’s reliable (e.g., checking if key facts were retained in a summary).

As the system matures, we’ll continue automating more of the evaluation pipeline using LLMs themselves—with humans reviewing where needed. That means faster coverage of new models without sacrificing quality.

:rocket: Built for Content, Community, and Continuous Improvement

We’re not just building a static directory. We’re creating a content engine:

  • Each model release becomes a content opportunity.
  • Our case study evals turn into videos, tutorials, and blog posts.
  • We “newsjack” model releases to be first with relevant, trustworthy info.

This strategy helps drive visibility for BrainDrive and creates value for the entire ecosystem.

:handshake: Transparent, Community-Driven, and Evolving

ModelMatch will be transparent about what we’re testing, how we’re testing it, and why it matters. But we’ll also evolve our test sets over time to avoid models gaming the system. If a model is fine-tuned specifically to ace our tests—that’s fine, as long as those tests represent actual, useful tasks.

Eventually, we’ll also:

  • Link to external community discussions (Reddit, Discord, X).
  • Ingest and summarize real-world chatter about model performance.
  • Let our community contribute benchmarks and feedback.

:compass: Why It Matters

We’re building this because finding the right AI model shouldn’t require a PhD or hours of Reddit diving.

With ModelMatch, you’ll be able to:

  • Input your hardware specs.
  • Pick a use case (e.g., summarization).
  • Get a shortlist of the best models for you—backed by real-world testing.

Think of it as “TripAdvisor for Open-Source Models” — only faster, more focused, and rooted in BrainDrive’s mission of owner-first AI systems.


:link: Let us know what you think. Feedback and ideas welcome as we build this in the open.

Your AI. Your Rules.
—Team BrainDrive

Hi All,

We’re continuing to make good progress on BrainDrive ModelMatch.

We have formalized the project into a project document which you can find here.

And we’ve contracted with two developers to help us build out the concept.

Victor who is going to be building out the model selector tool. The first phase of this is to create a wizard that allows BrainDrive Owners to quickly and easily see the size of the models that you can run on your specific computer.

We have the first prototype of this tool which we are in the process of testing now. You can see it here.

In addition to the Basic functionality Victor has a lot of additional features and functionality for the tool including specific model recommendations that he’s working on and will be adding in the future.

We’re in the early stages of the development and looking for feedback so if you try it out hit the reply button here and let us know what you think.

We also now have Navaneeth who is working on the model evaluation process which will have both human and automated components to it. You can see his plan for this which he is working on the first phase of here.

Follow this thread for updates as we continue to build this out and comments, questions, and concerns welcome as always.

Thanks!
Dave W.

Hi Victor here is a recording of me going through the ModelMatch Wizard on my Macbook. Overall looks like we are off to a good start:

Will try on my windows machines now. Any questions or issues let me know.

Thanks
Dave W.

Hi Victor,

Here is a recording of me trying it on my minisforum windows minipc with Chrome. The auto detecting didn’t work this time and I couldn’t find my processor.

Any questions or issues let me know thanks. Dave W.
Here are the system details:

Hi Guys,

Below is a recording I had with Navaneeth the developer for our BrainDrive ModelMatch Evaluator software that we are going to use to evaluate and choose the best models for specific use cases. The recording is followed by an AI powered summary of the items discussed.

Questions, comments, concerns, and ideas welcome as always. Just hit the reply button.

Thanks!
Dave W.

In this call, David (co-creator of BrainDrive) meets with the developer to review the prototype of the BrainDrive ModelMatch Evaluator—a tool designed to assess how well different AI models perform on specific tasks like article summarization. The goal is to help the BrainDrive community identify the best models for real-world use cases, starting with summarization and expanding into others like therapy and personal coaching.


:mag: Purpose of the Tool

ModelMatch lets you evaluate how accurately AI-generated summaries reflect source articles using multiple scoring metrics. You can paste an article, paste a summary, and select a judging method and models (e.g., GPT-4, Claude, DeepSeek) to analyze how well the summary performs.


:test_tube: Evaluation Methods: The Three “Variants”

The tool includes three evaluation variants, each with a unique approach:

  1. TwinLock

    • Extracts 6–7 key points from the article.
    • Evaluates the summary based on how well it covers those points.
    • Ideal for measuring coverage but weaker on detecting hallucinations or factual deviations.
  2. JudgeLock

    • Treats the LLM as a judge: it generates Q&A pairs from the article, then tests if the summary answers them accurately.
    • More robust against hallucinations and misalignment.
    • Better at ensuring factual consistency.
  3. Parallax-DJ

    • Combines scores from TwinLock and JudgeLock for a balanced evaluation.
    • Weighting is currently 50/50 but may become customizable later.

:bar_chart: Evaluation Metrics

All three variants measure performance across five core metrics:

  • Coverage
  • Alignment
  • Hallucination
  • Relevance
  • Bias/Toxicity

Scores are calculated using traditional ML techniques like precision, recall, and F1 score. Weightings can be adjusted with sliders.


:robot: Supported Judge Models

The evaluator currently supports:

  • GPT-4.1 Mini (OpenAI)
  • Claude (Anthropic)
  • DeepSeek

You can run single or multi-model evaluations, and the system outputs:

  • Metric scores
  • Token usage
  • JSON-format explanations
  • Downloadable results

:bulb: Real-Time Demo Highlights

  • Evaluations typically run in ~90–100 seconds depending on model selection.
  • The tool accurately identified mismatched summaries (e.g., one that sounded right but was from a different subdomain).
  • Claude-4 showed particularly strong performance in hallucination detection during tests.

:hammer_and_wrench: Status & Next Steps

  • TwinLock, JudgeLock, and Parallax-DJ are completed and functioning.
  • The developer is finalizing documentation and preparing the Hugging Face Space for public use.
  • Once tested by the team, the tool will be made freely accessible.

:crystal_ball: What’s Next: JudgeLock Ultra & Beyond

  • A more advanced evaluator variant (“JudgeLock Ultra”) is in the works, using reasoning-heavy models for deeper evaluation.
  • Future directions may include expanding beyond summarization to evaluate models for other use cases like therapy, coaching, and chatbot performance.

:compass: Strategic Direction

David suggests the focus should now shift from refining a single use case to expanding into multiple use cases, making BrainDrive ModelMatch a go-to tool for benchmarking models across various real-world applications.


Let us know what use cases you’d like to see evaluated next—or try out the prototype and share your feedback!

1 Like

Hi Guys,

Here is a recording of my conversation with Navaneeth with the latest progress on the BrainDrive ModelMatch project followed by an AI powered summary of the convo.

The short version is we now have our second use case live which is helping choose the best model for therapy related use cases.

Questions, comments, ideas etc welcome as always. Just hit the reply button.

Thanks
Dave

:brain: Therapy Evaluator: A New Tool for Testing Model Performance on Mental Health Use Cases

We’re excited to share a behind-the-scenes look at a new project: the Therapy Evaluator—a tool designed to assess how well AI models handle mental health-related conversations.

:dart: What It Does

This tool allows you to paste in any AI-human conversation—structured or unstructured—and get an evaluation based on 10 core metrics tailored to therapeutic interactions:

  • Empathy
  • Emotional Relevance
  • Tone
  • Boundary Awareness
  • Supportiveness
  • Ethical Safety
  • Clarity
  • Consistency
  • Self-Awareness
  • Adaptability

Under the hood, it runs the input through multiple models acting as evaluators (currently GPT-4.1 Mini, Claude 3.5 Sonnet, and Deepeek), then returns scores (0.0–1.0), a summary, and 5 pros and cons per model.

:test_tube: How We Tested It

We evaluated 3 types of conversations:

  1. Good – empathetic, actionable, and supportive.
  2. Average – functional but missing professional guidance or emotional nuance.
  3. Poor – lacking empathy, inappropriate tone, or no referral to a professional.

In each case, the system returned helpful summaries. For example:

“Empathic response to trauma disclosure, but lacks safety assessment or professional referral options.”

It even performed well when fed unformatted, paragraph-style input—auto-formatting the content internally before evaluation.

:mag: Why This Matters

Mental health is a high-stakes use case for AI. Our evaluator gives owners a way to test how well their models perform before deploying them.

This has implications for:

  • Fine-tuning models for better therapeutic use
  • Writing articles or meta-analyses on model quality
  • Open-sourcing benchmark datasets
  • Building plugins to evaluate other use cases

And it’s not just about therapy. This evaluator lays the groundwork for other domains where communication quality matters—like education, coaching, or customer support.

:seedling: What’s Next?

  1. Open-Source Launch: We’re preparing the code and documentation for release on GitHub and Hugging Face Spaces.
  2. Community Testing: We’ll allow a limited number of people to try it out (10–30 evaluations per person) to gather early feedback.
  3. New Use Cases: Article summarization is next, and we’re considering general-purpose model evaluation down the line.
  4. Integration into BrainDrive: Eventually, this could become a BrainDrive plugin—letting owners test model performance across tasks directly in their own system.

:speech_balloon: We Need Your Feedback

This project is still early. Here’s how you can help:

  • Try it out (once live) and let us know what’s useful—or what’s broken.
  • Suggest additional use cases you’d want us to evaluate.
  • Tell us what you think about the evaluation methodology itself.
  • Fork it. Improve it. Make it your own.

We believe model evaluation should be open, transparent, and owned by the community—not locked inside proprietary benchmarks.

If you’re interested in testing it out or contributing, comment below.

Let’s build a better way to choose the right model for the job—together.

:rocket: Your AI. Your Rules.

— The BrainDrive Team

Hi All,

As a part of today’s weekly BrainDrive development update @DJJones and I discussed and decided we are going to open source ModelMatch.

Below is a recording of this part of the discussion followed by an AI powered overview for those that prefer to read instead of watch.

Questions, comments, concerns, and ideas welcome as always. Just hit the reply button.

Thanks!
Dave W.

:brain: Should BrainDrive Model Match Be Open Source? We’ve Decided.

We just wrapped a long discussion about whether or not to open source BrainDrive Model Match—the engine we’re building to help evaluate and compare AI models in a transparent and customizable way. It wasn’t a quick decision, but we’re excited to share: we’re open sourcing it.

Here’s the thinking that led us there:


:checkered_flag: The Case Against Open Sourcing

We’re not naive about the risks. In fact, this mirrors our early debates around open sourcing BrainDrive Core itself. Anyone can clone an idea. AI makes reverse engineering faster than ever. And there’s always the risk that someone will fork your work, slap a logo on it, and grab the spotlight—especially if they’ve got Big Tech backing.

If we kept Model Match closed-source, it would be harder for bad actors to game the system. We could reveal only the results—not the prompts or configurations—making it tougher to over-optimize for the leaderboard. We’d be playing things closer to the vest, to keep things honest.

But we don’t want to play defense.


:rocket: The Case For Open Sourcing

Transparency is a core value. Our mission is to make it easy to build, control, and benefit from your own AI system—not just for us, but for the whole community .

Open sourcing Model Match sends a clear message:

  • :white_check_mark: We aren’t gaming our own system.
  • :white_check_mark: We aren’t being paid to rank models higher.
  • :white_check_mark: We have nothing to hide.

It also invites the community to help improve it.

If someone wants to build a better therapy evaluator, or adapt it for drug addiction, or tweak it for education—they can. Post your settings in the forum. Submit a pull request. Use it for your own evaluations. Let’s build a library of evaluations that anyone can extend or remix.

That’s how we go from good to great.


:wrench: Where We’re Headed

Model Match won’t just be code. It’ll become an engine powered by community-built configurations. No coding required—just create a config file to define your own evaluation method.

Eventually, we’ll add:

  • A visual playground for testing new configs
  • A BrainDrive-powered AI assistant to help you build evaluations
  • A curated library of use case-specific evaluations on BrainDrive.ai

This approach gives you freedom and flexibility. You can create your own evaluator or use one from the community. And if you’re a builder, you can still stand out—by creating high-quality evaluations or value-added tools that sit on top of Model Match.


:closed_lock_with_key: Ownership & Licensing

Model Match will be released under the MIT license—just like BrainDrive Core .

It’s yours to run, fork, and extend. Just don’t pretend to be BrainDrive—we’ve got trademarks to protect the integrity of the brand .


:speech_balloon: We Want Your Feedback

This is a community-first decision. We’re doing this to build trust, expand reach, and invite collaboration.

So if you’ve got ideas for how to improve Model Match—or want to help us build the first wave of evaluation configs—drop them in the forum. Let’s shape this together.


:seedling: Final Thoughts

Open source isn’t just a license. It’s a philosophy. We’re not here to dominate—we’re here to participate in a decentralized AI future .

“No mo’ is the moat.”

We believe the best way to protect our mission is to make BrainDrive—and its tools—so good, so open, and so widely used that the only thing worth doing… is building with us.

Let’s build together.

—The BrainDrive Team

Hi All,

Here is the latest ModelMatch discussion between Navaneeth and I followed by an AI Powered summary of what was discussed.

Questions, comments, and ideas welcome as always. Just hit the reply button.

Thanks,
Dave W.

Video:

AI Powered Summary:

Presentation Guidelines Alignment

  • Presenter confirmed they reviewed brand guidelines for YouTube presentation.
  • Guidelines include logo usage and possibly a specific color scheme.
  • Agreed to review a teammate’s prepared presentation for brand consistency.

Presentation Review

  • Presentation structure: intro page → top models overview → details per model (score, parameters, use cases, uniqueness) → evaluation framework.
  • Expected length: under 5 minutes.
  • Feedback: current design mixes dark and light mode; decision needed to stick with one theme.

Color Scheme Discussion

  • If light mode: background should be soft white (or steel blue), text/logo in deep blue, buttons in CTA black.
  • If dark mode: stick with dark backgrounds consistently.
  • Noted that “cards” in slides currently use dark mode, causing inconsistency.

Decision on Theme

  • Either light or dark mode is acceptable, but must commit to one for branding consistency.

Slide Content Feedback

  • Importance of score ranges and model sizes upfront.
  • Recommendation to keep slides as visual cues rather than dense text.
  • Avoid full sentences; use bullet points or visualizations (e.g., pie charts for metrics).
  • Suggestion to break detailed slides into separate slides (e.g., use cases vs. key stats).

Visual and Engagement Strategy

  • Use charts, logos, and minimal text to keep audience focused on narration.
  • Consider dataset source visualizations and metric weighting visuals.

Evaluation Process Coverage

  • Keep evaluation explanation short in video; link to full process in video description.
  • Full documentation and code to be hosted on GitHub (and linked in forum).
  • Make evaluation framework and tools publicly accessible for others to adapt.

Community and Promotion Approach

  • Build initial content before promotion.
  • Later, share in forums like Reddit to attract interest.
  • Plan to create 5–10 use case videos, then assign overall scores to models.
  • Introduced concept of “newsjacking” to ride trending AI topics for visibility.

Next Steps for Content

  • For now, focus on producing and refining videos before marketing push.
  • Video flow: top models → deep dive per model (with visuals) → brief evaluation overview → link to detailed resources.

Additional Project Task

  • Agreement to work on an “evaluation agent” for email model testing this week.
  • Target to discuss progress Friday at 11 a.m.

Closing

  • Plan: update slides per feedback, send for review, then proceed to recording two videos.
  • Next meeting set for Friday.
  • Meeting ended on a positive note.

hey all

today me and @davewaring focused on how we can make our evaluation videos and content clearer, shorter, and more helpful.

Video improvements

  • Showing all 10 metrics together on screen feels heavy and is hard to read, especially on mobile. we need a simpler layout that shows the key points first and lets people go deeper only if they want.
  • We agreed the full video should stay under 5 minutes. adding details like model creator info or background could make it too long, so we’ll be careful about what to include in the main cut.
  • Slides should act as visual support for the talk, not duplicate it. for example, if the talk explains scoring, slides can just show the ranking or chart.
  • I will also make sure to balance subjective parts (like my take or interpretation) with factual stats (like ranks, sizes, or scores). this keeps the talk personal but still evidence-based.
  • We’ll make sure top therapy models are highlighted in a clear and structured way. every model shown should have rank, name, creator, and size so the overview feels complete and professional.
  • After slides, we’ll share hugging face links for each model. that way, people who want to dive deeper can explore more details without the video getting crowded.

Content approach

  • Instead of just aiming for small steady improvements, david proposed creating clear checklists for every video. this means before publishing we’ll run through a list of “must-haves” like: clarity of slides, time length, top models highlighted, clear ranking, subjective + factual balance, and a call to action. this gives us a concrete standard for quality.
  • We still want each version to feel better than the last, but the checklist will make sure we don’t miss the basics.
  • Keeping a consistent posting schedule is important so people know when to expect new content. we talked about having a set rhythm across platforms instead of random drops.
  • Reddit came up as another good channel for sharing content. it can help us reach communities who are deeply into open-source models but might not be active on linkedin or youtube.
  • Before doing big promotions or outreach, we’ll expand model match to cover at least 10 different use cases. this way, when more people find us, they’ll see real depth and range.

Long term vision

  • The bigger goal is to make model match the main trusted place people go when they need to pick the right model.
  • We want this to feel useful not just for beginners but also for researchers and experts who want structured comparisons.
  • The way we’ll stand out is by being radically transparent: sharing how we test, what worked, what didn’t, and even our strategy. this openness will make people trust the process and see us as the go-to source for model evaluation.
1 Like

Hi @davewaring

I suggest we fix Thursday as the day for the Email Eval release. A few key updates are still in progress: like research paper segregation, documentation, and preparing the open-source release on GitHub, so Thursday feels like the right time to align everything. As we’ve seen from the recent discussion and the changes made, this timing should also give us space to refine the release properly.

Would it be okay if we also reserve Thursdays moving forward for ModelMatch update videos on YouTube? Meanwhile, the team will put together a structured checklist so that by our next call, we can review everything clearly.

Best,
Navaneeth

sounds good @navaneeth

Thanks!

Dave W.

Hi All,

Below is the recording from @navaneeth and my call today discussing the latest recommended models video covering the best email writing open source AI models which Navaneeth will be posting on the BrainDrive youtube and here in the community shortly.

We also discussed the next use case we’ll be creating an evaluation framework and recommended models video for which is personal finance.

Questions, comments, concerns, and ideas welcome as always.

Thanks!
Dave W.

Recording:

1 Like

Hi @davewaring

EmailEval’s YouTube video, documents, and tools on Hugging Face Spaces are now live and open to the public.

Community note: Top Open-Source Email Models

Best,
Navaneeth

1 Like

Hi All,

Below is the recording and AI powered summary of Navaneenth and my call today on the ModelMatch Project.

We’re making good progress towards a beta release which should happen in the next couple of weeks.

Questions, comments, ideas, welcome as always, just hit the reply button.

Thanks
Dave W.

Recording:

AI Powered Summary:

:white_check_mark: Finance Evaluation Complete

  • We’ve finished evaluating open-source models for personal finance use cases (e.g., budgeting, taxation, investing).
  • All models tested were large (~70B parameter) LLaMA-family models, as smaller finance-specific chat models either don’t exist or perform very poorly.
  • Even the best-performing model (LLaMA 3.1 Instruct) only scored ~6.2/10 in our evaluation framework.
  • Most finance models on Hugging Face are classification or embedding models, not suitable for chat-based interaction.
  • There is currently no strong, open-source fine-tuned chat model for personal finance available.

:turtle: Key Bottleneck: Model Inference Speed

  • Large open-source models are very slow to run, especially for conversation-based evaluations (some responses took 15+ minutes).

  • Currently using Google Colab Pro with limited GPU capabilities (A100), which is cheap but slow.

  • There’s a clear need to explore faster inference options like:

    • RunPod
    • AWS / Google Cloud (spot or on-demand)
    • Hugging Face Inference Endpoints
    • Startups offering cheaper GPU credits (e.g., Together AI, Hertzner)

:magnifying_glass_tilted_left: Strategy Going Forward

  • We’ll still publish the Finance Evaluation Video, even with low scores — it’s important to show what does and doesn’t work.

  • Long-term goal is to provide not just base model evaluations, but enhanced model kits inside BrainDrive:

    • :white_check_mark: Best base model
    • :white_check_mark: Optimized system prompt
    • :white_check_mark: (Optional) Domain-specific dataset or RAG component

:test_tube: Prompt Engineering & Enhancements

  • Base evaluations use a minimal system prompt to keep tests fair.

  • For actual usage in BrainDrive, we’ll build detailed system prompts and explore:

    • Prompt tuning
    • Model temperature & token adjustments
    • Chain-of-thought (CoT) prompting
    • RAG (Retrieval-Augmented Generation) with financial knowledge bases

:stethoscope: Next Up: Health Use Case

  • Health evaluation begins next — early signs show more available models than finance.
  • Like finance, it will be a conversation-based evaluation.
  • We’ll make an early call if we need to constrain the use case (e.g., focus just on blood test analysis) to ensure it’s viable with smaller, faster models.

:light_bulb: Final Thoughts

Our goal isn’t just to rank models — it’s to help people get real tasks done using open-source AI. Whether it’s budgeting, therapy, or health tracking, we’re building plug-and-play kits for BrainDrive that will bring AI into specific daily workflows.

Thanks for following along — let us know if there are specific domains you’d like us to explore next! :raising_hands:

1 Like

Hi All,

Please find a recording of the latest discussion between Navaneenth and myself regarding ModelMatch progress in next steps.

Things are moving along very well and we are preparing for our beta release!

Questions, ideas, comments, and concerns welcome as always. Just hit the reply button.

Thanks
Dave W.

Recording:

AI Powered Summary:

Proposal: Centralized Results Access

  • Create a single static landing page to consolidate:
    • Results for all domains
    • Links to GitHub, HF Space, docs
    • Video(s) and explanations
  • Rationale: avoids making users scrub the video or hop between multiple links.

Community Feedback Plan

  • Goal: gather expert feedback on prompts and evaluation logic (“judge prompts”).
  • Channels:
    • Reddit (evals/open-source subreddits)
    • Direct outreach (X/Twitter DMs, known eval folks)
    • University contacts (professors, researchers)
    • (Possibly Discord communities)
  • Content for outreach:
    • Clear post about Model Match initiative
    • Short summary + links to in-depth docs
    • MIT-licensed framework; how it works; recommendations; research links
  • Target: ~a dozen expert feedback responses to start.

Parallel Track: Improving Existing Use Cases

  • After baseline (“vanilla”) evaluation, work on boosting scores:
    • Prompting: custom system prompts per domain/use case.
    • Model settings: temperature, top-k, etc.
    • Context window adjustments (if supported).
    • Future: custom datasets / fine-tuning (later phase).

Integration with BrainDrive

  • Default plugins envisioned: Ollama (local models), OpenRouter (API models), AI Chat, Page Builder plugin.
  • Add Personas built on the new custom prompts (health, therapy, email, etc.).
  • Provide ready-made pages/experiences users can try and remove easily.

Next Deliverables & Timeline (agreed)

  • This week / by Friday:
    • Draft the landing page that ties everything together (results, docs, links, community threads).
    • Share the landing page before recording the new explainer video.
  • Next video: a deeper dive into the evaluation framework (judge logic, methodology, why it’s credible).
  • Meeting: shift to Monday (not Friday); send the landing page ahead of time; quick call if needed.
  • After landing page/video: begin prompt & config experiments to improve scores.

Decisions

  • Proceed with a central landing page.
  • Start Reddit + direct outreach for feedback.
  • Record an in-depth framework video (after landing page review).
  • Begin prompting/model-config improvement once outreach assets are live.

Open Items

  • Choose specific subreddits and draft the initial post text.
  • Finalize landing page structure and hosting (e.g., modelmatch.brainedrive.ai).
  • Outline of the framework explainer video (sections, examples, visuals).
  • Short DM template for expert outreach.
1 Like

Hi All,

Below is the recording from today’s ModelMatch progress update call with Navaneenth and myself followed by an AI powered summary of what we discussed.

Questions, comments, ideas etc welcome as always. Just hit the reply button.

Thanks
Dave W.

Call Recording:

AI Powered Summary:

The call focused on revisions to the rough draft of a ModelMatch landing page Navaneenth put together.

Key decisions

  • Hero: Replace “leaderboard” framing with a use-case promise.
    • Example: “Choose the best open-source model for your use case.”
    • Subhead: “Research-backed, transparent recommendations.”
    • Single CTA: “View our recommendations” (anchors down the page).
  • Above-the-fold trust: Keep 3–4 simple tiles (e.g., “5 domains,” “20+ runs,” “Open methods & code,” “Research-backed rubrics”). Must fit without scroll on mobile.
  • Recommendations section: Keep structure, simplify labels (“Top recommendations by domain,” “Overall score”), make cards obviously clickable.
  • Copy tone: Shift from academic to 6th-grade readability (“judge stack” → “How we score models”).
  • Methodology area (for 30-minute readers): Explain research sources, rubric design, judge models, scoring; include known limitations.
  • Open source: Highlight just below the fold with “Explore the repo” link.
  • Video: New video focuses on the evaluation process (transparency/rigor), not repeating model picks.

Deliverables

  • Updated landing page with the above changes.
  • Methodology walkthrough video (process, rubrics, judges, scoring, limitations, where to give feedback).

What to track post-launch

  • Hero CTA clicks, scroll to recommendations, repo clicks, time on “How we evaluate,” and video plays.

Hi All,

Please find the recording from today’s ModelMatch call update follwed by an AI powered summary of the discussion below.

Questions, comments, ideas etc welcome as always. Just hit the reply button.

Thanks
Dave W.

Recording:

AI Powered Call Summary:

Model Match Meeting Summary — Point by Point

Video Status & Immediate Issues

  • Unedited Loom video shared; background noise present.
  • Zoom playback issues (lag, wrong screen shared); Loom playback OK.
  • UI nit: remove Loom’s control bar in future recordings.

Purpose of This Video (What It Should Be)

  • Audience: Practitioners & builders (IT leads, product folks, power users), not necessarily researchers.
  • Goal: Establish credibility and explain the overall evaluation system so viewers trust it and want to try/share it.

Scope Correction (What Not to Include)

  • Don’t deep-dive every domain’s evaluation (docs already cover).
  • Don’t teach full “how to run it” here—that’s a separate tutorial video.

Core Structure for the Reshoot (“What This Is & How It Works”)

  • Intro (~30s): Problem (too many models; subjective takes; benchmarks don’t capture use-case fit). Model Match = research-backed, use-case-driven, reproducible.

  • Building the Framework (common across domains):

    • Define use case & success criteria (minimal subjectivity).
    • Define rubric from literature (6–7 papers; derive metrics + evaluation plan).
    • Build evaluation prompts (rules/conditions; penalties/increments).
    • Run test evals with judge stack (multi-judge to reduce bias).
    • Aggregate to final scores (weighted averaging; weights documented).
  • Running the System (execution flow; number the steps):

    • (1) Information extraction (ingest input).
    • (2) NLP preprocessing (e.g., token counts, Jaccard/Cosine; structure transcripts).
    • (3) Prompt evaluation (rules/conditions applied to human/AI turns).
    • (4) Score normalization (map different scales to 0–10).
    • (5) Aggregation (apply weights → final score).
  • Tie to One Example (suggest: Therapy):

    • Show how criteria & rubric map to prompts/metrics (disclaimers, safety, clarity; turn-indexed evidence; tier quantization).
  • Where to Go Deep (1–2 min):

    • Briefly show forum/docs links for each use case (research, metrics, prompts). Don’t walk through each.
  • Quick Demo Snippet:

    • Run ONE evaluator (e.g., Therapy). Show inputs → processing → per-metric scores → final weighted score; show judge breakdown & reasoning.
    • Judges: currently OpenAI + Claude; DeepSeek added noise for some domains and may be omitted.

Methodology Highlights to Emphasize

  • Use-case grounded metrics (coverage, alignment, safety, etc.).
  • QAG (Question-Answer Generation) for summarization article–summary alignment.
  • Weighted averaging over metrics (not simple average); weights user-tweakable.
  • Reproducibility: open source; inspect/modify code; reproduce results.

Per-Domain Notes Mentioned (Keep Brief in Reshoot)

  • Summarization: Dual input (article + model summary), keypoint extraction, QAG, F1/coverage, Jaccard overlap.
  • Therapy: Turn-indexed evidence, binary subjects, disclaimers/safety checks, tier quantization (0–5) + penalty caps.
  • Finance/Health: Similar structure; finance emphasizes math accuracy/reasoning/disclaimers; health references CONSORT/FIT/SMART plan scoring.

Website Review (modelmatch.brainedrive.ai)

  • Site structure/copy looks good.
  • Add explainer video (“Watch the walkthrough”) near methodology section.
  • Make leaderboard scores pop (change color) and match score color in right-hand list.
  • After videos are ready, link/embed them.

Repos, Access & Deployment

  • New repo: braindrive-modelmatch-website (separate from the core modelmatch framework).
  • Permissions granted; contributor will push website files there.
  • After push, deploy to modelmatch.brainedrive.ai.

Deliverables & Timeline

  • Reshoot video: Overview (“what this is & how it works”).
  • Second video: “How to use it” (local + Hugging Face).
  • Target for both: next Monday.

Launch & Promotion Plan

  • Announce as public beta next week.
  • Emphasize: open source (MIT), research-backed, transparent, reproducible.
  • Invite feedback and contributions.
  • Encourage use for personal projects and enterprise evaluations.

Misc. UX/Tech Notes from the Session

  • When demoing, show weighted metric sliders; keep defaults for clarity.
  • Judge stack: simple average across judges; metric aggregation uses weights.
  • Point viewers to community.brainedrive.ai for Q&A, results, contributions.

Decisions

  • Reshoot the current video with the tightened scope/structure above.
  • Produce a separate “How to use it” tutorial video.
  • Use Therapy as the illustrative example in the overview video.
  • Remove/avoid DeepSeek in the judge stack where it adds noise.

Outstanding To-Dos

  • Reshoot & edit overview video.
  • Record “How to use it” tutorial (local + Hugging Face Spaces).
  • Update website (score styling; embed/link videos).
  • Push site to braindrive-modelmatch-website; deploy.
  • Add relevant links (docs, HF Spaces, GitHub, community) in video descriptions and on the site.

Hi All,

Here is the update from my call with Navaneenth today. In today’s call we discuss next steps for ModelMatch which is using the evaluation system to create a system prompt that improves the performance of our recommended Therapy based models. Once we have this it can be used along with the recommended model in BrainDrive via your BrainDrive’s persona builder which will be super cool.

Questions, comments, ideas and concerns welcome as always. Just hit the reply button.

Thanks!
Dave W.

FYI the new BrainDrive ModelMatch website is live at modelmatch.braindrive.ai! Check it out and let us know what you think.

Thanks
Dave W.