Hi everyone,
I’ve merged a fix for Ollama integration that addresses persona configuration issues.
The Problem
Personas were not applying correctly to Ollama models, and Ollama was using default context lengths (2048 or 4096 tokens, depending on the model) instead of the persona’s configured context window.
The root cause: Persona LLM config fields follow OpenAI-style formatting, but Ollama requires different field names and structure. Parameters need to be placed inside an options object rather than at the root level.
Parameter Mapping Example:
| Persona Config | Ollama Config (in options) |
|---|---|
context_window: 8192 |
num_ctx: 8192 |
temperature: 0.7 |
temperature: 0.7 |
top_p: 0.9 |
top_p: 0.9 |
top_k: 40 |
top_k: 40 |
frequency_penalty: 0.5 |
repeat_penalty: 1.25 |
presence_penalty: 0.5 |
repeat_penalty: 1.25 |
stop_sequences: [...] |
stop: [...] |
Note: max_tokens doesn’t come from persona config but is mapped separately (set in the backend) (max_tokens → num_predict)
Without this mapping, Ollama was cutting tokens from the top of the context, including parts of the system prompt and relevant retrieved context.
Context Window Examples:
- gemma3:1b: 32,768 tokens (BrainDrive UI allows max 30,000)
- qwen3:8b: 32,768 tokens (BrainDrive UI allows max 30,000)
- llama3.2:3b: 128,000 tokens
Tip: Run ollama show <model_name> to check the context length for any model.
The Fix
The fix now correctly maps persona configuration to Ollama’s expected format. All parameters are properly translated and wrapped in the options object per Ollama’s API specification.
What’s Next: RAG Optimization Strategy
This fix revealed opportunities to optimize how we handle context. Here’s what I’m planning:
1. Chunk Size Optimization - BrainDriveAI/Document-Chat-Service
Reduce chunk sizes optimized for 8000-token context window models. Most modern Ollama models support at least 8K context, with many supporting significantly higher limits.
2. Dynamic Chunk Retrieval - BrainDriveAI/Document-Chat-Service
Update the chat-with-documents backend to dynamically determine the number of chunks to return based on the selected LLM’s capabilities. Different models have different context capacities, and retrieval should adapt accordingly.
3. Reverse Chunk Ordering - BrainDriveAI/Document-Chat-Service
Reverse the order of retrieved chunks so the most relevant content appears last in the context. Since Ollama may strip tokens from the top when approaching context limits, placing the most relevant information at the bottom ensures it’s preserved.
4. Dynamic Context Normalizer with Auto-Compact - BrainDriveAI/BrainDrive-Core
Implement a context manager that:
- Summarizes previous conversation history up to “n” messages
- Compresses context to ensure: system prompt + history + retrieved chunks + latest message fit within the context window
- Dynamically adjusts based on the selected model (initially optimized for 8000-token models)
Suggested Context Allocation Strategy (for 8K token models):
| Component | Percentage | Approx Tokens | Notes |
|---|---|---|---|
| System Prompt | 12.5% | ~1,000 tokens (max) | Protected, never truncated |
| Previous Conversation History | 20% | ~1,600 tokens | Summarized if needed |
| Retrieved Context | 45% | ~3,600 tokens | Most relevant chunks (reversed order) |
| User Query | 7.5% | ~600 tokens (max) | Current message |
| Buffer for LLM Response | 15% | ~1,200 tokens | Critical to prevent response cutoff |
| Total | 100% | ~8,000 tokens | Dynamically managed |
Implementation Plan
These RAG optimizations will be addressed in upcoming PRs. The dynamic context normalizer and allocation strategy will be the priority, followed by chunk optimization.
Related: Issue #192
Let me know your thoughts or suggestions on the context allocation strategy!