Under the Hood
Cohort's architecture is designed for one thing: getting multiple AI agents to produce better results than any single agent could alone.
Local by Default. Cloud by Choice. Safe Either Way.
Cohort adapts to your hardware, your budget, and your privacy requirements. No forced dependencies. No hidden API calls. You decide what runs where.
Do I need an API key? No. Auto-detects your GPU. Picks the right model for your VRAM. All three tiers run locally. 104 tok/s on a consumer GPU.
See response tiers ↓Can I use Claude or GPT? Yes. Bring your own API key. Override any tier. Cloud is always opt-in, per-tier, your key directly. No platform markup.
See pricing →What stops runaway costs? Token budgets, rate limits, loop prevention. Over-budget degrades to local silently. No surprise bills. All configurable.
View safeguards ↓The Pipeline
Define Your Task
Describe what you need -- a website, a code review, a content plan, a security audit. One sentence to get started. The agents will ask for what they need.
Cohort Selects Agents
The orchestrator identifies which specialists are needed and assembles the team. No manual configuration.
Agents Collaborate
Roundtable discussions, breakout groups, and parallel execution. Agents debate, disagree, and converge on the best approach.
You Get Results
A complete deliverable -- not a draft, not a suggestion. Working code, live websites, formatted reports.
Core Capabilities
Agent Roundtables
Assemble 3-8 specialist agents into a structured discussion. Each agent brings a distinct persona, domain expertise, and priorities. The conversation produces a synthesis that no single agent would have reached alone.
Cohort supports both sequential roundtables (each agent responds in turn) and compiled mode (all personas loaded into a single LLM call for 90% token reduction). Choose the mode that fits your latency and depth requirements.
From a real roundtable session. Each agent pushed back on the previous contribution. The CEO synthesized the final direction from all inputs.
Contribution Scoring
Every potential response is scored across 4 weighted dimensions before an agent speaks. High-scoring agents speak next. Low-scoring agents yield the floor. The conversation self-organizes toward the experts who actually have something to add.
This eliminates the most common failure mode in multi-agent systems: circular conversations where agents politely agree with each other until the token budget runs out.
# Dimensions: novelty (35%), expertise (30%), ownership (20%), question (15%) # Novelty -- is this new information or a repeat? novelty = calculate_novelty(proposed_message, recent_messages[-5:]) score += SCORING_WEIGHTS["novelty"] * novelty # Expertise -- does this agent's domain match the topic? score += SCORING_WEIGHTS["expertise"] * calculate_expertise_relevance( agent_config, topic_keywords ) # Ownership -- is this agent a primary stakeholder? if agent_id in primary_stakeholders: score += SCORING_WEIGHTS["ownership"] # Direct question -- was this agent asked specifically? if is_directly_questioned(agent_id, recent_messages): score += SCORING_WEIGHTS["question"]
Novelty dominates at 35% -- repeating what someone already said tanks your score. Explicit @mentions bypass scoring entirely, so you can always summon a specific agent.
Zero API Costs
Cohort runs on Ollama and llama.cpp out of the box. A single consumer GPU runs the default model fast enough for real-time multi-agent discussions. Your data never leaves your machine.
Cloud APIs are supported as an optional escalation path, but the default pipeline is fully local. No API keys required to get started.
Benchmarked on consumer hardware: RTX 3080 Ti 12GB, running llama.cpp with qwen3.5:9b (Q4_K_M). No cloud GPUs, no API keys.
Response Tiers
Three inference tiers -- Smart, Smarter, and Smartest -- let you match model depth to task complexity. Cohort auto-detects your GPU VRAM at startup and assigns the right model to each tier. No manual configuration.
Smart uses a lean model with no thinking tokens for fast, low-cost tasks. Smarter enables extended thinking for deeper reasoning on the same hardware. Smartest escalates to the largest model available -- the 35B MoE on dual-GPU systems, a larger dense model on single-GPU. Users can override any tier's primary or fallback model in Settings.
| Tier | Best For | 10-12GB GPU Default | Cost |
|---|---|---|---|
| [S] Smart | Short answers, quick lookups, summaries | qwen3.5:9b | Free |
| [S+] Smarter | Multi-step reasoning, code review, analysis | qwen3.5:9b (thinking) | Free |
| [S++] Smartest | Architecture decisions, complex debugging, long-form drafts | qwen3.5:9b (thinking) | Free |
| [S++] Smartest | Same tasks, cloud model quality | Cloud API (your key) | Per-token |
Smartest on 24GB+ dual-GPU systems uses the qwen3.5:35b-a3b MoE model at 124 tok/s -- fully local, zero per-token cost. Cloud API is an optional override, not a requirement.
Spending Safeguards
Multi-agent systems can burn through tokens fast. Cohort ships with built-in spending controls that prevent runaway usage -- no configuration required to get sane defaults. All limits are adjustable in Settings > Model Tiers.
Protocol-First Architecture
Agents, storage backends, and inference providers are all duck-typed protocols. Implement the interface, and Cohort uses it. No base classes to inherit, no vendor SDK to import. pip install cohort pulls only two lightweight dependencies (httpx + fastmcp).
Swap Ollama for llama.cpp. Replace SQLite with Postgres. Bring your own agents written in any framework. Cohort doesn't care -- it talks protocols, not implementations.
@runtime_checkable class AgentProfile(Protocol): """Duck-typed interface for agent profiles. Any object with these attributes satisfies the protocol. No subclassing required.""" name: str role: str capabilities: list[str] def relevance_score(self, topic: str) -> float: ... def can_contribute(self, context: dict) -> bool: ...
That's the complete agent contract. Three attributes, two methods. If your object has them, Cohort can use it -- no imports, no inheritance.
Executive Briefings
Auto-generated HTML reports show what your AI team did overnight: agent activity narratives, intel summaries, task execution metrics, and system health. Delivered to your inbox or served from a local dashboard.
Persistent Memory
Every agent has a persistent memory store that accumulates domain knowledge across sessions. When a security agent learns about a new vulnerability class, that knowledge is available in every future audit -- not just the current conversation.
Overnight training pipelines research topics, curate materials from a document library, inject verified facts, and assess agent performance automatically. Agents don't just run tasks -- they develop expertise over time.
// Training scores { "fact": "Completed training on parallel_change_pattern (score: 100.0%)", "learned_from": "teacher_agent", "confidence": "high" } // Assessment remediation -- learned from mistakes { "fact": "Always include concrete code snippets when explaining Python concepts - never describe without demonstrating", "learned_from": "assessment_remediation" } // Deep domain knowledge injected by training pipeline { "fact": "asyncio.TaskGroup (Python 3.11+) provides structured concurrency: all tasks awaited together, if any raises an exception all siblings are cancelled. Replaces gather().", "learned_from": "pattern_knowledge", "confidence": "medium" }
Real entries from the python_developer agent. Training scores, lessons from failed assessments, and domain knowledge -- all persisted across sessions and used in future tasks.