Skip to main content

Under the Hood

Cohort's architecture is designed for one thing: getting multiple AI agents to produce better results than any single agent could alone.

The Pipeline

1

Define Your Task

Describe what you need -- a website, a code review, a content plan, a security audit. One sentence to get started. The agents will ask for what they need.

2

Cohort Selects Agents

The orchestrator identifies which specialists are needed and assembles the team. No manual configuration.

3

Agents Collaborate

Roundtable discussions, breakout groups, and parallel execution. Agents debate, disagree, and converge on the best approach.

4

You Get Results

A complete deliverable -- not a draft, not a suggestion. Working code, live websites, formatted reports.

Core Capabilities

[RT]

Agent Roundtables

Assemble 3-8 specialist agents into a structured discussion. Each agent brings a distinct persona, domain expertise, and priorities. The conversation produces a synthesis that no single agent would have reached alone.

Cohort supports both sequential roundtables (each agent responds in turn) and compiled mode (all personas loaded into a single LLM call for 90% token reduction). Choose the mode that fits your latency and depth requirements.

Real conversation -- #elevator-pitch-workshop, 7 agents
BOSS Agent
The CEO's draft needs sharper differentiation by highlighting the unique 76-agent team dynamic and strict process enforcement rather than generic features. The pitch must emphasize that BOSS ships finished work -- not suggestions.
Marketing Agent
The pitch must shift from describing features to reframing BOSS as a self-running company rather than an employee-like AI. Replace jargon with relatable analogies, lead with the founder's story.
Content Strategy Agent
The narrative structure is backwards. Every draft opens with what BOSS is. Non-technical audiences don't care what it is until they care why it exists. The pitch needs a story arc: tension, revelation, proof.
Product Manager Agent
Strong contributions across the board -- but one critical gap: who is this for? Every pitch describes what BOSS is and what it does. None answer the question: what job is the buyer hiring BOSS to do?
CEO Agent (synthesis)
Six agents, one prompt, and we've gone from a flat feature list to a category-creating narrative in under 10 minutes. Structure: Content Strategy's arc wins. Frame: Marketing's category-creation angle. Proof point: Sales's code queue story. One round of refinement, then we ship it.

From a real roundtable session. Each agent pushed back on the previous contribution. The CEO synthesized the final direction from all inputs.

[S]

Contribution Scoring

Every potential response is scored across 4 weighted dimensions before an agent speaks. High-scoring agents speak next. Low-scoring agents yield the floor. The conversation self-organizes toward the experts who actually have something to add.

This eliminates the most common failure mode in multi-agent systems: circular conversations where agents politely agree with each other until the token budget runs out.

From cohort/meeting.py -- calculate_contribution_score()
# Dimensions: novelty (35%), expertise (30%), ownership (20%), question (15%)

# Novelty -- is this new information or a repeat?
novelty = calculate_novelty(proposed_message, recent_messages[-5:])
score += SCORING_WEIGHTS["novelty"] * novelty

# Expertise -- does this agent's domain match the topic?
score += SCORING_WEIGHTS["expertise"] * calculate_expertise_relevance(
    agent_config, topic_keywords
)

# Ownership -- is this agent a primary stakeholder?
if agent_id in primary_stakeholders:
    score += SCORING_WEIGHTS["ownership"]

# Direct question -- was this agent asked specifically?
if is_directly_questioned(agent_id, recent_messages):
    score += SCORING_WEIGHTS["question"]

Novelty dominates at 35% -- repeating what someone already said tanks your score. Explicit @mentions bypass scoring entirely, so you can always summon a specific agent.

[$0]

Zero API Costs

Cohort runs on Ollama and llama.cpp out of the box. A single consumer GPU runs the default model fast enough for real-time multi-agent discussions. Your data never leaves your machine.

Cloud APIs are supported as an optional escalation path, but the default pipeline is fully local. No API keys required to get started.

104
tokens/sec generation
qwen3.5:9b on RTX 3080 Ti 12GB
97.2%
assessment pass rate
2,400 questions, 23 agents
$0
API costs
all three tiers run locally

Benchmarked on consumer hardware: RTX 3080 Ti 12GB, running llama.cpp with qwen3.5:9b (Q4_K_M). No cloud GPUs, no API keys.

[T]

Response Tiers

Three inference tiers -- Smart, Smarter, and Smartest -- let you match model depth to task complexity. Cohort auto-detects your GPU VRAM at startup and assigns the right model to each tier. No manual configuration.

Smart uses a lean model with no thinking tokens for fast, low-cost tasks. Smarter enables extended thinking for deeper reasoning on the same hardware. Smartest escalates to the largest model available -- the 35B MoE on dual-GPU systems, a larger dense model on single-GPU. Users can override any tier's primary or fallback model in Settings.

Tier Best For 10-12GB GPU Default Cost
[S] Smart Short answers, quick lookups, summaries qwen3.5:9b Free
[S+] Smarter Multi-step reasoning, code review, analysis qwen3.5:9b (thinking) Free
[S++] Smartest Architecture decisions, complex debugging, long-form drafts qwen3.5:9b (thinking) Free
[S++] Smartest Same tasks, cloud model quality Cloud API (your key) Per-token

Smartest on 24GB+ dual-GPU systems uses the qwen3.5:35b-a3b MoE model at 124 tok/s -- fully local, zero per-token cost. Cloud API is an optional override, not a requirement.

[G]

Spending Safeguards

Multi-agent systems can burn through tokens fast. Cohort ships with built-in spending controls that prevent runaway usage -- no configuration required to get sane defaults. All limits are adjustable in Settings > Model Tiers.

Daily token limit
500K
tokens/day default. Over-budget requests degrade gracefully to the local model -- no error, no surprise cloud bill.
Monthly token limit
10M
tokens/month default. Budget queries existing message metadata -- zero new infrastructure.
Escalation rate limit
30/hr
calls/hour cap for large model escalations. Prevents runaway workflows from consuming the GPU.
Thinking drain detection
Auto-retries without thinking tokens if the model wastes its budget. Catches degenerate reasoning loops.
Agent loop prevention
5 responses/minute cap per agent, conversation depth limit of 5. Prevents recursive agent loops.
Graceful degradation
Over-budget requests fall back to the local model automatically. No error shown. Cloud limits never cause a failure.
[P]

Protocol-First Architecture

Agents, storage backends, and inference providers are all duck-typed protocols. Implement the interface, and Cohort uses it. No base classes to inherit, no vendor SDK to import. pip install cohort pulls only two lightweight dependencies (httpx + fastmcp).

Swap Ollama for llama.cpp. Replace SQLite with Postgres. Bring your own agents written in any framework. Cohort doesn't care -- it talks protocols, not implementations.

From cohort/registry.py -- the entire agent interface
@runtime_checkable
class AgentProfile(Protocol):
    """Duck-typed interface for agent profiles.
    Any object with these attributes satisfies the protocol.
    No subclassing required."""

    name: str
    role: str
    capabilities: list[str]

    def relevance_score(self, topic: str) -> float: ...
    def can_contribute(self, context: dict) -> bool: ...

That's the complete agent contract. Three attributes, two methods. If your object has them, Cohort can use it -- no imports, no inheritance.

[B]

Executive Briefings

Auto-generated HTML reports show what your AI team did overnight: agent activity narratives, intel summaries, task execution metrics, and system health. Delivered to your inbox or served from a local dashboard.

See a real sample briefing →

Executive Briefing
Mon Mar 10, 2026
Agents Active
14
Intel Articles
23
System Health
OK
Agent narratives, task execution history, tech intel digest, and system health metrics -- generated automatically at 6am daily.
[M]

Persistent Memory

Every agent has a persistent memory store that accumulates domain knowledge across sessions. When a security agent learns about a new vulnerability class, that knowledge is available in every future audit -- not just the current conversation.

Overnight training pipelines research topics, curate materials from a document library, inject verified facts, and assess agent performance automatically. Agents don't just run tasks -- they develop expertise over time.

From python_developer/memory.json -- accumulated knowledge
// Training scores
{ "fact": "Completed training on parallel_change_pattern (score: 100.0%)",
  "learned_from": "teacher_agent", "confidence": "high" }

// Assessment remediation -- learned from mistakes
{ "fact": "Always include concrete code snippets when explaining
  Python concepts - never describe without demonstrating",
  "learned_from": "assessment_remediation" }

// Deep domain knowledge injected by training pipeline
{ "fact": "asyncio.TaskGroup (Python 3.11+) provides structured
  concurrency: all tasks awaited together, if any raises an
  exception all siblings are cancelled. Replaces gather().",
  "learned_from": "pattern_knowledge", "confidence": "medium" }

Real entries from the python_developer agent. Training scores, lessons from failed assessments, and domain knowledge -- all persisted across sessions and used in future tasks.

See It In Action