Skip to main content

Agent Benchmarks

We don't just claim our agents work. We prove it.

97.2% Overall Accuracy
2,400 Questions Tested
24 Specialist Agents
0 Agents Failed

Agent Scorecard

Every agent tested on 100 domain-specific questions spanning intermediate, advanced, and expert difficulty. Passing threshold: 70%.

Agent Score Categories Difficulty Spread
Analytics Agent
100%
7 Int 100% Adv 100% Exp 100%
Campaign Orchestrator
100%
10 Int 100% Adv 100% Exp 100%
Cohort Orchestrator
100%
12 Int 100% Adv 100% Exp 100%
Content Strategy Agent
100%
10 Int 100% Adv 100% Exp 100%
Documentation Agent
100%
10 Int 100% Adv 100% Exp 100%
Email Marketing Agent
100%
14 Int 100% Adv 100% Exp 100%
Marketing Strategist
100%
16 Int 100% Adv 100% Exp 100%
Reddit Specialist
100%
5 Int 100% Adv 100% Exp 100%
Coding Orchestrator
99%
9 Int 100% Adv 100% Exp 97%
Supervisor Agent
99%
10 Int 100% Adv 98% Exp 100%
Brand Design Agent
98%
7 Int 96% Adv 100% Exp 100%
Media Production Agent
98%
4 Int 96% Adv 100% Exp 100%
QA Agent
98%
10 Int 100% Adv 98% Exp 97%
Security Agent
98%
10 Int 95% Adv 98% Exp 100%
Setup Guide
98%
8 Int 100% Adv 96% Exp 100%
System Coder
97%
11 Int 95% Adv 96% Exp 100%
CEO Agent
96%
8 Int 100% Adv 96% Exp 93%
Python Developer
95%
8 Int 100% Adv 100% Exp 83%
Code Archaeologist
94%
9 Int 90% Adv 96% Exp 93%
Database Developer
94%
10 Int 100% Adv 100% Exp 80%
LinkedIn Specialist
94%
15 Int 88% Adv 100% Exp 100%
Web Developer
93%
7 Int 90% Adv 100% Exp 83%
JavaScript Developer
91%
8 Int 90% Adv 94% Exp 87%
Hardware Agent
90%
10 Int 80% Adv 88% Exp 100%

Inference Performance

Benchmarked on consumer GPUs using llama.cpp with qwen3.5:9b (Q4_K_M). No cloud GPUs. No API keys. These are the numbers you'll see on your hardware.

104 tok/s on RTX 3080 Ti
99 tok/s on RTX 3080
124 tok/s on dual GPU (35B)
262K context on single 12GB GPU
GPU VRAM Smart [S] Smarter [S+] Smartest [S++]
24GB+ (dual 12GB) qwen3.5:9b qwen3.5:9b (thinking) qwen3.5:35b-a3b (MoE)
10-12GB qwen3.5:9b qwen3.5:9b (thinking) qwen3.5:9b (thinking)
6-10GB qwen3.5:4b qwen3.5:4b (thinking) qwen3.5:9b
4-6GB qwen3.5:2b qwen3.5:4b (thinking) qwen3.5:4b
<4GB qwen3.5:2b qwen3.5:2b (thinking) qwen3.5:2b

Tier assignments are auto-detected at startup. Override any tier's model in Settings > Model Tiers.

Context Capacity
262K tokens
full context, single RTX 3080 12GB
40K tokens
interactive speed (under 50ms/token)
KV Cache Placement
99
tok/s (GPU cache)
34
tok/s (RAM cache)
KV cache on system RAM is 3x slower. Cohort ships defaults that keep the cache on GPU -- no configuration required.
On-Demand Escalation Server (dual-GPU)

On dual-GPU systems, the 35B model starts on a separate port, serves the Smartest request, and auto-stops after 30 seconds idle. The primary 9B server is never interrupted. Single-GPU users get automatic CPU offload with no configuration.

Category Breakdowns

Drill into each agent's domain expertise. Every category represents a distinct skill area tested independently.

Analytics Agent

100%
Statistical Analysis 27/27
Data Visualization 16/16
Product Analytics 13/13
Operational Analytics 13/13
Financial Metrics 11/11
Business Intelligence 10/10
Marketing Analytics 10/10

Campaign Orchestrator

100%
Attribution Modeling 13/13
A/B Testing 13/13
Budgeting Roi 13/13
Tracking Analytics 12/12
Platform Specific 12/12
Performance Analytics 11/11
Regional Compliance 8/8
Crisis Management 7/7
Campaign Strategy 6/6
Launch Sequencing 5/5

Cohort Orchestrator

100%
Distributed Systems 14/14
Multi Agent Coordination 11/11
Error Handling 11/11
Orchestration Patterns 10/10
Workflow State 10/10
Message Passing 9/9
Resource Allocation 8/8
Task Decomposition 6/6
Monitoring Observability 6/6
Rollback Compensation 6/6
Process Governance 5/5
Agent Assessment 4/4

Content Strategy Agent

100%
Content Strategy Planning 28/28
SEO 24/24
Content Analytics 14/14
Pool Industry 10/10
Multi Platform Strategy 6/6
Editorial Workflows 5/5
Copywriting 4/4
Content Creation 4/4
Audience Development 3/3
Distribution Channels 2/2

Documentation Agent

100%
API Docs 17/17
Docs As Code 15/15
Information Architecture 12/12
Writing Craft 11/11
Localization 9/9
Style Guides 8/8
Diagramming 8/8
Markdown 7/7
Metrics 7/7
Accessibility 6/6

Email Marketing Agent

100%
Email Deliverability 22/22
Email UX 13/13
Email Authentication 10/10
Email Analytics 10/10
Marketing Platforms 8/8
List Management 7/7
Testing Optimization 6/6
Copywriting 5/5
Lifecycle Email 5/5
Compliance 4/4
Pre Launch 3/3
Email Strategy 3/3
Crowdfunding 3/3
Email Types 1/1

Marketing Strategist

100%
Marketing Analytics 13/13
SEO Technical 9/9
PPC / Google Ads 9/9
Digital Marketing Strategy 8/8
FTC Compliance 7/7
Social Media Marketing 6/6
Trademark Usage 6/6
Customer Segmentation 5/5
Brand Development 5/5
Video Licensing 5/5
Crowdfunding Marketing 5/5
Pre Launch Audience 5/5
Content Marketing 5/5
SEO On-Page 5/5
Marketing Funnels 4/4
SEO Off-Page 3/3

Reddit Specialist

100%
Platform Mechanics 42/42
Community Engagement 18/18
Content Strategy 16/16
API Integration 13/13
Advertising 11/11

Coding Orchestrator

99%
Task Routing 17/17
Architecture 16/16
Cross Language 11/11
Code Review 10/10
Security 10/10
Testing 9/9
Cloud 8/8
Priority 5/5
CI/CD 13/14

Supervisor Agent

99%
Monitoring 16/16
Agent Performance 13/13
Incident 10/10
Chaos 9/9
Compliance 9/9
Quality 7/7
Process 7/7
Capacity 7/7
Risk 5/5
SRE 16/17

Brand Design Agent

98%
Color And Typography 25/25
Technical Production 18/18
Ai Image Generation 10/10
Accessibility 7/7
Logo Design 6/6
Brand Strategy 17/18
Design Fundamentals 15/16

Media Production Agent

98%
Post Production 44/44
Licensing 14/14
Video Production 31/32
Platform Optimization 9/10

QA Agent

98%
Test Strategy 13/13
Coverage Analysis 10/10
Test Process 10/10
Bug Reporting 9/9
Release Management 9/9
Performance Testing 7/7
Security Testing 7/7
ISTQB 5/5
Edge Case Analysis 16/17
API Testing 12/13

Security Agent

98%
Injection 20/20
Auth 15/15
Web Headers 11/11
OWASP Top 10 10/10
Supply Chain 8/8
Secrets Management 8/8
Threat Modeling 4/4
Infrastructure 4/4
Python Security 10/11
Cryptography 8/9

Setup Guide

98%
Hardware Vram 19/19
Troubleshooting 14/14
Model Selection 13/13
Onboarding 8/8
Communication 7/7
Cohort Ecosystem 7/7
Ollama Setup 21/22
Security Privacy 9/10

System Coder

97%
Docker 14/14
Kubernetes 12/12
Terraform 11/11
Linux 10/10
CI/CD 9/9
Powershell 7/7
Networking 4/4
Git 2/2
Bash 17/18
Ansible 7/8
Security 4/5

CEO Agent

96%
Performance 11/11
Resource Management 11/11
Communication 10/10
Product Strategy 9/9
Decision Frameworks 16/17
Leadership 14/15
Strategic Planning 13/14
Risk Management 12/13

Python Developer

95%
API Design 13/13
Security 11/11
ORM / Database 10/10
Data Processing 9/9
Core Python 20/21
Testing 13/14
Async 13/15
DevOps 6/7

Code Archaeologist

94%
Code Metrics 15/15
Static Analysis 12/12
Design Patterns 11/11
Architecture 9/9
Dependencies 8/8
Documentation 11/12
Diagramming 10/11
Tech Debt 7/8
Flow Analysis 11/14

Database Developer

94%
Transactions 12/12
SQL Queries 11/11
Indexing 10/10
Migrations 9/9
Security 7/7
Data Modeling 7/7
NoSQL 6/6
PostgreSQL 11/12
Schema Design 7/8
Optimization 14/18

LinkedIn Specialist

94%
Social Selling 8/8
Personal Branding 7/7
LinkedIn API 7/7
LinkedIn Analytics 6/6
Lead Generation 4/4
B2B Marketing 4/4
Company Page Optimization 3/3
LinkedIn Premium 3/3
Employee Advocacy 3/3
Events And Live 3/3
LinkedIn Newsletters 2/2
Thought Leadership 1/1
LinkedIn Advertising 20/21
LinkedIn Algorithm 12/13
Content Strategy 11/15

Web Developer

93%
Performance 19/19
Design Systems 9/9
HTML Semantics 7/7
Cross Browser 6/6
Responsive Design 10/11
CSS Architecture 19/21
Accessibility 23/27

JavaScript Developer

91%
Testing 10/10
State Management 9/9
Build Tools 8/8
Performance 5/5
React 16/17
Node.js 12/13
Typescript 19/22
Core Javascript 12/16

Hardware Agent

90%
Storage 18/18
Networking 9/9
Remote Management 6/6
Home Lab 4/4
Virtualization 10/11
Power Cooling 9/10
Cpu 13/15
Form Factors 5/6
Gpu 10/13
Memory 6/8

How We Test

Question Design

Each agent is tested on 100 domain-specific questions written to probe real-world competence, not trivia. Questions span three difficulty tiers with weighted representation.
  • Intermediate (20%) -- Core concepts every practitioner should know
  • Advanced (50%) -- Applied scenarios requiring synthesis across sub-domains
  • Expert (30%) -- Edge cases, tradeoffs, and production gotchas

Anti-Hallucination Controls

Default LLM temperature (0.8) causes agents to hallucinate confidently. Every Cohort agent runs at a tuned temperature calibrated through A/B testing to minimize fabrication while preserving useful reasoning.
  • Agent-specific temperatures (0.15 - 0.35) vs default 0.8
  • A/B tested: 3/3 agents hallucinated at 0.8, 0/3 at tuned temps
  • External grading -- agents don't score themselves

Infrastructure

All tests run on local hardware with local models -- no cloud API calls, no cherry-picked examples, no prompt engineering to boost scores. The same model and configuration used in production is what gets tested.
  • Model: qwen3.5:9b (dense 9B parameter, 6.6GB VRAM)
  • Hardware: Consumer GPU (RTX 3080 12GB)
  • Fully reproducible -- same model, same prompts, same temperatures

Build With Agents You Can Trust

Every agent ships with the same assessment pipeline. Run benchmarks on your own hardware.

Get Started