Agent Benchmarks

We don't just claim our agents work. We prove it.

97.2% Overall Accuracy

2,400 Questions Tested

24 Specialist Agents

0 Agents Failed

Agent Scorecard

Every agent tested on 100 domain-specific questions spanning intermediate, advanced, and expert difficulty. Passing threshold: 70%.

Agent	Score	Categories	Difficulty Spread
Analytics Agent	100%	7	Int 100% Adv 100% Exp 100%
Campaign Orchestrator	100%	10	Int 100% Adv 100% Exp 100%
Cohort Orchestrator	100%	12	Int 100% Adv 100% Exp 100%
Content Strategy Agent	100%	10	Int 100% Adv 100% Exp 100%
Documentation Agent	100%	10	Int 100% Adv 100% Exp 100%
Email Marketing Agent	100%	14	Int 100% Adv 100% Exp 100%
Marketing Strategist	100%	16	Int 100% Adv 100% Exp 100%
Reddit Specialist	100%	5	Int 100% Adv 100% Exp 100%
Coding Orchestrator	99%	9	Int 100% Adv 100% Exp 97%
Supervisor Agent	99%	10	Int 100% Adv 98% Exp 100%
Brand Design Agent	98%	7	Int 96% Adv 100% Exp 100%
Media Production Agent	98%	4	Int 96% Adv 100% Exp 100%
QA Agent	98%	10	Int 100% Adv 98% Exp 97%
Security Agent	98%	10	Int 95% Adv 98% Exp 100%
Setup Guide	98%	8	Int 100% Adv 96% Exp 100%
System Coder	97%	11	Int 95% Adv 96% Exp 100%
CEO Agent	96%	8	Int 100% Adv 96% Exp 93%
Python Developer	95%	8	Int 100% Adv 100% Exp 83%
Code Archaeologist	94%	9	Int 90% Adv 96% Exp 93%
Database Developer	94%	10	Int 100% Adv 100% Exp 80%
LinkedIn Specialist	94%	15	Int 88% Adv 100% Exp 100%
Web Developer	93%	7	Int 90% Adv 100% Exp 83%
JavaScript Developer	91%	8	Int 90% Adv 94% Exp 87%
Hardware Agent	90%	10	Int 80% Adv 88% Exp 100%

Inference Performance

Benchmarked on consumer GPUs using llama.cpp with qwen3.5:9b (Q4_K_M). No cloud GPUs. No API keys. These are the numbers you'll see on your hardware.

104 tok/s on RTX 3080 Ti

99 tok/s on RTX 3080

124 tok/s on dual GPU (35B)

262K context on single 12GB GPU

GPU VRAM	Smart [S]	Smarter [S+]	Smartest [S++]
24GB+ (dual 12GB)	qwen3.5:9b	qwen3.5:9b (thinking)	qwen3.5:35b-a3b (MoE)
10-12GB	qwen3.5:9b	qwen3.5:9b (thinking)	qwen3.5:9b (thinking)
6-10GB	qwen3.5:4b	qwen3.5:4b (thinking)	qwen3.5:9b
4-6GB	qwen3.5:2b	qwen3.5:4b (thinking)	qwen3.5:4b
<4GB	qwen3.5:2b	qwen3.5:2b (thinking)	qwen3.5:2b

Tier assignments are auto-detected at startup. Override any tier's model in Settings > Model Tiers.

Context Capacity

262K tokens

full context, single RTX 3080 12GB

40K tokens

interactive speed (under 50ms/token)

KV Cache Placement

tok/s (GPU cache)

tok/s (RAM cache)

KV cache on system RAM is 3x slower. Cohort ships defaults that keep the cache on GPU -- no configuration required.

On-Demand Escalation Server (dual-GPU)

On dual-GPU systems, the 35B model starts on a separate port, serves the Smartest request, and auto-stops after 30 seconds idle. The primary 9B server is never interrupted. Single-GPU users get automatic CPU offload with no configuration.

Category Breakdowns

Drill into each agent's domain expertise. Every category represents a distinct skill area tested independently.

Analytics Agent

100%

Statistical Analysis 27/27

Data Visualization 16/16

Product Analytics 13/13

Operational Analytics 13/13

Financial Metrics 11/11

Business Intelligence 10/10

Marketing Analytics 10/10

Campaign Orchestrator

100%

Attribution Modeling 13/13

A/B Testing 13/13

Budgeting Roi 13/13

Tracking Analytics 12/12

Platform Specific 12/12

Performance Analytics 11/11

Regional Compliance 8/8

Crisis Management 7/7

Campaign Strategy 6/6

Launch Sequencing 5/5

Cohort Orchestrator

100%

Distributed Systems 14/14

Multi Agent Coordination 11/11

Error Handling 11/11

Orchestration Patterns 10/10

Workflow State 10/10

Message Passing 9/9

Resource Allocation 8/8

Task Decomposition 6/6

Monitoring Observability 6/6

Rollback Compensation 6/6

Process Governance 5/5

Agent Assessment 4/4

Content Strategy Agent

100%

Content Strategy Planning 28/28

SEO 24/24

Content Analytics 14/14

Pool Industry 10/10

Multi Platform Strategy 6/6

Editorial Workflows 5/5

Copywriting 4/4

Content Creation 4/4

Audience Development 3/3

Distribution Channels 2/2

Documentation Agent

100%

API Docs 17/17

Docs As Code 15/15

Information Architecture 12/12

Writing Craft 11/11

Localization 9/9

Style Guides 8/8

Diagramming 8/8

Markdown 7/7

Metrics 7/7

Accessibility 6/6

Email Marketing Agent

100%

Email Deliverability 22/22

Email UX 13/13

Email Authentication 10/10

Email Analytics 10/10

Marketing Platforms 8/8

List Management 7/7

Testing Optimization 6/6

Copywriting 5/5

Lifecycle Email 5/5

Compliance 4/4

Pre Launch 3/3

Email Strategy 3/3

Crowdfunding 3/3

Email Types 1/1

Marketing Strategist

100%

Marketing Analytics 13/13

SEO Technical 9/9

PPC / Google Ads 9/9

Digital Marketing Strategy 8/8

FTC Compliance 7/7

Social Media Marketing 6/6

Trademark Usage 6/6

Customer Segmentation 5/5

Brand Development 5/5

Video Licensing 5/5

Crowdfunding Marketing 5/5

Pre Launch Audience 5/5

Content Marketing 5/5

SEO On-Page 5/5

Marketing Funnels 4/4

SEO Off-Page 3/3

Reddit Specialist

100%

Platform Mechanics 42/42

Community Engagement 18/18

Content Strategy 16/16

API Integration 13/13

Advertising 11/11

Coding Orchestrator

99%

Task Routing 17/17

Architecture 16/16

Cross Language 11/11

Code Review 10/10

Security 10/10

Testing 9/9

Cloud 8/8

Priority 5/5

CI/CD 13/14

Supervisor Agent

99%

Monitoring 16/16

Agent Performance 13/13

Incident 10/10

Chaos 9/9

Compliance 9/9

Quality 7/7

Process 7/7

Capacity 7/7

Risk 5/5

SRE 16/17

Brand Design Agent

98%

Color And Typography 25/25

Technical Production 18/18

Ai Image Generation 10/10

Accessibility 7/7

Logo Design 6/6

Brand Strategy 17/18

Design Fundamentals 15/16

Media Production Agent

98%

Post Production 44/44

Licensing 14/14

Video Production 31/32

Platform Optimization 9/10

QA Agent

98%

Test Strategy 13/13

Coverage Analysis 10/10

Test Process 10/10

Bug Reporting 9/9

Release Management 9/9

Performance Testing 7/7

Security Testing 7/7

ISTQB 5/5

Edge Case Analysis 16/17

API Testing 12/13

Security Agent

98%

Injection 20/20

Auth 15/15

Web Headers 11/11

OWASP Top 10 10/10

Supply Chain 8/8

Secrets Management 8/8

Threat Modeling 4/4

Infrastructure 4/4

Python Security 10/11

Cryptography 8/9

Setup Guide

98%

Hardware Vram 19/19

Troubleshooting 14/14

Model Selection 13/13

Onboarding 8/8

Communication 7/7

Cohort Ecosystem 7/7

Ollama Setup 21/22

Security Privacy 9/10

System Coder

97%

Docker 14/14

Kubernetes 12/12

Terraform 11/11

Linux 10/10

CI/CD 9/9

Powershell 7/7

Networking 4/4

Git 2/2

Bash 17/18

Ansible 7/8

Security 4/5

CEO Agent

96%

Performance 11/11

Resource Management 11/11

Communication 10/10

Product Strategy 9/9

Decision Frameworks 16/17

Leadership 14/15

Strategic Planning 13/14

Risk Management 12/13

Python Developer

95%

API Design 13/13

Security 11/11

ORM / Database 10/10

Data Processing 9/9

Core Python 20/21

Testing 13/14

Async 13/15

DevOps 6/7

Code Archaeologist

94%

Code Metrics 15/15

Static Analysis 12/12

Design Patterns 11/11

Architecture 9/9

Dependencies 8/8

Documentation 11/12

Diagramming 10/11

Tech Debt 7/8

Flow Analysis 11/14

Database Developer

94%

Transactions 12/12

SQL Queries 11/11

Indexing 10/10

Migrations 9/9

Security 7/7

Data Modeling 7/7

NoSQL 6/6

PostgreSQL 11/12

Schema Design 7/8

Optimization 14/18

LinkedIn Specialist

94%

Social Selling 8/8

Personal Branding 7/7

LinkedIn API 7/7

LinkedIn Analytics 6/6

Lead Generation 4/4

B2B Marketing 4/4

Company Page Optimization 3/3

LinkedIn Premium 3/3

Employee Advocacy 3/3

Events And Live 3/3

LinkedIn Newsletters 2/2

Thought Leadership 1/1

LinkedIn Advertising 20/21

LinkedIn Algorithm 12/13

Content Strategy 11/15

Web Developer

93%

Performance 19/19

Design Systems 9/9

HTML Semantics 7/7

Cross Browser 6/6

Responsive Design 10/11

CSS Architecture 19/21

Accessibility 23/27

JavaScript Developer

91%

Testing 10/10

State Management 9/9

Build Tools 8/8

Performance 5/5

React 16/17

Node.js 12/13

Typescript 19/22

Core Javascript 12/16

Hardware Agent

90%

Storage 18/18

Networking 9/9

Remote Management 6/6

Home Lab 4/4

Virtualization 10/11

Power Cooling 9/10

Cpu 13/15

Form Factors 5/6

Gpu 10/13

Memory 6/8

How We Test

Question Design

Each agent is tested on 100 domain-specific questions written to probe real-world competence, not trivia. Questions span three difficulty tiers with weighted representation.

Intermediate (20%) -- Core concepts every practitioner should know
Advanced (50%) -- Applied scenarios requiring synthesis across sub-domains
Expert (30%) -- Edge cases, tradeoffs, and production gotchas

Anti-Hallucination Controls

Default LLM temperature (0.8) causes agents to hallucinate confidently. Every Cohort agent runs at a tuned temperature calibrated through A/B testing to minimize fabrication while preserving useful reasoning.

Agent-specific temperatures (0.15 - 0.35) vs default 0.8
A/B tested: 3/3 agents hallucinated at 0.8, 0/3 at tuned temps
External grading -- agents don't score themselves

Infrastructure

All tests run on local hardware with local models -- no cloud API calls, no cherry-picked examples, no prompt engineering to boost scores. The same model and configuration used in production is what gets tested.

Model: qwen3.5:9b (dense 9B parameter, 6.6GB VRAM)
Hardware: Consumer GPU (RTX 3080 12GB)
Fully reproducible -- same model, same prompts, same temperatures

Build With Agents You Can Trust

Every agent ships with the same assessment pipeline. Run benchmarks on your own hardware.

Get Started