Agent Benchmarks
We don't just claim our agents work. We prove it.
Agent Scorecard
Every agent tested on 100 domain-specific questions spanning intermediate, advanced, and expert difficulty. Passing threshold: 70%.
| Agent | Score | Categories |
|---|---|---|
| Analytics Agent |
100%
|
7 |
| Campaign Orchestrator |
100%
|
10 |
| Cohort Orchestrator |
100%
|
12 |
| Content Strategy Agent |
100%
|
10 |
| Documentation Agent |
100%
|
10 |
| Email Marketing Agent |
100%
|
14 |
| Marketing Strategist |
100%
|
16 |
| Reddit Specialist |
100%
|
5 |
| Coding Orchestrator |
99%
|
9 |
| Supervisor Agent |
99%
|
10 |
| Brand Design Agent |
98%
|
7 |
| Media Production Agent |
98%
|
4 |
| QA Agent |
98%
|
10 |
| Security Agent |
98%
|
10 |
| Setup Guide |
98%
|
8 |
| System Coder |
97%
|
11 |
| CEO Agent |
96%
|
8 |
| Python Developer |
95%
|
8 |
| Code Archaeologist |
94%
|
9 |
| Database Developer |
94%
|
10 |
| LinkedIn Specialist |
94%
|
15 |
| Web Developer |
93%
|
7 |
| JavaScript Developer |
91%
|
8 |
| Hardware Agent |
90%
|
10 |
Inference Performance
Benchmarked on consumer GPUs using llama.cpp with qwen3.5:9b (Q4_K_M). No cloud GPUs. No API keys. These are the numbers you'll see on your hardware.
| GPU VRAM | Smart [S] | Smarter [S+] | Smartest [S++] |
|---|---|---|---|
| 24GB+ (dual 12GB) | qwen3.5:9b | qwen3.5:9b (thinking) | qwen3.5:35b-a3b (MoE) |
| 10-12GB | qwen3.5:9b | qwen3.5:9b (thinking) | qwen3.5:9b (thinking) |
| 6-10GB | qwen3.5:4b | qwen3.5:4b (thinking) | qwen3.5:9b |
| 4-6GB | qwen3.5:2b | qwen3.5:4b (thinking) | qwen3.5:4b |
| <4GB | qwen3.5:2b | qwen3.5:2b (thinking) | qwen3.5:2b |
Tier assignments are auto-detected at startup. Override any tier's model in Settings > Model Tiers.
On dual-GPU systems, the 35B model starts on a separate port, serves the Smartest request, and auto-stops after 30 seconds idle. The primary 9B server is never interrupted. Single-GPU users get automatic CPU offload with no configuration.
Category Breakdowns
Drill into each agent's domain expertise. Every category represents a distinct skill area tested independently.
Analytics Agent
100%Campaign Orchestrator
100%Cohort Orchestrator
100%Content Strategy Agent
100%Documentation Agent
100%Email Marketing Agent
100%Marketing Strategist
100%Reddit Specialist
100%Coding Orchestrator
99%Supervisor Agent
99%Brand Design Agent
98%Media Production Agent
98%QA Agent
98%Security Agent
98%Setup Guide
98%System Coder
97%CEO Agent
96%Python Developer
95%Code Archaeologist
94%Database Developer
94%LinkedIn Specialist
94%Web Developer
93%JavaScript Developer
91%Hardware Agent
90%How We Test
Question Design
- Intermediate (20%) -- Core concepts every practitioner should know
- Advanced (50%) -- Applied scenarios requiring synthesis across sub-domains
- Expert (30%) -- Edge cases, tradeoffs, and production gotchas
Anti-Hallucination Controls
- Agent-specific temperatures (0.15 - 0.35) vs default 0.8
- A/B tested: 3/3 agents hallucinated at 0.8, 0/3 at tuned temps
- External grading -- agents don't score themselves
Infrastructure
- Model: qwen3.5:9b (dense 9B parameter, 6.6GB VRAM)
- Hardware: Consumer GPU (RTX 3080 12GB)
- Fully reproducible -- same model, same prompts, same temperatures
Build With Agents You Can Trust
Every agent ships with the same assessment pipeline. Run benchmarks on your own hardware.
Get Started