Bench
Agent Comparison
Tool × model, per level rubric
A bench of observations — each agent (tool + model) averaged across the rubric of its level. Profile and normal use different rubrics, so they're shown in separate tables. Overall is the level-specific composite score (see migration 015 for weights).
Profile — accuracy & the visual triangle
| Tool |
Model |
Articles |
Accuracy |
Figure Recog. |
Rel. Legibility |
Prose↔Art |
Overall |
| claude-code |
claude-opus-4-6 |
82 |
4.2 |
3.4 |
3.0 |
3.8 |
★ 3.7 |
- Accuracy
- Facts in the prose, KG triples, and art labels cross-check against each other and the cited sources. 5 = every claim supported.
- Figure Recognizability
- Is the central ASCII figure visually identifiable as this specific entity? 5 = a reader would recognize it without the label; 1 = a generic labeled box.
- Relationship Legibility
- Can a reader trace which concepts connect and what the relationships are? 5 = clearly drawn, meaningfully grouped; 1 = arrow spaghetti or collapsed labels.
- Prose ↔ Art
- Does the prose disambiguate, date, and frame the entity — adding what the art alone cannot carry — rather than restating the labels in sentence form?
- Overall
- Weighted average on the 1–5 scale. Accuracy, Figure, and Prose↔Art each count fully; Rel. Legibility counts half.
Normal — eight equally-weighted dimensions
| Tool |
Model |
Articles |
Accuracy |
Complete |
Readable |
Sources |
Level Fit |
Vis. Acc. |
Vis. Leg. |
Vis.↔Prose |
Overall |
| claude-code |
claude-opus-4-7 |
4 |
4.5 |
5.0 |
5.0 |
4.8 |
5.0 |
4.7 |
5.0 |
4.7 |
★ 4.8 |
| codex |
gpt-5 |
9 |
4.9 |
4.7 |
4.8 |
4.1 |
4.9 |
4.9 |
4.9 |
4.5 |
★ 4.7 |
| claude-code |
claude-opus-4-6 |
17 |
4.6 |
4.6 |
5.0 |
4.2 |
4.9 |
4.8 |
4.8 |
4.4 |
★ 4.7 |
- Accuracy
- Non-trivial claims are correct and backed by inline citations that actually support them. 5 = every claim supported; 1 = contradictions or fabrications.
- Complete
- The topic is covered thoroughly for ~1,200 visible words plus expandable
<details> blocks. 5 = major aspects all present; 1 = reads like a stub.
- Readable
- Prose is clear, logically sectioned, and pitched at the normal-reader level. 5 = sections flow and vocabulary matches the audience.
- Sources
- Cited sources are reliable, diverse, and reasonably current. 5 = primary sources, recent scholarship, authoritative references.
- Level Fit
- Article matches the normal audience and the layered-narrative format. 5 = surface prose approachable; details blocks add depth without repeating.
- Vis. Accuracy
- Stats, diagram labels, and timeline events are factually correct and grounded in the cited sources.
- Vis. Legibility
- Visuals render cleanly — Mermaid parses, labels are readable, timelines are chronological, KG nodes and edges are labeled.
- Vis. ↔ Prose
- Each stat / diagram / timeline sits at a useful narrative point and illustrates something the prose establishes, rather than feeling bolted-on.
- Overall
- Simple average on the 1–5 scale across all eight dimensions.