Bench

Agent Comparison

Tool × model, per level rubric

A bench of observations — each agent (tool + model) averaged across the rubric of its level. Profile and normal use different rubrics, so they're shown in separate tables. Overall is the level-specific composite score (see migration 015 for weights).

Profile — accuracy & the visual triangle

Tool	Model	Articles	Accuracy	Figure Recog.	Rel. Legibility	Prose↔Art	Overall
claude-code	claude-opus-4-6	82	4.2	3.4	3.0	3.8	★ 3.7

Accuracy: Facts in the prose, KG triples, and art labels cross-check against each other and the cited sources. 5 = every claim supported.
Figure Recognizability: Is the central ASCII figure visually identifiable as this specific entity? 5 = a reader would recognize it without the label; 1 = a generic labeled box.
Relationship Legibility: Can a reader trace which concepts connect and what the relationships are? 5 = clearly drawn, meaningfully grouped; 1 = arrow spaghetti or collapsed labels.
Prose ↔ Art: Does the prose disambiguate, date, and frame the entity — adding what the art alone cannot carry — rather than restating the labels in sentence form?
Overall: Weighted average on the 1–5 scale. Accuracy, Figure, and Prose↔Art each count fully; Rel. Legibility counts half.

Normal — eight equally-weighted dimensions

Tool	Model	Articles	Accuracy	Complete	Readable	Sources	Level Fit	Vis. Acc.	Vis. Leg.	Vis.↔Prose	Overall
claude-code	claude-opus-4-7	4	4.5	5.0	5.0	4.8	5.0	4.7	5.0	4.7	★ 4.8
codex	gpt-5	9	4.9	4.7	4.8	4.1	4.9	4.9	4.9	4.5	★ 4.7
claude-code	claude-opus-4-6	17	4.6	4.6	5.0	4.2	4.9	4.8	4.8	4.4	★ 4.7

Accuracy: Non-trivial claims are correct and backed by inline citations that actually support them. 5 = every claim supported; 1 = contradictions or fabrications.
Complete: The topic is covered thoroughly for ~1,200 visible words plus expandable <details> blocks. 5 = major aspects all present; 1 = reads like a stub.
Readable: Prose is clear, logically sectioned, and pitched at the normal-reader level. 5 = sections flow and vocabulary matches the audience.
Sources: Cited sources are reliable, diverse, and reasonably current. 5 = primary sources, recent scholarship, authoritative references.
Level Fit: Article matches the normal audience and the layered-narrative format. 5 = surface prose approachable; details blocks add depth without repeating.
Vis. Accuracy: Stats, diagram labels, and timeline events are factually correct and grounded in the cited sources.
Vis. Legibility: Visuals render cleanly — Mermaid parses, labels are readable, timelines are chronological, KG nodes and edges are labeled.
Vis. ↔ Prose: Each stat / diagram / timeline sits at a useful narrative point and illustrates something the prose establishes, rather than feeling bolted-on.
Overall: Simple average on the 1–5 scale across all eight dimensions.