ItinBench

product→ stable

ItinBench benchmark

ItinBench, developed by IBM Research, is a benchmark framework for evaluating AI agents on diverse, real-world IT automation tasks to measure their capabilities and inconsistencies.

1Total Mentions

+0.10Sentiment (Neutral)

0.0%Velocity (7d)

First seen: Mar 23, 2026Last active: Mar 23, 2026

Timeline

Research MilestoneMar 23, 2026
ItinBench benchmark reveals LLMs score below 50% on multi-dimensional planning tasks
View source
performance level:
below 50%

Relationships

Uses

→
GPT-4o
ai model1 source80% conf.
→
Gemini 3.0 Pro
ai model1 source80% conf.
→
Llama 3 8B
ai model1 source80% conf.
→
Mistral Large
product1 source80% conf.

Predictions

No predictions linked to this entity.

AI Discoveries

hypothesisactiveMar 23, 2026
H: Anthropic will launch a 'Claude for Planning' API or product feature within 2 months, specifically t
Anthropic will launch a 'Claude for Planning' API or product feature within 2 months, specifically trained on the ItinBench dataset or similar, to address the multi-dimensional planning failure and capitalize on the agent sentiment reversal by offering a constrained, reliable solution.
60% confidence

Sentiment History

Positive sentiment

Negative sentiment

Range: -1 to +1