Benchmark Overview

We provide a benchmark to evaluate the planning capabilities of state-of-the-art agentic models.

Available Benchmarks

Evaluates the agent’s ability to handle complex, multi-step planning tasks that require reasoning and constraint satisfaction.

The DeepPlanning benchmark includes two major task categories:

Travel Planning: Complete travel itinerary planning with multiple constraints
Shopping Planning: Optimal shopping plan generation with budget and preference management