Skip to Content
BenchmarksDeepPlanning

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints


Abstract

While agent evaluation has shifted toward long-horizon tasks, most benchmarks still emphasize local, step-level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands genuine planning ability. Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine-grained local constraints typical of real-world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization. Evaluations on DeepPlanning show that even frontier agentic LLMs struggle with these problems, highlighting the importance of reliable explicit reasoning patterns and parallel tool use for achieving better effectiveness-efficiency trade-offs. Error analysis further points to promising directions for improving agentic LLMs over long planning horizons. We open-source the code and data to support future research.

DeepPlanning Framework Overview


📊 Benchmark Details

DeepPlanning features two realistic, long-horizon domains that require agents to navigate complex environments with strict Verifiable Global Constraints.

📉 Statistics at a Glance

Metric✈️ Travel Planning🛒 Shopping Planning
Tasks120 (ZH) / 120 (EN)120 (EN)
Toolkits9 Specialized APIs15 Specialized APIs
Data Volume7,708 records / task171 records / task
Primary GoalMinute-level itineraryOptimized shopping list
EnvironmentIsolated Python SandboxIsolated Python Sandbox

✈️ Domain 1: Travel Planning

Agents act as personal travel assistants to organize multi-day trips where time, location, and budget are tightly coupled.

  • Input: Natural language query (destination, dates, budget) and specific preferences (e.g., “3-star hotel with a dryer”).
  • Tools: 9 APIs for searching flights, trains, hotels, restaurants, and attractions.
  • Output: A structured planning report with itemized costs and a minute-by-minute schedule.
  • Core Skill: Spatio-temporal reasoning—ensuring flight times, attraction hours, and transit durations all align without overlaps or budget overruns.

🛒 Domain 2: Shopping Planning

Agents must solve a combinatorial optimization problem to find the best products while maximizing discount utility.

  • Input: Shopping lists with detailed attribute requirements and total budget limits.
  • Tools: 15 APIs for semantic search, multi-attribute filtering, and coupon management.
  • Output: A structured JSON cart containing the optimal set of products and applied coupons.
  • Core Skill: Combinatorial Optimization—calculating complex coupon stacking rules (e.g., cross-store vs. same-brand) to achieve the absolute lowest final price.

🧠 Core Planning Competencies

DeepPlanning evaluates three critical agentic abilities:

  1. Proactive Information Acquisition: Actively calling APIs to discover hidden environment states (e.g., checking if an attraction is closed or a product is in stock) instead of hallucinating facts.

  2. Local Constrained Reasoning: Satisfying step-level logic, such as matching specific brands, sizes, or hotel amenities requested by the user.

  3. Global Constrained Optimization: Managing holistic boundaries—like total budget caps and multi-day time feasibility—where a single local mistake invalidates the entire plan.


🏆 Leaderboard 🏆

Comprehensive evaluation results on DeepPlanning. Results are averaged over four runs. Bold indicates the best result.

RankModelAvg Acc.Travel PlanningShopping Planning
CS
Score
PS
Score
Comp
Score
Case
Acc.
Match
Score
Case
Acc.
1
Model iconOpenAI/GPT-5.2-high
44.688.583.385.835.084.854.2
2
Model iconAnthropic/Claude-4.5-Opus (w/ thinking)
33.979.370.975.122.780.045.0
3
Model iconOpenAI/GPT-5-high
31.678.765.972.318.980.444.2
4
Model iconGoogle/Gemini-3-Flash-Preview
28.867.157.762.45.980.651.7
5
Model iconAlibaba/Qwen3-Max (w/ thinking)
28.764.061.762.813.882.643.5
6
Model iconAnthropic/Claude-4.5-Opus (w/o thinking)
26.367.558.863.16.782.245.8
7
Model iconAnthropic/Claude-4.5-Sonnet (w/ thinking)
25.565.258.461.87.680.043.3
8
Model iconOpenAI/o3
24.976.555.666.111.376.938.5
9
Model iconGoogle/Gemini-3-Pro-Preview
23.258.425.141.80.778.045.8
10
Model iconDeepSeek-AI/DeepSeek-V3.2 (w/ thinking)
21.647.435.041.20.778.842.5
11
Model iconByteDance/Seed-1.8-high
20.443.656.750.10.077.540.8
12
Model iconxAI/Grok-4.1-fast (reasoning)
17.257.137.747.42.774.031.7
13
Model iconAnthropic/Claude-4.5-Sonnet (w/o thinking)
17.253.442.848.11.175.833.3
14
Model iconAlibaba/Qwen-Plus (w/ thinking)
17.135.422.428.90.073.334.1
15
Model iconGoogle/Gemini-2.5-Pro
17.062.342.052.23.269.130.8
16
Model iconZ.ai/GLM-4.7 (w/ thinking)
14.044.044.644.30.472.527.5
17
Model iconAlibaba/Qwen3-Max (w/o thinking)
12.836.730.731.80.870.224.7
18
Model iconOpenAI/o4-mini
12.458.036.647.23.069.121.7
19
Model iconMoonshot-AI/Kimi-K2-thinking
12.145.232.538.90.065.824.2
20
Model iconByteDance/Seed-1.8-minimal
11.343.047.545.30.068.122.5
21
Model iconAlibaba/Qwen-Plus (w/o thinking)
7.537.313.025.10.063.915.0
22
Model iconZ.ai/GLM-4.7 (w/o thinking)
7.138.922.530.70.061.214.2
23
Model iconDeepSeek-AI/DeepSeek-V3.2 (w/o thinking)
5.337.412.124.70.058.310.6
24
Model iconOpenAI/GPT-5.2-none
4.554.329.942.10.458.68.6
25
Model iconxAI/Grok-4.1-fast (non-reasoning)
3.039.619.729.60.050.15.9

CS Score = Commonsense Score | PS Score = Personalized Score | Comp Score = Composite Score | Case Acc. = Case Accuracy | Match Score = Match Score. Bold values indicate best performance per category.


Acknowledgments

We thank Fliggy (飞猪) and Amap (高德) for their technical support.

Last updated on