DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints

Qwen Team

📄 Paper

🤗 Dataset

💻 Code

Abstract

While agent evaluation has shifted toward long-horizon tasks, most benchmarks still emphasize local, step-level reasoning rather than the global constrained optimization (e.g., time and financial budgets) that demands genuine planning ability. Meanwhile, existing LLM planning benchmarks underrepresent the active information gathering and fine-grained local constraints typical of real-world settings. To address this, we introduce DeepPlanning, a challenging benchmark for practical long-horizon agent planning. It features multi-day travel planning and multi-product shopping tasks that require proactive information acquisition, local constrained reasoning, and global constrained optimization. Evaluations on DeepPlanning show that even frontier agentic LLMs struggle with these problems, highlighting the importance of reliable explicit reasoning patterns and parallel tool use for achieving better effectiveness-efficiency trade-offs. Error analysis further points to promising directions for improving agentic LLMs over long planning horizons. We open-source the code and data to support future research.

DeepPlanning Framework Overview

📊 Benchmark Details

DeepPlanning features two realistic, long-horizon domains that require agents to navigate complex environments with strict Verifiable Global Constraints.

📉 Statistics at a Glance

Metric	✈️ Travel Planning	🛒 Shopping Planning
Tasks	120 (ZH) / 120 (EN)	120 (EN)
Toolkits	9 Specialized APIs	15 Specialized APIs
Data Volume	7,708 records / task	171 records / task
Primary Goal	Minute-level itinerary	Optimized shopping list
Environment	Isolated Python Sandbox	Isolated Python Sandbox

✈️ Domain 1: Travel Planning

Agents act as personal travel assistants to organize multi-day trips where time, location, and budget are tightly coupled.

Input: Natural language query (destination, dates, budget) and specific preferences (e.g., “3-star hotel with a dryer”).
Tools: 9 APIs for searching flights, trains, hotels, restaurants, and attractions.
Output: A structured planning report with itemized costs and a minute-by-minute schedule.
Core Skill: Spatio-temporal reasoning—ensuring flight times, attraction hours, and transit durations all align without overlaps or budget overruns.

🛒 Domain 2: Shopping Planning

Agents must solve a combinatorial optimization problem to find the best products while maximizing discount utility.

Input: Shopping lists with detailed attribute requirements and total budget limits.
Tools: 15 APIs for semantic search, multi-attribute filtering, and coupon management.
Output: A structured JSON cart containing the optimal set of products and applied coupons.
Core Skill: Combinatorial Optimization—calculating complex coupon stacking rules (e.g., cross-store vs. same-brand) to achieve the absolute lowest final price.

🧠 Core Planning Competencies

DeepPlanning evaluates three critical agentic abilities:

Proactive Information Acquisition: Actively calling APIs to discover hidden environment states (e.g., checking if an attraction is closed or a product is in stock) instead of hallucinating facts.
Local Constrained Reasoning: Satisfying step-level logic, such as matching specific brands, sizes, or hotel amenities requested by the user.
Global Constrained Optimization: Managing holistic boundaries—like total budget caps and multi-day time feasibility—where a single local mistake invalidates the entire plan.

📢 Change Log

v1.1 (2026-03-03)

Updated several tasks in the Shopping Planning benchmark and corrected erroneous answer annotations for a subset of questions. Dataset available at Qwen/DeepPlanning on Hugging Face.
Added new models to the leaderboard: Claude-4.6-Opus, Qwen-3.5-Plus, GLM-5, Seed-2.0-pro-high, Kimi-K2.5-thinking.

v1.0 (2026-01)

Initial release of the DeepPlanning benchmark with Travel Planning and Shopping Planning domains.

🏆 Leaderboard 🏆

Comprehensive evaluation results on DeepPlanning v1.1. Results are averaged over four runs. Bold indicates the best result.

Rank	Model	Avg Acc.	Travel Planning				Shopping Planning
Rank	Model	Avg Acc.	CS Score	PS Score	Comp Score	Case Acc.	Match Score	Case Acc.
1	Anthropic/Claude-4.6-Opus (max)	58.9	86.1	80.3	83.2	61.5	85.3	56.2
2	OpenAI/GPT-5.2-high	48.2	88.5	83.3	85.8	35.0	88.4	61.4
3	Alibaba/Qwen-3.5-Plus (w/o thinking)	37.6	83.6	79.9	81.6	26.3	82.4	48.9
4	Anthropic/Claude-4.5-Opus (w/ thinking)	37.0	79.3	70.9	75.1	22.7	83.7	51.4
5	Alibaba/Qwen-3.5-Plus (w/ thinking)	35.9	76.8	75.4	76.2	25.0	82.1	46.7
6	Google/Gemini-3-Flash-Preview	33.8	67.1	57.7	62.4	5.9	86.9	61.6
7	OpenAI/GPT-5-high	30.5	78.7	65.9	72.3	18.9	68.7	42.1
8	Alibaba/Qwen3-Max (w/ thinking)	29.7	64.0	61.7	62.8	13.8	80.6	45.6
9	Google/Gemini-3-Pro-Preview	27.4	58.4	25.1	41.8	0.7	83.4	54.0
10	DeepSeek-AI/DeepSeek-V3.2 (w/ thinking)	27.4	47.4	35.0	41.2	0.7	84.0	54.0
11	Anthropic/Claude-4.5-Sonnet (w/ thinking)	26.8	65.2	58.4	61.8	7.6	78.0	46.0
12	Anthropic/Claude-4.5-Opus (w/o thinking)	26.4	67.5	58.8	63.1	6.7	81.0	46.0
13	ByteDance/Seed-2.0-pro-high	21.6	56.0	60.6	58.3	2.1	76.7	41.0
14	xAI/Grok-4.1-fast (reasoning)	19.1	57.1	37.7	47.4	2.7	73.2	35.6
15	DeepSeek-AI/DeepSeek-V3.2 (w/o thinking)	19.0	37.4	12.1	24.7	0.0	76.0	38.0
16	Anthropic/Claude-4.5-Sonnet (w/o thinking)	16.1	53.4	42.8	48.1	1.1	71.0	31.0
17	Alibaba/Qwen3-Max (w/o thinking)	15.4	36.7	30.7	31.8	0.8	72.3	30.1
18	Z.ai/GLM-5 (w/ thinking)	14.6	44.3	42.3	43.3	0.4	72.2	28.7
19	Moonshot-AI/Kimi-K2.5 (w/ thinking)	14.3	47.8	43.7	45.8	0.4	71.9	28.3
20	OpenAI/o4-mini	13.7	58.0	36.6	47.2	3.0	62.5	24.3
21	OpenAI/GPT-5.2-none	6.8	54.3	29.9	42.1	0.4	59.4	13.1
22	xAI/Grok-4.1-fast (non-reasoning)	5.2	39.6	19.7	29.6	0.0	52.9	10.4

CS Score = Commonsense Score | PS Score = Personalized Score | Comp Score = Composite Score | Case Acc. = Case Accuracy | Match Score = Match Score. Bold values indicate best performance per category.

Acknowledgments

We thank Fliggy (飞猪) and Amap (高德) for their technical support.