Qwen2.5 VL！Qwen2.5 VL！Qwen2.5 VL！

QWEN CHAT GITHUB HUGGING FACE MODELSCOPE DISCORD

我们发布了 Qwen2.5-VL，Qwen 模型家族的旗舰视觉语言模型，对比此前发布的 Qwen2-VL 实现了巨大的飞跃。欢迎访问 Qwen Chat 并选择 Qwen2.5-VL-72B-Instruct 进行体验。此外，我们在 Hugging Face 和 ModelScope 上开源了 Qwen2.5-VL 的 Base 和 Instruct 模型，包含 3B、7B 和 72B 在内的 3 个模型尺寸。

Qwen2.5-VL 的主要特点如下所示：

感知更丰富的世界：Qwen2.5-VL 不仅擅长识别常见物体，如花、鸟、鱼和昆虫，还能够分析图像中的文本、图表、图标、图形和布局。
Agent：Qwen2.5-VL 直接作为一个视觉 Agent，可以推理并动态地使用工具，初步具备了使用电脑和使用手机的能力。
理解长视频和捕捉事件：Qwen2.5-VL 能够理解超过 1 小时的视频，并且这次它具备了通过精准定位相关视频片段来捕捉事件的新能力。
视觉定位：Qwen2.5-VL 可以通过生成 bounding boxes 或者 points 来准确定位图像中的物体，并能够为坐标和属性提供稳定的 JSON 输出。
结构化输出：对于发票、表单、表格等数据，Qwen2.5-VL 支持其内容的结构化输出，惠及金融、商业等领域的应用。

模型性能

我们对视觉语言模型进行了全面的评估，比较了 SOTA 模型以及同尺寸规模模型中表现最好的模型。在旗舰模型 Qwen2.5-VL-72B-Instruct 的测试中，它在一系列涵盖多个领域和任务的基准测试中表现出色，包括大学水平的问题、数学、文档理解、视觉问答、视频理解和视觉 Agent。值得注意的是，Qwen2.5-VL 在理解文档和图表方面具有显著优势，并且能够作为视觉 Agent 进行操作，而无需特定任务的微调。

在较小的模型方面，Qwen2.5-VL-7B-Instruct 在多个任务中超越了 GPT-4o-mini，而 Qwen2.5-VL-3B 作为端侧 AI 的潜力股，甚至超越了我们之前版本 Qwen2-VL 的 7B 模型。

模型能力案例

1. 万物识别

Qwen2.5-VL 显著增强了其通用图像识别能力，大幅扩大了可识别的图像类别量级。不仅包括植物、动物、著名山川的地标，还包括影视作品中的 IP，以及各种各样的商品。

Example: Attractions Identification Next

User

What are these attractions? Please give their names in Chinese and English.

Qwen2.5-VL

These attractions are:

Top left: The Great Pyramids of Giza (埃及的吉萨大金字塔)
Top right: The Great Wall of China (中国的长城)
Bottom left: The Statue of Liberty (美国的自由女神像)
Bottom right: The Terracotta Army (中国的秦始皇兵马俑)

Example: Birds Identification Next

User

What are these birds? Please give their names in Chinese and English.

Qwen2.5-VL

These birds are: Top Left: 红翅黑鹂 (Red-winged Blackbird) Top Right: 蓝翅八色鸫 (Blue-winged Pitta) Bottom Left: 小斑鸬鹚 (Little Pied Cormorant) Bottom Right: 金头扇尾莺 (Golden-headed Cisticola)

Example: Cars Identification Next

User

What are these cars? Please give their names in Chinese and English.

Qwen2.5-VL

这些车分别是：

左上角：劳斯莱斯（Rolls-Royce）
右上角：梅赛德斯-奔驰GLE（Mercedes-Benz GLE）
左下角：法拉利Portofino M（Ferrari Portofino M）
右下角：保时捷911（Porsche 911）

Example: Celebrity Recognition Next

User

Who are these in this picture? Please give their names in Chinese and English.

Qwen2.5-VL

这张图片中的四位名人分别是：

左上角：迈克尔·杰克逊（Michael Jackson）
右上角：迈克尔·乔丹（Michael Jordan）
左下角：成龙（Jackie Chan）
右下角：唐纳德·特朗普（Donald Trump）

Example: Foods Identification Next

User

What are these foots? Please give their names in Chinese and English.

Qwen2.5-VL

Sure! Here are the names of the foods in the images, both in Chinese and English:

Top Left Image:
- Chinese: 牛仔骨 (Niú zǎi gǔ)
- English: Beef Short Ribs
Top Right Image:
- Chinese: 蔬菜沙拉 (Shūcài shālā)
- English: Vegetable Salad
Bottom Left Image:
- Chinese: 红烧狮子头 (Hóngshāo shizi tóu)
- English: Braised Meatballs
Bottom Right Image:
- Chinese: 火锅 (Huǒguō)
- English: Hot Pot (with various ingredients like meat, vegetables, and seafood)

Example: Products Identification Next

User

What are these products? Please give their names in Chinese and English.

Qwen2.5-VL

Sure, here are the products and their Chinese and English names:

Top Left Image:
- English Name: Propolinse Mouthwash
- Chinese Name: 蜂胶漱口水
Top Right Image:
- English Name: George Caroll Moisturizing and Refreshing Shampoo
- Chinese Name: 乔治卡罗尔保湿清爽洗发水
Bottom Left Image:
- English Name: On The Body Sweet Love Perfume Water
- Chinese Name: 洋甘菊甜蜜爱香水
Bottom Right Image:
- English Name: Tiger Balm Mosquito Repellent
- Chinese Name: 虎标驱蚊水

2. 精准的视觉定位

Qwen2.5-VL 采用矩形框和点的多样化方式对通用物体定位，可以实现层级化定位和规范的 JSON 格式输出。增强的定位能力为复杂场景中的视觉 Agent 进行理解和推理任务提供了基础。

Example: Precise Object Grounding Next

User

Detect all motorcyclists in the image and return their locations in the form of coordinates. The format of output should be like {“bbox_2d”: [x1, y1, x2, y2], “label”: “motorcyclist”, “sub_label”: “wearing helmat” # or “not wearing helmat”}.

Qwen2.5-VL

[
	{"bbox_2d": [341, 258, 397, 360], "label": "motorcyclist", "sub_label": "not wearing helmat "},
	{"bbox_2d": [212, 332, 274, 448], "label": "motorcyclist", "sub_label": "not wearing helmat "},
	{"bbox_2d": [66, 124, 112, 198], "label": "motorcyclist", "sub_label": "not wearing helmat "},
	{"bbox_2d": [5, 235, 63, 320], "label": "motorcyclist", "sub_label": "wearing helmat "}
]

Example: Precise Object Grounding Next

User

Identify basketball players and detect the key points of their heads and hands, returning them in the form of points. The primary label is the player’s name, and the secondary labels include left hand, right hand, and head.

Qwen2.5-VL

[
	{"point_2d": ["394", "105"], "label": "LeBron James", "label2": "head"},
	{"point_2d": ["876", "131"], "label": "Stephen Curry", "label2": "head"},
	{"point_2d": ["100", "614"], "label": "LeBron James", "label2": "right hand"},
	{"point_2d": ["460", "507"], "label": "LeBron James", "label2": "left hand"},
	{"point_2d": ["784", "660"], "label": "Stephen Curry", "label2": "left hand"},
	{"point_2d": ["945", "507"], "label": "Stephen Curry", "label2": "right hand"}
]

Example: Precise Object Grounding Next

User

Count the number of birds in the figure, including those that are only showing their heads. To ensure accuracy, first detect their key points, then give the total number..

Qwen2.5-VL

[
	{"point_2d": ["38", "314"], "label": "birds"},
	{"point_2d": ["126", "309"], "label": "birds"},
	{"point_2d": ["195", "277"], "label": "birds"},
	{"point_2d": ["243, 291"], "label": "birds"},
	{"point_2d": ["285, 291"], "label": "birds"},
	{"point_2d": ["345, 291"], "label": "birds"},
	{"point_2d": ["400, 284"], "label": "birds"},
	{"point_2d": ["465, 298"], "label": "birds"},
	{"point_2d": ["525, 314"], "label": "birds"},
	{"point_2d": ["644, 314"], "label": "birds"},
	{"point_2d": ["695, 314"], "label": "birds"},
	{"point_2d": ["762, 304"], "label": "birds"},
	{"point_2d": ["834, 264"], "label": "birds"},
	{"point_2d": ["248, 638"], "label": "birds"}
]

There are a total of 14 birds in the figure.

Example: Detecting before Counting Next

User

Please first output bbox coordinates and names of every item in this image in JSON format, and then answer how many items are there in the image.

Qwen2.5-VL

[
	{"bbox_2d": [19, 3, 84, 125], "label": "ice cream"},
	{"bbox_2d": [167, 0, 288, 134], "label": "flip flops"},
	{"bbox_2d": [349, 0, 457, 123], "label": "beach umbrella"},
	{"bbox_2d": [521, 4, 635, 123], "label": "starfish"},
	{"bbox_2d": [10, 180, 118, 294], "label": "cocktail drink"},
	{"bbox_2d": [150, 196, 308, 283], "label": "sunglasses"},
	{"bbox_2d": [354, 192, 467, 300], "label": "lifebuoy"},
	{"bbox_2d": [532, 202, 635, 290], "label": "watermelon slice"},
	{"bbox_2d": [2, 379, 134, 452], "label": "straw hat"},
	{"bbox_2d": [167, 335, 308, 456], "label": "palm tree"},
	{"bbox_2d": [360, 359, 468, 476], "label": "sun"},
	{"bbox_2d": [546, 338, 642, 452], "label": "martini glass"}
]

There are 12 items in the image.

Example: Grounding with Description Next

User

Locate every cake and describe their features, output the bbox coordinates in JSON format.

Qwen2.5-VL

[
	{"bbox_2d": [130, 725, 304, 906], "label": "cake with white frosting and eyes"},
	{"bbox_2d": [344, 641, 508, 800], "label": "cake with white frosting and eyes"},
	{"bbox_2d": [289, 530, 435, 690], "label": "cake with white frosting and colorful sprinkles"},
	{"bbox_2d": [40, 542, 197, 712], "label": "cake with white frosting and colorful sprinkles"},
	{"bbox_2d": [64, 386, 202, 566], "label": "cake with pink frosting and red sprinkles"},
	{"bbox_2d": [245, 375, 363, 529], "label": "cake with pink frosting and blue sprinkles"},
	{"bbox_2d": [389, 359, 492, 494], "label": "cake with pink frosting and blue sprinkles"},
	{"bbox_2d": [423, 436, 555, 587], "label": "cake with pink frosting and googly eyes"},
	{"bbox_2d": [518, 348, 642, 512], "label": "cake with pink frosting and blue sprinkles"},
	{"bbox_2d": [500, 549, 652, 698], "label": "cake with white frosting and googly eyes"},
	{"bbox_2d": [615, 465, 757, 626], "label": "cake with white frosting and colorful sprinkles"},
	{"bbox_2d": [723, 372, 849, 525], "label": "cake with chocolate frosting"},
	{"bbox_2d": [782, 497, 944, 676], "label": "cake with white frosting and colorful candies"}
]

Example: Pointing to target object Next

User

Locate the person who act bravely, report the bbox coordinates in JSON format.

Qwen2.5-VL

[
	{"bbox_2d": [74, 58, 526, 619], "label": "person who act bravely"}
]

3. 全面的文字识别和理解

Qwen2.5-VL 将 OCR 识别能力提升至一个新的水平，增强了多场景、多语言和多方向的文本识别和文本定位能力。同时，在信息抽取能力上进行大幅度增强，以满足日益增长的资质审核、金融商务等数字化、智能化需求。

Example: Vertical Text Recognition Next

User

Read all texts in the image, output in lines.

Qwen2.5-VL

平安喜樂八方來財順風順水順財神暴富大吉諸事皆順福時來運轉發財如意有錢有勢有前程好運來

Example: Multilingual Text Recognition Next

User

Recognise all texts in the image, output in lines.

Qwen2.5-VL

SMK أبو منير لبيع وصيانة الروديترات روديترات ماء - مكيف - دفايات COOLING CAR SYSTEM أبو منير 052-204-5334 محمد أبو سيراج 059-831-0796 056-811-8256

Example: Text Spotting Next

User

Spotting all the text in the image with line-level, and output in JSON format.

Qwen2.5-VL

[
	{"bbox_2d": [108, 175, 496, 230], "text_content": "AuntieAnne's"},
	{"bbox_2d": [49, 429, 252, 450], "text_content": "CINNAMON SUGAR"},
	{"bbox_2d": [49, 463, 61, 483], "text_content": "1"},
	{"bbox_2d": [102, 463, 113, 483], "text_content": "x"},
	{"bbox_2d": [131, 463, 198, 483], "text_content": "17,000"},
	{"bbox_2d": [461, 463, 534, 488], "text_content": "17,000"},
	{"bbox_2d": [21, 510, 143, 531], "text_content": "SUB TOTAL"},
	{"bbox_2d": [461, 516, 534, 538], "text_content": "17,000"},
	{"bbox_2d": [21, 579, 176, 601], "text_content": "GRAND TOTAL"},
	{"bbox_2d": [461, 585, 534, 607], "text_content": "17,000"},
	{"bbox_2d": [21, 628, 126, 649], "text_content": "CASH IDR"},
	{"bbox_2d": [461, 634, 534, 656], "text_content": "20,000"},
	{"bbox_2d": [21, 676, 169, 698], "text_content": "CHANGE DUE"},
	{"bbox_2d": [471, 682, 529, 704], "text_content": "3,000"},
]

Example: Key Information Extraction Next

User

Extract following information from the receipt：[‘发票代码’,‘发票号码’,‘到站’,‘燃油费’,‘票价’,‘乘车日期’,‘开车时间’,‘车次’,‘座号’]，output in JSON。

Qwen2.5-VL

{
    "发票代码": "221021325353",
    "发票号码": "10283819",
    "到站": "开发区",
    "燃油费": "2.0",
    "票价": "8.00<全>",
    "乘车日期": "2013-06-29",
    "开车时间": "流水",
    "车次": "040",
    "座号": "371"
}

Example: Key Information Extraction Next

User

You are a delivery bill verification expert who can identify house numbers and courier label information, and verify whether the address on the courier label matches the house number. Please identify the house number from the bill address and the house number on the door separately, and then verify if the door number is delivered correctly.

Qwen2.5-VL

To verify the delivery, let’s break down the information from both the courier label and the door.

Courier Label Information:

Address on the label: 1935 South Wabash Avenue Suite 1, Chicago, IL 60616
House number from the address: 1935

Door Information:

House number on the door: 1935

Verification:

The house number on the courier label is 1935.
The house number on the door is also 1935.

Since both the house numbers match, the delivery is verified to be correct. The package should indeed be delivered to this location.

Example: Key Information Extraction Next

User

本月人工费花费多少

Qwen2.5-VL

根据表格中的数据，2006年6月30日的“本月合计”行显示了当月的各项费用。其中，“人工费”列的数值为45,560元。

因此，本月的人工费花费为45,560元。

User

材料费分别用在了哪些方面

Qwen2.5-VL

根据提供的表格，材料费分别用在了以下几个方面：

耗用主要材料：金额为114,240元。
耗用结构件：金额为138,350元。
耗用其他材料：金额为14,160元。
摊销周转材料：金额为7,130元。

这些费用合计为： [ 114,240 + 138,350 + 14,160 + 7,130 = 273,880 \text{元} ]

这与“本月合计”中的材料费273,880元一致。

4. Qwen特色的文档解析

在 Qwen2.5-VL 中，我们在设计了一种更全面的文档解析格式，称为 QwenVL HTML 格式，它既可以将文档中的文本精准地识别出来，也能够提取文档元素（如图片、表格等）的位置信息，从而准确地将文档中的版面布局进行精准还原。基于精心构建的海量数据，QwenVL HTML 可以对广泛的场景进行鲁棒的文档解析，比如杂志、论文、网页、甚至手机截屏等等。

Example: QwenVL HTML Next

User

QwenVL HTML

Qwen2.5-VL

<html>
<body>
<h1 data-bbox="879 283 1605 348">Qwen2.5 Technical Report</h1>
<p data-bbox="1124 406 1360 456"></p>
<address class="author"><p>Qwen Team</p></address>
<div class="image" data-bbox="778 491 850 541"><img data-bbox="778 491 850 541"/></div>
<p data-bbox="885 491 1460 541">https://huggingface.co/Qwen</p>
<div class="image" data-bbox="778 541 850 589"><img data-bbox="778 541 850 589"/></div>
<p data-bbox="885 541 1708 589">https://modelscope.cn/organization/qwen</p>
<div class="image" data-bbox="792 589 850 639"><img data-bbox="792 589 850 639"/></div>
<p data-bbox="885 589 1584 639">https://github.com/QwenLM/Qwen2.5</p>
<h2 data-bbox="1143 681 1344 733">Abstract</h2>
<p data-bbox="434 785 2050 1252">In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning, including offline learning DPO and online learning GRPO. Post-training techniques significantly enhance human preference, and notably improve long text generation, structural data analysis, and instruction following.</p>
<p data-bbox="434 1262 2050 1587">To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich configurations. The open-weight offerings include base models and instruction-tuned models in sizes of $0.5 \mathrm{~B}, 1.5 \mathrm{~B}, 3 \mathrm{~B}, 7 \mathrm{~B}, 14 \mathrm{~B}, 32 \mathrm{~B}$, and $72 \mathrm{~B}$ parameters. Quantized versions of the instruction-tuned models are also provided. Over 100 models can be accessed from Hugging Face Hub, ModelScope, and Kaggle. In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio.</p>
<p data-bbox="434 1587 2050 2052">Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o-mini and GPT-4o respectively. Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math (Yang et al., 2024b), Qwen2.5-Coder (Hui et al., 2024), QwQ (Qwen Team, 2024d), and multimodal models.</p>
<div class="image" data-bbox="408 2275 2086 2800"><img data-bbox="408 2275 2086 2800"/></div>
<p data-bbox="289 2864 2202 3058">Figure 1: In the iterative development of the Qwen series, data scaling has played a crucial role. Qwen 2.5, which leverages 18 trillion tokens for pre-training, has demonstrated the most advanced capabilities within the Qwen series, especially in terms of domain expertise, underscoring the importance of scale together with mixture in enhancing the model’s capabilities.</p>
</body>
</html>

Example: QwenVL HTML Next

User

QwenVL HTML

Qwen2.5-VL

<html><body>
<div class="formula" data-bbox="920 213 1345 326"><img data-bbox="920 213 1345 326"/><div>$$h=D \tan \theta \approx D \theta=D \frac{\lambda}{2 L}$$</div></div>
<p data-bbox="247 338 1458 392">测出玻璃片长度 $D$ 和条纹间距 $L$, 即可求出薄片的厚度或细丝直径。</p>
<p data-bbox="308 392 1894 458">(2) 利用劈尖的等厚条纹, 可检查工件的表面是否平整和有无缺陷, 如图 9-21 所示。</p>
<div class="image" data-bbox="420 508 894 832"><img data-bbox="420 508 894 832"/></div>
<p data-bbox="384 858 934 914">图 9-20 测量细丝或薄片的厚度</p>
<p></p>
<div class="image" data-bbox="1268 468 1788 832"><img data-bbox="1268 468 1788 832"/></div>
<p data-bbox="1138 858 1914 914">图 9-21 检查工件的表面是否平整和有无缺陷</p>
<h2 data-bbox="324 946 534 1000"> 2. 牛顿环</h2>
<p data-bbox="247 1020 2016 1278">如图 9-22 所示, 在一块光学平晶上, 放置一曲率半径很大的平凸透镜, 在透镜和平晶间便形成空气膜层。以单色平行光垂直照射, 经空气膜上、下表面反射的两束光发生干涉, 于是在空气膜的上表面出现一组干涉条纹。这也是一种等厚干涉条纹, 并且是以触点为圆心的一组同心圆环, 称为牛顿环。彩色牛顿环的投影见彩色插图 6 。</p>
<div class="image" data-bbox="600 1308 1034 1914"><img data-bbox="600 1308 1034 1914"/></div>
<p data-bbox="844 1950 900 2000">(a)</p>
<p></p>
<div class="image" data-bbox="1138 1288 1658 1942"><img data-bbox="1138 1288 1658 1942"/></div>
<p data-bbox="1418 1950 1508 2000">(b)</p>
<p data-bbox="988 2016 1272 2068">图 9-22 牛顿环</p>
<p data-bbox="336 2084 916 2144">(1) 牛顿环明纹与暗纹的位置</p>
<p data-bbox="247 2148 2016 2276">考虑到入射光在膜层的上表面的反射无半波损失, 但在下表面的反射有半波损失。设 $r$为牛顿环第 $k$ 级圆环的半径, 该环对应空气膜厚度为 $e$, 则明环条件为</p>
<div class="formula" data-bbox="778 2296 1524 2484"><img data-bbox="778 2296 1524 2484"/><div>$$\delta=2 e+\frac{\lambda}{2}=k \lambda \quad(k=1,2,3, \cdots)$$</div>暗环条件为 $\quad \delta=2 e+\frac{\lambda}{2}=(2 k+1) \frac{\lambda}{2} \quad(k=0,1,2, \cdots)$</div>
<p data-bbox="336 2516 742 2576">(2) 牛顿环半径公式</p>
<p data-bbox="336 2580 584 2642">在 $\triangle O O^{\prime} A$ 中</p>
<p></p>
<div class="formula" data-bbox="880 2572 1398 2648"><img data-bbox="880 2572 1398 2648"/><div>$$r^{2}=R^{2}-(R-e)^{2}=2 R e-e^{2}$$</div></div>
<p data-bbox="247 2648 1650 2708">因为 $R \gg e$, 略去 2 阶小项 $e^{2}$ 后, 得 $e=r^{2} / 2 R$, 代入式 (9-27), 则明环条件为</p>
<div class="formula" data-bbox="798 2708 1448 2822"><img data-bbox="798 2708 1448 2822"/><div>$$\delta=\frac{r^{2}}{R}+\frac{\lambda}{2}=k \lambda \quad(k=1,2,3, \cdots)$$</div></div>
</body></html>

Example: QwenVL HTML with image caption Next

User

QwenVL HTML with image caption

Qwen2.5-VL

<html>
<body>
<div class="image caption" data-bbox="30 164 1225 807">
    <img data-bbox="30 164 1225 807" />
    <p>This is an image of a magazine cover. It's got three pictures on it, and then the title underneath that says 2030 in white letters, and then next to that it has some Chinese writing. The picture on the left shows wind turbines up on top of a hillside. There's several of them. They're all white with blue bases. You can see mountains off in the distance behind this hillside. Then there's another picture right beside that one showing water coming out from a dam. And then the third picture shows solar panels lined up in rows as far as you can see.</p>
</div>
<p data-bbox="95 871 1120 925">这是科学家描绘的美好图景：到 2030 年，地球上消耗的所有能源，都将是清洁的可再生能源。要进入这个由风、水、太阳所主导的新能源时代，我们必须克服怎样的困难？</p>
<h2 data-bbox="89 953 451 995">24 可再生能源政策简史</h2>
<p data-bbox="134 1081 458 1116">产业圆桌 Industry Round Table</p>
<p data-bbox="89 1121 444 1163">26 2030 我们开什么车</p>
<p data-bbox="683 1078 1064 1121">44 完美主义 硬币的两面</p>
<p data-bbox="134 1280 486 1315">信息技术 Information Technology</p>
<p data-bbox="89 1320 384 1362">30 电视向网络投降</p>
<p data-bbox="683 1280 1064 1325">50 社会身份 健康晴雨表</p>
</body>
</html>

Example: Parsing Screenshot Next

User

图片解析成html

Qwen2.5-VL

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>淘宝许愿淘彩头分25亿</title>
    <style>
        body {
            background-color: #4a0d6e;
            margin: 0;
            padding: 0;
            font-family: Arial, sans-serif;
            color: white;
            text-align: center;
        }
        .header {
            display: flex;
            justify-content: space-between;
            align-items: center;
            padding: 10px;
        }
        .header h1 {
            margin: 0;
            color: gold;
        }
        .content {
            padding: 20px;
        }
        .wish-options {
            display: flex;
            justify-content: center;
            gap: 10px;
            margin-top: 20px;
        }
        .wish-option {
            width: 100px;
            height: 200px;
            border-radius: 10px;
            padding: 10px;
            box-sizing: border-box;
            cursor: pointer;
        }
        .selected {
            background-color: gold;
        }
        .unselected {
            background-color: purple;
        }
        .footer {
            margin-top: 20px;
        }
        .button {
            background-color: gold;
            border: none;
            padding: 10px 20px;
            border-radius: 5px;
            cursor: pointer;
        }
    </style>
</head>
<body>
    <div class="header">
        <div><img src="back_arrow.png" alt="Back"></div>
        <h1>淘宝·许愿淘彩头分25亿</h1>
        <div><img src="more_options.png" alt="More Options"></div>
    </div>
    <div class="content">
        <p>许下新年心愿得红包</p>
        <img src="golden_lantern.png" alt="Golden Lantern">
        <div class="wish-options">
            <div class="wish-option selected">福寿康宁</div>
            <div class="wish-option unselected">我想免费周游世界</div>
            <div class="wish-option unselected">追随自己的热情</div>
        </div>
        <div class="footer">
            <button class="button">换一批 | 定制心愿</button>
            <p>许下你的定制心愿，更灵验哦</p>
            <button class="button">许愿拿红包 🎉</button>
        </div>
    </div>
</body>
</html>

5. 增强的视频理解

Qwen2.5-VL 的视频理解能力经过全面升级，在时间处理上，我们引入了动态帧率（FPS）训练和绝对时间编码技术。这样一来，模型不仅能够支持小时级别的超长视频理解，还具备秒级的事件定位能力。它不仅能够准确地理解小时级别的长视频内容，还可以在视频中搜索具体事件，并对视频的不同时间段进行要点总结，从而快速、高效地帮助用户提取视频中蕴藏的关键信息。

Example: Information Extraction from Videos Next

User

Watch the video and list the paper titles in a table.

Qwen2.5-VL

Here is a table listing the paper titles from the video:

Paper Title
A New Sequential Prediction Framework with Spatial-temporal Embedding
NeW CRFs: Neural Window Fully-connected CRFs for Monocular Depth Estimation
Deep Unified Representation for Heterogeneous Recommendation
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
KoMen: Domain Knowledge Guided Interaction Recommendation for Emerging Scenarios Yiqing Xie
Vision-Language Pre-Training for Boosting Scene Text Detectors

Example: Video Reasoning Next

User

图中物体是什么，请详细地分析一下

Qwen2.5-VL

图中的物体是一个中国传统的舞狮道具，具体来说是一个小型的、可能是用于装饰或儿童玩具的舞狮头。这个舞狮头主要是红色的，象征着喜庆和好运，在中国文化中，红色被认为是一种能够驱邪避凶的颜色。它的设计非常生动，具有夸张的表情和张开的大嘴，露出白色的牙齿和红色的舌头，这些都是为了在表演时能够吸引观众的注意力，并传达出一种威猛和活力的感觉。

这个舞狮头通过几根细线悬挂起来，表明它可能是一个可以操控的玩偶或者装饰品。在实际的舞狮表演中，通常会有两个人穿着这样的狮子服装进行舞蹈，一个人控制头部，另一个人控制身体和尾巴部分，通过他们的协调动作来模拟狮子的各种姿态和行为。

背景中可以看到一些家居装饰品，如花瓶和植物，这表明这个场景可能是在一个家庭环境中拍摄的。整体来看，这个舞狮头不仅是一件具有文化意义的物品，也是一个充满趣味和艺术价值的装饰品。

Example: Long Video Understanding Next

User

Could you provide a comprehensive overview of the competition’s progress?

Qwen2.5-VL

The video captures an intense and highly competitive tennis match between Venus Williams and Serena Williams, two renowned professional tennis players, during the Australian Open in Melbourne. The match is set on a vibrant blue hard court, with the iconic “Melbourne” branding prominently displayed. The atmosphere is electric, with a packed stadium filled with enthusiastic spectators, many of whom are dressed in blue, adding to the visual spectacle.

Match Progress:

Initial Stages:
- The match begins with both players demonstrating their powerful serves and precise returns. The scoreboard shows the game progressing closely, with both sisters showing their competitive spirit.
- Venus Williams starts strong, securing a few early points with her aggressive play and strategic shots. Her focus and determination are evident as she moves swiftly across the court.
Mid-Game Dynamics:
- As the match progresses, Serena Williams begins to assert her dominance. She showcases her signature power and agility, hitting deep groundstrokes and executing well-placed volleys.
- The rally between the two becomes increasingly intense, with extended exchanges that highlight their exceptional athleticism and skill. The crowd erupts in cheers, responding to the high-quality tennis on display.
Key Moments:
- A notable moment occurs when Venus Williams hits a powerful serve, but Serena responds with a stunning return, sending the ball back with incredible speed and precision. This exchange leads to a prolonged rally, captivating the audience.
- Another significant point sees Serena Williams executing a drop shot, catching Venus off guard and winning the point. The crowd’s reaction underscores the brilliance of the shot.
Later Stages:
- The match continues with both players maintaining their high level of performance. The scoreboard reflects a close contest, with neither player gaining a substantial lead.
- Venus Williams demonstrates her resilience, fighting back from behind and keeping the pressure on Serena. Her ability to adapt and change tactics keeps the match unpredictable.
Final Stages:
- As the match nears its conclusion, the tension builds. Both players are visibly fatigued but remain committed to giving their best effort.
- In a crucial moment, Serena Williams secures a break point and capitalizes on it, taking the lead. Her composure and experience shine through as she manages to maintain this advantage.
- The final points are fiercely contested, with both players pushing themselves to the limit. Ultimately, Serena Williams prevails, clinching the match with a decisive shot.

Audience and Atmosphere:

The crowd plays a significant role throughout the match, providing constant support and encouragement. Their reactions range from excited cheers to moments of suspenseful silence, reflecting the intensity of the game.
The presence of prominent sponsors such as KIA, ANZ, and Emirates adds to the professional setting, emphasizing the importance of the event.

Conclusion:

The match between Venus Williams and Serena Williams at the Australian Open is a testament to their exceptional talent and rivalry. It showcases not only their individual skills but also the deep connection and mutual respect they share as siblings. The match concludes with Serena Williams emerging victorious, but the true winner is the sport of tennis, which benefits from such high-caliber performances. The audience leaves the stadium with lasting memories of an unforgettable encounter.

Example: Live chat Next

Live chat

Example: Temporal Video Grounding Next

User

Give the query: ’the user is experiencing the image generation feature’, when does the described content occur in the video? Use seconds for time format.

Qwen2.5-VL

The described content occurs from 28 seconds to 50 seconds in the video. During this segment, the user interacts with the image generation feature, requesting and receiving an artistic double scene painting of a mountain during day and night. The user then adds a bird to the generated image, demonstrating the functionality of the image generation tool.

Example: Structured Video Captioning Next

User

Localize a series of activity events in the video, output the start and end timestamp for each event, and describe each event with sentences. Provide the result in json format with ‘mm:ss.ff’ format for time depiction.

Qwen2.5-VL

[
    {
        "start_time": "00:21.00",
        "end_time": "00:49.00",
        "description": "A person removes a piece of meat from its packaging and cuts off the fat."
    },
    {
        "start_time": "00:50.00",
        "end_time": "01:27.00",
        "description": "The person seasons the meat with salt and pepper on both sides."
    },
    {
        "start_time": "01:28.00",
        "end_time": "01:36.00",
        "description": "The seasoned meat is placed on a grill."
    },
    {
        "start_time": "01:37.00",
        "end_time": "02:08.00",
        "description": "Oil is poured into a pot followed by chopped onions, carrots, and bell peppers which are then stirred."
    },
    {
        "start_time": "02:09.00",
        "end_time": "02:26.00",
        "description": "A rack is placed inside the pot and the meat is set on top."
    },
    {
        "start_time": "02:27.00",
        "end_time": "02:34.00",
        "description": "Beer is poured into the pot."
    },
    {
        "start_time": "02:35.00",
        "end_time": "02:43.00",
        "description": "The lid of the pot is covered with charcoal briquettes."
    },
    {
        "start_time": "03:14.00",
        "end_time": "04:26.00",
        "description": "The meat is sliced into pieces."
    }
]

6. 能够操作电脑和手机的视觉 Agent

通过利用内在的感知、解析和推理能力，Qwen2.5-VL 展现出了不错的设备操作能力。这包括在手机、网络平台和电脑上执行任务，为创建真正的视觉代理提供了有价值的参考点。

Example: Mobile Use Next

Please help me book a one-way ticket in the bookings app. The starting point is Chongqing Jiangbei Airport and the ending point is Beijing Capital Airport on January 28th

Example: Mobile Use Next

帮我给我的QQ好友张三，发送一条新春祝福

Example: Computer Use Next

帮我找到 Manchester, GB 地区这个月的天气 (Find the monthly forecast for Manchester, GB for this month.)

Example: Computer Use Next

您能帮我增强我照片的色彩鲜艳度吗？ (Could you assist me in enhancing the color vibrancy of my photo?)

Example: Computer Use Next

帮我下载在 “/home/user/eval.vsix” 位置的VS Code插件 (Please help me install an extension in VS Code from a local VSIX file “/home/user/eval.vsix”)

模型更新

与 Qwen2-VL 相比，Qwen2.5-VL 增强了模型对时间和空间尺度的感知能力，并进一步简化了网络结构以提高模型效率。

时间和图像尺寸的感知

在空间维度上，Qwen2.5-VL 不仅能够动态地将不同尺寸的图像转换为不同长度的 token，还直接使用图像的实际尺寸来表示检测框和点等坐标，而不进行传统的坐标归一化。这使得模型能够直接学习图像的尺度。在时间维度上，引入了动态 FPS (每秒帧数)训练和绝对时间编码，将 mRoPE id 直接与时间流速对齐。这使得模型能够通过时间维度 id 的间隔来学习时间的节奏。

更简洁高效的视觉编码器

视觉编码器在多模态大模型中扮演着至关重要的角色。我们从头开始训练了一个原生动态分辨率的 ViT，包括 CLIP、视觉-语言模型对齐和端到端训练等阶段。为了解决多模态大模型在训练和测试阶段 ViT 负载不均衡的问题，我们引入了窗口注意力机制，有效减少了 ViT 端的计算负担。在我们的 ViT 设置中，只有四层是全注意力层，其余层使用窗口注意力。最大窗口大小为 8x8，小于 8x8 的区域不需要填充，而是保持原始尺度，确保模型保持原生分辨率。此外，为了简化整体网络结构，我们使 ViT 架构与 LLMs 更加一致，采用了 RMSNorm 和 SwiGLU 结构。

下一步

在不久的将来，我们将进一步提升模型的问题解决和推理能力，同时整合更多模态。这将使模型变得更加智能，并推动我们向着能够处理多种输入类型和任务的综合全能模型迈进。

Qwen2.5 VL！Qwen2.5 VL！Qwen2.5 VL！

模型性能#

模型能力案例#

1. 万物识别#

2. 精准的视觉定位#

3. 全面的文字识别和理解#

Courier Label Information:#

Door Information:#

Verification:#

4. Qwen特色的文档解析#

5. 增强的视频理解#

Match Progress:#

Audience and Atmosphere:#

Conclusion:#

6. 能够操作电脑和手机的视觉 Agent#

模型更新#

下一步#

模型性能

模型能力案例

1. 万物识别

2. 精准的视觉定位

3. 全面的文字识别和理解

Courier Label Information:

Door Information:

Verification:

4. Qwen特色的文档解析

5. 增强的视频理解

Match Progress:

Audience and Atmosphere:

Conclusion:

6. 能够操作电脑和手机的视觉 Agent

模型更新

下一步