We release Qwen2.5-VL, the new flagship vision-language model of Qwen and also a significant leap from the previous Qwen2-VL. To try the latest model, feel free to visit Qwen Chat and choose Qwen2.5-VL-72B-Instruct. Also, we open both base and instruct models in 3 sizes, including 3B, 7B, and 72B, in both Hugging Face and ModelScope.
The key features include:
Understand things visually: Qwen2.5-VL is not only proficient in recognizing common objects such as flowers, birds, fish, and insects, but it is highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
Being agentic: Qwen2.5-VL directly plays as a visual agent that can reason and dynamically direct tools, which is capable of computer use and phone use.
Understanding long videos and capturing events: Qwen2.5-VL can comprehend videos of over 1 hour, and this time it has a new ability of capturing event by pinpointing the relevant video segments.
Capable of visual localization in different formats: Qwen2.5-VL can accurately localize objects in an image by generating bounding boxes or points, and it can provide stable JSON outputs for coordinates and attributes.
Generating structured outputs: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.
We evaluate our models with the SOTA models as well as the best models of similar model sizes. In terms of the flagship model Qwen2.5-VL-72B-Instruct, it achieves competitive performance in a series of benchmarks covering domains and tasks, inlcuding college-level problems, math, document understanding, general question answering, math, video understanding, and visual agent. Notably, Qwen2.5-VL achieves significant advantages in understanding documents and diagrams, and it is capable of playing as a visual agent without task-specific finetuning.
In terms of smaller models, Qwen2.5-VL-7B-Instruct outperforms GPT-4o-mini in a number of tasks, and Qwen2.5-VL-3B, which is a solution for edge AI, even outperforms the 7B model of our previous version Qwen2-VL.
Qwen2.5-VL has significantly enhanced its general image recognition capabilities, expanding the categories of images to an ultra-large number. It not only includes plants, animals, landmarks of famous mountains and rivers, but also IPs from film and TV series, as well as a wide variety of products.
Qwen2.5-VL utilizes bounding boxes and point-based representations for grounding, enabling hierarchical positioning and standardized JSON output. This enhanced localization capability serves as a foundation for visual reasoning.
Detect all motorcyclists in the image and return their locations in the form of coordinates. The format of output should be like {“bbox_2d”: [x1, y1, x2, y2], “label”: “motorcyclist”, “sub_label”: “wearing helmat” # or “not wearing helmat”}.
Identify basketball players and detect the key points of their heads and hands, returning them in the form of points. The primary label is the player’s name, and the secondary labels include left hand, right hand, and head.
Count the number of birds in the figure, including those that are only showing their heads. To ensure accuracy, first detect their key points, then give the total number..
Locate every cake and describe their features, output the bbox coordinates in JSON format.
Qwen2.5-VL
[{"bbox_2d":[130,725,304,906],"label":"cake with white frosting and eyes"},{"bbox_2d":[344,641,508,800],"label":"cake with white frosting and eyes"},{"bbox_2d":[289,530,435,690],"label":"cake with white frosting and colorful sprinkles"},{"bbox_2d":[40,542,197,712],"label":"cake with white frosting and colorful sprinkles"},{"bbox_2d":[64,386,202,566],"label":"cake with pink frosting and red sprinkles"},{"bbox_2d":[245,375,363,529],"label":"cake with pink frosting and blue sprinkles"},{"bbox_2d":[389,359,492,494],"label":"cake with pink frosting and blue sprinkles"},{"bbox_2d":[423,436,555,587],"label":"cake with pink frosting and googly eyes"},{"bbox_2d":[518,348,642,512],"label":"cake with pink frosting and blue sprinkles"},{"bbox_2d":[500,549,652,698],"label":"cake with white frosting and googly eyes"},{"bbox_2d":[615,465,757,626],"label":"cake with white frosting and colorful sprinkles"},{"bbox_2d":[723,372,849,525],"label":"cake with chocolate frosting"},{"bbox_2d":[782,497,944,676],"label":"cake with white frosting and colorful candies"}]
Qwen2.5-VL has upgraded its OCR recognition capabilities to a new level, with enhanced multi-scenario, multi-language and multi-orientation text recognition and text localization performance. Furthermore, it has been significantly enhanced in information extraction to meet the growing digitalized and intelligent demands in areas such as qualification review and financial business.
You are a delivery bill verification expert who can identify house numbers and courier label information, and verify whether the address on the courier label matches the house number. Please identify the house number from the bill address and the house number on the door separately, and then verify if the door number is delivered correctly.
Qwen2.5-VL
To verify the delivery, let’s break down the information from both the courier label and the door.
Qwen2.5-VL has designed a unique document parsing format called QwenVL HTML format, which extracts layout information based on HTML. QwenVL HTML can perform document parsing in various scenarios, such as magazines, research papers, web pages, and even mobile screenshots.
<html><body><h1data-bbox="879 283 1605 348">Qwen2.5 Technical Report</h1><pdata-bbox="1124 406 1360 456"></p><addressclass="author"><p>Qwen Team</p></address><divclass="image"data-bbox="778 491 850 541"><imgdata-bbox="778 491 850 541"/></div><pdata-bbox="885 491 1460 541">https://huggingface.co/Qwen</p><divclass="image"data-bbox="778 541 850 589"><imgdata-bbox="778 541 850 589"/></div><pdata-bbox="885 541 1708 589">https://modelscope.cn/organization/qwen</p><divclass="image"data-bbox="792 589 850 639"><imgdata-bbox="792 589 850 639"/></div><pdata-bbox="885 589 1584 639">https://github.com/QwenLM/Qwen2.5</p><h2data-bbox="1143 681 1344 733">Abstract</h2><pdata-bbox="434 785 2050 1252">In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning, including offline learning DPO and online learning GRPO. Post-training techniques significantly enhance human preference, and notably improve long text generation, structural data analysis, and instruction following.</p><pdata-bbox="434 1262 2050 1587">To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich configurations. The open-weight offerings include base models and instruction-tuned models in sizes of $0.5 \mathrm{~B}, 1.5 \mathrm{~B}, 3 \mathrm{~B}, 7 \mathrm{~B}, 14 \mathrm{~B}, 32 \mathrm{~B}$, and $72 \mathrm{~B}$ parameters. Quantized versions of the instruction-tuned models are also provided. Over 100 models can be accessed from Hugging Face Hub, ModelScope, and Kaggle. In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio.</p><pdata-bbox="434 1587 2050 2052">Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o-mini and GPT-4o respectively. Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math (Yang et al., 2024b), Qwen2.5-Coder (Hui et al., 2024), QwQ (Qwen Team, 2024d), and multimodal models.</p><divclass="image"data-bbox="408 2275 2086 2800"><imgdata-bbox="408 2275 2086 2800"/></div><pdata-bbox="289 2864 2202 3058">Figure 1: In the iterative development of the Qwen series, data scaling has played a crucial role. Qwen 2.5, which leverages 18 trillion tokens for pre-training, has demonstrated the most advanced capabilities within the Qwen series, especially in terms of domain expertise, underscoring the importance of scale together with mixture in enhancing the model’s capabilities.</p></body></html>
<html><body><divclass="image caption"data-bbox="30 164 1225 807"><imgdata-bbox="30 164 1225 807"/><p>This is an image of a magazine cover. It's got three pictures on it, and then the title underneath that says 2030 in white letters, and then next to that it has some Chinese writing. The picture on the left shows wind turbines up on top of a hillside. There's several of them. They're all white with blue bases. You can see mountains off in the distance behind this hillside. Then there's another picture right beside that one showing water coming out from a dam. And then the third picture shows solar panels lined up in rows as far as you can see.</p></div><pdata-bbox="95 871 1120 925">这是科学家描绘的美好图景:到 2030 年,地球上消耗的所有能源,都将是清洁的可再生能源。要进入这个由风、水、太阳所主导的新能源时代,我们必须克服怎样的困难?</p><h2data-bbox="89 953 451 995">24 可再生能源政策简史</h2><pdata-bbox="134 1081 458 1116">产业圆桌 Industry Round Table</p><pdata-bbox="89 1121 444 1163">26 2030 我们开什么车</p><pdata-bbox="683 1078 1064 1121">44 完美主义 硬币的两面</p><pdata-bbox="134 1280 486 1315">信息技术 Information Technology</p><pdata-bbox="89 1320 384 1362">30 电视向网络投降</p><pdata-bbox="683 1280 1064 1325">50 社会身份 健康晴雨表</p></body></html>
Qwen2.5-VL’s video comprehension capabilities have been comprehensively upgraded. In terms of temporal processing, we have introduced dynamic frame rate (FPS) training and absolute time encoding technology. As a result, the model can not only support the understanding of ultra-long videos on an hourly scale but also achieve second-level event localization. It is capable of accurately comprehending content from long videos spanning hours, searching for specific events within videos, and summarizing key points from different time segments. This allows users to quickly and efficiently extract crucial information embedded in the videos.
Could you provide a comprehensive overview of the competition’s progress?
Qwen2.5-VL
The video captures an intense and highly competitive tennis match between Venus Williams and Serena Williams, two renowned professional tennis players, during the Australian Open in Melbourne. The match is set on a vibrant blue hard court, with the iconic “Melbourne” branding prominently displayed. The atmosphere is electric, with a packed stadium filled with enthusiastic spectators, many of whom are dressed in blue, adding to the visual spectacle.
The match begins with both players demonstrating their powerful serves and precise returns. The scoreboard shows the game progressing closely, with both sisters showing their competitive spirit.
Venus Williams starts strong, securing a few early points with her aggressive play and strategic shots. Her focus and determination are evident as she moves swiftly across the court.
Mid-Game Dynamics:
As the match progresses, Serena Williams begins to assert her dominance. She showcases her signature power and agility, hitting deep groundstrokes and executing well-placed volleys.
The rally between the two becomes increasingly intense, with extended exchanges that highlight their exceptional athleticism and skill. The crowd erupts in cheers, responding to the high-quality tennis on display.
Key Moments:
A notable moment occurs when Venus Williams hits a powerful serve, but Serena responds with a stunning return, sending the ball back with incredible speed and precision. This exchange leads to a prolonged rally, captivating the audience.
Another significant point sees Serena Williams executing a drop shot, catching Venus off guard and winning the point. The crowd’s reaction underscores the brilliance of the shot.
Later Stages:
The match continues with both players maintaining their high level of performance. The scoreboard reflects a close contest, with neither player gaining a substantial lead.
Venus Williams demonstrates her resilience, fighting back from behind and keeping the pressure on Serena. Her ability to adapt and change tactics keeps the match unpredictable.
Final Stages:
As the match nears its conclusion, the tension builds. Both players are visibly fatigued but remain committed to giving their best effort.
In a crucial moment, Serena Williams secures a break point and capitalizes on it, taking the lead. Her composure and experience shine through as she manages to maintain this advantage.
The final points are fiercely contested, with both players pushing themselves to the limit. Ultimately, Serena Williams prevails, clinching the match with a decisive shot.
The crowd plays a significant role throughout the match, providing constant support and encouragement. Their reactions range from excited cheers to moments of suspenseful silence, reflecting the intensity of the game.
The presence of prominent sponsors such as KIA, ANZ, and Emirates adds to the professional setting, emphasizing the importance of the event.
The match between Venus Williams and Serena Williams at the Australian Open is a testament to their exceptional talent and rivalry. It showcases not only their individual skills but also the deep connection and mutual respect they share as siblings. The match concludes with Serena Williams emerging victorious, but the true winner is the sport of tennis, which benefits from such high-caliber performances. The audience leaves the stadium with lasting memories of an unforgettable encounter.
Give the query: ’the user is experiencing the image generation feature’, when does the described content occur in the video? Use seconds for time format.
Qwen2.5-VL
The described content occurs from 28 seconds to 50 seconds in the video. During this segment, the user interacts with the image generation feature, requesting and receiving an artistic double scene painting of a mountain during day and night. The user then adds a bird to the generated image, demonstrating the functionality of the image generation tool.
Localize a series of activity events in the video, output the start and end timestamp for each event, and describe each event with sentences. Provide the result in json format with ‘mm:ss.ff’ format for time depiction.
Qwen2.5-VL
[{"start_time":"00:21.00","end_time":"00:49.00","description":"A person removes a piece of meat from its packaging and cuts off the fat."},{"start_time":"00:50.00","end_time":"01:27.00","description":"The person seasons the meat with salt and pepper on both sides."},{"start_time":"01:28.00","end_time":"01:36.00","description":"The seasoned meat is placed on a grill."},{"start_time":"01:37.00","end_time":"02:08.00","description":"Oil is poured into a pot followed by chopped onions, carrots, and bell peppers which are then stirred."},{"start_time":"02:09.00","end_time":"02:26.00","description":"A rack is placed inside the pot and the meat is set on top."},{"start_time":"02:27.00","end_time":"02:34.00","description":"Beer is poured into the pot."},{"start_time":"02:35.00","end_time":"02:43.00","description":"The lid of the pot is covered with charcoal briquettes."},{"start_time":"03:14.00","end_time":"04:26.00","description":"The meat is sliced into pieces."}]
Please help me book a one-way ticket in the bookings app. The starting point is Chongqing Jiangbei Airport and the ending point is Beijing Capital Airport on January 28th
Compared to Qwen2-VL, Qwen2.5-VL has enhanced the model’s perception of temporal and spatial scales, and further simplified the network structure to improve model efficiency.
Perception of Time and Image Size
In the spatial dimension, Qwen2.5-VL not only dynamically converts images of different sizes into tokens of varying lengths but also directly represents coordinates such as detection boxes and points using the actual size scale of the image, without performing traditional coordinate normalization. This allows the model to directly learn the scale of the images. In the temporal dimension, dynamic FPS (Frames Per Second) training and absolute time encoding have been introduced, aligning mRoPE ids directly with the speed of time. This enables the model to learn the pace of time through the intervals of temporal dimension ids.
More Concise and Efficient Visual Encoder
The visual encoder plays a crucial role in multimodal large models. We trained a native dynamic resolution ViT from scratch, including stages for CLIP, vision-language model alignment, and end-to-end training. To address the issue of load imbalance in ViT during the training and testing phases of multimodal large models, we introduced Window Attention to effectively reduce the computational load on the ViT side. In our ViT setup, only four layers are Full Attention layers, while the rest use Window Attention. The maximum window size is 8x8, and regions smaller than 8x8 do not require padding; instead, they retain their original scale, ensuring that the model maintains native resolution. Additionally, to simplify the overall network structure, we made the ViT architecture more consistent with LLMs by adopting RMSNorm and SwiGLU structures.
In the near future, we will further enhance the model’s problem-solving and reasoning capabilities, while incorporating more modalities. This will make the model smarter and move us towards an integrated omni-model that can handle multiple types of input and tasks.