Along with the rapid development of our large language model Qwen, we leveraged Qwen’s capabilities and unified multimodal pretraining to address the limitations of multimodal models in generalization, and we opensourced multimodal model Qwen-VL in Sep. 2023. Recently, the Qwen-VL series has undergone a significant upgrade with the launch of two enhanced versions, Qwen-VL-Plus and Qwen-VL-Max. The key technical advancements in these versions include:

  • Substantially boost in image-related reasoning capabilities;
  • Considerable enhancement in recognizing, extracting, and analyzing details within images and texts contained therein;
  • Support for high-definition images with resolutions above one million pixels and images of various aspect ratios.
Model NameModel Description
qwen-vl-plusQwen's Enhanced Large Visual Language Model. Significantly upgraded for detailed recognition capabilities and text recognition abilities, supporting ultra-high pixel resolutions up to millions of pixels and arbitrary aspect ratios for image input. It delivers significant performance across a broad range of visual tasks.
qwen-vl-maxQwen's Most Capable Large Visual Language Model. Compared to the enhanced version, further improvements have been made to visual reasoning and instruction-following capabilities, offering a higher level of visual perception and cognitive understanding. It delivers optimal performance on an even broader range of complex tasks.

Compared to the open-source version of Qwen-VL, these two models perform on par with Gemini Ultra and GPT-4V in multiple text-image multimodal tasks, significantly surpassing the previous best results from open-source models.

Notably, Qwen-VL-Max outperforms both GPT-4V from OpenAI and Gemini from Google in tasks on Chinese question answering and Chinese text comprehension. This breakthrough underscores the model’s advanced capabilities and its potential to set new standards in the field of multimodal AI research and application.

ModelDocVQA
Document understanding
ChartQA
Chart understanding
AI2D
Science diagrams
TextVQA
Text reading
MMMU
College-level problems
MathVista
Mathematical reasoning
MM-Bench-CN
Natural image QA in Chinese
Other Best
Open-source LVLM
81.6%
(CogAgent)
68.4%
(CogAgent)
73.7%
(Fuyu-Medium)
76.1%
(CogAgent)
45.9%
(Yi-VL-34B)
36.7%
(SPHINX-V2)
72.4%
(InternLM-XComposer-VL)
Gemini Pro88.1%74.1%73.9%74.6%47.9%45.2%74.3%
Gemini Ultra90.9%80.8% 179.5% 182.3% 159.4% 153.0% 1-
GPT-4V88.4%78.5%78.2%78.0%56.8%49.9%73.9%
Qwen-VL-Plus91.4%78.1%75.9%78.9%45.2%43.3%68.0%
Qwen-VL-Max93.1% 179.8% 279.3% 279.5% 251.4% 350.0% 275.1% 1

New versions of the Qwen-VL-Plus and Qwen-VL-Max models not only exhibit exceptional benchmark performance but also demonstrate a marked improvement in problem-solving capabilities within real-world scenarios. These advanced models effortlessly engage in dialogue, identify celebrities and landmarks, generate text, and notably, have significantly enhanced their abilities to describe and interpret visual content.

Here we present some practical examples:

1. Basic Recognition Capabilities

The latest Qwen-VL models are now more adept at accurately describing and identifying complex information within images, as well as providing detailed background and answering related questions. For instance, Qwen-VL can recognize not only common objects, but also celebrities and landmarks. Qwen-VL can write poetry in various languages inspired by visuals, and analyze everyday screenshots.

Example: celebrity recognition Next
User
who are they? image
Qwen-VL

These are some of the most famous and highest-paid celebrities in the world. From left to right, top to bottom:

  • Justin Bieber (Canadian singer)
  • Lionel Messi (Argentine soccer player)
  • Cristiano Ronaldo (Portuguese soccer player)
  • Elon Musk (South African-born American entrepreneur and businessman)
  • Kylie Jenner (American media personality and socialite)
  • Taylor Swift (American singer-songwriter)
  • Robert Downey Jr. (American actor)
  • Scarlett Johansson (American actress)

It’s worth noting that this is just a small sample of the many talented and successful celebrities out there, and that fame and wealth are not the only measures of success or importance.

User
their chinese name?
Qwen-VL

Here are the Chinese names of the celebrities mentioned:

  • Justin Bieber: 贾斯汀·比伯
  • Lionel Messi: 莱昂内尔·梅西
  • Cristiano Ronaldo: 克里斯蒂亚诺·罗纳尔多
  • Elon Musk: 埃隆·马斯克
  • Kylie Jenner: 凯莉·詹娜
  • Taylor Swift: 泰勒·斯威夫特
  • Robert Downey Jr.: 小罗伯特·唐尼
  • Scarlett Johansson: 斯嘉丽·约翰逊

Note that these are just one possible translation of their names into Chinese, and there may be other variations used in different contexts or regions.

2. Visual Agent Capability: The Essential Role of Localization

Beyond its fundamental capabilities in description and recognition, Qwen-VL also has impressive abilities to pinpoint and query specific elements. For instance, it can accurately highlight the black cars within an image. Moreover, Qwen-VL is also equipped to make judgments, deductions, and decisions based on the prevailing context of a scene.

Example: Locate Whatever You Desire Next
User
Locate the red car image
Qwen-VL
The red car is located in the bottom right corner of the image. image

3. Visual Reasoning Capability: To Solve Real Problems

One of the most notable advancements in the latest Qwen-VL is its capacity for complex reasoning based on visual inputs. This enhanced visual reasoning capability goes well beyond mere content description, extending to the comprehension and interpretation of intricate representations such as flowcharts, diagrams, and other symbolic systems. In the realms of problem-solving and reasoning, Qwen-VL-Plus/Max excels not only in mathematical problem-solving and information organization but also in conducting more profound interpretations and analyses of charts and graphs.

Example: Math Next
User
read the image and solve it step by step image
Qwen-VL

The image shows two geometric figures, a rectangular prism and a cube, with their dimensions labeled. The problem asks to calculate the surface area and volume of these figures.

  1. For the rectangular prism:
  • Length (l) = 4 cm
  • Width (w) = 2 cm
  • Height (h) = 3 cm

Surface Area (SA) of a rectangular prism = 2lw + 2lh + 2wh Volume (V) of a rectangular prism = lwh

Substituting the given values: SA = 2(4cm)(2cm) + 2(4cm)(3cm) + 2(2cm)(3cm) SA = 16cm² + 24cm² + 12cm² SA = 52cm²

V = (4cm)(2cm)(3cm) V = 24cm³

  1. For the cube:
  • Side length (a) = 3 dm

Surface Area (SA) of a cube = 6a² Volume (V) of a cube = a³

Substituting the given value: SA = 6(3dm)² SA = 6(9dm²) SA = 54dm²

V = (3dm)³ V = 27dm³

So, the surface area and volume for each figure are as follows:

For the rectangular prism:

  • Surface Area = 52 cm²
  • Volume = 24 cm³

For the cube:

  • Surface Area = 54 dm²
  • Volume = 27 dm³

4. Text Information Recognition & Processing

Text processing in images has also improved significantly, especially in terms of recognizing Chinese and English text. Qwen-VL-Plus/Max can now efficiently extract information from tables and documents and reformat this information to meet custom output requirements. In addition, it has an efficient mechanism for identifying and converting dense text, which is very effective in dealing with documents that contain a lot of information. It supports images with extreme aspect ratios, ensuring the flexibility to process diverse visual content.

Example: Information Retrieval Next
User
Please read the text in this image and return the information in the following JSON format (note xxx is placeholder, if the information is not available in the image, put “N/A” instead). {“class”: xxx, “DLN”: xxx, “DOB”: xxx, “Name”: xxx, “Address”: xxx, “EXP”: xxx, “ISS”: xxX, “SEX”: xxx, “HGT”: xxx, “WGT”: xxx, “EYES”: xxx, “HAIR”: xxX, “DONOR”: xxx} image
Qwen-VL
{“class”: “C”, “DLN”: “I1234568”, “DOB”: “08/31/1977”, “Name”: “IMA CARDHOLDER”, “Address”: “2570 24TH STREET ANYTOWN, CA 95818”, “EXP”: “08/31/2014”, “ISS”: “08/31/2009”, “SEX”: “F”, “HGT”: “5’-05"”, “WGT”: “125 lb”, “EYES”: “BRN”, “HAIR”: “BRN”, “DONOR”: “YES”}

How to Use

Now you can access Qwen-VL-Plus and Qwen-VL-Max through the Huggingface Spaces, the Qwen website, and Dashscope APIs.

Summary

Qwen-VL-Plus and Qwen-VL-Max make significant strides in enhancing high-resolution recognition, text analysis, and image reasoning capabilities. These models now match the performance of GPT4-v and Gemini, outperforming all other open-source and proprietary models in many tasks, such as MMMU, CMMMU, and MathVista. They achieve world-class results in document analysis (DocVQA) and Chinese language-related image understanding (MM-Bench-CN).

Our objective is to continually tap into and elevate the potential of Qwen-VL, enabling it to make a difference across a broader range of applications. We can envision Qwen-VL as an assistant equipped with superhuman visual and linguistic comprehension skills that can provide robust support in everyday dialogues as well as complex scenarios like driving environments and programming contexts.

While there is still a long way to go, we are confident that Qwen-VL will evolve to perceive and understand the world akin to human cognition through continuous optimization and expansion!