Introducing Qwen-VL

Along with the rapid development of our large language model Qwen, we leveraged Qwen’s capabilities and unified multimodal pretraining to address the limitations of multimodal models in generalization, and we opensourced multimodal model Qwen-VL in Sep. 2023. Recently, the Qwen-VL series has undergone a significant upgrade with the launch of two enhanced versions, Qwen-VL-Plus and Qwen-VL-Max. The key technical advancements in these versions include:

Substantially boost in image-related reasoning capabilities;
Considerable enhancement in recognizing, extracting, and analyzing details within images and texts contained therein;
Support for high-definition images with resolutions above one million pixels and images of various aspect ratios.

Model Name	Model Description
qwen-vl-plus	^{Qwen's Enhanced Large Visual Language Model. Significantly upgraded for detailed recognition capabilities and text recognition abilities, supporting ultra-high pixel resolutions up to millions of pixels and arbitrary aspect ratios for image input. It delivers significant performance across a broad range of visual tasks.}
qwen-vl-max	^{Qwen's Most Capable Large Visual Language Model. Compared to the enhanced version, further improvements have been made to visual reasoning and instruction-following capabilities, offering a higher level of visual perception and cognitive understanding. It delivers optimal performance on an even broader range of complex tasks.}

Compared to the open-source version of Qwen-VL, these two models perform on par with Gemini Ultra and GPT-4V in multiple text-image multimodal tasks, significantly surpassing the previous best results from open-source models.

Notably, Qwen-VL-Max outperforms both GPT-4V from OpenAI and Gemini from Google in tasks on Chinese question answering and Chinese text comprehension. This breakthrough underscores the model’s advanced capabilities and its potential to set new standards in the field of multimodal AI research and application.

Model	DocVQA ^{^{Document understanding}}	ChartQA ^{^{Chart understanding}}	AI2D ^{^{Science diagrams}}	TextVQA ^{^{Text reading}}	MMMU ^{^{College-level problems}}	MathVista ^{^{Mathematical reasoning}}	MM-Bench-CN ^{^{Natural image QA in Chinese}}
Other Best ^{Open-source LVLM}	81.6% ^{^(CogAgent)}	68.4% ^{^(CogAgent)}	73.7% ^{^{(Fuyu-Medium)}}	76.1% ^{^(CogAgent)}	45.9% ^{^(Yi-VL-34B)}	36.7% ^{^(SPHINX-V2)}	72.4% ^{^{(InternLM-XComposer-VL)}}
Gemini Pro	88.1%	74.1%	73.9%	74.6%	47.9%	45.2%	74.3%
Gemini Ultra	90.9%	80.8% ¹	79.5% ¹	82.3% ¹	59.4% ¹	53.0% ¹	-
GPT-4V	88.4%	78.5%	78.2%	78.0%	56.8%	49.9%	73.9%
Qwen-VL-Plus	91.4%	78.1%	75.9%	78.9%	45.2%	43.3%	68.0%
Qwen-VL-Max	93.1% ¹	79.8% ²	79.3% ²	79.5% ²	51.4% ³	50.0% ²	75.1% ¹

New versions of the Qwen-VL-Plus and Qwen-VL-Max models not only exhibit exceptional benchmark performance but also demonstrate a marked improvement in problem-solving capabilities within real-world scenarios. These advanced models effortlessly engage in dialogue, identify celebrities and landmarks, generate text, and notably, have significantly enhanced their abilities to describe and interpret visual content.

Here we present some practical examples:

1. Basic Recognition Capabilities

The latest Qwen-VL models are now more adept at accurately describing and identifying complex information within images, as well as providing detailed background and answering related questions. For instance, Qwen-VL can recognize not only common objects, but also celebrities and landmarks. Qwen-VL can write poetry in various languages inspired by visuals, and analyze everyday screenshots.

Example: celebrity recognition Next

User

who are they?

Qwen-VL

These are some of the most famous and highest-paid celebrities in the world. From left to right, top to bottom:

Justin Bieber (Canadian singer)
Lionel Messi (Argentine soccer player)
Cristiano Ronaldo (Portuguese soccer player)
Elon Musk (South African-born American entrepreneur and businessman)
Kylie Jenner (American media personality and socialite)
Taylor Swift (American singer-songwriter)
Robert Downey Jr. (American actor)
Scarlett Johansson (American actress)

It’s worth noting that this is just a small sample of the many talented and successful celebrities out there, and that fame and wealth are not the only measures of success or importance.

User

their chinese name?

Qwen-VL

Here are the Chinese names of the celebrities mentioned:

Justin Bieber: 贾斯汀·比伯
Lionel Messi: 莱昂内尔·梅西
Cristiano Ronaldo: 克里斯蒂亚诺·罗纳尔多
Elon Musk: 埃隆·马斯克
Kylie Jenner: 凯莉·詹娜
Taylor Swift: 泰勒·斯威夫特
Robert Downey Jr.: 小罗伯特·唐尼
Scarlett Johansson: 斯嘉丽·约翰逊

Note that these are just one possible translation of their names into Chinese, and there may be other variations used in different contexts or regions.

Example: landmark recognition Next

User

Where is it in the picture? How did you recognize it?

Qwen-VL

The picture shows a busy street at night in New York City, USA. The Empire State Building can be seen in the background, which is a distinctive landmark of New York City. Therefore, I recognized it as New York City.

Example: poem writing Next

User

Look at pictures and write a poem.

Qwen-VL

A rocky crag by the sea, A place of solitude and mystery. The water calm and blue, Reflecting the sky above, anew.

The mountain stands tall and proud, Its peak reaching towards the clouds. A symbol of strength and resilience, Enduring through time’s relentless persistence.

The world around may change and shift, But this mountain remains steadfast and uplifted. A reminder to us all, To stand firm in our own personal call.

So let us take a moment to pause, And appreciate this natural cause. For in its beauty and grandeur, We find inspiration and renewal, forever.

Example: mobile app screenshot understanding Next

User

Describe the image.

Qwen-VL

The image is a screenshot of an app displaying a map of Beijing, China. The map is centered on the Summer Palace and Yuyuantan Park. The app provides directions for driving from Yuyuantan Park to the Summer Palace. It offers two options: one takes 22 minutes and the other takes 25 minutes. The app also indicates that there are local plate restrictions on the route.

2. Visual Agent Capability: The Essential Role of Localization

Beyond its fundamental capabilities in description and recognition, Qwen-VL also has impressive abilities to pinpoint and query specific elements. For instance, it can accurately highlight the black cars within an image. Moreover, Qwen-VL is also equipped to make judgments, deductions, and decisions based on the prevailing context of a scene.

Example: Locate Whatever You Desire Next

User

Locate the red car

Qwen-VL

The red car is located in the bottom right corner of the image.

Example: Contextually Anchored Image Description Generation Next

User

What does the pointed news talk about?

Qwen-VL

The pointed news is about OpenAI’s chatbot ChatGPT having personalized instructions to store your preferences.

Example: Autonomous Driving Next

User

The current vehicle speed is 150km/h. According to the picture, how should the vehicle operate? A. To accelerate B. Slow down C. Maintain the original speed D. Parking Choose an answer and give reasons.

Qwen-VL

B. Slow down

According to the traffic sign in the picture, the maximum speed limit for the current lane is 100 km/h. If the current vehicle speed is 150 km/h, it exceeds the speed limit and should slow down to comply with the traffic regulations.

3. Visual Reasoning Capability: To Solve Real Problems

One of the most notable advancements in the latest Qwen-VL is its capacity for complex reasoning based on visual inputs. This enhanced visual reasoning capability goes well beyond mere content description, extending to the comprehension and interpretation of intricate representations such as flowcharts, diagrams, and other symbolic systems. In the realms of problem-solving and reasoning, Qwen-VL-Plus/Max excels not only in mathematical problem-solving and information organization but also in conducting more profound interpretations and analyses of charts and graphs.

Example: Math Next

User

read the image and solve it step by step

Qwen-VL

The image shows two geometric figures, a rectangular prism and a cube, with their dimensions labeled. The problem asks to calculate the surface area and volume of these figures.

For the rectangular prism:

Length (l) = 4 cm
Width (w) = 2 cm
Height (h) = 3 cm

Surface Area (SA) of a rectangular prism = 2lw + 2lh + 2wh Volume (V) of a rectangular prism = lwh

Substituting the given values: SA = 2(4cm)(2cm) + 2(4cm)(3cm) + 2(2cm)(3cm) SA = 16cm² + 24cm² + 12cm² SA = 52cm²

V = (4cm)(2cm)(3cm) V = 24cm³

For the cube:

Side length (a) = 3 dm

Surface Area (SA) of a cube = 6a² Volume (V) of a cube = a³

Substituting the given value: SA = 6(3dm)² SA = 6(9dm²) SA = 54dm²

V = (3dm)³ V = 27dm³

So, the surface area and volume for each figure are as follows:

For the rectangular prism:

Surface Area = 52 cm²
Volume = 24 cm³

For the cube:

Surface Area = 54 dm²
Volume = 27 dm³

Example: Code Next

User

convert the process into a python code program

Qwen-VL

The image shows a Scratch programming block, which can be converted into Python code as follows:

import random

# Set up the variables
a = random.randint(1, 50)
b = random.randint(1, 50)
answer = a + b

# Ask the user for their input
response = int(input(f"What is {a} + {b}? "))

# Check if the response matches the answer
if response == answer:
    print("Correct!")
else:
    print("Incorrect!")

This Python code generates two random numbers between 1 and 50, adds them together to get the answer, and then asks the user for their response. It then checks if the response matches the answer, and prints out a message accordingly.

Example: Table Next

User

Spot a data point that stands out in these charts and what that implicates. Then produce a detailed markdown table for all the data shown.

Qwen-VL

The most striking data point in these charts is the high percentage of plastic waste that is mismanaged in the United States, which stands at 4%. This implies that a significant portion of plastic waste in the US is not being properly disposed of or recycled, leading to environmental pollution and potential harm to ecosystems and human health.

Here’s a detailed markdown table for all the data shown:

Region	Landfilled (2019)	Mismanaged (2019)	Incinerated (2019)	Recycled (2019)
World	49%	22%	19%	9%
United States	73%	4%	19%	4%
Europe	44%	6%	38%	12%
Asia (excluding China and India)	39%	34%	19%	8%

Note: The data source for this information is the OECD (Organisation for Economic Co-operation and Development), and the percentages represent the share of plastic waste that is recycled, landfilled, incinerated, and mismanaged in each region in 2019.

Example: Reasoning Next

User

Look at this sequence of three shapes. What shape should come as the fourth shape? Explain your reasoning with detailed descriptions of the first shapes.

Qwen-VL

The sequence of shapes is a triangle, square, and pentagon, followed by a question mark. This pattern suggests that the next shape in the sequence should be a hexagon. The pattern appears to be based on the number of sides in each shape, with each subsequent shape having one more side than the previous one. Therefore, the fourth shape should have six sides, making it a hexagon.

User

翻译成中文

Qwen-VL

看这个三个形状的序列。下一个形状应该是什么形状？解释你的推理过程。形状序列是三角形、正方形和五边形，然后是一个问号。这表明下一个形状应该是六边形。这个序列似乎基于每个形状的边数，每个后续形状比前一个形状多一条边。因此，第四个形状应该有六条边，使它成为一个六边形。

Example: Model Flow Next

User

explain the picture

Qwen-VL

The image presents a three-stage training pipeline for the Qwen-VL series, a visual language model. Each stage is represented by a box labeled “QwenLM” and is accompanied by a smaller box indicating the type of data used in that stage.

In Stage 1, the model is pre-trained using image-text pairs. This stage involves learning from a large dataset of images and their corresponding text descriptions to understand the relationship between visual and linguistic information.

Stage 2 is multi-task pre-training with multi-task and interleaved VL data. Here, the model is trained on multiple tasks simultaneously, such as image captioning, visual question answering, and referring expression comprehension. The use of interleaved data helps the model learn to switch between tasks efficiently.

Finally, in Stage 3, the model undergoes supervised finetuning with chat-interleaved VL data. This stage involves refining the model’s performance on specific tasks using labeled data. The chat-interleaved data provides a more conversational context for the model to learn from, enhancing its ability to understand and respond to natural language queries.

Overall, this three-stage training pipeline aims to develop a robust visual language model capable of understanding and generating human-like responses to visual and linguistic inputs.

4. Text Information Recognition & Processing

Text processing in images has also improved significantly, especially in terms of recognizing Chinese and English text. Qwen-VL-Plus/Max can now efficiently extract information from tables and documents and reformat this information to meet custom output requirements. In addition, it has an efficient mechanism for identifying and converting dense text, which is very effective in dealing with documents that contain a lot of information. It supports images with extreme aspect ratios, ensuring the flexibility to process diverse visual content.

Example: Information Retrieval Next

User

Please read the text in this image and return the information in the following JSON format (note xxx is placeholder, if the information is not available in the image, put “N/A” instead). {“class”: xxx, “DLN”: xxx, “DOB”: xxx, “Name”: xxx, “Address”: xxx, “EXP”: xxx, “ISS”: xxX, “SEX”: xxx, “HGT”: xxx, “WGT”: xxx, “EYES”: xxx, “HAIR”: xxX, “DONOR”: xxx}

Qwen-VL

{“class”: “C”, “DLN”: “I1234568”, “DOB”: “08/31/1977”, “Name”: “IMA CARDHOLDER”, “Address”: “2570 24TH STREET ANYTOWN, CA 95818”, “EXP”: “08/31/2014”, “ISS”: “08/31/2009”, “SEX”: “F”, “HGT”: “5’-05"”, “WGT”: “125 lb”, “EYES”: “BRN”, “HAIR”: “BRN”, “DONOR”: “YES”}

Example: Text Information Recognition & Processing Next

User

规范格式输出图中所有文字

Qwen-VL

图片中的文字如下：

学习提示
郭沫若的诗集《女神》是中国新诗的代表作品，它以崭新的内容和形式，表达了“五四”时期狂飙突进的时代精神。《立在地球边上放号》是《女神》中富有代表性的一首诗。诗人设想站在地球边上全方位俯瞰地球，放声呼唤，纵情高歌，想象着那怒涌的白云、壮丽的北冰洋和狂暴的太平洋，意在赞美摧毁旧世界、创造新生活的雄强之力，体现了“五四”所焕发的自由宏阔、雄奇奔放的气概。阅读时要注意联系“五四”特定的时代氛围来理解这首诗的内涵与形式特征。
闻一多的《红烛》化用“蜡炬”这一古典意象，赋予它新的含义，赞美的红烛以“烧蜡成灰”来点亮世界的奉献精神。注意体会诗人如何借助与红烛的“对话”表达青春的困惑与希望，以及对理想的坚毅追求。洋溢在诗中的幻想和情绪渲染，叹词的回环使用，诗句长短错落形成的节奏美，也是欣赏时关注的。
昌耀的《峨日朵雪峰之侧》内蕴丰富，阅读时要注意发挥想象，体味“雪峰”、“太阳”、“蜘蛛”等意象营造出的凝重而又壮美的氛围，感受其中蕴含的谦卑而强劲的生命力量。
雪莱的《致云雀》意蕴深刻，比喻新颖，要注意体会诗人的情感，理解云雀的象征意义；把握作品节与节之间环环相扣的特点，欣赏诗人想象的奇特和运用修辞手法的大胆、灵活。
学习这组诗作，要多借助朗读，理清诗歌的抒情脉络，把握意象的内涵，感悟诗人抒发的情思，体会象征手法的运用。
第一单元
11

请注意，图片中的文字可能需要进一步的排版和格式化，以便于阅读和理解。如果您需要帮助，请随时告诉我。

Example: Text Information Recognition & Processing Next

User

A brief summary of the blog

Qwen-VL

The blog post announces several new updates and improvements to OpenAI’s platform, including:

GPT-4 Turbo with a 128K context window, which is more capable and supports a 128K context window.
New Assistants API that makes it easier for developers to build their own assistive AI apps and tools.
New multimodal capabilities in the platform, including text-to-speech (TTS).
Improved instruction following and JSON mode.
Reproducible outputs and log probabilities.
Updated GPT-3.5 Turbo with a 16K context window by default.

The post also mentions that these updates will be rolled out to all paying customers starting at 10am PT today.

How to Use

Now you can access Qwen-VL-Plus and Qwen-VL-Max through the Huggingface Spaces, the Qwen website, and Dashscope APIs.

Try Qwen-VL-Plus (https://huggingface.co/spaces/Qwen/Qwen-VL-Plus) and Qwen-VL-Max (https://huggingface.co/spaces/Qwen/Qwen-VL-Max) in the Huggingface Spaces

Log in to the QianWen web portal at https://tongyi.aliyun.com/qianwen, and switch to “Image Understanding” mode to harness the latest Qwen-VL-Max capabilities.
Access the powerful APIs of Qwen-VL-Plus and Qwen-VL-Max through the Dashscope platform (https://help.aliyun.com/zh/dashscope/developer-reference/vl-plus-quick-start).

Summary

Qwen-VL-Plus and Qwen-VL-Max make significant strides in enhancing high-resolution recognition, text analysis, and image reasoning capabilities. These models now match the performance of GPT4-v and Gemini, outperforming all other open-source and proprietary models in many tasks, such as MMMU, CMMMU, and MathVista. They achieve world-class results in document analysis (DocVQA) and Chinese language-related image understanding (MM-Bench-CN).

Our objective is to continually tap into and elevate the potential of Qwen-VL, enabling it to make a difference across a broader range of applications. We can envision Qwen-VL as an assistant equipped with superhuman visual and linguistic comprehension skills that can provide robust support in everyday dialogues as well as complex scenarios like driving environments and programming contexts.

While there is still a long way to go, we are confident that Qwen-VL will evolve to perceive and understand the world akin to human cognition through continuous optimization and expansion!

Our Official Website: https://tongyi.aliyun.com/qianwen
Github: https://github.com/QwenLM/Qwen-VL
Huggingface: http://huggingface.co/Qwen/Qwen-VL-Chat
ModelScope: https://modelscope.cn/studios/qwen/Qwen-VL-Chat-Demo
API: https://help.aliyun.com/zh/dashscope/developer-reference/tongyi-qianwen-vl-plus-api
Discord: https://discord.gg/CV4E9rpNSD

Introducing Qwen-VL

1. Basic Recognition Capabilities#

2. Visual Agent Capability: The Essential Role of Localization#

3. Visual Reasoning Capability: To Solve Real Problems#

4. Text Information Recognition & Processing#

How to Use#

Summary#

1. Basic Recognition Capabilities

2. Visual Agent Capability: The Essential Role of Localization

3. Visual Reasoning Capability: To Solve Real Problems

4. Text Information Recognition & Processing

How to Use

Summary