Qwen-VL全新升级！

我们在Qwen语言模型的基础上，结合此前我们提出的多模态多任务训练，以解决多模态模型在泛化能力上的局限性，并于2023年9月开源了多模态模型Qwen-VL。最近，Qwen-VL系列有了重大升级，推出了两个增强版本：Qwen-VL-Plus和Qwen-VL-Max。这两个版本的关键提升包括：

显著提升与图像相关的推理能力；
在识别、提取和分析图像及其内含文本中的细节方面有明显增强；
支持百万像素以上的高清晰度图像以及各种宽高比的图像。

Model Name	模型描述
qwen-vl-plus	^{Qwen的增强型大规模视觉语言模型。该模型针对细节识别能力和文本识别能力进行了显著升级，支持高达数百万像素的超高清分辨率及任意图像输入的宽高比。它在各类视觉任务上都展现出卓越的性能表现。}
qwen-vl-max	^{Qwen的最强大视觉语言模型。相较于增强版本，该模型在视觉推理和指令执行能力上做出了进一步提升，提供了更高级别的视觉感知与认知理解力,在更广泛复杂的任务上都能实现最优性能。}

相比于开源版本的Qwen-VL，这两个模型在多个文本-图像多模态任务中与Gemini Ultra和GPT-4V的表现相当，显著超越了之前开源模型的最佳结果。值得一提的是，Qwen-VL-Max在中文问题回答和中文文本理解任务上超越了OpenAI的GPT-4V以及谷歌的Gemini。下文展示了实验结果及真实用例。

Model	DocVQA ^{^{Document understanding}}	ChartQA ^{^{Chart understanding}}	AI2D ^{^{Science diagrams}}	TextVQA ^{^{Text reading}}	MMMU ^{^{College-level problems}}	MathVista ^{^{Mathematical reasoning}}	MM-Bench-CN ^{^{Natural image QA in Chinese}}
Other Best ^{Open-source LVLM}	81.6% ^{^(CogAgent)}	68.4% ^{^(CogAgent)}	73.7% ^{^{(Fuyu-Medium)}}	76.1% ^{^(CogAgent)}	45.9% ^{^(Yi-VL-34B)}	36.7% ^{^(SPHINX-V2)}	72.4% ^{^{(InternLM-XComposer-VL)}}
Gemini Pro	88.1%	74.1%	73.9%	74.6%	47.9%	45.2%	74.3%
Gemini Ultra	90.9%	80.8% ¹	79.5% ¹	82.3% ¹	59.4% ¹	53.0% ¹	-
GPT-4V	88.4%	78.5%	78.2%	78.0%	56.8%	49.9%	73.9%
Qwen-VL-Plus	91.4%	78.1%	75.9%	78.9%	45.2%	43.3%	68.0%
Qwen-VL-Max	93.1% ¹	79.8% ²	79.3% ²	79.5% ²	51.4% ³	51.0% ²	75.1% ¹

最新版本的Qwen-VL-Plus和Qwen-VL-Max模型不仅在基准测试性能上表现出色，而且在解决实际场景中的问题方面也展现了显著的进步。这些先进的模型能够轻松进行对话互动、识别名人和地标、生成文本，并且它们在描述和解读视觉内容的能力上有了显著提升。

以下是一些实际应用示例：

1. 基础识别能力

最新的Qwen-VL模型在准确描述和识别图像中的复杂信息以及提供详尽背景和回答相关问题方面表现出更高的能力。例如，Qwen-VL不仅能识别常见的物体，还能识别名人和地标。此外，Qwen-VL还能根据视觉输入创作各种语言的诗歌，并能对日常截图进行分析解读。

Example: celebrity recognition Next

User

who are they?

Qwen-VL

These are some of the most famous and highest-paid celebrities in the world. From left to right, top to bottom:

Justin Bieber (Canadian singer)
Lionel Messi (Argentine soccer player)
Cristiano Ronaldo (Portuguese soccer player)
Elon Musk (South African-born American entrepreneur and businessman)
Kylie Jenner (American media personality and socialite)
Taylor Swift (American singer-songwriter)
Robert Downey Jr. (American actor)
Scarlett Johansson (American actress)

It’s worth noting that this is just a small sample of the many talented and successful celebrities out there, and that fame and wealth are not the only measures of success or importance.

User

their chinese name?

Qwen-VL

Here are the Chinese names of the celebrities mentioned:

Justin Bieber: 贾斯汀·比伯
Lionel Messi: 莱昂内尔·梅西
Cristiano Ronaldo: 克里斯蒂亚诺·罗纳尔多
Elon Musk: 埃隆·马斯克
Kylie Jenner: 凯莉·詹娜
Taylor Swift: 泰勒·斯威夫特
Robert Downey Jr.: 小罗伯特·唐尼
Scarlett Johansson: 斯嘉丽·约翰逊

Note that these are just one possible translation of their names into Chinese, and there may be other variations used in different contexts or regions.

Example: landmark recognition Next

User

Where is it in the picture? How did you recognize it?

Qwen-VL

The picture shows a busy street at night in New York City, USA. The Empire State Building can be seen in the background, which is a distinctive landmark of New York City. Therefore, I recognized it as New York City.

Example: poem writing Next

User

Look at pictures and write a poem.

Qwen-VL

A rocky crag by the sea, A place of solitude and mystery. The water calm and blue, Reflecting the sky above, anew.

The mountain stands tall and proud, Its peak reaching towards the clouds. A symbol of strength and resilience, Enduring through time’s relentless persistence.

The world around may change and shift, But this mountain remains steadfast and uplifted. A reminder to us all, To stand firm in our own personal call.

So let us take a moment to pause, And appreciate this natural cause. For in its beauty and grandeur, We find inspiration and renewal, forever.

Example: mobile app screenshot understanding Next

User

Describe the image.

Qwen-VL

The image is a screenshot of an app displaying a map of Beijing, China. The map is centered on the Summer Palace and Yuyuantan Park. The app provides directions for driving from Yuyuantan Park to the Summer Palace. It offers two options: one takes 22 minutes and the other takes 25 minutes. The app also indicates that there are local plate restrictions on the route.

2. 视觉Agent能力

除了在描述和识别方面的基本功能外，Qwen-VL还具有强大的定位和检索特定元素的能力。例如，它能够精准地在图像中标识出黑色的汽车。此外，Qwen-VL还具备根据场景的当前上下文进行判断、推理和决策的能力。

Example: Locate Whatever You Desire Next

User

Locate the red car

Qwen-VL

The red car is located in the bottom right corner of the image.

Example: Contextually Anchored Image Description Generation Next

User

What does the pointed news talk about?

Qwen-VL

The pointed news is about OpenAI’s chatbot ChatGPT having personalized instructions to store your preferences.

Example: Autonomous Driving Next

User

The current vehicle speed is 150km/h. According to the picture, how should the vehicle operate? A. To accelerate B. Slow down C. Maintain the original speed D. Parking Choose an answer and give reasons.

Qwen-VL

B. Slow down

According to the traffic sign in the picture, the maximum speed limit for the current lane is 100 km/h. If the current vehicle speed is 150 km/h, it exceeds the speed limit and should slow down to comply with the traffic regulations.

3. 视觉推理能力

最新的Qwen-VL的一项重大突破在于其基于视觉输入进行复杂推理的能力。这项增强的视觉推理功能超越了简单的内容描述，能够理解和解读诸如流程图、图表和其他符号系统等复杂的表达形式。在问题解决和推理领域，Qwen-VL-Plus/Max不仅擅长数学问题求解和信息组织，还在对图表和图形进行深度诠释与分析方面表现出色。

Example: Math Next

User

read the image and solve it step by step

Qwen-VL

The image shows two geometric figures, a rectangular prism and a cube, with their dimensions labeled. The problem asks to calculate the surface area and volume of these figures.

For the rectangular prism:

Length (l) = 4 cm
Width (w) = 2 cm
Height (h) = 3 cm

Surface Area (SA) of a rectangular prism = 2lw + 2lh + 2wh Volume (V) of a rectangular prism = lwh

Substituting the given values: SA = 2(4cm)(2cm) + 2(4cm)(3cm) + 2(2cm)(3cm) SA = 16cm² + 24cm² + 12cm² SA = 52cm²

V = (4cm)(2cm)(3cm) V = 24cm³

For the cube:

Side length (a) = 3 dm

Surface Area (SA) of a cube = 6a² Volume (V) of a cube = a³

Substituting the given value: SA = 6(3dm)² SA = 6(9dm²) SA = 54dm²

V = (3dm)³ V = 27dm³

So, the surface area and volume for each figure are as follows:

For the rectangular prism:

Surface Area = 52 cm²
Volume = 24 cm³

For the cube:

Surface Area = 54 dm²
Volume = 27 dm³

Example: Code Next

User

convert the process into a python code program

Qwen-VL

The image shows a Scratch programming block, which can be converted into Python code as follows:

import random

# Set up the variables
a = random.randint(1, 50)
b = random.randint(1, 50)
answer = a + b

# Ask the user for their input
response = int(input(f"What is {a} + {b}? "))

# Check if the response matches the answer
if response == answer:
    print("Correct!")
else:
    print("Incorrect!")

This Python code generates two random numbers between 1 and 50, adds them together to get the answer, and then asks the user for their response. It then checks if the response matches the answer, and prints out a message accordingly.

Example: Table Next

User

Spot a data point that stands out in these charts and what that implicates. Then produce a detailed markdown table for all the data shown.

Qwen-VL

The most striking data point in these charts is the high percentage of plastic waste that is mismanaged in the United States, which stands at 4%. This implies that a significant portion of plastic waste in the US is not being properly disposed of or recycled, leading to environmental pollution and potential harm to ecosystems and human health.

Here’s a detailed markdown table for all the data shown:

Region	Landfilled (2019)	Mismanaged (2019)	Incinerated (2019)	Recycled (2019)
World	49%	22%	19%	9%
United States	73%	4%	19%	4%
Europe	44%	6%	38%	12%
Asia (excluding China and India)	39%	34%	19%	8%

Note: The data source for this information is the OECD (Organisation for Economic Co-operation and Development), and the percentages represent the share of plastic waste that is recycled, landfilled, incinerated, and mismanaged in each region in 2019.

Example: Reasoning Next

User

Look at this sequence of three shapes. What shape should come as the fourth shape? Explain your reasoning with detailed descriptions of the first shapes.

Qwen-VL

The sequence of shapes is a triangle, square, and pentagon, followed by a question mark. This pattern suggests that the next shape in the sequence should be a hexagon. The pattern appears to be based on the number of sides in each shape, with each subsequent shape having one more side than the previous one. Therefore, the fourth shape should have six sides, making it a hexagon.

User

翻译成中文

Qwen-VL

看这个三个形状的序列。下一个形状应该是什么形状？解释你的推理过程。形状序列是三角形、正方形和五边形，然后是一个问号。这表明下一个形状应该是六边形。这个序列似乎基于每个形状的边数，每个后续形状比前一个形状多一条边。因此，第四个形状应该有六条边，使它成为一个六边形。

Example: Model Flow Next

User

explain the picture

Qwen-VL

The image presents a three-stage training pipeline for the Qwen-VL series, a visual language model. Each stage is represented by a box labeled “QwenLM” and is accompanied by a smaller box indicating the type of data used in that stage.

In Stage 1, the model is pre-trained using image-text pairs. This stage involves learning from a large dataset of images and their corresponding text descriptions to understand the relationship between visual and linguistic information.

Stage 2 is multi-task pre-training with multi-task and interleaved VL data. Here, the model is trained on multiple tasks simultaneously, such as image captioning, visual question answering, and referring expression comprehension. The use of interleaved data helps the model learn to switch between tasks efficiently.

Finally, in Stage 3, the model undergoes supervised finetuning with chat-interleaved VL data. This stage involves refining the model’s performance on specific tasks using labeled data. The chat-interleaved data provides a more conversational context for the model to learn from, enhancing its ability to understand and respond to natural language queries.

Overall, this three-stage training pipeline aims to develop a robust visual language model capable of understanding and generating human-like responses to visual and linguistic inputs.

4. 文字识别及处理

在识别中文和英文文本方面。Qwen-VL-Plus/Max现在能够高效地从表格和文档中提取信息，并根据自定义输出需求重新格式化这些信息。此外，它还具备一种高效的密集文本识别与转换机制，在处理包含大量信息的文档时表现非常出色。该技术支持各种极端纵横比的图像，确保了对多样视觉内容进行处理的灵活性。

Example: Information Retrieval Next

User

Please read the text in this image and return the information in the following JSON format (note xxx is placeholder, if the information is not available in the image, put “N/A” instead). {“class”: xxx, “DLN”: xxx, “DOB”: xxx, “Name”: xxx, “Address”: xxx, “EXP”: xxx, “ISS”: xxX, “SEX”: xxx, “HGT”: xxx, “WGT”: xxx, “EYES”: xxx, “HAIR”: xxX, “DONOR”: xxx}

Qwen-VL

{“class”: “C”, “DLN”: “I1234568”, “DOB”: “08/31/1977”, “Name”: “IMA CARDHOLDER”, “Address”: “2570 24TH STREET ANYTOWN, CA 95818”, “EXP”: “08/31/2014”, “ISS”: “08/31/2009”, “SEX”: “F”, “HGT”: “5’-05"”, “WGT”: “125 lb”, “EYES”: “BRN”, “HAIR”: “BRN”, “DONOR”: “YES”}

Example: Text Information Recognition & Processing Next

User

规范格式输出图中所有文字

Qwen-VL

图片中的文字如下：

学习提示
郭沫若的诗集《女神》是中国新诗的代表作品，它以崭新的内容和形式，表达了“五四”时期狂飙突进的时代精神。《立在地球边上放号》是《女神》中富有代表性的一首诗。诗人设想站在地球边上全方位俯瞰地球，放声呼唤，纵情高歌，想象着那怒涌的白云、壮丽的北冰洋和狂暴的太平洋，意在赞美摧毁旧世界、创造新生活的雄强之力，体现了“五四”所焕发的自由宏阔、雄奇奔放的气概。阅读时要注意联系“五四”特定的时代氛围来理解这首诗的内涵与形式特征。
闻一多的《红烛》化用“蜡炬”这一古典意象，赋予它新的含义，赞美的红烛以“烧蜡成灰”来点亮世界的奉献精神。注意体会诗人如何借助与红烛的“对话”表达青春的困惑与希望，以及对理想的坚毅追求。洋溢在诗中的幻想和情绪渲染，叹词的回环使用，诗句长短错落形成的节奏美，也是欣赏时关注的。
昌耀的《峨日朵雪峰之侧》内蕴丰富，阅读时要注意发挥想象，体味“雪峰”、“太阳”、“蜘蛛”等意象营造出的凝重而又壮美的氛围，感受其中蕴含的谦卑而强劲的生命力量。
雪莱的《致云雀》意蕴深刻，比喻新颖，要注意体会诗人的情感，理解云雀的象征意义；把握作品节与节之间环环相扣的特点，欣赏诗人想象的奇特和运用修辞手法的大胆、灵活。
学习这组诗作，要多借助朗读，理清诗歌的抒情脉络，把握意象的内涵，感悟诗人抒发的情思，体会象征手法的运用。
第一单元
11

请注意，图片中的文字可能需要进一步的排版和格式化，以便于阅读和理解。如果您需要帮助，请随时告诉我。

Example: Text Information Recognition & Processing Next

User

A brief summary of the blog

Qwen-VL

The blog post announces several new updates and improvements to OpenAI’s platform, including:

GPT-4 Turbo with a 128K context window, which is more capable and supports a 128K context window.
New Assistants API that makes it easier for developers to build their own assistive AI apps and tools.
New multimodal capabilities in the platform, including text-to-speech (TTS).
Improved instruction following and JSON mode.
Reproducible outputs and log probabilities.
Updated GPT-3.5 Turbo with a 16K context window by default.

The post also mentions that these updates will be rolled out to all paying customers starting at 10am PT today.

如何使用

当前，用户可以通过Huggingface Spaces、通义千问官方网站以及Dashscope APIs来使用Qwen-VL-Plus和Qwen-VL-Max模型。

登录千问官网 https://tongyi.aliyun.com/qianwen，并使用图像理解模式来使用最新的Qwen-VL模型。
通过DashScope使用Qwen-VL-Plus和Qwen-VL-Max的API(DashScope).

总结

Qwen-VL-Plus和Qwen-VL-Max在提升高分辨率识别、文本分析和图像推理能力方面取得了重大突破。这些模型现在已经达到了GPT4-V和Gemini的性能水平，在诸如MMMU、CMMMU和MathVista等众多任务上超越了所有其他开源及专有模型。它们在文档分析（DocVQA）和中文相关的图像理解（MM-Bench-CN）等领域取得了领先的表现。

我们的目标是持续挖掘并提升Qwen-VL的潜力，使其能够在更广泛的应用场景中发挥作用。我们可以设想一个具备超人视觉和语言理解能力的Qwen-VL助手，无论是在日常对话中，还是在复杂的驾驶环境和编程场景下，都能提供强有力的支持。

尽管我们还有很长的路要走，但我们坚信通过持续优化和扩展，Qwen-VL将逐步进化到能够像人类认知一样感知和理解世界！

官网：https://tongyi.aliyun.com/qianwen
Github: https://github.com/QwenLM/Qwen-VL
Huggingface: http://huggingface.co/Qwen/Qwen-VL-Chat
ModelScope: https://modelscope.cn/studios/qwen/Qwen-VL-Chat-Demo
API: https://help.aliyun.com/zh/dashscope/developer-reference/tongyi-qianwen-vl-plus-api
Discord: https://discord.gg/CV4E9rpNSD

We have a new blog!View this page at qwen.ai.

Qwen-VL全新升级！

1. 基础识别能力#

2. 视觉Agent能力#

3. 视觉推理能力#

4. 文字识别及处理#

如何使用#

总结#

We have a new blog!
View this page at qwen.ai.

1. 基础识别能力

2. 视觉Agent能力

3. 视觉推理能力

4. 文字识别及处理

如何使用

总结