DEMO GITHUB HUGGING FACE MODELSCOPE API DISCORD

After a year’s relentless efforts, today we are thrilled to release Qwen2-VL! Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of:

  • SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.

  • Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.

  • Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.

  • Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.

We opensource Qwen2-VL-2B and Qwen2-VL-7B with Apache 2.0 license, and we release the API of Qwen2-VL-72B! The opensource is integrated to Hugging Face Transformers, vLLM, and other third-party frameworks. Hope you enjoy!

Performance

We evaluate our model’s visual capabilities across six key dimensions: complex college-level problem-solving, mathematical abilities, document and table comprehension, multilingual text-image understanding, general scenario question-answering, video comprehension, and agent-based interactions. Overall, our 72B model showcases top-tier performance across most metrics, often surpassing even closed-source models like GPT-4o and Claude 3.5-Sonnet. Notably, it demonstrates a significant edge in document understanding.

At the 7B scale, we’ve managed to retain support for image, multi-image, and video inputs, delivering competitive performance in a more cost-effective model size. Specifically, our model excels in document understanding tasks such as DocVQA and in multilingual text understanding from images, as assessed by MTVQA, establishing state-of-the-art performance.

Additionally, we’re excited to introduce a smaller 2B model, optimized for potential mobile deployment. Despite its compact size, this model boasts strong performance in image, video, and multilingual comprehension. It particularly shines in video-related tasks, document understanding, and general scenario question-answering when compared to other models of similar scale.

Model Capabilities

1. Enhanced Recognition Capabilities

Qwen2-VL now boasts improved object recognition, extending beyond plants and landmarks to comprehend complex relationships between multiple objects in a scene. We’ve also significantly boosted the model’s ability to recognize handwritten text and multiple languages within images, making it more accessible to users worldwide.

Example: Multi-object Recognition Next
User
Output the color and number of each box. image
Qwen2-VL

The image shows a stack of colorful blocks with numbers on them. Here is the color and number of each block from top to bottom:

  • Top row (single block): Blue with the number 9
  • Second row (two blocks): Light blue with the number 7, and green with the number 8
  • Third row (three blocks): Purple with the number 4, pink with the number 5, and light green with the number 6
  • Bottom row (four blocks): Red with the number 0, orange with the number 1, yellow with the number 2, and light green with the number 3

2. Visual Reasoning: Solving Real-World Problems

In this iteration, we have significantly enhanced Qwen2-VL’s mathematical and coding proficiencies. The model is not only capable of solving problems by analyzing pictures but can also interpret and solve complex mathematical problems through chart analysis. Extremely aspect-ratio-distorted images can also be correctly interpreted. Additionally, we have reinforced the model’s capability to extract information from real-world images and charts and improved its instruction-following skills. This fusion of visual perception and logical reasoning empowers the model to tackle practical issues, bridging the gap between abstract concepts and tangible solutions.

Example: Code Next
User
Solve the algorithmic problem in the web page. image
Qwen2-VL

To solve the “Snake in Matrix” problem, we need to implement a function that takes the size of the matrix n and a list of commands, and returns the final position of the snake after executing all the commands. Here’s a step-by-step approach to solve this problem:

  1. Initialize Variables:

    • Initialize the current position of the snake as 0.
    • Initialize the direction of the snake as “RIGHT” (initially, the snake moves to the right).
  2. Define Movement Functions:

    • Create a function to move the snake based on the current direction and the size of the matrix n.
    • Update the position of the snake accordingly.
  3. Process Commands:

    • Iterate through each command in the list of commands.
    • For each command, update the direction and position of the snake.
  4. Return Final Position:

    • After processing all commands, return the final position of the snake.

Here’s the Python code to implement the solution:

class Solution:
    def finalPositionOfSnake(self, n: int, commands: List[str]) -> int:
        # Initialize variables
        position = 0
        direction = "RIGHT"
        
        # Define movement functions
        def move_right():
            nonlocal position
            position += 1
        
        def move_down():
            nonlocal position
            position += n
        
        def move_left():
            nonlocal position
            position -= 1
        
        def move_up():
            nonlocal position
            position -= n
        
        # Process commands
        for command in commands:
            if command == "RIGHT":
                move_right()
            elif command == "DOWN":
                move_down()
            elif command == "LEFT":
                move_left()
            elif command == "UP":
                move_up()
        
        # Return final position
        return position

This code defines a class Solution with a method finalPositionOfSnake that takes the size of the matrix n and a list of commands, and returns the final position of the snake after executing all the commands. The movement functions (move_right, move_down, move_left, move_up) are used to update the position of the snake based on the current direction.

3. Video Understanding and Live Chat

Beyond static images, Qwen2-VL extends its prowess to video content analysis. It can summarize video content, answer questions related to it, and maintain a continuous flow of conversation in real-time, offering live chat support. This functionality allows it to act as a personal assistant, helping users by providing insights and information drawn directly from video content.

Example: Video Understanding Next
User
Please describe the video in detail.
Qwen2-VL
The video begins with a man speaking to the camera, followed by a group of people sitting in a control room. The camera then cuts to two men floating inside a space station, where they are seen speaking to the camera. The men appear to be astronauts, and they are wearing space suits. The space station is filled with various equipment and machinery, and the camera pans around to show the different areas of the station. The men continue to speak to the camera, and they appear to be discussing their mission and the various tasks they are performing. Overall, the video provides a fascinating glimpse into the world of space exploration and the daily lives of astronauts.
User
What is the color of the clothes the two astronauts wearing?
Qwen2-VL
The two astronauts are wearing blue and black clothes.
Example: Live chat Next
Live chat

4. Visual Agent Capabilities: Function Calling and Visual Interactions.

Qwen2-VL demonstrates strong potential as a visual agent, facilitating interactions similar to human perceptions of the world.

  • The model facilitates Function Calling, enabling it to harness external tools for real-time data retrieval – be it flight statuses, weather forecasts, or package tracking – by deciphering visual cues. This integration of visual interpretation with functional execution elevates its utility, making it a powerful tool for information management and decision-making.
Example: Function Calling Next
Ask about the weather
  • Visual Interactions represent a significant stride towards mimicking human perception. By allowing the model to engage with visual stimuli akin to human senses, we’re pushing the boundaries of AI’s ability to perceive and respond to its environment. This capability paves the way for more intuitive and immersive interactions, where Qwen2-VL acts not just as an observer, but an active participant in our visual experiences.
Example: UI Interactions Next
Operate a Mobile Phone

Certainly, the model is not perfect and has some limitations that I hope you can understand. For example, the model is unable to extract audio from videos, and its knowledge is only up to date as of June 2023. Additionally, the model cannot guarantee complete accuracy when processing complex instructions or scenarios, and it is relatively weak in tasks involving counting, character recognition, and 3D spatial awareness.

Model Architecture

Overall, we’ve continued with the Qwen-VL architecture, which leverages a Vision Transformer (ViT) model and Qwen2 language models. For all these variants, we utilized a ViT with approximately 600M parameters, designed to handle both image and video inputs seamlessly. To further enhance the model’s ability to effectively perceive and comprehend visual information in videos, we introduced several key upgrades:

  • A key architectural improvement in Qwen2-VL is the implementation of Naive Dynamic Resolution support. Unlike its predecessor, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, thereby ensuring consistency between the model input and the inherent information in images. This approach more closely mimics human visual perception, allowing the model to process images of any clarity or size.
  • Another key architectural enhancement is the innovation of Multimodal Rotary Position Embedding (M-ROPE). By deconstructing the original rotary embedding into three parts representing temporal and spatial (height and width) information,M-ROPE enables LLM to concurrently capture and integrate 1D textual, 2D visual, and 3D video positional information.

Developing with Qwen2-VL

To use the largest Qwen2-VL model, Qwen2-VL-72B, you can access it through our official API (sign up the account and obtain the API key through DashScope) temporarily as demonstrated below:

from openai import OpenAI
import os
import base64


def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


# Path to your image
image_path = "dog_and_girl.jpeg"

# Getting the base64 string
base64_image = encode_image(image_path)


def get_response():
    client = OpenAI(
        api_key=os.getenv("DASHSCOPE_API_KEY"),
        base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    )
    completion = client.chat.completions.create(
        model="qwen-vl-max-0809",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What is this?"},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
                        },
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                    },
                ],
            }
        ],
        top_p=0.8,
        stream=True,
        stream_options={"include_usage": True},
    )
    for chunk in completion:
        print(chunk.model_dump_json())


if __name__ == "__main__":
    get_response()

The 2B and 7B models of the Qwen2-VL series are open-sourced and accessible on Hugging Face and ModelScope. You can explore the model cards for detailed usage instructions, features, and performance metrics. Below we provide an example of the simplest usage with HF Transformers.

Make sure you install transformers from source by pip install git+https://github.com/huggingface/transformers as codes for Qwen2-VL were just merged into the main branch. If you didn’t install it from source, you may encounter the following error:

KeyError: 'qwen2_vl'

We offer a toolkit to help you handle various types of visual input more conveniently. It supports inputs including base64, URLs, and interleaved images and videos. You can install it using the following command:

pip install qwen-vl-utils

Here is a code snippet for demonstration. Specifically, we recommend using flash attention 2 if possible for the sake of acceleration and memory saving.

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

To facilitate seamless integration and use of our latest models, we support a range of tools and frameworks in the open-source ecosystem, including quantization (AutoGPTQ, AutoAWQ), deployment (vLLM), finetuning (Llama-Factory), etc.

License

Both the open-source Qwen2-VL-2B and Qwen2-VL-7B are under Apache 2.0.

What’s Next

We look forward to your feedback and the innovative applications you will build with Qwen2-VL. In the near future, we are going to build stronger vision language models upon our next-version language models and endeavor to integrate more modalities towards an omni model!