QWEN CHAT DISCORD

Introduction

The evolution of multimodal large models is continually pushing the boundaries of what we believe technology can achieve. From the initial QwenVL to the latest Qwen2.5 VL, we have made progress in enhancing the model’s ability to understand image content. Today, we are excited to introduce a new model, Qwen VLo, a unified multimodal understanding and generation model. This newly upgraded model not only “understands” the world but also generates high-quality recreations based on that understanding, truly bridging the gap between perception and creation. Note that this is a preview version and you can access it through Qwen Chat. You can directly send a prompt like “Generate a picture of a cute cat” to generate an image or upload an image of a cat and ask “Add a cap on the cat’s head” to modify an image. The image generation process is shown below.

The Creative Process: Turn Your Imagination Into Reality

As demonstrated in the video showcasing the generative process, Qwen VLo employs a progressive generation method, gradually constructing the entire image from left to right and top to bottom. During this process, the model continuously refines and optimizes its predictions to ensure that the final result is coherent and harmonious. This generative mechanism not only enhances visual quality but also provides users with a more flexible and controllable creative experience.


From Understanding to Creation: Enhanced Multimodal Generation Capabilities

Qwen VLo has undergone a comprehensive upgrade in both its original multimodal understanding and generation capabilities. It significantly deepens its comprehension of image content and achieves more accurate and consistent generation results. Below are the core highlights of Qwen VLo:

  1. More Precise Content Understanding and Recreation

    Previous multimodal models often struggled with semantic inconsistencies during the generation process, such as misinterpreting a car as another object or failing to retain key structural features of the original image. Qwen VLo, equipped with enhanced detail-capturing abilities, maintains a high level of semantic consistency throughout the generation process. For instance, when a user inputs a photo of a car and requests a “color change,” Qwen VLo can accurately identify the car model, preserve its original structure, and naturally transform its color style. The generated result meets expectations while maintaining realism.

  2. Support for Open-Ended Instruction-Based Editing

    Users can provide creative instructions in natural language, such as “change this painting to a Van Gogh style,” “make this photo look like it’s from the 19th century,” or “add a sunny sky to this image.” Qwen VLo can flexibly respond to these open-ended commands and produce results that align with user expectations. Whether it’s artistic style transfer, scene reconstruction, or detailed touch-ups, the model handles them all with ease. Even traditional visual perception tasks, such as predicting depth maps, segmentation maps, detection maps, and edge information, can be accomplished through simple editing instructions. Furthermore, Qwen VLo can also seamlessly handle more complex instructions — such as modifying objects, editing text, and changing backgrounds — all within a single command.

  3. Multilingual Instruction Support

    Qwen VLo supports multiple languages, including Chinese and English, breaking down language barriers and providing a unified, convenient interaction experience for global users. Regardless of the language you use, simply describe your needs, and the model will quickly understand and deliver the desired output.


Demo Cases

Qwen VLo acts like a human artist, using its understanding to turn imagination into reality. Below are some examples for reference.

Qwen VLo is capable of directly generating images and modifying them by replacing backgrounds, adding subjects, performing style transfers, and even executing extensive modifications based on open-ended instructions, as well as handling detection and segmentation tasks.

A cute Shiba Inu Next
User

生成一个可爱的柴犬

Translation: Generate a cute Shiba Inu

Qwen-VLo
User

背景改成草原

Translation: Change the background to a grassland

Qwen-VLo
User

给它带上红色帽子和黑色透明墨镜,帽子上写着“QwenVLo”

Translation: Put a red hat and black transparent sunglasses on it, with ‘QwenVLo’ written on the hat

Qwen-VLo
User

变成吉卜力风格

Translation: Switch to Ghibli style

Qwen-VLo
User

变成3d Q版风格

Translation: Switch to 3D Q-version style

Qwen-VLo
User

把它放到水晶球里

Translation: Place it inside a crystal ball

Qwen-VLo
User

桌面上摆着这个水晶球,生成以一个人的第一视角在公园的圆形咖啡桌上在笔记本上画画

Translation: Place this crystal ball on a desk and generate an image from a first-person perspective of someone drawing on a notebook placed on a round coffee table in a park

Qwen-VLo
User

用蓝色的蒙版检测框框出图中的笔

Translation: Use a blue mask to detect and frame the pen in the picture

Qwen-VLo
User

用粉色的mask分割出图中的狗狗边缘

Translation: Use a pink mask to segment the edge of the dog in the picture

Qwen-VLo

Qwen VLo can reinterpret and recreate based on its understanding, allowing for greater flexibility in style changes and migrations, such as transforming cartoons into realistic images or turning figures into balloons, among other creative outputs.

Style Conversion Next
User
变成真实照片
Translation: Turn into a real photo
Qwen-VLo
User
背景换成艾弗尔铁塔
Translation: Change the background to the Eiffel Tower
Qwen-VLo
User
变成气球飘到空中
Translation: Turn into a balloon floating in the air
Qwen-VLo
User
把西瓜换成榴莲
Translation: Replace the watermelon with durian
Qwen-VLo

The model’s advanced capabilities in image and instruction comprehension enable it to better interpret complex commands, incorporating multiple operations and modifications in a single instruction. This allows for the completion of multi-step tasks in one go, such as creating posters or combining objects.

Complex Image Prompt Next
User

Generate a photo in which a man wearing a black hat is reading a newspaper on the subway, next to him is a beautiful young woman wearing red sunglasses, and there’s also a Husky dog. The subway window shows the Statue of Liberty, and the subway station sign displays ‘Qwen VLo’.

(翻译:生成照片,图中男人带着黑帽子在地铁上看报纸,旁边是一个美丽的带红色墨镜年轻女性,还有一只哈士奇,地铁的窗外是自由女神像,地铁的站牌显示“Qwen VLo”)

Qwen-VLo

In addition to image editing and re-creation, Qwen VLo can also perform annotations on existing information, such as detection, segmentation, edge detection, and more.

Perception and Localization Next
User
生成摆满水果的桌面
Translation: Generate a table full of fruits
Qwen-VLo
User
预测边缘检测图
Translation: Predict an edge detection map
Qwen-VLo
User
用红色mask分割图中香蕉的边缘
Translation: Use a red mask to segment the edges of the banana in the image.
Qwen-VLo

Qwen VLo supports the understanding and generation of multiple input images. (The function of multiple image inputs has not yet been officially launched, so stay tuned.)

Multiple image input Next
User

把这些洗浴用品,放到这个红色的篮子里面

Translation: Put these bathing products into this red basket.

Qwen-VLo

Moreover, besides supporting tasks that involve both text and image inputs, Qwen VLo also supports direct text-to-image generation, including general images as well as bilingual (Chinese and English) posters.

Text to Image Next
User
Epic anime artwork of a wizard atop a mountain at night casting a cosmic spell into the dark sky that says “Qwen VLo!” made out of colorful energy'
Qwen-VLo

Qwen VLo supports image generation with dynamic aspect ratio, and can easily handle elongated formats with aspect ratios as extreme as 4:1 or 1:3. (The feature for generating images with extreme aspect ratios is not yet officially launched—stay tuned for its release.)

Text to Image Next
User

动漫插画;水彩手绘;前景是草坡,草坡上有个人在奔跑,动态感,然后是厚重的白云;蓝色背景;颜色层次多渐变;过渡自然和谐

Translation: Anime illustration; watercolor hand-drawn style; the foreground features a grassy hill where a person is running, conveying a sense of motion. Behind are thick, billowing clouds; blue background; multi-layered color gradients with smooth and harmonious transitions.

Qwen-VLo

As a unified understanding and generative model, Qwen VLo can also reanalyze and understand the content it generates. For example, it can identify the breeds of dogs and cats within the generated images.

Generation and Understanding Next
User
Generate a puppy and a kitten.
Qwen-VLo
User
What breed of cat and dog is this?
Qwen-VLo

The image shows a puppy and a kitten interacting in a cozy indoor setting. Let’s identify the breeds:

Dog:

  • The puppy appears to be a Beagle or a Beagle mix. Key characteristics include:
    • Tri-color coat (brown, white, and black).
    • Floppy ears.
    • Compact body structure.
    • Expressive eyes and a friendly demeanor.

Cat:

  • The kitten appears to be a Tabby cat. Key characteristics include:
    • Striped or spotted fur pattern, which is typical of tabby cats.
    • A mix of brown, black, and white fur, which is common in tabby patterns.

Both animals are young and appear to be very playful and curious, which is typical behavior for puppies and kittens. The interaction between them suggests a friendly and affectionate relationship, which is often seen when different species bond early in life.

How to Use

Qwen VLo uses dynamic resolution training, supporting dynamic resolution generation. Both input and output allow for images of arbitrary resolutions and aspect ratios. This means users are no longer constrained by fixed formats and can generate images tailored to different scenarios, whether it’s posters, illustrations, web banners, or social media covers.

Additionally, Qwen VLo introduces an innovative generative mechanism: a progressive top-to-bottom, left-to-right generation process.

This mechanism not only improves generation efficiency but is particularly suited for tasks requiring fine control, such as generating long paragraphs of text. For example, when designing advertisements or comic panels with extensive text, Qwen VLo generates content progressively, allowing users to observe and adjust the process in real-time for optimal creative results.

Limitations

Qwen VLo is still in the preview stage, and there are many shortcomings. During the generation process, there may be issues such as inaccuracies, inconsistencies with the original image, non-compliance with instructions, and instability in recognizing and understanding the intent of the generated images. We appreciate your understanding. We will continue to iterate and improve the stability and robustness of the model.


Next Steps: Express Ideas Through Images, Foster Understanding Through Generation

As multimodal large models increasingly gain the ability to handle bidirectional text and visual inputs and outputs, we are opening up new avenues for expression and interaction. In the future, models will not only answer questions with text but also convey ideas and meanings through images. For example, generating diagrams, adding auxiliary lines, or annotating key areas will provide users with more diverse communication tools.

Moreover, multimodal models with generative capabilities offer new ways to supervise and refine their understanding. By generating intermediate results like segmentation maps or detection maps, the model can verify its own comprehension and further improve its performance. This is a direction we will continue to explore and develop in the future.