API Documentation (Chinese) HuggingFace Demo ModelScope Demo

Introduction

After the release of Qwen2.5, we heard the community’s demand for processing longer contexts. In recent months, we have made many optimizations for the model capabilities and inference performance of extremely long context. Today, we are proud to introduce the new Qwen2.5-Turbo version, which features:

  1. Longer Context Support: We have extended the model’s context length from 128k to 1M, which is approximately 1 million English words or 1.5 million Chinese characters, equivalent to 10 full-length novels, 150 hours of speech transcripts, or 30,000 lines of code. The model achieves 100% accuracy in the 1M length Passkey Retrieval task and scores 93.1 on the long text evaluation benchmark RULER, surpassing GPT-4’s 91.6 and GLM4-9B-1M’s 89.9. Additionally, the model maintains very strong competitiveness in short sequence capabilities, on par with GPT-4o-mini.
  2. Faster Inference Speed: Using sparse attention mechanisms, we successfully reduced the time to first token for processing a context of 1M tokens from 4.9 minutes to 68 seconds, achieving a 4.3x speedup.
  3. Lower Cost: The price remains ¥0.3 / 1M tokens. At the same cost, Qwen2.5-Turbo can process 3.6 times the number of tokens as GPT-4o-mini.

Now, you can use it through the API service of Alibaba Cloud Model Studio [Chinese], or through HuggingFace Demo or ModelScope Demo.

Demo

We have prepared some demos to showcase the new uses that Qwen2.5-Turbo can bring, covering deep understanding of long novels, repository-level code assistant, and reading multiple papers.

Example: Understanding Long Novels Next
User

[Upload the three Chinese novels of The Three-Body Problem, which contain 690k tokens.]

Provide a summary of the plots of these three novels in English

Qwen2.5-Turbo

The three novels you mentioned from the “Earth’s Past” series are interconnected and form a trilogy that delves into the complex interactions between Earth and an alien civilization known as the Trisolarans.

First Novel: “Earth’s Past” In the first novel, the narrative focuses on the development of a top-secret project called “Red Coast” (红岸) initiated by the Chinese military. The protagonist, Ye Wenjie, is a scientist who discovers that the Trisolarans have begun listening to the universe. Her actions, particularly her decision to respond to the Trisolarans, drastically alter the course of human history. She sets up a powerful signal to send to the Trisolarans using the sun as an amplifier. This action marks the beginning of a new era where humanity faces a significant existential threat.

Second Novel: “Dark Forest” The second novel, titled “Dark Forest,” delves into the concept of the “Dark Forest Theory.” According to this theory, the universe is a vast battlefield where civilizations must remain silent to avoid detection and subsequent destruction. The Trisolarans and humans engage in a complex strategic game, with the Trisolarans deploying advanced weapons known as “water drops” (水滴). These water drops are capable of destroying entire planets by disrupting their ecosystems. Meanwhile, humans develop their own strategies, including the “Wall-Bearing Plan” (面壁计划), which involves four individuals who devise plans to counter the Trisolaran threat without revealing their intentions to the outside world. The story follows one of these face-bearers, Luo Ji, as he struggles with the ethical implications of his mission.

Third Novel: “Death’s End” In “Death’s End,” Luo Ji becomes the ultimate guardian of the Dark Forest Theory. He holds the key to the survival of both human and Trisolaran civilizations. However, his position as the guardian (执剑人) brings him immense personal struggle and psychological turmoil. As Luo Ji’s time as the guardian nears its end, he must make a critical decision about the future of humanity. Meanwhile, the Trisolarans continue their efforts to colonize and assimilate other worlds. The novel explores the consequences of this decision and the broader impact on the universe. It also introduces a new element: the creation of micro-black holes (黑域) by humans, which could potentially isolate them from the rest of the universe. However, the effectiveness of this strategy is questioned when six of these micro-black holes begin to move and accelerate towards Earth. The trilogy as a whole explores themes of survival, technological advancement, and the ethical dilemmas faced by humanity in the face of extraterrestrial threats. Each novel builds upon the previous one, revealing more about the intricate relationship between the two civilizations and the universe they inhabit.

How to Use the API

The latest Qwen2.5-Turbo supporting 1M tokens is used in the same way as the standard Qwen API and is compatible with the OpenAI API. Below is a simple Python call example (Note: You need to set the environment variable YOUR_API_KEY to your API Key. For more details, please visit Quick Start of Alibaba Cloud Model Studio [Chinese]):

import os

from openai import OpenAI

# Input a long text file
with open("example.txt", "r", encoding="utf-8") as f:
    text = f.read()
user_input = text + "\n\nSummarize the above text."

client = OpenAI(
    api_key=os.getenv("YOUR_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen-turbo-latest",
    messages=[
      {'role': 'system', 'content': 'You are a helpful assistant.'},
      {'role': 'user', 'content': user_input},
    ],
)

print(completion.choices[0].message)

Model Performance

In this section, we evaluate the performance of Qwen2.5-Turbo through various benchmark tests and improvements in inference speed.

Passkey Retrieval

We first conducted experiments on the 1M-token Passkey Retrieval task. The results show that Qwen2.5-Turbo can perfectly capture all hidden numbers in the 1M tokens of irrelevant text, demonstrating the model’s ability to capture detailed information in ultra-long contexts.

More Complex Long Text Tasks

We select several datasets of long text understanding to test the model, including:

  • RULER: An extended benchmark based on Needle in a Haystack, tasks include finding multiple “needles” in irrelevant contexts, answering multiple questions, or finding the most or least frequent words in the context. The maximum context length is 128K.
  • LV-Eval: A benchmark test requiring simultaneous understanding of numerous evidence fragments. We adjust the evaluation metrics in the original version of LV-Eval to avoid false negatives caused by overly strict matching rules. The maximum context length is 256K.
  • LongbenchChat: A dataset evaluating human preference alignment in the tasks of long context. The maximum context length is 100K.

The results show that Qwen2.5-Turbo has advantages in various tasks of long context:

  • In the RULER benchmark test, Qwen2.5-Turbo scores 93.1, surpassing GPT-4o-mini and even GPT-4, proving its excellent ability to handle long text tasks.
  • In more tasks of long context understanding like LV-Eval and LongBench-Chat, Qwen2.5-Turbo surpasses GPT-4o-mini in most dimensions and can process tasks with a context of over 128K tokens.

Short Text Tasks

In addition to performance improvements in tasks of long context, we are also concerned about the model’s performance in tasks of short context. The existing context length extension methods often lead to significant performance degradation when processing short texts. Therefore, we have paid special attention to this issue when building Qwen2.5-Turbo, ensuring that the extension of context length almost does not affect the ability of short text understanding.

Results on short text benchmarks show that Qwen2.5-Turbo significantly surpasses previous open-source models with a context length of 1M tokens in most tasks; compared to GPT-4o-mini and Qwen2.5-14B-Instruct models, Qwen2.5-Turbo achieves similar performance in short text tasks while supporting 8 times the context length.

Inference Speed

We tested the TTFT (time to first token) for inputs of different lengths. On the sequences of 1M tokens, we used sparse attention to compress the computation of the attention by about 12.5 times, achieving a speedup of 3.2 to 4.3 times under different hardware configurations.

What’s Next?

While we are pleased to finally extend the context of Qwen2.5-Turbo to 1M tokens, we also recognize that the current model does not always perform satisfactorily when solving long sequence tasks in real applications. There are many unresolved challenges, such as the model’s performance being more unstable in long sequence tasks, and the cost of inference making it difficult to use larger models. However, we will actively explore further alignment of human preferences in long sequences, optimize inference efficiency to reduce computation time, and attempt to launch larger and stronger long-context models. We look forward to sharing new progress in developing long-context models with you soon, so stay tuned!