4 months after our first release of Qwen-7B, which is the starting point of our opensource journey of large language models (LLM), we now provide an introduction to the Qwen series to give you a whole picture of our work as well as our objectives. Below are important links to our opensource projects and community.

PAPER GITHUB HUGGING FACE MODELSCOPE DISCORD

Additionally, we have WeChat groups for chatting and we invite you to join the groups through the provided link in our GitHub readme.

Overview

In general, Qwen is more than a language model but a project towards AGI which for now consists of LLM and LMM. The following figure shows the main components of Qwen:

where Qwen refers to the base language model, while Qwen-Chat refers to the chat model trained with techniques like SFT and RLHF. We also have models specialized for domains and tasks, such as Code-Qwen for coding and Math-Qwen for mathematics. LLM can be extended to multimodality with modality alignment, and thus we have vision-language model Qwen-VL as well as audio-language model Qwen-Audio. Note that this blog mainly serves for introducing the language model. As to the large multimodal models (LMM), such as Qwen-VL and Qwen-Audio, please refer to the respective blog.

Base Model: A Good Starting Point for Alignment

The general procedure of building an assistant model includes pretraining and post-training, where the latter mostly consists of SFT and RLHF. As to pretraining, similar to previous LLM, GPT-3, Llama, Qwen is a Transformer-based language model pretrained by the task of next token prediction. For simplicity and stability, we did not introduce more tasks for the language model but focus on model size scaling and data scaling. For now, we have developed 5 models of different sizes, 4 of which are opensourced. Specially, we now release Qwen-1.8B, Qwen-7B, Qwen-14B, and Qwen-72B.

ModelRelease DateMax LengthSystem Prompt Enhancement# of Pretrained TokensMinimum GPU Memory Usage of Finetuning (Q-Lora)Minimum GPU Usage of Generating 2048 Tokens (Int4)Tool Usage
Qwen-1.8B23.11.3032K2.2T5.8GB2.9GB
Qwen-7B23.08.0332K2.4T11.5GB8.2GB
Qwen-14B23.09.258K3.0T18.7GB13.0GB
Qwen-72B23.11.3032K3.0T61.4GB48.9GB

Models are sufficiently trained with 2-3 trillion tokens. The pretraining data are multilingual, and thus Qwen is essentially a multilingual model instead of a model of a single language or bilingual. Note that due to the limitations of our pretraining data, the model is strongly capable of English and Chinese and also capable of other languages, such as Spanish, French, and Japanese. To extend its multilingual capabilities, we applied a tokenizer with high efficiency in encoding information from different languages. In comparison with other tokenizers, ours demonstrates high compression rate in a series of languages.

Another focus of our pretraining is the extension of context length. We directly apply continual pretraining with longer context length and larger base value for RoPE. Additionally, we find that.this method is also effective in extrapolation. Now our opensourced models mostly support a context length of 32K tokens, and they were evaluated through L-Eval and “Needle in a Haystack”.

ModelInput LengthAverageCourseraGSMQuALITYTOEFLCodeUSFcition
ChatGPT-3.5-16k16K60.7363.5184.0061.3878.4312.2264.84
Qwen-72B-Chat32K62.3058.1376.0077.2286.246.6669.53

Benchmark evaluation shows that our largest opensourced model Qwen-72B as well as the largest proprietary shows competitive performance against Llama 2, GPT-3.5 and GPT-4.

Note that this is an evaluation of base language model. This only reflects that we might have a good starting point for post-training, i.e., SFT and RLHF.

Alignment

We refer both techniques to the word “alignment” in post-training. Currently, it is consensus that we can obtain a chat model with a relatively small amount of finetuning data. We focus on improving the diversity and complexity (instag and tulu 2) of the SFT data and strictly control the quality by manual checking and automatic evaluation.

Based on a good SFT model, we can then explore the effects of RLHF. It is difficult to train RLHF, specifically PPO-based method, Besides the training instabilities of PPO, another key to the final performance is the quality of reward model. Therefore, we have spent efforts in building a reliable reward model by reward model pretraining on large-scale comparison data and finetuning on carefully labeled comparison data of high quality. In comparison with the SFT model, we find that the RLHF model is more creative and follows the instructions better, and thus its generated responses are more preferred by human annotators.

Tool Use and Agent

One of the most amazing parts of today’s LLMs is the capabilities of tool use and agent playing. We directly label data of ReAct formats in order to endow the abilities of generating thought and action and generating responses based on previous steps and observations. Also, the model directly learns the in-context learning ability and thus it then can use unseen tool through understanding instructions and demonstrations.

We currently support function calling, code interpreter, and hugging face agent, which respectively serves for tool use, data analysis and using AI models for different outputs, say image generation. Furthermore, based on our agent framework, we further build a project called AgentFabric, following GPTs, which allows you to build a specialzed AI agent for yourself simply by chatting with our model for configuration.

Summary

We release the Qwen series, and in this blog, we provide a simple introduction to the Qwen language models Now, we are still following the recipes of pretraining, SFT, and RLHF and we are figuring out a path towards scaling model and data. We hope that our opensource is contributive to the research and application communities.