GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD

Introduction

The open-source community has long sought a model that strikes an ideal balance between performance, efficiency, and memory footprint. Despite the emergence of cutting-edge models like Qwen1.5-72B and DBRX, the models have faced persistent challenges such as large memory consumption, slow inference speed, and substantial finetuning costs.

A growing consensus within the field now points to a model with approximately 30 billion parameters as the optimal “sweet spot” for achieving both strong performance and manageable resource requirements. In response to this trend, we are proud to unveil the latest additions to our Qwen1.5 language model series: Qwen1.5-32B and Qwen1.5-32B-Chat.

Over the past months, we have meticulously developed the Qwen1.5-32B base model, striving to match or even surpass the performance benchmarks set by state-of-the-art 30B models. Simultaneously, we have made advancements in our post-training techniques, particularly in RLHF, to elevate the conversational capabilities of Qwen1.5-32B-Chat.

Model Quality

Qwen1.5-32B is a new member of the Qwen1.5 language model series, and besides model sizes, there is almost nothing different in model architecture except for the inclusion grouped query attention (GQA). Thus it has better potential of more efficient inference performance in model serving.

Here we provide the performance comparison with the SOTA of around 30B parameters or larger model sizes, in terms of the base capability evaluation, chat evaluation, and multilingual evaluation. Below, we report the evaluation of capabilities of base language models:

ModelMMLUC-EvalGSM8KMATHHumanEvalMBPPBBHCMMLU
Llama2-34B62.6-42.26.222.633.044.1-
Yi-34B76.381.467.214.423.241.054.383.7
Mixtral-8x7B70.6-74.428.440.260.7--
Qwen1.5-72B77.584.179.534.141.553.465.583.5
Qwen1.5-32B73.483.577.436.137.249.466.882.3

Our 32B model demonstrates competitive performance across a variety of tasks, including MMLU, GSM8K, HumanEval, and BBH. Compared with the 72B parameter model, Qwen1.5-32B exhibits a slight decrease in performance, yet it still outperforms other 30B models, such as Llama2-34B and Mixtral-8x7B, in most tasks.

In terms of the chat models, we follow the evaluation recipe of Qwen1.5 to test their performance on MT-Bench and Alpaca-Eval 2.0. The results are shown below:

ModelsMT-BenchAlpacaEval 2.0
Avg. ScoreLC Win Rate
Qwen1.5-72B-Chat8.6136.60
Qwen1.5-32B-Chat8.3027.49

Significantly, Qwen1.5-32B-Chat achieves a score of over 8 points, and the gap between Qwen1.5-32B-Chat and Qwen1.5-72B-Chat is relatively small. This result indicates that the 32B model is a viable alternative for users who require a more efficient and cost-effective solution for chat applications.

We also test the multilingual capabilities of Qwen1.5-32B on a diverse set of 12 languages, including Arabic, Spanish, French, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, and Indonesian, covering domains including exams, understanding, math, and translation. Results are shown below:

The detailed results are demonstrated below:

ModelsExamsUnderstandingMathTranslationAverage
Mixtral-8x7B56.0870.7045.0029.7850.39
Qwen1.5-72B66.3578.1661.6735.5760.44
Qwen1.5-32B61.5776.4856.1333.4656.91

Similar to other Qwen1.5 models, the 32B one also has decent multiplingual capabilities and it is also slightly behind the 72B model.

Finally we come to take a look at its performance in the long-context evaluation, Needle in a Haystack. We are happy to see that it is able to achieve a top-level performance in the context of 32K tokens.

Develop with Qwen1.5-32B

We advise you to read our blog for Qwen1.5 to figure out the usages with Transformers, vLLM, llama.cpp, Ollama, etc.

Conclusion

We release the medium-size model Qwen1.5-32B as well as its chat counterpart. The models require much less memory footprint and run significantly faster than the 72B model. We hope that this release can help our users to figure out a better solution for their downstream application to tackle the problems of weak capabilities of 14B models (especially in agent playing scenarios) and high inference costs of 72B models.