GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD

Introduction

Recently we have witnessed a burst of large-scale models with over 100 billion parameters in the opensource community. These models have demonstrated remarkable performance in both benchmark evaluation and chatbot arena. Today, we release the first 100B+ model of the Qwen1.5 series, Qwen1.5-110B, which achieves comparable performance with Meta-Llama3-70B in the base model evaluation, and outstanding performance in the chat evaluation, including MT-Bench and AlpacaEval 2.0.

Model Features

Qwen1.5-110B is similar to other Qwen1.5 models and built with the same Transformer decoder architecture. It consists of grouped query attention (GQA) and it can be efficient in model serving. The model supports the context length 32K tokens, and the model is still multilingual, supporting a large number of languages including English, Chinese, French, Spanish, German, Russian, Korean, Japanese, Vietnamese, Arabic, etc.

Model Quality

We conduct a series of evaluations for the base language models, and we compare with Meta-Llama3-70B, the recent SOTA language model as well as Mixtral-8x22B.

Qwen1.5-110BQwen1.5-72BLlama-3-70BMixtral-8x22B
MMLU80.477.579.577.8
TheoremQA34.929.332.035.9
GPQA35.936.336.434.3
Hellaswag87.586.088.088.7
BBH74.865.576.669.2
ARC-C69.665.968.870.7
GSM8K85.479.579.278.6
MATH49.634.141.041.7
HumanEval52.441.545.745.1
MBPP58.153.455.171.2

The above results show that the new 110B model is at least competitive with the Llama-3-70B model in terms of base capabilities. In terms of this model, we did not change the pretraining and posttraining recipes drastically, and thus we believe that the performance improvement compared with 72B comes from increasing model size.

We also test the chat models on MT-Bench and AlpacaEval 2.0.

ModelsMT-BenchAlpacaEval 2.0
Avg. ScoreLC Win Rate
Llama-3-70B-Instruct8.8534.40
Qwen1.5-72B-Chat8.6136.60
Qwen1.5-110B-Chat8.8843.90

Compared with the previously released 72B model, on the two benchmark evaluation for chat models the 110B performs significantly better. The consistent improvement in the evaluation indicates that stronger and larger base language models can lead to better chat models even without changing the post-training recipes much.

Develop with Qwen1.5-110B

We advise you to read our blog for Qwen1.5 to figure out the usages with Transformers, vLLM, llama.cpp, Ollama, LMStudio, SkyPilot, Axolotl, LLaMA-Factory, etc.

Conclusion

The Qwen1.5-110B is the largest model in the Qwen1.5 series, and it is also the first one with over 100 billion parameters in the series. It demonstrates competitive performance against the very recently released SOTA model Llama-3-70B and it is significantly better than the 72B model. This tells us that there is still a lot of room in model size scaling for better performance. While the releease of Llama-3 indicates the significance of data scaling to an extremely large scale, we believe we can get the best of both worlds by scaling both data and model size in our future release. Stay tuned for Qwen2!