GITHUB HUGGING FACE MODELSCOPE DEMO DISCORD

Introduction

After months of efforts, we are pleased to announce the evolution from Qwen1.5 to Qwen2. This time, we bring to you:

Pretrained and instruction-tuned models of 5 sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B;
Having been trained on data in 27 additional languages besides English and Chinese;
State-of-the-art performance in a large number of benchmark evaluations;
Significantly improved performance in coding and mathematics;
Extended context length support up to 128K tokens with Qwen2-7B-Instruct and Qwen2-72B-Instruct.

We have opensourced the models in Hugging Face and ModelScope to you and we are looking forward to hearing from you!

Model Information

The Qwen2 series include base and instruction-tuned models of 5 sizes, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, Qwen2-72B. We illustrate the key information of the models in the following table:

Models	Qwen2-0.5B	Qwen2-1.5B	Qwen2-7B	Qwen2-57B-A14B	Qwen2-72B
# Params	0.49B	1.54B	7.07B	57.41B	72.71B
# Non-Emb Params	0.35B	1.31B	5.98B	56.32B	70.21B
GQA	True	True	True	True	True
Tie Embedding	True	True	False	False	False
Context Length	32K	32K	128K	64K	128K

Specifically, previously in Qwen1.5, only Qwen1.5-32B and Qwen1.5-110B have adopted Group Query Attention (GQA). This time, for all model sizes, we apply GQA so that they can enjoy the benefits of faster speed and less memory usage in model inference. For small models, we prefer the application of tying embedding as the large sparse embeddings take up a large proportion of the total model parameters.

In terms of the context length, all base language models have been pretrained on data of the context length of 32K tokens, and we observe satisfactory extrapolation capabilities up to 128K in PPL evaluation. However, for instruction-tuned models, we are not satisfied with merely PPL evaluation; we need the models to be capable of correctly understanding long context and completing tasks. In the table, we list the context length capabilities of instruction-tuned models, as assessed through the evaluation of the Needle in a Haystack task. Notably, when augmented with YARN, both Qwen2-7B-Instruct and Qwen2-72B-Instruct models demonstrate an impressive capacity to handle context lengths extending up to 128K tokens.

Significant efforts were directed towards augmenting both the volume and quality of pretraining and instruction-tuning datasets across a diverse linguistic spectrum, beyond English and Chinese, to bolster its multilingual competencies. Although large language models possess an inherent capacity to generalize to other languages, we explicitly highlight the inclusion of 27 additional languages in our training:

Regions	Languages
Western Europe	German, French, Spanish, Portuguese, Italian, Dutch
Eastern & Central Europe	Russian, Czech, Polish
Middle East	Arabic, Persian, Hebrew, Turkish
Eastern Asia	Japanese, Korean
South-Eastern Asia	Vietnamese, Thai, Indonesian, Malay, Lao, Burmese, Cebuano, Khmer, Tagalog
Southern Asia	Hindi, Bengali, Urdu

Additionally, we have devoted significant effort to addressing code-switching, a frequent occurrence in multilingual evaluation. Consequently, our models’ proficiency in handling this phenomenon have notably enhanced. Evaluations using prompts that typically induce code-switching across languages confirm a substantial reduction in associated issues.

Performance

Comparative assessments reveal substantial enhancements in performance for large-scale models (70B+ parameters) relative to Qwen1.5. Here our evaluation centers on the large-size model Qwen2-72B. In terms of base language models, Qwen2-72B and state-of-the-art open models are evaluated for different capbilities including natural language understanding, knowledge acquisition, coding proficiency, mathematical skills, and multilingual abilities. Benefiting from meticulously curated datasets and optimized training methods, Qwen2-72B exhibits superior performance compared to leading models such as Llama-3-70B. Notably, it surpasses the performance of its predecessor, Qwen1.5-110B, despite having fewer parameters.

After extensive large-scale pre-training, we conduct post-training to further enhance Qwen’s intelligence, bringing it closer to human. This process further improves the model’s capabilities in areas such as coding, mathematics, reasoning, instruction following, multilingual understanding, and more. Additionally, it aligns the model’s output with human values, ensuring that it is helpful, honest, and harmless. Our post-training phase is designed with the principle of scalable training with minimal human annotation. Specifically, we investigate how to obtain high-quality, reliable, diverse and creative demonstration data and preference data with various automated alignment strategies, such as rejection sampling for math, execution feedback for coding and instruction-following, back-translation for creative writing, scalable oversight for role-play, etc. As for training, we apply a combination of supervised fine-tuning, reward model training and online DPO training. We also employ a novel Online Merging Optimizer to minimize the alignment tax. These collective efforts have significantly boosted the capabilities and intelligence of our models, as illustrated in the following table.

We comprehensively evaluate Qwen2-72B-Instruct on 16 benchmarks across various domains. Qwen2-72B-Instruct strikes a balance between obtaining better capabilities and aligning well with human values. Specifically, Qwen2-72B-Instruct significantly surpasses Qwen1.5-72B-Chat across all benchmarks, and also reaches competitive performance compared with Llama-3-70B-Instruct.¹

In terms of smaller models, our Qwen2 models also outcompete the SOTA models of similar or even larger sizes. In comparison with the very recently released SOTA models, Qwen2-7B-Instruct can still demonstrate advantages across benchmarks, showing specifically outstanding performance on coding and Chinese-related metrics.¹

Highlights

Coding & Mathematics

We have persistently dedicated our efforts to enhance the advanced capabilities of Qwen, particularly in coding and mathematics. In coding, we have successfully integrated the code training experience and data from CodeQwen1.5, resulting in significant improvements in Qwen2-72B-Instruct across various programming languages. Regarding mathematics, by exploiting the extensive and high-quality datasets, Qwen2-72B-Instruct has reflects stronger capabilities in solving mathematic problems.

Long Context Understanding

In Qwen2, all instruction-tuned models have been trained on 32k length contexts, and extrapolated to longer context lengths using techniques like YARN or Dual Chunk Attention.

The figure below shows our test results on the Needle in a Haystack. Notably, Qwen2-72B-Instruct is capable of flawlessly handling information extraction tasks within a 128k context. Coupled with its inherent strong performance, it becomes the preferred choice for handling long text tasks when resources are sufficient.

Additionally, it’s worth noting the impressive capabilities of other models in the series: Qwen2-7B-Instruct nearly flawlessly handles contexts up to 128k in length, Qwen2-57B-A14B-Instruct manages contexts up to 64k, and the two smaller models in the lineup support contexts of 32k.

Alongside the long-context models, we have also open-sourced an agent solution for efficiently processing documents containing up to 1 million tokens. For more details, see our dedicated blog post on this topic.

Safety and Responsibility

The table below presents the proportion of harmful responses generated by large models for four categories of multilingual unsafe querys(Illegal Activity, Fraud, Pornography, Privacy Violence). The test data was derived from Jailbreak and translated into multiple languages for evaluation. We find that Llama-3 does not effectively handle multilingual prompts, and therefore, it is not included in the comparison. Through significance testing (P_value), we found that the Qwen2-72B-Instruct model performs comparably to GPT-4 in terms of safety, and significantly outperforms the Mistral-8x22B model.

Language		Illegal Activity			Fraud			Pornography			Privacy Violence
	GPT-4	Mistral-8x22B	Qwen2-72B-Instruct	GPT-4	Mistral-8x22B	Qwen2-72B-Instruct	GPT-4	Mistral-8x22B	Qwen2-72B-Instruct	GPT-4	Mistral-8x22B	Qwen2-72B-Instruct
zh	0%	13%	0%	0%	17%	0%	43%	47%	53%	0%	10%	0%
en	0%	7%	0%	0%	23%	0%	37%	67%	63%	0%	27%	3%
ar	0%	13%	0%	0%	7%	0%	15%	26%	15%	3%	13%	0%
es	0%	7%	0%	3%	0%	0%	48%	64%	50%	3%	7%	3%
fr	0%	3%	0%	3%	3%	7%	3%	19%	7%	0%	27%	0%
ko	0%	4%	0%	3%	8%	4%	17%	29%	10%	0%	26%	4%
pt	0%	7%	0%	3%	7%	3%	47%	57%	47%	4%	26%	4%
th	0%	10%	0%	7%	23%	3%	13%	17%	10%	13%	7%	7%
vi	0%	4%	0%	4%	11%	0%	22%	26%	22%	0%	0%	0%
Average	0%	8%	0%	3%	11%	2%	27%	39%	31%	3%	16%	2%

Developing with Qwen2

Now all models have been released in Hugging Face and ModelScope. Feel free to visit the model cards for detailed usages, and learn more information about each model, including its features, performance, etc.

For a long time, a lot of friends have been supporting the development of Qwen, including finetuning (Axolotl, Llama-Factory, Firefly, Swift, XTuner), quantization (AutoGPTQ, AutoAWQ, Neural Compressor), deployment (vLLM, SGL, SkyPilot, TensorRT-LLM, OpenVino, TGI), API platforms (Together, Fireworks, OpenRouter), local run (MLX, Llama.cpp, Ollama, LM Studio), Agent and RAG Frameworks (LlamaIndex, CrewAI, OpenDevin) , Evaluation (LMSys, OpenCompass, Open LLM Leaderboard), model training (Dolphin, Openbuddy) etc. For how to use Qwen2 with the third-party frameworks, please refer to the respective documentation as well as our official documentation.

Still there are a number of teams and people not mentioned that have made contributions to Qwen. We sincerely thank them for the support, and we hope that our collaboration can boost the research and development of the opensource AI community.

License

This time, we change the licenses of our models to different ones. While Qwen2-72B as well as its instruction-tuned models still uses the original Qianwen License, all other models, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, and Qwen2-57B-A14B, turn to adopt Apache 2.0! We believe that the enhanced openness of our models to the community can accelerate the applications and commercial usages of Qwen2 all around the world.

What’s Next for Qwen2?

We are training larger Qwen2 models to further explore model scaling along with our recent data scaling. Additionally, we extend the Qwen2 language models to multimodal, capable of understanding both vision and audio information. In the near future, we will continue opensource new models to accelerate opensource AI. Stay tuned!

Citation

If you find our work helpful, feel free to give us a cite!

@article{qwen2,
      title={Qwen2 Technical Report}, 
      author={An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jin Xu and Jingren Zhou and Jinze Bai and Jinzheng He and Junyang Lin and Kai Dang and Keming Lu and Keqin Chen and Kexin Yang and Mei Li and Mingfeng Xue and Na Ni and Pei Zhang and Peng Wang and Ru Peng and Rui Men and Ruize Gao and Runji Lin and Shijie Wang and Shuai Bai and Sinan Tan and Tianhang Zhu and Tianhao Li and Tianyu Liu and Wenbin Ge and Xiaodong Deng and Xiaohuan Zhou and Xingzhang Ren and Xinyu Zhang and Xipin Wei and Xuancheng Ren and Yang Fan and Yang Yao and Yichang Zhang and Yu Wan and Yunfei Chu and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zhihao Fan},
      journal={arXiv preprint arXiv:2407.10671},
      year={2024}
}

Appendix

Base Language Model Evaluation

The evaluation of base models mainly focuses on the model performance of natural language understanding, general question answering, coding, mathematics, scientific knowledge, reasoning, multilingual capability, etc.

The datasets for evaluation include:

English Tasks: MMLU (5-shot), MMLU-Pro (5-shot), GPQA (5shot), Theorem QA (5-shot), BBH (3-shot), HellaSwag (10-shot), Winogrande (5-shot), TruthfulQA (0-shot), ARC-C (25-shot)

Coding Tasks: EvalPlus (0-shot) (HumanEval, MBPP, HumanEval+, MBPP+), MultiPL-E (0-shot) (Python, C++, JAVA, PHP, TypeScript, C#, Bash, JavaScript)

Math Tasks: GSM8K (4-shot), MATH (4-shot)

Chinese Tasks: C-Eval(5-shot), CMMLU (5-shot)

Multilingual Tasks: Multi-Exam (M3Exam 5-shot, IndoMMLU 3-shot, ruMMLU 5-shot, mMMLU 5-shot), Multi-Understanding (BELEBELE 5-shot, XCOPA 5-shot, XWinograd 5-shot, XStoryCloze 0-shot, PAWS-X 5-shot), Multi-Mathematics (MGSM 8-shot), Multi-Translation (Flores-101 5-shot)

Qwen2-72B performance

Datasets	DeepSeek-V2	Mixtral-8x22B	Llama-3-70B	Qwen1.5-72B	Qwen1.5-110B	Qwen2-72B
Architecture	MoE	MoE	Dense	Dense	Dense	Dense
#Activated Params	21B	39B	70B	72B	110B	72B
#Params	236B	140B	70B	72B	110B	72B
English
MMLU	78.5	77.8	79.5	77.5	80.4	84.2
MMLU-Pro	-	49.5	52.8	45.8	49.4	55.6
GPQA	-	34.3	36.3	36.3	35.9	37.9
Theorem QA	-	35.9	32.3	29.3	34.9	43.1
BBH	78.9	78.9	81.0	65.5	74.8	82.4
HellaSwag	87.8	88.7	88.0	86.0	87.5	87.6
WindoGrande	84.8	85.0	85.3	83.0	83.5	85.1
ARC-C	70.0	70.7	68.8	65.9	69.6	68.9
TruthfulQA	42.2	51.0	45.6	59.6	49.6	54.8
Coding
HumanEval	45.7	46.3	48.2	46.3	54.3	64.6
MBPP	73.9	71.7	70.4	66.9	70.9	76.9
EvalPlus	55.0	54.1	54.8	52.9	57.7	65.4
MultiPL-E	44.4	46.7	46.3	41.8	52.7	59.6
Mathematics
GSM8K	79.2	83.7	83.0	79.5	85.4	89.5
MATH	43.6	41.7	42.5	34.1	49.6	51.1
Chinese
C-Eval	81.7	54.6	65.2	84.1	89.1	91.0
CMMLU	84.0	53.4	67.2	83.5	88.3	90.1
Multilingual
Mulit-Exam	67.5	63.5	70.0	66.4	75.6	76.6
Multi-Understanding	77.0	77.7	79.9	78.2	78.2	80.7
Multi-Mathematics	58.8	62.9	67.1	61.7	64.4	76.0
Multi-Translation	36.0	23.3	38.0	35.6	36.2	37.8

Qwen2-57B-A14B

Datasets	Jamba	Mixtral-8x7B	Yi-1.5-34B	Qwen1.5-32B	Qwen2-57B-A14B
Architecture	MoE	MoE	Dense	Dense	MoE
#Activated Params	12B	12B	34B	32B	14B
#Params	52B	47B	34B	32B	57B
English
MMLU	67.4	71.8	77.1	74.3	76.5
MMLU-Pro	-	41.0	48.3	44.0	43.0
GPQA	-	29.2	-	30.8	34.3
Theorem QA	-	23.2	-	28.8	33.5
BBH	45.4	50.3	76.4	66.8	67.0
HellaSwag	87.1	86.5	85.9	85.0	85.2
Winogrande	82.5	81.9	84.9	81.5	79.5
ARC-C	64.4	66.0	65.6	63.6	64.1
TruthfulQA	46.4	51.1	53.9	57.4	57.7
Coding
HumanEval	29.3	37.2	46.3	43.3	53.0
MBPP	-	63.9	65.5	64.2	71.9
EvalPlus	-	46.4	51.9	50.4	57.2
MultiPL-E	-	39.0	39.5	38.5	49.8
Mathematics
GSM8K	59.9	62.5	82.7	76.8	80.7
MATH	-	30.8	41.7	36.1	43.0
Chinese
C-Eval	-	-	-	83.5	87.7
CMMLU	-	-	84.8	82.3	88.5
Multilingual
Multi-Exam	-	56.1	58.3	61.6	65.5
Multi-Understanding	-	70.7	73.9	76.5	77.0
Multi-Mathematics	-	45.0	49.3	56.1	62.3
Multi-Translation	-	29.8	30.0	33.5	34.5

Qwen2-7B

Datasets	Mistral-7B	Gemma-7B	Llama-3-8B	Qwen1.5-7B	Qwen2-7B
# Params	7.2B	8.5B	8.0B	7.7B	7.6B
# Non-emb Params	7.0B	7.8B	7.0B	6.5B	6.5B
English
MMLU	64.2	64.6	66.6	61.0	70.3
MMLU-Pro	30.9	33.7	35.4	29.9	40.0
GPQA	24.7	25.7	25.8	26.7	31.8
Theorem QA	19.2	21.5	22.1	14.2	31.1
BBH	56.1	55.1	57.7	40.2	62.6
HellaSwag	83.2	82.2	82.1	78.5	80.7
Winogrande	78.4	79.0	77.4	71.3	77.0
ARC-C	60.0	61.1	59.3	54.2	60.6
TruthfulQA	42.2	44.8	44.0	51.1	54.2
Coding
HumanEval	29.3	37.2	33.5	36.0	51.2
MBPP	51.1	50.6	53.9	51.6	65.9
EvalPlus	36.4	39.6	40.3	40.0	54.2
MultiPL-E	29.4	29.7	22.6	28.1	46.3
Mathematics
GSM8K	52.2	46.4	56.0	62.5	79.9
MATH	13.1	24.3	20.5	20.3	44.2
Chinese
C-Eval	47.4	43.6	49.5	74.1	83.2
CMMLU	-	-	50.8	73.1	83.9
Multilingual
Multi-Exam	47.1	42.7	52.3	47.7	59.2
Multi-Understanding	63.3	58.3	68.6	67.6	72.0
Multi-Mathematics	26.3	39.1	36.3	37.3	57.5
Multi-Translation	23.3	31.2	31.9	28.4	31.5

Qwen2-0.5B & Qwen2-1.5B

Datasets	Phi-2	Gemma-2B	MiniCPM	Qwen1.5-1.8B	Qwen2-0.5B	Qwen2-1.5B
#Non-Emb Params	2.5B	2.0B	2.4B	1.3B	0.35B	1.3B
MMLU	52.7	42.3	53.5	46.8	45.4	56.5
MMLU-Pro	-	15.9	-	-	14.7	21.8
Theorem QA	-	-	-	-	8.9	15.0
HumanEval	47.6	22.0	50.0	20.1	22.0	31.1
MBPP	55.0	29.2	47.3	18.0	22.0	37.4
GSM8K	57.2	17.7	53.8	38.4	36.5	58.5
MATH	3.5	11.8	10.2	10.1	10.7	21.7
BBH	43.4	35.2	36.9	24.2	28.4	37.2
HellaSwag	73.1	71.4	68.3	61.4	49.3	66.6
Winogrande	74.4	66.8	-	60.3	56.8	66.2
ARC-C	61.1	48.5	-	37.9	31.5	43.9
TruthfulQA	44.5	33.1	-	39.4	39.7	45.9
C-Eval	23.4	28.0	51.1	59.7	58.2	70.6
CMMLU	24.2	-	51.1	57.8	55.1	70.3

Instruction-tuned Model Evaluation¹

Qwen2-72B-Instruct

Datasets	Llama-3-70B-Instruct	Qwen1.5-72B-Chat	Qwen2-72B-Instruct
English
MMLU	82.0	75.6	82.3
MMLU-Pro	56.2	51.7	64.4
GPQA	41.9	39.4	42.4
TheroemQA	42.5	28.8	44.4
MT-Bench	8.95	8.61	9.12
Arena-Hard	41.1	36.1	48.1
IFEval (Prompt Strict-Acc.)	77.3	55.8	77.6
Coding
HumanEval	81.7	71.3	86.0
MBPP	82.3	71.9	80.2
MultiPL-E	63.4	48.1	69.2
EvalPlus	75.2	66.9	79.0
LiveCodeBench	29.3	17.9	35.7
Mathematics
GSM8K	93.0	82.7	91.1
MATH	50.4	42.5	59.7
Chinese
C-Eval	61.6	76.1	83.8
AlignBench	7.42	7.28	8.27

Qwen2-57B-A14B-Instruct

Datasets	Mixtral-8x7B-Instruct-v0.1	Yi-1.5-34B-Chat	Qwen1.5-32B-Chat	Qwen2-57B-A14B-Instruct
Architecture	MoE	Dense	Dense	MoE
#Activated Params	12B	34B	32B	14B
#Params	47B	34B	32B	57B
English
MMLU	71.4	76.8	74.8	75.4
MMLU-Pro	43.3	52.3	46.4	52.8
GPQA	-	-	30.8	34.3
TheroemQA	-	-	30.9	33.1
MT-Bench	8.30	8.50	8.30	8.55
Coding
HumanEval	45.1	75.2	68.3	79.9
MBPP	59.5	74.6	67.9	70.9
MultiPL-E	-	-	50.7	66.4
EvalPlus	48.5	-	63.6	71.6
LiveCodeBench	12.3	-	15.2	25.5
Mathematics
GSM8K	65.7	90.2	83.6	79.6
MATH	30.7	50.1	42.4	49.1
Chinese
C-Eval	-	-	76.7	80.5
AlignBench	5.70	7.20	7.19	7.36

Qwen2-7B-Instruct

Datasets	Llama-3-8B-Instruct	Yi-1.5-9B-Chat	GLM-4-9B-Chat	Qwen1.5-7B-Chat	Qwen2-7B-Instruct
English
MMLU	68.4	69.5	72.4	59.5	70.5
MMLU-Pro	41.0	-	-	29.1	44.1
GPQA	34.2	-	-	27.8	25.3
TheroemQA	23.0	-	-	14.1	25.3
MT-Bench	8.05	8.20	8.35	7.60	8.41
Coding
Humaneval	62.2	66.5	71.8	46.3	79.9
MBPP	67.9	-	-	48.9	67.2
MultiPL-E	48.5	-	-	27.2	59.1
Evalplus	60.9	-	-	44.8	70.3
LiveCodeBench	17.3	-	-	6.0	26.6
Mathematics
GSM8K	79.6	84.8	79.6	60.3	82.3
MATH	30.0	47.7	50.6	23.2	49.6
Chinese
C-Eval	45.9	-	75.6	67.3	77.2
AlignBench	6.20	6.90	7.01	6.20	7.21

Qwen2-0.5B-Instruct & Qwen2-1.5B-Instruct

Datasets	Qwen1.5-0.5B-Chat	Qwen2-0.5B-Instruct	Qwen1.5-1.8B-Chat	Qwen2-1.5B-Instruct
MMLU	35.0	37.9	43.7	52.4
HumanEval	9.1	17.1	25.0	37.8
GSM8K	11.3	40.1	35.3	61.6
C-Eval	37.2	45.2	55.3	63.8
IFEval (Prompt Strict-Acc.)	14.6	20.0	16.8	29.0

Multilingual capability of instruction-tuned models

We compare Qwen2 instruction-tuned models with other recent LLMs on several cross-lingual open benchmarks as well as by human evaluation. For benchmarks, we show the results on 2 evaluation datasets:

M-MMLU from Okapi: multilingual commonsense evaluation (we evaluate with a subset on ar, de, es, fr, it, nl, ru, uk, vi, zh)
MGSM: math evaluation on languages including de, en, es, fr, ja, ru, th, zh and bn

The results are averaged over languages for each benchmark and shown as follows:

Models	M-MMLU (5-shot)	MGSM (0-shot, CoT)
*Proprietary LLMs*
GPT-4-0613	78.0	87.0
GPT-4-Turbo-0409	79.3	90.5
GPT-4o-0513	83.2	89.6
Claude-3-Opus-20240229	80.1	91.0
Claude-3-Sonnet-20240229	71.0	85.6
*Open-source LLMs*
command-r-plus-110b	65.5	63.5
Qwen1.5-7B-Chat	50.0	37.0
Qwen1.5-32B-Chat	65.0	65.0
Qwen1.5-72B-Chat	68.4	71.7
Qwen2-7B-Instruct	60.0	57.0
Qwen2-57B-A14B-Instruct	68.0	74.0
Qwen2-72B-Instruct	78.0	86.6

For human evaluation, we compare Qwen2-72B-Instruct with GPT3.5, GPT4 and Claude-3-Opus using in-house evaluation set, which includes 10 languages ar, es, fr, ko, th, vi, pt, id, ja and ru (the scores range from 1~5):

Models	ar	es	fr	ko	th	vi	pt	id	ja	ru	Average
Claude-3-Opus-20240229	4.15	4.31	4.23	4.23	4.01	3.98	4.09	4.40	3.85	4.25	4.15
GPT-4o-0513	3.55	4.26	4.16	4.40	4.09	4.14	3.89	4.39	3.72	4.32	4.09
GPT-4-Turbo-0409	3.44	4.08	4.19	4.24	4.11	3.84	3.86	4.09	3.68	4.27	3.98
Qwen2-72B-Instruct	3.86	4.10	4.01	4.14	3.75	3.91	3.97	3.83	3.63	4.15	3.93
GPT-4-0613	3.55	3.92	3.94	3.87	3.83	3.95	3.55	3.77	3.06	3.63	3.71
GPT-3.5-Turbo-1106	2.52	4.07	3.47	2.37	3.38	2.90	3.37	3.56	2.75	3.24	3.16

Grouped by task types, the results are shown as follows:

Models	Knowledge	Understanding	Creation	Math
Claude-3-Opus-20240229	3.64	4.45	4.42	3.81
GPT-4o-0513	3.76	4.35	4.45	3.53
GPT-4-Turbo-0409	3.42	4.29	4.35	3.58
Qwen2-72B-Instruct	3.41	4.07	4.36	3.61
GPT-4-0613	3.42	4.09	4.10	3.32
GPT-3.5-Turbo-1106	3.37	3.67	3.89	2.97

These results demonstrate the strong multilingual capabilities of Qwen2 instruction-tuned models.

Update on 2024-07-16: The results of instruction-tuned models may differ from those presented in the technical report; in case of any discrepancy, the results documented in the technical report should take precedence. ↩︎ ↩︎ ↩︎

We have a new blog!View this page at qwen.ai.

Hello Qwen2

Introduction#

Model Information#

Performance#

Highlights#

Coding & Mathematics#

Long Context Understanding#

Safety and Responsibility#

Developing with Qwen2#

License#

What’s Next for Qwen2?#

Citation#

Appendix#

Base Language Model Evaluation#

Qwen2-72B performance#

Qwen2-57B-A14B#

Qwen2-7B#

Qwen2-0.5B & Qwen2-1.5B#

Instruction-tuned Model Evaluation1#

Qwen2-72B-Instruct#

Qwen2-57B-A14B-Instruct#

Qwen2-7B-Instruct#

Qwen2-0.5B-Instruct & Qwen2-1.5B-Instruct#

Multilingual capability of instruction-tuned models#

We have a new blog!
View this page at qwen.ai.