Qwen2-Audio: Chat with Your Voice!

DEMO PAPER GITHUB HUGGING FACE MODELSCOPE DISCORD

To achieve the objective of building an AGI system, the model should be capable of understanding information from different modalities. Thanks to the rapid development of large language models, LLMs are now capable of understanding language and reasoning. Previously we have taken a step forward to extend our LLM, i.e., Qwen, to more modalities, including vision and audio, and built Qwen-VL and Qwen-Audio. Today, we release Qwen2-Audio, the next version of Qwen-Audio, which is capable of accepting audio and text inputs and generating text outputs. Qwen2-Audio has the following features:

Voice Chat: for the first time, users can use the voice to give instructions to the audio-language model without ASR modules.
Audio Analysis: the model is capable of analyzing audio information, including speech, sound, music, etc., with text instructions.
Multilingual: the model supports more than 8 languages and dialects, e.g., Chinese, English, Cantonese, French, Italian, Spanish, German, and Japanese.

We open-weight Qwen2-Audio-7B and Qwen2-Audio-7B-Instruct in Hugging Face and ModelScope, and we have built a demo for users to interact with. Below are some examples to show the model performance:

Voice Chat

Voice Chat: Identify speaker information and respond Next

User

Qwen2-Audio

Yes, the speaker is female and in her twenties.

User

Qwen2-Audio

I'm sorry to hear that you're feeling stressed about your exam. It's completely normal to feel anxious when you're preparing for an important test. Here are some tips that might help you: 1. Start studying early: Don't leave everything until the last minute. Break up your study sessions into smaller chunks of time, and try to stick to a regular schedule. 2. Find a quiet place to study: A quiet environment can help you concentrate better. If possible, find a library or a study room where you can work without distractions. 3.Take breaks: It's important to take breaks during your study sessions. This will give your brain a chance to rest and recharge. Try taking short breaks every hour or so. 4. Get enough sleep: Lack of sleep can make it harder to focus and retain information. Make sure you're getting enough sleep each night, ideally around 7-8 hours. 5. Practice relaxation techniques: Deep breathing exercises, meditation, or yoga can all help you relax and reduce stress. Remember, it's okay if you're not feeling well-prepared. Just do your best, and trust yourself. Good luck with your exam!

Voice Chat: Speech translation Next

User

Qwen2-Audio

每个人都希望被欣赏，所以如果你欣赏某人，不要把它保密。

User

Qwen2-Audio

Jeder möchte geschätzt werden, also wenn du jemanden schätzt, lass es nicht als Geheimnis verbergen.

User

Qwen2-Audio

Chacun veut être apprecié, alors si tu apprécies quelqu'un, ne le gardes pas secret.

User

Qwen2-Audio

1. 每个人都喜欢被赞美，所以如果你喜欢一个人，不要隐藏你的赞赏。 2. 所有人都渴望得到认可，因此如果你欣赏某人，请不要独自享用这种感觉。 3. 人们都希望被欣赏，所以如果你喜欢某人，不要独自保密欣赏他人是一种美德，所以不要让别人知道你对他们的赞赏是秘密的。 4. 人们都希望受到赞扬，所以如果你欣赏某人，不妨公开表达出来。 5. 如果你喜欢某人，不要犹豫告诉他们，因为每个人都有权利享受他人的赞赏。

Voice Chat: Detecting background noise and responding accordingly Next

User

Qwen2-Audio

Use headphones to block out external noise and find a quiet place to study. Prioritize your tasks and set realistic goals for yourself. Use active learning techniques such as summarizing, note-taking, and quizzing to help you retain information. Take breaks and practice self-care to avoid burnout.

User

Qwen2-Audio

Yes, you can try negotiating with them to reach a mutually beneficial agreement. You could also consider seeking help from a mediator or professional negotiator if needed.

Audio Analysis

Audio Analysis : Speech transcription and story telling Next

User

Qwen2-Audio

The original content of this audio is: 'Stop all the clocks, cut off the telephone, prevent the dog from barking, silence the pianos and with muffled drum bring out the coffin, let the mourners come.'

User

No less than 50 words and no more than 200 words, add scenes such as rain and thunder to the poem. Finally, give the verses you created in lines.

Qwen2-Audio

In the midst of an eerie stormy night, With raindrops falling heavy and bright, The world seemed to hush its voice, As winds howled through the darkened space. Stop all the clocks, the time stood still, As the lightning flashed through the sky, The telephone's ring was drowned out, By the roaring thunder that shook the ground. Prevent the dog from barking, its cries were silenced, By the fear that gripped it in the night, Silence the pianos, their keys were muted, As the mournful sound of raindrops beat on wood. With muffled drums, a slow march began, A solemn procession through the storm, The coffin was carried, heavy and solemn, Let the mourners come, to pay their final respects. As the storm slowly passed, the world emerged, A new day dawned, but memories lingered, Of the night when time stood still, And the mourners came, to say goodbye.

Audio Analysis: Sound analysis Next

User

What do you hear?

Qwen2-Audio

I heard the sound of an alarm and a truck's air brake mixed with the noise of a heavy engine running and shifting gears.

User

Under what circumstances can you hear such sounds?

Qwen2-Audio

These types of sounds might be commonly heard in industrial settings like construction sites or factories where machinery is operated, or during emergency situations like traffic accidents where sirens are triggered.

User

Qwen2-Audio

It's understandable to feel scared under any situation that makes you feel uncomfortable. It's important to remember that it's okay to be scared. Here are some things you can do: 1. Take deep breaths: Inhale deeply through your nose and exhale slowly through your mouth. This can help calm your mind and body. 2. Find a safe place: If possible, move to a safe location away from the source of the fear. If you're in a car, pull over to a parking lot or a safe spot on the side of the road. 3. Stay informed: If you have access to information about the situation, stay informed but don't overdo it. Too much information can create more anxiety. 4. Talk to someone: Talking to someone you trust can help you process your feelings and gain perspective. You can call a friend, family member, or even a professional counselor if needed. 5. Practice relaxation techniques: Listening to music, doing yoga or meditation, or engaging in other relaxing activities can help reduce stress and anxiety. Remember, it's okay to take time for yourself to process your emotions and come up with a plan that works best for you.

Audio Analysis: Music analysis Next

User

Qwen2-Audio

This is a pop punk music piece. There is a male vocalist singing melodically in the lead. The tune is being played by the guitar with the bass guitar playing in the background. The rhythmic background consists of loud and aggressive sounding drums. The atmosphere is rebellious. This piece could be used in the soundtrack of a high school drama movie or TV show during scenes where the characters are rebelling against authority.

User

Qwen2-Audio

The key of this music is F# major.

User

Qwen2-Audio

The tempo of this music is 104.17 bpm.

User

Qwen2-Audio

The time signature of this music is 4/4.

Audio Analysis: Robustness of mixed audio analysis Next

User

Qwen2-Audio

The old man laid down his hand to light a cigar.

User

Qwen2-Audio

The old man laid down his hand to light a cigar.

User

Qwen2-Audio

The old man laid down his hand to light a cigar.

User

Qwen2-Audio

The old man laid down his hand to light a cigar.

User

Qwen2-Audio

The lyrics of the singing are 'Waiting for my favorite song.'

User

Qwen2-Audio

The lyrics of the singing are 'Waiting for my favorite song.'

Performance

We have conducted a series of experiments on benchmark datasets, including LibriSpeech, Common Voice 15, Fleurs, Aishell2, CoVoST2, Meld, Vocalsound, and AIR-Benchmark, to evaluate the performance of Qwen2-Audio in comparison with our previously released Qwen-Audio as well as the state-of-the-art models in each tasks. Below we demonstrate a figure to show how Qwen2-Audio performs against the competitors. In all the tasks, Qwen2-Audio significantly surpasses either previous SOTAs or Qwen-Audio.

More specific results on the datasets are listed below in the table.

Architecture

Below is a demonstration of the training architecture. Specifically, we start from the Qwen language model and an audio encoder as the foundation models. We sequentially apply multi-task pretraining for the audio language alignment, and supervised finetuning as well as direct preference optimization to grasp the capabilities of downstream tasks and model human preference.

How to Use

Now Qwen2-Audio is officially supported by Hugging Face Transfomers. We advise you install the latest version of transformers from source:

pip install git+https://github.com/huggingface/transformers

We demonstrate how to use Qwen2-Audio-7B-Instruct to play with voice chat and audio analysis. Here is an example of voice chat:

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"},
    ]},
    {"role": "assistant", "content": "Yes, the speaker is female and in her twenties."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(librosa.load(
                    BytesIO(urlopen(ele['audio_url']).read()), 
                    sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

In voice chat mode, the user input is simply the audio without text, and the instructions from the users are inside the audio. Next is an example of audio analysis:

from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct")
model = Qwen2AudioForConditionalGeneration.from_pretrained("Qwen/Qwen2-Audio-7B-Instruct", device_map="auto")

conversation = [
    {'role': 'system', 'content': 'You are a helpful assistant.'}, 
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound?"},
    ]},
    {"role": "assistant", "content": "It is the sound of glass shattering."},
    {"role": "user", "content": [
        {"type": "text", "text": "What can you do when you hear that?"},
    ]},
    {"role": "assistant", "content": "Stay alert and cautious, and check if anyone is hurt or if there is any damage to property."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
        {"type": "text", "text": "What does the person say?"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(
                    librosa.load(
                        BytesIO(urlopen(ele['audio_url']).read()), 
                        sr=processor.feature_extractor.sampling_rate)[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs.input_ids = inputs.input_ids.to("cuda")

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

In comparison, in audio analysis mode, there is additional text instructions. However, switching between either mode only needs to modify your user inputs and there is no need to worry about others such as system prompt.

Next Step

This time we bring a new audio language model, Qwen2-Audio, which supports voice chat and audio analysis at the same time and understands more than 8 languages and dialects. In the near future, we plan to train improved Qwen2-Audio models on larger pretraining datasets, enabling the model to support longer audio (over 30s). We also plan to build larger Qwen2-Audio models to explore the scaling laws of audio language models.

Qwen2-Audio: Chat with Your Voice!

Voice Chat#

Audio Analysis#

Performance#

Architecture#

How to Use#

Next Step#

Voice Chat

Audio Analysis

Performance

Architecture

How to Use

Next Step