Time to Speak Some Dialects, Qwen-TTS!

Introduction

Here we introduce the latest update of Qwen-TTS (qwen-tts-latest or qwen-tts-2025-05-22) through Qwen API . Trained on a large-scale dataset encompassing over millions of hours of speech, Qwen-TTS achieves human-level naturalness and expressiveness. Notably, Qwen-TTS automatically adjusts prosody, pacing, and emotional inflections in response to the input text. Notably, Qwen-TTS supports the generation of 3 Chinese dialects, including Pekingese, Shanghainese, and Sichuanese.

As of now, Qwen-TTS supports 7 Chinese-English bilingual voices, including Cherry, Ethan, Chelsie, Serena, Dylan (Pekingese), Jada (Shanghainese) and Sunny (Sichuanese). More languages and stylistic options will be released in the near future.

Samples of Chinese Dialects

Here are some samples showcase Qwen-TTS’s ability to capture dialects and natural speech patterns.

Speaker	Dialects	Text
Dylan	Beijing	我们家那边后面有一个后山，就护城河那边，完了呢我们就在山上啊就其实也没什么，就是在土坡上跑来跑去，然后谁捡个那个嗯比较威风的棍，完了我们就呃得瞎打呃，要不就是什么掏个洞啊什么的。
Dylan	Beijing	得有自己的想法，别净跟着别人瞎起哄，多动动脑子，有点儿结构化的思维啥的。
Jada	Shanghai	侬只小赤佬，啊呀，数学句子错它八道题，还想吃肯德基啊！夜到麻将队三缺一啊，嘿嘿，叫阿三头来顶嘛！哦，提前上料这样产品，还要卖300块硬币啊。
Jada	Shanghai	侬来帮伊向暖吧，天光已经暗转亮哉。
Sunny	Sichuan	胖娃胖嘟嘟，骑马上成都，成都又好耍。胖娃骑白马，白马跳得高。胖娃耍关刀，关刀耍得圆。胖娃吃汤圆。
Sunny	Sichuan	他一辈子的使命就是不停地爬哟，爬到大海头上去，不管有好多远！

Additional Results

Qwen-TTS has demonstrated human-level performance, and metrics on the SeedTTS-Eval benchmark is shown below:

Speaker	WER (↓)			SIM (↑)
Speaker	zh	en	hard	zh	en	hard
Chelsie	1.256	2.004	6.171	0.658	0.473	0.662
Serena	1.495	2.206	7.394	0.804	0.508	0.803
Ethan	1.489	1.969	6.754	0.777	0.558	0.779
Cherry	1.209	1.967	6.069	0.799	0.664	0.801

Here are some Chinese-English bilingual samples for these four speakers:

Speaker	Text	Sample
Cherry	对吧！我就特别喜欢这种超市，尤其是过年的时候，去逛超市就觉得超级超级开心，然后买点儿东西就要买好多好多东西，这个也想买那个也想买，然后买一堆东西带回去。
Cherry	Take a look at http://www.granite.ab.ca/access/email.
Ethan	啊？真的假的？他们俩拍吻戏。可是我觉得他们两个没有CP感欸。
Ethan	Jane's eyes wide with terror, she screamed, "The brakes aren't working! What do we do now? We're completely trapped, and we're heading straight for that wall, I can't stop it!" Then, a strange calm washed over her as she murmured, "Well, at least the view was nice. It's almost poetic, this beautiful scene for our grand finale, isn't it?"
Chelsie	哼！还让不让人好好减肥啦，不行，你要请我一顿好的赔偿我哦。
Chelsie	"Oh my gosh! Are we really going to the Maldives? That’s unbelievable!" Jennie squealed.
Serena	小狗，么么么么么，你快看它，它好可爱。哇，它倒立了是不是，太厉害了！好呀，我们要不要用狗粮把它拐回家，嗯？
Serena	You can call me directly at 4257037344 or my cell 4254447474 or send me a meeting request with all the appropriate information.

How to use

Using Qwen-TTS with Qwen API is simple. We demonstrate a code snippet for you to play with it below:

import os
import requests
import dashscope


def get_api_key():
    api_key = os.getenv("DASHSCOPE_API_KEY")
    if not api_key:
        raise EnvironmentError("DASHSCOPE_API_KEY environment variable not set.")
    return api_key


def synthesize_speech(text, voice="Dylan", model="qwen-tts-latest"):
    api_key = get_api_key()
    try:
        response = dashscope.audio.qwen_tts.SpeechSynthesizer.call(
            model=model,
            api_key=api_key,
            text=text,
            voice=voice,
        )
        
        # Check if response is None
        if response is None:
            raise RuntimeError("API call returned None response")
        
        # Check if response.output is None
        if response.output is None:
            raise RuntimeError("API call failed: response.output is None")
        
        # Check if response.output.audio exists
        if not hasattr(response.output, 'audio') or response.output.audio is None:
            raise RuntimeError("API call failed: response.output.audio is None or missing")
        
        audio_url = response.output.audio["url"]
        return audio_url
    except Exception as e:
        raise RuntimeError(f"Speech synthesis failed: {e}")


def download_audio(audio_url, save_path):
    try:
        resp = requests.get(audio_url, timeout=10)
        resp.raise_for_status()
        with open(save_path, 'wb') as f:
            f.write(resp.content)
        print(f"Audio file saved to: {save_path}")
    except Exception as e:
        raise RuntimeError(f"Download failed: {e}")


def main():
    text = (
        """哟，您猜怎么着？今儿个我看NBA，库里投篮跟闹着玩似的，张手就来，篮筐都得喊他“亲爹”了"""
    )
    save_path = "downloaded_audio.wav"
    try:
        audio_url = synthesize_speech(text)
        download_audio(audio_url, save_path)
    except Exception as e:
        print(e)


if __name__ == "__main__":
    main()

Summary

Qwen-TTS is a text-to-speech model that supports Chinese-English bilingual synthesis and several Chinese dialects. It aims to produce natural and expressive speech, and is available via API. While it shows promising results, we look forward to further improvements and broader language support in the future.

Time to Speak Some Dialects, Qwen-TTS!

Introduction#

Samples of Chinese Dialects#

Additional Results#

How to use#

Summary#

Introduction

Samples of Chinese Dialects

Additional Results

How to use

Summary