Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

Paper under double-blind review

Abstract

While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces \textit{S-DiT}, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to S-DiT to reduce the difficulty of alignment learning without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise rectified flow technique to accelerate the generation process. Experiments demonstrate that S-DiT achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity. Notably, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples can be found in this demo page.

Model Overview

(a) The WaveVAE model; (b) Overview of S-DiT. We insert the sparse alignment anchors into the latent vector sequence to provide coarse alignment information. The transformer blocks in S-DiT will automatically build fine-grained alignment paths.

Zero-Shot TTS

We list the speech examples on the LibriSpeech benchmark here.

Target Text Prompt S-DiT (0.3B) F5-TTS NaturalSpeech 3 VoiceBox CosyVoice ARDiT StyleTTS 2 HierSpeech++

His death in this conjuncture was a public misfortune.

Sim-O: 0.7777

Sim-O: 0.7727

Sim-O: 0.7577

For if he's anywhere on the farm, we can send for him in a minute.

Sim-O: 0.8587

Sim-O: 0.7473

Sim-O: 0.0.6905

John Wesley Combash, Jacob Taylor, and Thomas Edward Skinner.

Sim-O: 0.8291

Sim-O: 0.6916

Sim-O: 0.7662

The strong position held by the Edison system under the strenuous competition that was already springing up was enormously improved by the introduction of the three wire system and it gave an immediate impetus to incandescent lighting.

Sim-O: 0.8702

Sim-O: 0.7334

Sim-O: 0.8103

* please scroll horizontally to see more samples.

Accent Intensity Control

Note that CTA-TTS only support to generate standard English and English with a Chinese accent (Chinglish).
The target text is "Unconsciously, our yells and exclamations yielded to this rhythm".
We use S-DiT (0.3B) in this study.

Accent Types Prompt S-DiT (Accented) S-DiT (Standard) CTA-TTS (Accented) CTA-TTS (Standard)

Mandarin

Vietnamese

- -

Hindi

- -

Incredible Improvement Brought by Scaling.

We compare S-DiT (0.3B) with S-DiT (1.5B) and S-DiT (7.0B) here. Both models are trained on the 600kh internal datasets.

Capabilities Target Text Prompt S-DiT (0.3B) S-DiT (1.5B) S-DiT (7.0B)

Paralinguistic

老夫人疑惑,“娇娇,你今儿是怎么了?”平日里,她的娇娇儿和萧弈的关系也没这么好……

Expressiveness

Ultimate innovation and exceptional design, they lead the modern life styles and creating countless legends, this is, the Apple Inc.

Duration Control

We only adjust the speed rate for "about both the size of the peri" for phoneme-level duration control.
We use S-DiT (0.3B) in this study.

Control Type Target Text Speed = 0.9 Speed = 1.0 Speed = 1.1

Sentence-Level

Notably, raising questions about both the size of the perimeter and efforts to sweep and secure.

Phoneme-Level

Notably, raising questions about both the size of the perimeter and efforts to sweep and secure.

Robustness

First prompt is derived from MaskGCT's demo page. Second prompt is derived from Bailing-TTS's demo page.

Speaker Target Text Prompt S-DiT (0.3B)

Trump

She sells sea shells by the seashore. The shells she sells are surely seashells.

Child example from Bailing-TTS

针蓝线蓝领子蓝,蓝针蓝线蓝领蓝。蓝针蓝线连蓝领,针蓝线蓝领子蓝。

Code-Switched Generation

The first to second prompts are derived from the public LibriSpeech dataset.
The third prompt is derived from Seed-TTS's demo page and the fourth to fifth prompts are derived from online videos.

Speaker Target Text Prompt S-DiT (0.3B)

Spk 1 from LibriSpeech

你昨天的performance真是outstanding,完全展示了你的skills。

Spk 2 from LibriSpeech

我觉得我们需要一个更clear的strategy来实现我们的goals。

容嚒嚒 (Meme Rong)

这次旅行的schedule有点tight,我们需要plan得更efficient一些。

Elon Musk

他今天的mood看起来不太好,可能需要一些space。

丁真 (Zhen Ding)

在这big城市,充满着layers的层次,时间仿佛freeze停滞,我只想回到Snow Leopard身边再来一次!

Attention Matrices Visualization

We visualize the attention matrices from S-DiT model and observe:

  • The function of each layer stays consistent across different timesteps;
  • The layers can be categorized into three types:
    • The bottom layers handle text and audio feature extraction;
    • The middle layers focus on speech-text alignment;
    • The top layers refine the target latent features.

(a) The bottom layers

(b) The middle layers

(c) The top layers

Examples of Potential Failure Cases

We investigate the following 4 cases:
    1) we only choose the left/right boundary frame as the anchor of sparse alignment;
    2) we add significant gaussian noises to the predicted duration value
        (to simulate a poorly performing duration predictor);
    3) extremely long or short phoneme durations;
    4) long text inputs.
And we have the following conclusions for ours with sparse alignment:
    1) it is robust to poorly-chosen anchors;
    2) it is more robust to noisy duration (poor duration predictor);
    3) it is robust against extremely long or short phoneme durations;
    4) it can generate speeches with long text inputs.

Setting Prompt Ours w/ Sparse Alignment Ours w/ Forced Alignment Ours w/o Alignments

Case 1

- -

Case 2

-

Case 3 - Short

Case 3 - Long

Case 4

Robustness to Long Sequences against Other AR Models

The target text is:
In recent years, large scale language models and diffusion models have brought considerable advancements to the field of speech synthesis. Unlike traditional text to speech systems, these models are trained on large scale, multi domain speech corpora, which contributes to notable improvements in the naturalness and expressiveness of synthesized audio. Given only seconds of speech prompt, these models can synthesize identity preserving speech in a zero shot manner.

Prompt S-DiT (0.3B) VoiceCraft