Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

[Anonymous authors]

Paper under double-blind review

Model Overview

(a) The WaveVAE model; (b) Overview of S-DiT. We insert the sparse alignment anchors into the latent vector sequence to provide coarse alignment information. The transformer blocks in S-DiT will automatically build fine-grained alignment paths.

Zero-Shot TTS

We list the speech examples on the LibriSpeech benchmark here.

Target Text	S-DiT (0.3B)	F5-TTS	NaturalSpeech 3
His death in this conjuncture was a public misfortune.	Sim-O: 0.7777	Sim-O: 0.7727	Sim-O: 0.7577
For if he's anywhere on the farm, we can send for him in a minute.	Sim-O: 0.8587	Sim-O: 0.7473	Sim-O: 0.0.6905
John Wesley Combash, Jacob Taylor, and Thomas Edward Skinner.	Sim-O: 0.8291	Sim-O: 0.6916	Sim-O: 0.7662
The strong position held by the Edison system under the strenuous competition that was already springing up was enormously improved by the introduction of the three wire system and it gave an immediate impetus to incandescent lighting.	Sim-O: 0.8702	Sim-O: 0.7334	Sim-O: 0.8103

* please scroll horizontally to see more samples.

Accent Intensity Control

Note that CTA-TTS only support to generate standard English and English with a Chinese accent (Chinglish).
The target text is "Unconsciously, our yells and exclamations yielded to this rhythm".
We use S-DiT (0.3B) in this study.

Accent Types	CTA-TTS (Accented)	CTA-TTS (Standard)
Mandarin
Vietnamese	-	-
Hindi	-	-

Incredible Improvement Brought by Scaling.

We compare S-DiT (0.3B) with S-DiT (1.5B) and S-DiT (7.0B) here. Both models are trained on the 600kh internal datasets.

Capabilities	Target Text	Prompt	S-DiT (0.3B)	S-DiT (1.5B)	S-DiT (7.0B)
Paralinguistic	老夫人疑惑，“娇娇，你今儿是怎么了？”平日里，她的娇娇儿和萧弈的关系也没这么好……
Expressiveness	Ultimate innovation and exceptional design, they lead the modern life styles and creating countless legends, this is, the Apple Inc.

Duration Control

We only adjust the speed rate for "about both the size of the peri" for phoneme-level duration control.
We use S-DiT (0.3B) in this study.

Control Type	Target Text	Speed = 0.9	Speed = 1.0	Speed = 1.1
Sentence-Level	Notably, raising questions about both the size of the perimeter and efforts to sweep and secure.
Phoneme-Level	Notably, raising questions about both the size of the perimeter and efforts to sweep and secure.

Robustness

First prompt is derived from MaskGCT's demo page. Second prompt is derived from Bailing-TTS's demo page.

Speaker	Target Text	Prompt	S-DiT (0.3B)
Trump	She sells sea shells by the seashore. The shells she sells are surely seashells.
Child example from Bailing-TTS	针蓝线蓝领子蓝，蓝针蓝线蓝领蓝。蓝针蓝线连蓝领，针蓝线蓝领子蓝。

Code-Switched Generation

The first to second prompts are derived from the public LibriSpeech dataset.
The third prompt is derived from Seed-TTS's demo page and the fourth to fifth prompts are derived from online videos.

Speaker	Target Text	Prompt	S-DiT (0.3B)
Spk 1 from LibriSpeech	你昨天的performance真是outstanding，完全展示了你的skills。
Spk 2 from LibriSpeech	我觉得我们需要一个更clear的strategy来实现我们的goals。
容嚒嚒 (Meme Rong)	这次旅行的schedule有点tight，我们需要plan得更efficient一些。
Elon Musk	他今天的mood看起来不太好，可能需要一些space。
丁真 (Zhen Ding)	在这big城市，充满着layers的层次，时间仿佛freeze停滞，我只想回到Snow Leopard身边再来一次！

Attention Matrices Visualization

We visualize the attention matrices from S-DiT model and observe:

The function of each layer stays consistent across different timesteps;
The layers can be categorized into three types:
- The bottom layers handle text and audio feature extraction;
- The middle layers focus on speech-text alignment;
- The top layers refine the target latent features.

(a) The bottom layers

(b) The middle layers

(c) The top layers

Examples of Potential Failure Cases

We investigate the following 4 cases:
1) we only choose the left/right boundary frame as the anchor of sparse alignment;
2) we add significant gaussian noises to the predicted duration value
(to simulate a poorly performing duration predictor);
3) extremely long or short phoneme durations;
4) long text inputs.
And we have the following conclusions for ours with sparse alignment:
1) it is robust to poorly-chosen anchors;
2) it is more robust to noisy duration (poor duration predictor);
3) it is robust against extremely long or short phoneme durations;
4) it can generate speeches with long text inputs.

Setting	Ours w/ Forced Alignment	Ours w/o Alignments
Case 1	-	-
Case 2		-
Case 3 - Short
Case 3 - Long
Case 4

Robustness to Long Sequences against Other AR Models

The target text is:
In recent years, large scale language models and diffusion models have brought considerable advancements to the field of speech synthesis. Unlike traditional text to speech systems, these models are trained on large scale, multi domain speech corpora, which contributes to notable improvements in the naturalness and expressiveness of synthesized audio. Given only seconds of speech prompt, these models can synthesize identity preserving speech in a zero shot manner.