UNet-VITS: Elevating Single-Stage TTS Quality with Spectral Restoration and Post-Processing Optimization

Min Zheng; Danqing Liu; Tengyue Yang; Haoyu Liu; Yanhui Guo

Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China

Research Article

UNet-VITS: Elevating Single-Stage TTS Quality with Spectral Restoration and Post-Processing Optimization

Download20 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/eai.18-12-2025.2365268,
    author={Min  Zheng and Danqing  Liu and Tengyue  Yang and Haoyu  Liu and Yanhui  Guo},
    title={UNet-VITS: Elevating Single-Stage TTS Quality with Spectral Restoration and Post-Processing Optimization},
    proceedings={Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China},
    publisher={EAI},
    proceedings_a={IIKI},
    year={2026},
    month={6},
    keywords={Speech Synthesis U-Net SnakeBeta Activation End-to-End Text-to-Speech},
    doi={10.4108/eai.18-12-2025.2365268}
}

Min Zheng
Danqing Liu
Tengyue Yang
Haoyu Liu
Yanhui Guo
Year: 2026
UNet-VITS: Elevating Single-Stage TTS Quality with Spectral Restoration and Post-Processing Optimization
IIKI
EAI
DOI: 10.4108/eai.18-12-2025.2365268

Min Zheng¹^,*, Danqing Liu², Tengyue Yang¹, Haoyu Liu¹, Yanhui Guo³

1: College of Computer, Qinghai Normal University, Xining, China
2: College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China
3: College of Artificial Intelligence, Shandong Women’s University, Shandong, China

*Contact email: zhengmin824@foxmail.com

Abstract

Recent advances in single-stage text-to-speech (TTS) synthesis have shown strong performance compared with conventional pipeline systems. However, state-of-the-art models such as VITS2 still suffer from insufficient naturalness, prosodic breaks in long utterances, limited spectral accuracy caused by the loss of high-frequency details, and strong dependency on seen speaking styles. This study proposes UNet-VITS, an enhanced VITS2-based architecture designed to improve single-stage TTS quality through three synergistic technical improvements. First, a multi-scale feature fusion mechanism is employed to enhance latent feature representation and capture both local phonetic details and global prosodic structures. Second, residual blocks with SnakeBeta activation are introduced to optimize gradient flow, increase model capacity, and improve the modeling of speech harmonic structures and F0 trajectories. Third, a U-Net-based post-processing module treats mel-spectrograms as acoustic images and uses multi-scale skip connections to refine F0-informed spectral details, achieving post-hoc enhancement of audio quality. Experiments on LJ Speech and VCTK demonstrate improvements in speech naturalness, spectral fidelity, pitch consistency, and speaker similarity while maintaining competitive training efficiency. This work provides a reusable full-chain optimization paradigm for single-stage TTS research.

Keywords: Speech Synthesis, U-Net, SnakeBeta Activation, End-to-End Text-to-Speech

Published: 2026-06-17
Publisher: EAI

: http://dx.doi.org/10.4108/eai.18-12-2025.2365268

UNet-VITS: Elevating Single-Stage TTS Quality with Spectral Restoration and Post-Processing Optimization

Abstract

About EAI

Community

Publish with EAI