
Research Article
UNet-VITS: Elevating Single-Stage TTS Quality with Spectral Restoration and Post-Processing Optimization
@INPROCEEDINGS{10.4108/eai.18-12-2025.2365268, author={Min Zheng and Danqing Liu and Tengyue Yang and Haoyu Liu and Yanhui Guo}, title={UNet-VITS: Elevating Single-Stage TTS Quality with Spectral Restoration and Post-Processing Optimization}, proceedings={Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China}, publisher={EAI}, proceedings_a={IIKI}, year={2026}, month={6}, keywords={Speech Synthesis U-Net SnakeBeta Activation End-to-End Text-to-Speech}, doi={10.4108/eai.18-12-2025.2365268} }- Min Zheng
Danqing Liu
Tengyue Yang
Haoyu Liu
Yanhui Guo
Year: 2026
UNet-VITS: Elevating Single-Stage TTS Quality with Spectral Restoration and Post-Processing Optimization
IIKI
EAI
DOI: 10.4108/eai.18-12-2025.2365268
Abstract
Recent advances in single-stage text-to-speech (TTS) synthesis have shown strong performance compared with conventional pipeline systems. However, state-of-the-art models such as VITS2 still suffer from insufficient naturalness, prosodic breaks in long utterances, limited spectral accuracy caused by the loss of high-frequency details, and strong dependency on seen speaking styles. This study proposes UNet-VITS, an enhanced VITS2-based architecture designed to improve single-stage TTS quality through three synergistic technical improvements. First, a multi-scale feature fusion mechanism is employed to enhance latent feature representation and capture both local phonetic details and global prosodic structures. Second, residual blocks with SnakeBeta activation are introduced to optimize gradient flow, increase model capacity, and improve the modeling of speech harmonic structures and F0 trajectories. Third, a U-Net-based post-processing module treats mel-spectrograms as acoustic images and uses multi-scale skip connections to refine F0-informed spectral details, achieving post-hoc enhancement of audio quality. Experiments on LJ Speech and VCTK demonstrate improvements in speech naturalness, spectral fidelity, pitch consistency, and speaker similarity while maintaining competitive training efficiency. This work provides a reusable full-chain optimization paradigm for single-stage TTS research.


