About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China

Research Article

UNet-VITS: Elevating Single-Stage TTS Quality with Spectral Restoration and Post-Processing Optimization

Download20 downloads
Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.4108/eai.18-12-2025.2365268,
        author={Min  Zheng and Danqing  Liu and Tengyue  Yang and Haoyu  Liu and Yanhui  Guo},
        title={UNet-VITS: Elevating Single-Stage TTS Quality with Spectral Restoration and Post-Processing Optimization},
        proceedings={Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China},
        publisher={EAI},
        proceedings_a={IIKI},
        year={2026},
        month={6},
        keywords={Speech Synthesis U-Net SnakeBeta Activation End-to-End Text-to-Speech},
        doi={10.4108/eai.18-12-2025.2365268}
    }
    
  • Min Zheng
    Danqing Liu
    Tengyue Yang
    Haoyu Liu
    Yanhui Guo
    Year: 2026
    UNet-VITS: Elevating Single-Stage TTS Quality with Spectral Restoration and Post-Processing Optimization
    IIKI
    EAI
    DOI: 10.4108/eai.18-12-2025.2365268
Min Zheng1,*, Danqing Liu2, Tengyue Yang1, Haoyu Liu1, Yanhui Guo3
  • 1: College of Computer, Qinghai Normal University, Xining, China
  • 2: College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, China
  • 3: College of Artificial Intelligence, Shandong Women’s University, Shandong, China
*Contact email: zhengmin824@foxmail.com

Abstract

Recent advances in single-stage text-to-speech (TTS) synthesis have shown strong performance compared with conventional pipeline systems. However, state-of-the-art models such as VITS2 still suffer from insufficient naturalness, prosodic breaks in long utterances, limited spectral accuracy caused by the loss of high-frequency details, and strong dependency on seen speaking styles. This study proposes UNet-VITS, an enhanced VITS2-based architecture designed to improve single-stage TTS quality through three synergistic technical improvements. First, a multi-scale feature fusion mechanism is employed to enhance latent feature representation and capture both local phonetic details and global prosodic structures. Second, residual blocks with SnakeBeta activation are introduced to optimize gradient flow, increase model capacity, and improve the modeling of speech harmonic structures and F0 trajectories. Third, a U-Net-based post-processing module treats mel-spectrograms as acoustic images and uses multi-scale skip connections to refine F0-informed spectral details, achieving post-hoc enhancement of audio quality. Experiments on LJ Speech and VCTK demonstrate improvements in speech naturalness, spectral fidelity, pitch consistency, and speaker similarity while maintaining competitive training efficiency. This work provides a reusable full-chain optimization paradigm for single-stage TTS research.

Keywords
Speech Synthesis, U-Net, SnakeBeta Activation, End-to-End Text-to-Speech
Published
2026-06-17
Publisher
EAI
http://dx.doi.org/10.4108/eai.18-12-2025.2365268
Copyright © 2025–2026 EAI
EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center
  • Cookie Preferences

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL