LSTM-based Speech Segmentation for TTS Synthesis

Hanzlíček, Zdeněk; Vít, Jakub; Tihelka, Daniel

Title:	LSTM-based Speech Segmentation for TTS Synthesis
Other Titles:	Segmentace řeči založená na LSTM pro TTS syntézu
Authors:	Hanzlíček, Zdeněk Vít, Jakub Tihelka, Daniel
Citation:	HANZLÍČEK, Z.., VÍT, J.., TIHELKA, D.. LSTM-based Speech Segmentation for TTS Synthesis. In: Text, Speech, and Dialogue 22nd International Conference, TSD 2019, Ljubljana,Slovenia, September 11-13, 2019, Proceedings. Cham: Springer, 2019. s. 361-372. ISBN 978-3-030-27946-2 , ISSN 0302-9743.
Issue Date:	2019
Publisher:	Springer
Document type:	konferenční příspěvek conferenceObject
URI:	2-s2.0-85072850106 http://hdl.handle.net/11025/36611
ISBN:	978-3-030-27946-2
ISSN:	0302-9743
Keywords:	Segmentace řeči, syntéza řeči, LSTM neuronové sítě
Keywords in different language:	Speech segmentation, Speech synthesis, LSTM neural networks
Abstract:	Tento článek popisuje experimenty se segmentací řeči pro účely TTS syntézy . Použili jsme obousměrnou LSTM neuronovou síť pro klasifikaci telefonů v rámečku a další obousměrnou LSTM síť pro predikci délky jednotlivých telefonů. Navrhovaná procedura segmentace kombinuje oba výstupy a najde optimální zarovnání řeč-foném pomocí dynamického programovacího přístupu. Zavedli jsme dvě modifikace pro zvýšení robustnosti klasifikace fonémů. Experimenty byly provedeny na 2 profesionálních a 2 amatérských hlasy. Bylo provedeno srovnání s referenční segmentací založenou na HMM s dalšími manuálními korekcemi. Preferenční poslechové testy ukázaly, že referenční a experimentální segmentace jsou rovnocenné, pokud jsou použity v systému TTS pro výběr jednotek.
Abstract in different language:	This paper describes experiments on speech segmentation for the purposes of text-to-speech synthesis. We used a bidirectional LSTM neural network for framewise phone classification and another bidirectional LSTM network for predicting the duration of particular phones. The proposed segmentation procedure combines both outputs and finds the optimal speech-phoneme alignment by using the dynamic programming approach. We introduced two modifications to increase the robustness of phoneme classification. Experiments were performed on 2 professional voices and 2 amateur voices. A comparison with a reference HMM-based segmentation with additional manual corrections was performed. Preference listening tests showed that the reference and experimental segmentation are equivalent when used in a unit selection TTS system.
Rights:	Plný text není přístupný. © Springer
Appears in Collections:	Konferenční příspěvky / Conference Papers (KKY) OBD

Files in This Item:

File	Size	Format
Hanzlicek2019_Chapter_LSTM-BasedSpeechSegmentationFo.pdf	422,4 kB	Adobe PDF	View/Open Request a copy

Show full item record

Please use this identifier to cite or link to this item: http://hdl.handle.net/11025/36611

search

navigation