This week I played with text2wave
component of festival tool set used on Linux to run speech synthesis. I am not very keen in speech synthesis tools on Linux, but on my machine the HTS voices and CMS voices sounded better than espeak and mbrola.
So I played a bit with it.
The voice produced by text2wave
is a bit flat, so I searched for some way to annotate text. I do need:
- to be able to adjust rate (speed) of talking;
- to be able to adjust a pitch (higher, lower tone);
- to be able to insert pauses on paragraph breaks;
- to be able to change voice (speaker) in text;
- to be able to save spoken text to a file;
- to be able to script the process to automatically run it on many files.
Promises
Festival promises much. It declares support for sable
(XML) markup. For which I couldn’t find any specs, except in the festival
sources.
The DTD of this XML declares that I will be able to do everything I need.
If one however will try it out, and the look at source, one will notice that:
-
- The
DIV
element is ignored. And this is the only one element which could introduce a pause. This problem is documented in manual, but why it is then in DTD?; - The
BREAK
element is a sophisticated way of entering a space in text. All parameters declared by DTD are ignored. - The
AUDIO
tag requires URL as an argument instead of file. It doesn’t understand path to a file at all. Of course it can understandfile://
URL, but it does mean that path to you audio file must be absolute. This simply rules out any re-distribution of your text files. - The
AUDIO
tag does neither validate sampling rate nor resamples the file. In result you must guess what output sampling rate would be and match your file with it. This is a pain in behind, because usingAUDIO
is the only way of injecting pause in text. - The
AUDIO
tag promises “background” and “insertion” mode, but, of course, only “insertion” works. This is documented, but the fact that it doesn’t re-sample file is not documented. - The
EMPH
is in fact usingRATE
to say word slower. It ignores all parameters, but it is documented in manual. Again, why it is in DTD? - The PITCH and RATE do promise change of speaking speed and tone. This is the promise only. It works only with low quality “diphone” voices. The high quality CMU/HTS voices are not affected by this setting at all. This is not documented. In fact even internal festival commands do not affect them.
- The range of attribute values, the meaning of numbers and etc is nowhere documented.
You must guess if whenPITCH
takes n% as an argument, then 100% will be “base” pitch or twice the base pitch? - The
VOLUME
does work. Although you must guess how do you specify the value. And, of course, what is the base value and how large is the margin before sounds starts “clicking” due to clipping. - The
SPEAKER
element, which changes the voice, works…. but in fact it doesn’t. The voices I have are all using different sample-rate whileSPEAKER
does use first speaker sampling rate for all voices. In my case the male “di-phone” voice is sampled at slow sampling rate, the HTS is sampled at highest sampling rate and CMU voices are in the middle. In result you can’t switch speakers unless you will limit yourself to a very narrow range of voices.
If you will however use thetext2wave -F 32000
and are lucky with selected number, then the problem goes away. - The
sable
XML parser follows XML specs in places it likes and ignores in places it doesn’t like. It collapses the sequence of white-space characters in one space, which is as XML requires, but does not even try to understand which is the XML way to say “and I have the space here which I like to keep”. In result there is no way to add pause in text without usingAUDIO
element. - The text-to-speech process can take very long time. Yet the XML parser does not pre-parse the input to validate if all syntax is correct. Instead it can run for an hour to crash in the middle if you mess up with tags.
- The
Documentation
Worst possible ever. It looks like notes made by a person who wrote the program. For an example:
text2wave [options] textfile Convert a textfile to a waveform Options -mode <string> Explicit tts mode. -o ofile File to save waveform (default is stdout). -otype <string> Output waveform type: alaw, ulaw, snd, aiff, riff, nist etc. (default is riff) -F <int> Output frequency. -scale <float> Volume factor -eval <string> File or lisp s-expression to be evaluated before synthesis.
Now please tell me what the <string>
in -mode
could be?
And finally… gosh… ... riff, nist etc.
. Etc.?! Really? What are we playing here, a LOTTO? Guess the sequence of letters and win the prize?!
Btw. -F
is unpredictable and produces errors. Especially if you, by accident, will set -F
to match the default sampling of voices.
The names of voices are cryptic. The content of archives at festivox.org
is not described. Guys, I did waste few GB of your server bandwidth only to discover that what I downloaded was not what I was looking for.
Maintenance
The festival (manual nil)
command do depend on Netscape. Yes. You read it correctly. Not on default OS web browser, but on Netscape. Amazing!
Even tough last big release was in 2017 (2.5) the “latest” link on server points to revision 2.1.
There are plenty of dependencies on non existing external documents. For an example a link to sable
XML format specification points into a void. You need to reverse-engineer it from DTD and implementing code.
Voices
The compatibility of voices is a total missery. The HTS voices, which are best, can be get from Nitech archive page. They are however not up-wards compatible and must match exactly the festival version you have. And know what? Festival will not validate the compatibility. It will just crash with either cryptic message or core-dump.
State faults
If one (SayText "....")
command fails, subsequent calls will also fail. There is no cleanup, nothing. Zero fault resistance.
Is it total crap then?
Well… both “yes” and “no”. It is a good scientific work but a total crap when it comes to coding quality and something what we may broadly describe as a “end-user support”. This is certainly not a program which should be used in any production environment.
It is however speaking quite well.
Never ever use common noun as a program name!
I must say it even tough some professors may feel offended. “Festival” is f*ng event during people dance and play! And “rhino” is an animal.
Using common words, especially such which have nothing in common with a project to name it is a strategy used by an army to hide project from an enemy.
Are we, users really Your enemy?
Summary
Festival and text2wave is just one big disappointment. And one big lost chance. The HTS voices are really good and can compete with today cloud TTS services. They run however locally so there is no security and privacy concerns. You can make festival to say “….enter some phrase which is illegal in Your country in here…” and be sure that your Police will not intercept it.
It could be good. It could be fast. It could be easy to use.
But it is not.
What a shame.
Note: If you need a working, up to date and much better sounding TTS finds a project called “piper”. It sounds superior to festival and is an order of magnitude more reliable. Unfortunately the only thing you can change it is the tempo in which the speaker talks. But at least it is well documented and doesn’t make empty promises.
Gladly I did manage to craft a small bash script which, although far from being efficient, allows me to annotate the text in such a way that I can:
-
- give “piper” voices friendly names;
- make one of them to be ‘narrator’ who speaks normal text and other one ‘actor’ who speaks text enclosed in “double quotes” and switch them on the fly. This option makes books read by TTS to sound very attractive.
- change tempo (with “piper” parameter) on the fly;
- change pitch and volume on the fly (with “sox” project as a mid-processing pipeline. In fact all “sox” effects are possible.);
- insert additional pauses automatically when more than three consequent \n, dots or whitespaces are found in text;
- pack it all to *.mp3 file with “lame” as a post-processing pipeline.
What I cannot do it is:
-
- to make voices to whisper;
- to make voices to yell or scream;
- to make voices to mumble, cry and etc.;
- to control accent inside a word;
- to extend a vocabulary so that words “piper” doesn’t know are said phonetically. For an example “Hm” for piper is read in ‘letter-by-letter’ mode, as if you would iterate over the alphabet and sounds like “Ejch am”.
I think I will soon turn this script in some more powerful tool. I need to figure out how to manipulate piper at source level and how to twist the “digital larynx” it is using to make it whisper. But for that I need to get my new PC running, because the “piper” can starve to death my Q2400 quad core Pentium machine and still fails to speak in a real time. It is hard to experiment in such conditions.