Festival speech synthesis: worst coding ever?

This week I played with text2wave component of festival tool set used on Linux to run speech synthesis. I am not very keen in speech synthesis tools on Linux, but on my machine the HTS voices and CMS voices sounded better than espeak and mbrola.

So I played a bit with it.

The voice produced by text2wave is a bit flat, so I searched for some way to annotate text. I do need:

  • to be able to adjust rate (speed) of talking;
  • to be able to adjust a pitch (higher, lower tone);
  • to be able to insert pauses on paragraph breaks;
  • to be able to change voice (speaker) in text;
  • to be able to save spoken text to a file;
  • to be able to script the process to automatically run it on many files.

Promises

Festival promises much. It declares support for sable (XML) markup. For which I couldn’t find any specs, except in the festival sources.

The DTD of this XML declares that I will be able to do everything I need.

If one however will try it out, and the look at source, one will notice that:

    1. The DIV element is ignored. And this is the only one element which could introduce a pause. This problem is documented in manual, but why it is then in DTD?;
    2. The BREAK element is a sophisticated way of entering a space in text. All parameters declared by DTD are ignored.
    3. The AUDIO tag requires URL as an argument instead of file. It doesn’t understand path to a file at all. Of course it can understand file:// URL, but it does mean that path to you audio file must be absolute. This simply rules out any re-distribution of your text files.
    4. The AUDIO tag does neither validate sampling rate nor resamples the file. In result you must guess what output sampling rate would be and match your file with it. This is a pain in behind, because using AUDIO is the only way of injecting pause in text.
    5. The AUDIO tag promises “background” and “insertion” mode, but, of course, only “insertion” works. This is documented, but the fact that it doesn’t re-sample file is not documented.
    6. The EMPH is in fact using RATE to say word slower. It ignores all parameters, but it is documented in manual. Again, why it is in DTD?
    7. The PITCH and RATE do promise change of speaking speed and tone. This is the promise only. It works only with low quality “diphone” voices. The high quality CMU/HTS voices are not affected by this setting at all. This is not documented. In fact even internal festival commands do not affect them.
    8. The range of attribute values, the meaning of numbers and etc is nowhere documented.
      You must guess if when PITCH takes n% as an argument, then 100% will be “base” pitch or twice the base pitch?
    9. The VOLUME does work. Although you must guess how do you specify the value. And, of course, what is the base value and how large is the margin before sounds starts “clicking” due to clipping.
    10. The SPEAKER element, which changes the voice, works…. but in fact it doesn’t. The voices I have are all using different sample-rate while SPEAKER does use first speaker sampling rate for all voices. In my case the male “di-phone” voice is sampled at slow sampling rate, the HTS is sampled at highest sampling rate and CMU voices are in the middle. In result you can’t switch speakers unless you will limit yourself to a very narrow range of voices.
      If you will however use the text2wave -F 32000 and are lucky with selected number, then the problem goes away.
    11. The sable XML parser follows XML specs in places it likes and ignores in places it doesn’t like. It collapses the sequence of white-space characters in one space, which is as XML requires, but does not even try to understand   which is the XML way to say “and I have the space here which I like to keep”. In result there is no way to add pause in text without using AUDIO element.
    12. The text-to-speech process can take very long time. Yet the XML parser does not pre-parse the input to validate if all syntax is correct. Instead it can run for an hour to crash in the middle if you mess up with tags.

Documentation

Worst possible ever. It looks like notes made by a person who wrote the program. For an example:

text2wave [options] textfile
  Convert a textfile to a waveform
  Options
  -mode <string>  Explicit tts mode.
  -o ofile        File to save waveform (default is stdout).
  -otype <string> Output waveform type: alaw, ulaw, snd, aiff, riff, nist etc.
                  (default is riff)
  -F <int>        Output frequency.
  -scale <float>  Volume factor
  -eval <string>  File or lisp s-expression to be evaluated before
                  synthesis.

Now please tell me what the <string> in -mode could be?

And finally… gosh… ... riff, nist etc.. Etc.?! Really? What are we playing here, a LOTTO? Guess the sequence of letters and win the prize?!

Btw. -F is unpredictable and produces errors. Especially if you, by accident, will set -F to match the default sampling of voices.

The names of voices are cryptic. The content of archives at festivox.org is not described. Guys, I did waste few GB of your server bandwidth only to discover that what I downloaded was not what I was looking for.

Maintenance

The festival (manual nil) command do depend on Netscape. Yes. You read it correctly. Not on default OS web browser, but on Netscape. Amazing!

Even tough last big release was in 2017 (2.5) the “latest” link on server points to revision 2.1.

There are plenty of dependencies on non existing external documents. For an example a link to sable XML format specification points into a void. You need to reverse-engineer it from DTD and implementing code.

Voices

The compatibility of voices is a total missery. The HTS voices, which are best, can be get from Nitech archive page. They are however not up-wards compatible and must match exactly the festival version you have. And know what? Festival will not validate the compatibility. It will just crash with either cryptic message or core-dump.

State faults

If one (SayText "....") command fails, subsequent calls will also fail. There is no cleanup, nothing. Zero fault resistance.

Is it total crap then?

Well… both “yes” and “no”. It is a good scientific work but a total crap when it comes to coding quality and something what we may broadly describe as a “end-user support”. This is certainly not a program which should be used in any production environment.

It is however speaking quite well.

Never ever use common noun as a program name!

I must say it even tough some professors may feel offended. “Festival” is f*ng event during people dance and play! And “rhino” is an animal.

Using common words, especially such which have nothing in common with a project to name it is a strategy used by an army to hide project from an enemy.

Are we, users really Your enemy?

Summary

Festival and text2wave is just one big disappointment. And one big lost chance. The HTS voices are really good and can compete with today cloud TTS services. They run however locally so there is no security and privacy concerns. You can make festival to say “….enter some phrase which is illegal in Your country in here…” and be sure that your Police will not intercept it.

It could be good. It could be fast. It could be easy to use.

But it is not.

What a shame.

Note: If you need a working, up to date and much better sounding TTS finds a project called “piper”. It sounds superior to festival and is an order of magnitude more reliable. Unfortunately the only thing you can change it is the tempo in which the speaker talks. But at least it is well documented and doesn’t make empty promises.

Gladly I did manage to craft a small bash script which, although far from being efficient, allows me to annotate the text in such a way that I can:

    • give “piper” voices friendly names;
    • make one of them to be ‘narrator’ who speaks normal text and other one ‘actor’ who speaks text enclosed in “double quotes” and switch them on the fly. This option makes books read by TTS to sound very attractive.
    • change tempo (with “piper” parameter) on the fly;
    • change pitch and volume on the fly (with “sox” project as a mid-processing pipeline. In fact all “sox” effects are possible.);
    • insert additional pauses automatically when more than three consequent \n, dots or whitespaces are found in text;
    • pack it all to *.mp3 file with “lame” as a post-processing pipeline.

What I cannot do it is:

    • to make voices to whisper;
    • to make voices to yell or scream;
    • to make voices to mumble, cry and etc.;
    • to control accent inside a word;
    • to extend a vocabulary so that words “piper” doesn’t know are said phonetically. For an example “Hm” for piper is read in ‘letter-by-letter’ mode, as if you would iterate over the alphabet and sounds like “Ejch am”.

I think I will soon turn this script in some more powerful tool. I need to figure out how to manipulate piper at source level and how to twist the “digital larynx” it is using to make it whisper. But for that I need to get my new PC running, because the “piper” can starve to death my Q2400 quad core Pentium machine and still fails to speak in a real time. It is hard to experiment in such conditions.

Leave a comment