Book People Archive

Creating Talking Books from E-texts

From: Nick Hodson <Nick_H@[redacted]>
Subject: Creating Talking Books from E-texts
Date: Sun, 4 Nov 2001 19:27:06 -0500
Subject: Creating Talking Books from E-texts.

Herewith the gist of my research on this subject since my last e-mail on
the topic, to the BookPeople, October 2000. If this does not answer your
queries, please e-mail me directly. Nick Hodson
______________________________________________________

Creating Speaking Books using Text to Speech Program Fonix ISpeak

A year ago I wrote a memo and some follow-ups to the Book People, about
making Talking Books, using Fonix ISpeak. Since then I have changed several
of the views I held at that time, and have also improved the end product,
as will be shown. I would like to thank the many members of the Book People
who have corresponded with me on these topics during the intervening year.

Before going deeper into this subject, we must realise that there are a
number of text-to-speech programs around. In order to produce Talking Books
you need one that will directly create digital files, preferably much
faster than in real time. Fonix ISpeak can read the whole of a book in one
go, creating suitable markers for the ends of paragraphs and chapters, and
it can be made to do this in minutes, rather than in the twelve to twenty
hours it might take to "read" the book in real time.

At the time of my October 2000 papers, there was talk of placing the files
produced by Fonix ISpeak onto CDs, but it was not until a while later that
the Goodman CD MP3 player was available here in the UK. Hitherto I had
concentrated on thinking about how to get the speech files into the
smallest size, while still retaining speech quality. This pointed to DSP
files, which run at 1067 bytes per second, but unfortunately the 64 mB Rio
MP3 player I was experimenting with does not support them. I was not able
to obtain a codec for the compressed wave format used by Audible Books for
use on the Rio, though I did ask. I made some 16 kbps MP3 files, which
could get a short book onto the Rio, but that was all: their sound quality
was not as good as I would have liked.

Meanwhile other problems with Fonix ISpeak files were being addressed.
Initially I had simply concentrated on trying to make speech files from
e-text books, using ISpeak. There were some very obvious problems, but
these were all dealt with before my memo of October 2000. One of these was
the problem of the Fonix ISpeak program crashing when it encountered
certain combinations of words. Another was, making sure that there were
markers left in the PCM WAV files that indicated the ends of chapters, and
also the ends of paragraphs. Certain operational difficulties with Fonix
ISpeak were noted, solutions found, and communicated to Fonix, so that they
could incorporate the solutions in their next release.

The speech files I produced at this time, and for some six months onward,
were not as good as I would have liked, but, nevertheless, I made CDs of
the nineteenth century novels I have transcribed into e-text format, put
them onto CD, and derived great pleasure in listening to them in the car.  I
think of Talking Books made without the refinements discovered during the
year 2001, as "Level One."

The most noticeable problem now was the mispronunciation of several words:
words which Fonix calls unusual words. Not that unusual, because there
would be seven or eight hundred in a single book. I wrote software which
listed alphabetically all the words in a book. I created lists of words
which Fonix ISpeak pronounces correctly, and of words that I had "taught"
it to pronounce correctly. This left several hundred words, in some cases,
that it had to be taught. A list of words yet to be taught could then be
easily created. The technique for teaching ISpeak how to pronounce a word
is easily learnt, and is quite quick to use. Thus a personalised dictionary
for each book could be easily produced. Fonix ISpeak files produced with
this vastly improved pronunciation are really easy to listen to, whereas
previously they had at times been very difficult to follow. The Fonix
ISpeak logic for pronouncing words it does not know is its weakest point.
It likes to insert the names of American States at unexpected spots.
"George VI had a wash and sat down to lunch" becomes "George Virgin Islands
had a Washington and Saturday down to lunch." All this can be dealt with
easily enough, once the problem is admitted. An amusing example of the
ISpeak logic is that "Tortoiseshell" is rendered as "Tortoise's Hell".

Then again, once I found I could accept mono MP3 files of 32 kbps, or even
24 kbps, for use on my CD player, I was able to improve the quality of
speech from that direction as well. The Talking Books I have produced
recently, I think of as "Level Two."

Fonix ISpeak has some other faults. One of these concerns apostrophes, for
it can't handle more than one per word, so "O'Reilly's hat" has to be
presented to ISpeak as "O Reilly's hat."  It likes to insert a pause in
long phrases between punctuation marks, of twelve words or so duration. It
pauses on certain favourites, such as "Which". Thus, "...to the table on
which..." is rendered as "...to the table on, which..."

I have written a little program to indicate these danger points in a book,
which are often only six or seven in number. Often, all that is needed is
to insert appropriate punctuation, just to break up the phrase.

One of the things ISpeak prides itself on is being able to decide whether a
word like "lead", with two pronunciations, is a verb or a noun, and hence
to be able to decide by context whether it is to be pronounced "leed" or
"led". In fact it is quite often wrong on this, and I am working on ways of
reducing the impact of this annoying fault. I would term a CD produced
clear of all these faults as "Level Three."

The remaining faults lie with French and other languages. As an Englishman
I have had to teach ISpeak to pronounce many words in the English manner.
This has led me to realise that there are sounds in English that do not
exist in American, and that consequently cannot be rendered exactly by
ISpeak. The converse is certainly true. Still more so is this true of
French, and of regional British accents. ISpeak does manage quite well with
dog-Latin, as in "Jacob Faithful". The approach with phrases ISpeak cannot
manage is to speak them yourself, if you are able to, or else get someone
else who can, and use a program like WaveLab to merge the phrases in where
necessary; but this is obviously a longish process, and it will not be
mentioned further here.

It may be interesting to have some very rough feel for size and duration.

Consider an average length book of 500K. This might be 300 pages, each
taking 2 minutes to read aloud. That is, 10 hours, equating to 600 minutes,
or 36,000 seconds. At 24 kbps, which I consider adequate, this is 108
megabytes. Some books might be longer, some shorter, but these are
ball-park figures. You can put 2 or 3 books on a CD, but experience with
the Goodman, which has several problems of its own, such as no
book-marking, and a poor display, has led me to put only one book on each
CD. A recent e-mail to the BP indicated that it might be as well to make
CDs with each component file of perhaps 4 or 5 minutes duration. I knew
there was a problem here, with the variable length of chapters, and the
Goodman's inability to start conveniently anywhere but the start of one of
its component chapters. I am glad to know of this idea. "Percival Keene"
has a chapter which is over three hours in length. I shall shortly be
experimenting with this possible fix.

More recently a superb device has appeared on the UK market, and at a very
reasonable price too. This is the Creative D.A.P. (Digital Audio Player)
JukeBox, which is the same size as a CD player, but which has a hard disk
with 6 gigabytes of memory. Some of this is taken up with software, but
even so, five gigabytes are available. This might well hold 30 books,
perhaps the entire literary output of a moderately productive author, such
as Marryat.

The CD player can be played in-car, as it has a cigar-lighter lead and a
gadget for transferring signals from the earphone socket to a dummy tape
cassette, but at the last reckoning Creative Labs had not brought out the
relatively simple cable from the cigar lighter of the car, through a
step-down voltage convertor, and into the oddly shaped power socket of the
Jukebox. I have not checked this for a month, and it may have been sorted
by now.

Finally, I must add that the E-Text file to be fed to Fonix ISpeak is not
exactly the same as the master file you use to create your HTML files of
the book you have been working on. Some of the differences have been
referred to above, and some are more fundamental. For example you need to
indicate the pauses to be made at the end of each paragraph, and at the end
of each chapter. I have found that ISpeak works better if you make each
paragraph of a book into a separate TXT file: there may well be upwards of
2,500 files making up the book. The list of these paragraph filenames goes
into one List-File (.LST). All this work is done automatically using my
software. It seems extraordinary, but this is the way I have found to give
the best results. These additional processes are quick, but do take a
little time, though they save time and improve the quality in the long run.

In order to make ISpeak create the WAV files of the book without running in
real time, and taking all day, literally, to produce the PCM WAV files for
a book, you need to engage your computer's "Windows Media Player" in some
other task. You could get it to read a chapter or two of another book.
Meanwhile ISpeak will then run as fast as your computer's speed will allow.
             
Finally you need a process to convert your PCM WAV files initially to
MPEG-3 WAV files. There are several freeware programs around to do this.
Knocking off the 70 byte header of these MPEG-3 WAV files converts them to
pukka MP3 files, but to be really pukka they need an MP3 header, which
contains data such as the name of the book, the author, and so forth. Here
is another little step to which I have also provided the answer.

Nick Hodson, 4th November 2001, London, England, UK