(For archival purpose, posted in arthur-chan.blogspot.com at 2009
Coprighted by Arthur Chan at 2005. When reproduced, this document should be reproduced in full.
This is my own opinion only and they do not represent the standpoints of Carnegie Mellon University or CMU Sphinx Group.
(Fourth Draft) Do we have a true open source dictation machine?
Forth Draft by Arthur Chan at 20050725
Third Draft by Arthur Chan at 20050607
Second Draft by Arthur Chan at 20050606
First Draft by Arthur Chan at 20050525
IntroductionIs there an open source dictation machine at 2005? Why do I even ask? Isn't that IBM has just open source some speech components at 2004? Isn't that open source speech recognition engine such as Sphinx, HTK, Julius, ISIP speech recognizer, Sphrachcore, NICO, Intel AVSR dictaiton machines also? Aren't they already open sourced?
I guess there is still a big gap between reality and the concepts of the public. I also feel a sense of irony here: partially because from a historical point of view, misconception of the public grows at the popularity of dictation machine such as Dragon Naturally Speaking and IBM Via Voice.
As a programmer and a sort of scientist, I just cannot bear wrong statements flowing around in the field. Some of them were even come from the mouth of "speech experts" in the on-line press. I am afraid to say this really makes me sick some time. That is why I woke up at midnight one day and write the following stuffs. I just hope that someday, if a crawler comes to my page or if people accidentally come to this page, they might be more informed about these issues.
Is there a true open source dictation system at 2005?
What is a speech recognizer? What is inside a speech recognizer?Let us come to something more central. That is what is actually a speech recognizer means. We could probably simply say it is a device to transform human speech to text. To avoid discussion on complicated issue such as how the brain encoded speech and such. It is even better to say a speech recognizer transform acoustic wave form of speech to text.
Modern day speech recognizers are usually statistical. That is to say (roughly and loosely in purpose), a model was first estimated (or trained) from a large amount of speakers. It is important to use a large amount of speakers. Why? Because to many people surprise, speech, even for the same language, has wide varieties. That is to say the "Hello" you say will be very different from the "Hello" say. (This is further complicated by the fact that I am not a native speaker. )
The varieties of speech among languages is understandably huge. However, another no so well-known fact is that even with one language, say English, the varieties of speech could be diverse and huge. This is a fact which is well-known by many linguists. The public would probably found it hard to palate perhaps because the mis-conception of "official" language. Such language exists but it is actually tough to define and could be changed over time. That perhaps explains why there is a need for grammar rule book of SAT. This is another interesting phenomenone but I will say someone else is better than me to discuss it.
There is no way only one speaker voice could represent the whole world, no matter how official their English are. As a result, a speech model is essentially an average of the voices. Here we come to a tricky part of building a speech recognizer. That is how to select the speakers to train the models. The most important criterion is that a large amount of speakers are required. An equally important criterion is that speakers accents, genders and phonetical contents should be balanced in the training sets.
These data was then used to create a single model which if you look at it, it is just a bunch of numbers. So, probably another way to say it is that your voice and my voice is essentially represented by one single sets numbers there.
Let us just call the above statistical models to be models from now on. A good model is actually a very important part of a good speech recognizer. However, this fact is not understood by many people. Many people, when they started to work on a speech recognizer fail to understand this fact. That is, what I think, the most significant confusion of the public.
Open source speech recognition engines: what do they provide now?There are actually many open source speech recognizers in the world. I will name HTK, ISIP Speech recognizers, Julius, Sphrachcore, NICO, AVSR. Of course, you have already know Sphinx. When we talk about Sphinx, it is actually 3 different recognizers. Sphinx 2, Sphinx 3 and Sphinx 4. (I personally know HTK, Sphinx 2, Sphinx 3. I will perhaps write another article to talk about them in detail separately. )
All of these recognizers have one common point, they are research systems. Another thing they are in common is that you could download their source code and study how they work in general.
However, not all of them actually provide the models we mentioned. Sphinx does, ISIP recognizer does, Julius does. Sphinx, ISIP recognizer, HTK, Sphrachcore, NICO and AVSR provides tools for users to train a model. That is to say, the users need to train the model before they could actually have a model. This is not an easy deal because as you will see training is a hard problem.
Among all recognizers Julius is probably the one which is closest to a dictation engine because it was truly built for that particular purpose. Sphinx's model was optimized for the broadcast new task so by no means its originial intention is to provide dictation for the users. As a matter of fact, Sphinx and HTK are more a set of handy tools for building speech application system.
Speech decoder vs Speech models: Is the definition of speech recognition that simple?A drive-home message here, a speech decoder and a speech model are different things. Between the two, I will say the model is actually more important. Good model on a poor decoder could still give you ok results. A good decoder on poor model though could result in a poor system.
Speech model is usually domain-specific. For example, in telephone channel, you need a telephone model for the data. That model will be different from a desktop model. That will also be different from the model you use in your PDA because the microphones are different
It is also domain-specific in another sense. For example, if you know that the recognition only involved 5 words, the recognition rate could be much better if you could constrain the speech recognition using this information. The same decoder, using another constraints, the performance could just be screwed up!
Now let us revisit the definition of speech recognition. The common and simple point of view speech recognition of course has its own elegance in theory. However, in practice, it doesn't clearly separate a system which has a model with a specific purpose or just a model with general purpose. That's why I want to bring up another point. That is a dictation machine is not strictly equal to a speech recognition engine. That is what I will devote next paragraph to.
A speech recognition engine does not strictly equal to a dictation machineIt is a misconception I could not bear.
When you read up to this point, perhaps you will start to understand the general notion of dictation be quite confusing. At least for the users, it is.
Systems such as Sphinx, ISIP Speech recognizer, HTK could be very a good important component to be a dictation engine. Namely, the decoder of the dictation system.
But by themselves, they are just tools of building speech systems. Some of them provide default model, as I explained though, they are not necessarily fit into certain situation. That sort of explain how many users of speech recognizer failed in building their system. They tried to use the default model to do everything. I could only hope that their system is not used by many.....
As the maintainer of Sphinx, I could probably speak for it, as in Sphinx 3.5, we only provide a set of API for people to build a speech recognition system. The default model (trained by Broadcast News) is there but it is definitely not for usage for every speech applications. It is certainly not for telephony channel because it is a broadband model (16k). So please don't try to use it for telephony applications. (Yes, Asterisk's folks, I am talking about you.)
Let me try to expand a little bit on what it takes to build a dictation system.
A modern desktop dictation has many things special. Let me list out seven of these points.
1, It has an acoustic model that is trained by at least 200 hours of read speech on a desktop channel.
2, It should have a language model train by at least a million word corpus.
3, It could adapt to different speaker's speech and use of language
4, Usually, it has a graphical user interface so that users can configure it. It also has capability to communicate with other applications
5, Of course it need to have a speech decoder. But as I repeat and repeat. That is just one component of the system.
As far as I see, a system that works out all these aspects is a true dictation system. And as far as I could see, sorry, there is no serious effort on this yet.
So how come my commercial recognizer could do it then?Simple, because they are good. When the dictations' models were first trained. Many hours of speech were already used (more than 100 hours). Several advanced techniques have been used to refine the models. Simply speaking, the five things I mentioned was already done by these famous dictation engine. Plus something I don't know. "
Digression: What does "speech component" mean?I will side track a little bit about the well-known IBM's "speech components". Are they speech recognizers?
Hmm. If you read to this point, you will know that a system based on speech recognition actually consists of many things. A speech recognizer is one, the model is one. Now, another part of a modern speech system starts to appear when developers start to have a set of working recognizer with its models. That is the modules or wrappers of speech recognizer.
Many people find that programming by just using the speech recognizer or some APIs provided by the speech recognizer is simply too hard. So people start to introduce another layer for developers. For example, some designer will wrap up the flow of the interaction between computer and human on inquiry of digit strings to be one module. So when IBM released the open source speech component. Essentially, we are talking about these wrappers.
If you checked out the specifications of Nuance and Scansofts (Oops, they are the same company now.), you will probably see similar things. These modules are undoubtedly something very important for the developer because it will save a lot of time of programming. If we don't think in controversy theoretic point of view, we should feel happy by the fact that any companies which open source these components. However, these speech components should not be confused with the speech recognition engine or the model. The latter two are indeed more important.
So for most of the folks who said the above systems to be a poor dictation machines, they might just make another bogus statement.
So, why there is no true open source dictation system?I believe I have already made my case on why I believe there is no true open source dictation system yet. I want to ask another important question, why such a system doesn't exist yet?
From my observation in discussion list, forums and internals of CMU, I would conclude that there is huge gap of knowledge between experts in speech recognition. Understanding these two perspectives are very important to understand why there are such deep misconceptions on speech recognition.
Experts' PerspectiveExperts, of course, mean people who knows everything. They know how to write a decoder, how to train a model and also how to use a recognizer correctly. Another thing we need to point out is that experts in speech recognition are usually people who are fairly intelligent. (Alright, let's say except me.) They might like to have intellectual excitement. Some might like to use their intelligence to make money because most of the time they are pretty poor.( compare to other clever people. :-))
Now, here comes an interesting point, if building a dictation system is such an intelligent process, an intelligent people loves excitement. Why no one is motivated to build it now?
Well, do you know how long does it take to build an acoustic model? Hmm. I think you have no idea. So why don't we imagine we are workers of speech recognition for a while?
First of all you need to clean up more than 80 hours of speech and its transcription (written text described what the speech means). Then, you got to make sure your script of doing acoustic model training to be correct. (If it was wrong, you might probably waste 3 days.....).
Like 1 to 2 weeks later, you got your final model, which is usually not the greatest. Then you need to think about how to improve it. The above cycle will then repeat and repeat until you got a model until you cannot improve it any more. Look at the calender again, half a year is passed!
Then you find that the bottleneck of the system is that it is not trained by enough data! So you got to start the above process again.
Now, I hope that you will be at least be impressed by how valuable is an acoustic model (language model follows very similar process). An as a worker in the field, I will say the true magician in the game are people who do the training of the model . Most of the time, in one laboratory, there is only one to two persons who can do it correct. That's why an acoustic model is a very valuable asset to a speech recognition research laboratory or a person.
That's why it is very reasonable for experts not to open the model because it takes their blood to train it.
Users' PerspectiveA user is usually not an expert. So obviously he doesn't the above point of view. It sounds like we have already explained the problem then, itsn't it?
Sort of, let me add a couple of things here.
First, the effect of marketing of big speech recognition companies played a very important part to make the public be misled.
Second, the fact that open source software is so dominating in these days make many people believe that many software could be obtained easily for free. They also believe that any creative work could be obtained for free. Apparently, the fact that most speech recognizers only provide the decode, without the model, will make them surprised unpleasantly.
What we should do?As in many problems, I believe the first thing we should do is to know about the problem. That's why I wrote this article. I hope that the readers could see the importance of a model in a recognizer. Or at least, have a correct expectation of open source speech recognition.
If we could see that this is a problem, the rest is just to find the right path and try to solve it. I am pretty optimistic about it mainly because building a dictation engine will be at least as challenging as building a compiler or writing a kernel. It requires another type of sophistication. Training a model, writing a powerful recognizer. None of them are trivial. Think of these work could be shared for the world and could benefit the world, I would believe such a merit is still a great temptation for clever people to work on speech recognition.
That's the reason I am working on speech recognition now and probably not doing a lot of things my boss told me to do (he is a nice guy though.). I believe that many people think the same as me. We really need one model at the begining. Sphinx has provided an English one. Julius has provided a Japanese one. Could we have more? This really relies on one expert to start. Could you help?
After we have these models, there are many ways to make use of them and create even better models. For example, one could use them to bootstrap an even better model (allow me not to explain technically). Or we could use speaker adaptation to change its characteristic a little bit. Without the model though, nothing could be done.
Apart from enthusiasm, we still need to take care many technical issues. For example, how do we make the training time to be shorter? One idea is to build a machine that could install in average users' PCs and allow researchers in speech can borrow computation from these users. There is a lot of practical issues I believe no one really thought about. Hopefully, this could become an academic topic that more people could discuss in future.