Monday, April 27, 2009

A new kind of communication enhanced by search engines

Google provides a new way of communication.

1, You type in a message on some crawlable random webpage with an interesting distinguishable keyword.
2, Ask the other party to search for the keyword couple of days later.

Here you go! The key here is to make your message as plain as possible but your keyword as strange as possible. That's what I did in this message. I plan to implant the word "arthurchanlialastica" here. (As I just did.) Since this is a new word (searching in Google return no message), it's IDF should be fairly high and causing the retrieval engine rank it high. I expect when I search "arthurchanlialastica" Then this message will come up.

I also expect this will be "quite anonymous". Since other than "arthurchanlialastica", all the information is quite plain. So other parties without a search engine will not be able to spot my message. What happens here, "arthurchanlialastica" become a "radio frequency" which I could use to communicate. Sounds like an interesting idea.

Arthur

Sunday, April 26, 2009

Do we have a true open source dictation machine?

(For archival purpose, posted in arthur-chan.blogspot.com at 2009

Coprighted by Arthur Chan at 2005. When reproduced, this document should be reproduced in full.

This is my own opinion only and they do not represent the standpoints of Carnegie Mellon University or CMU Sphinx Group.

(Fourth Draft) Do we have a true open source dictation machine?

Forth Draft by Arthur Chan at 20050725

Third Draft by Arthur Chan at 20050607

Second Draft by Arthur Chan at 20050606

First Draft by Arthur Chan at 20050525

Introduction

Is there an open source dictation machine at 2005? Why do I even ask? Isn't that IBM has just open source some speech components at 2004? Isn't that open source speech recognition engine such as Sphinx, HTK, Julius, ISIP speech recognizer, Sphrachcore, NICO, Intel AVSR dictaiton machines also? Aren't they already open sourced?
I guess there is still a big gap between reality and the concepts of the public. I also feel a sense of irony here: partially because from a historical point of view, misconception of the public grows at the popularity of dictation machine such as Dragon Naturally Speaking and IBM Via Voice.
As a programmer and a sort of scientist, I just cannot bear wrong statements flowing around in the field. Some of them were even come from the mouth of "speech experts" in the on-line press. I am afraid to say this really makes me sick some time. That is why I woke up at midnight one day and write the following stuffs. I just hope that someday, if a crawler comes to my page or if people accidentally come to this page, they might be more informed about these issues.

Is there a true open source dictation system at 2005?

What is a speech recognizer? What is inside a speech recognizer?

Let us come to something more central. That is what is actually a speech recognizer means. We could probably simply say it is a device to transform human speech to text. To avoid discussion on complicated issue such as how the brain encoded speech and such. It is even better to say a speech recognizer transform acoustic wave form of speech to text.
Modern day speech recognizers are usually statistical. That is to say (roughly and loosely in purpose), a model was first estimated (or trained) from a large amount of speakers. It is important to use a large amount of speakers. Why? Because to many people surprise, speech, even for the same language, has wide varieties. That is to say the "Hello" you say will be very different from the "Hello" say. (This is further complicated by the fact that I am not a native speaker. )
The varieties of speech among languages is understandably huge. However, another no so well-known fact is that even with one language, say English, the varieties of speech could be diverse and huge. This is a fact which is well-known by many linguists. The public would probably found it hard to palate perhaps because the mis-conception of "official" language. Such language exists but it is actually tough to define and could be changed over time. That perhaps explains why there is a need for grammar rule book of SAT. This is another interesting phenomenone but I will say someone else is better than me to discuss it.
There is no way only one speaker voice could represent the whole world, no matter how official their English are. As a result, a speech model is essentially an average of the voices. Here we come to a tricky part of building a speech recognizer. That is how to select the speakers to train the models. The most important criterion is that a large amount of speakers are required. An equally important criterion is that speakers accents, genders and phonetical contents should be balanced in the training sets.
These data was then used to create a single model which if you look at it, it is just a bunch of numbers. So, probably another way to say it is that your voice and my voice is essentially represented by one single sets numbers there.
Let us just call the above statistical models to be models from now on. A good model is actually a very important part of a good speech recognizer. However, this fact is not understood by many people. Many people, when they started to work on a speech recognizer fail to understand this fact. That is, what I think, the most significant confusion of the public.

Open source speech recognition engines: what do they provide now?

There are actually many open source speech recognizers in the world. I will name HTK, ISIP Speech recognizers, Julius, Sphrachcore, NICO, AVSR. Of course, you have already know Sphinx. When we talk about Sphinx, it is actually 3 different recognizers. Sphinx 2, Sphinx 3 and Sphinx 4. (I personally know HTK, Sphinx 2, Sphinx 3. I will perhaps write another article to talk about them in detail separately. )
All of these recognizers have one common point, they are research systems. Another thing they are in common is that you could download their source code and study how they work in general.
However, not all of them actually provide the models we mentioned. Sphinx does, ISIP recognizer does, Julius does. Sphinx, ISIP recognizer, HTK, Sphrachcore, NICO and AVSR provides tools for users to train a model. That is to say, the users need to train the model before they could actually have a model. This is not an easy deal because as you will see training is a hard problem.
Among all recognizers Julius is probably the one which is closest to a dictation engine because it was truly built for that particular purpose. Sphinx's model was optimized for the broadcast new task so by no means its originial intention is to provide dictation for the users. As a matter of fact, Sphinx and HTK are more a set of handy tools for building speech application system.

Speech decoder vs Speech models: Is the definition of speech recognition that simple?

A drive-home message here, a speech decoder and a speech model are different things. Between the two, I will say the model is actually more important. Good model on a poor decoder could still give you ok results. A good decoder on poor model though could result in a poor system.
Speech model is usually domain-specific. For example, in telephone channel, you need a telephone model for the data. That model will be different from a desktop model. That will also be different from the model you use in your PDA because the microphones are different
It is also domain-specific in another sense. For example, if you know that the recognition only involved 5 words, the recognition rate could be much better if you could constrain the speech recognition using this information. The same decoder, using another constraints, the performance could just be screwed up!
Now let us revisit the definition of speech recognition. The common and simple point of view speech recognition of course has its own elegance in theory. However, in practice, it doesn't clearly separate a system which has a model with a specific purpose or just a model with general purpose. That's why I want to bring up another point. That is a dictation machine is not strictly equal to a speech recognition engine. That is what I will devote next paragraph to.

A speech recognition engine does not strictly equal to a dictation machine

It is a misconception I could not bear.
When you read up to this point, perhaps you will start to understand the general notion of dictation be quite confusing. At least for the users, it is.
Systems such as Sphinx, ISIP Speech recognizer, HTK could be very a good important component to be a dictation engine. Namely, the decoder of the dictation system.
But by themselves, they are just tools of building speech systems. Some of them provide default model, as I explained though, they are not necessarily fit into certain situation. That sort of explain how many users of speech recognizer failed in building their system. They tried to use the default model to do everything. I could only hope that their system is not used by many.....
As the maintainer of Sphinx, I could probably speak for it, as in Sphinx 3.5, we only provide a set of API for people to build a speech recognition system. The default model (trained by Broadcast News) is there but it is definitely not for usage for every speech applications. It is certainly not for telephony channel because it is a broadband model (16k). So please don't try to use it for telephony applications. (Yes, Asterisk's folks, I am talking about you.)
Let me try to expand a little bit on what it takes to build a dictation system.
A modern desktop dictation has many things special. Let me list out seven of these points.
1, It has an acoustic model that is trained by at least 200 hours of read speech on a desktop channel.
2, It should have a language model train by at least a million word corpus.
3, It could adapt to different speaker's speech and use of language
4, Usually, it has a graphical user interface so that users can configure it. It also has capability to communicate with other applications
5, Of course it need to have a speech decoder. But as I repeat and repeat. That is just one component of the system.
As far as I see, a system that works out all these aspects is a true dictation system. And as far as I could see, sorry, there is no serious effort on this yet.

So how come my commercial recognizer could do it then?

Simple, because they are good. When the dictations' models were first trained. Many hours of speech were already used (more than 100 hours). Several advanced techniques have been used to refine the models. Simply speaking, the five things I mentioned was already done by these famous dictation engine. Plus something I don't know. "

Digression: What does "speech component" mean?

I will side track a little bit about the well-known IBM's "speech components". Are they speech recognizers?
Hmm. If you read to this point, you will know that a system based on speech recognition actually consists of many things. A speech recognizer is one, the model is one. Now, another part of a modern speech system starts to appear when developers start to have a set of working recognizer with its models. That is the modules or wrappers of speech recognizer.
Many people find that programming by just using the speech recognizer or some APIs provided by the speech recognizer is simply too hard. So people start to introduce another layer for developers. For example, some designer will wrap up the flow of the interaction between computer and human on inquiry of digit strings to be one module. So when IBM released the open source speech component. Essentially, we are talking about these wrappers.
If you checked out the specifications of Nuance and Scansofts (Oops, they are the same company now.), you will probably see similar things. These modules are undoubtedly something very important for the developer because it will save a lot of time of programming. If we don't think in controversy theoretic point of view, we should feel happy by the fact that any companies which open source these components. However, these speech components should not be confused with the speech recognition engine or the model. The latter two are indeed more important.
So for most of the folks who said the above systems to be a poor dictation machines, they might just make another bogus statement.

So, why there is no true open source dictation system?

I believe I have already made my case on why I believe there is no true open source dictation system yet. I want to ask another important question, why such a system doesn't exist yet?
From my observation in discussion list, forums and internals of CMU, I would conclude that there is huge gap of knowledge between experts in speech recognition. Understanding these two perspectives are very important to understand why there are such deep misconceptions on speech recognition.

Experts' Perspective

Experts, of course, mean people who knows everything. They know how to write a decoder, how to train a model and also how to use a recognizer correctly. Another thing we need to point out is that experts in speech recognition are usually people who are fairly intelligent. (Alright, let's say except me.) They might like to have intellectual excitement. Some might like to use their intelligence to make money because most of the time they are pretty poor.( compare to other clever people. :-))
Now, here comes an interesting point, if building a dictation system is such an intelligent process, an intelligent people loves excitement. Why no one is motivated to build it now?
Well, do you know how long does it take to build an acoustic model? Hmm. I think you have no idea. So why don't we imagine we are workers of speech recognition for a while?
First of all you need to clean up more than 80 hours of speech and its transcription (written text described what the speech means). Then, you got to make sure your script of doing acoustic model training to be correct. (If it was wrong, you might probably waste 3 days.....).
Like 1 to 2 weeks later, you got your final model, which is usually not the greatest. Then you need to think about how to improve it. The above cycle will then repeat and repeat until you got a model until you cannot improve it any more. Look at the calender again, half a year is passed!
Then you find that the bottleneck of the system is that it is not trained by enough data! So you got to start the above process again.
Now, I hope that you will be at least be impressed by how valuable is an acoustic model (language model follows very similar process). An as a worker in the field, I will say the true magician in the game are people who do the training of the model . Most of the time, in one laboratory, there is only one to two persons who can do it correct. That's why an acoustic model is a very valuable asset to a speech recognition research laboratory or a person.
That's why it is very reasonable for experts not to open the model because it takes their blood to train it.

Users' Perspective

A user is usually not an expert. So obviously he doesn't the above point of view. It sounds like we have already explained the problem then, itsn't it?
Sort of, let me add a couple of things here.
First, the effect of marketing of big speech recognition companies played a very important part to make the public be misled.
Second, the fact that open source software is so dominating in these days make many people believe that many software could be obtained easily for free. They also believe that any creative work could be obtained for free. Apparently, the fact that most speech recognizers only provide the decode, without the model, will make them surprised unpleasantly.

What we should do?

As in many problems, I believe the first thing we should do is to know about the problem. That's why I wrote this article. I hope that the readers could see the importance of a model in a recognizer. Or at least, have a correct expectation of open source speech recognition.
If we could see that this is a problem, the rest is just to find the right path and try to solve it. I am pretty optimistic about it mainly because building a dictation engine will be at least as challenging as building a compiler or writing a kernel. It requires another type of sophistication. Training a model, writing a powerful recognizer. None of them are trivial. Think of these work could be shared for the world and could benefit the world, I would believe such a merit is still a great temptation for clever people to work on speech recognition.
That's the reason I am working on speech recognition now and probably not doing a lot of things my boss told me to do (he is a nice guy though.). I believe that many people think the same as me. We really need one model at the begining. Sphinx has provided an English one. Julius has provided a Japanese one. Could we have more? This really relies on one expert to start. Could you help?
After we have these models, there are many ways to make use of them and create even better models. For example, one could use them to bootstrap an even better model (allow me not to explain technically). Or we could use speaker adaptation to change its characteristic a little bit. Without the model though, nothing could be done.
Apart from enthusiasm, we still need to take care many technical issues. For example, how do we make the training time to be shorter? One idea is to build a machine that could install in average users' PCs and allow researchers in speech can borrow computation from these users. There is a lot of practical issues I believe no one really thought about. Hopefully, this could become an academic topic that more people could discuss in future.

Conclusion

This article points out several misconception of users and developers of speech recognition. I myself is a big believer of speech recognition and many language technologies. I believe by correction and clarification of these misconceptions. It will allow developers to realize the obstacles of open source speech recognition development and allow continuous development of the field.

Testig FB

Just to see if my post is published in FB as well.

My New Blog

Hi Guys,

I have closed "the Grand Janitor's Blog" for good. Partially because I am no longer with CMU Sphinx any more and partially because I need a new start in blogging.

I am lucky enough to get the arthur-chan domain. This should be easy enough to remember. In this blog, I will only blog on my personal observation about the world. Occasionally they are interesting. If you are directed to here for some reasons, welcome, you might be reading something which can't be found be in nowhere.

Arthur

波﹐ Ball, Bo, ...... and the difficulty of Chinese Romanization

(2nd Edition at 20090424)
When you are chatting with your high school friend, it helps to be less professional and just occasionally goof a bit.

For example, a friend of mine was chatting on-line and he jokingly accused me (correctly) that I shouldn't romanized his Chinese name 波 (which literally means either "Ball" or "Wave" or "Radio Wave" in Chinese) as "Bo". In Hong Kong, people used to call him (in jyutping with tones) "bo1 zai2". To make it sound English like, it will be "b ao z ay" (if I used cmudict and my imagination). In Jyutping, 波 is indeed romanized as "Bo". Since I am familiarized with Jyutping, "Bo" became my first choice of romanization.

What my friend expect me to write is "Ball". The early sound translation of Ball happen to be, guess what, 波 . Some people will simply call him "Ball zai" on-line (or in mixed Jyutping-Enligh Bo1 l zai2). So most of the time, he preferred people call him "Ball zai".

When I considered his side of argument. I thought there is something's wrong. a large percentage of population in Hong Kong pronounce the word "Ball" without the "l" sound. (i.e. even closer to Cantonese 波) So I even feel my romanization is legitimate. My first thought is "how can "Ball" be a romanization? What we are looking for is "B ao"! "Bo" should be the answer."

Nevertheless, my argument has serious flaw. "Bo" in American English is usually pronounced as "B ow". E.g. President Obama's dog is called "Bo".

So this is total mistake! My friend's name will become "煲仔" in Cantonese (literally means hot pot) if such a Romanization scheme! That's pretty bad.

I start to spend some of my quality time to think about this problem (to my BBN colleagues: that's when I can't work on work stuffs anymore) of what is the best romanization.

In my mind, goal of of romanization should be
1, mimic a language/dialect with English phonemes
2, convert it back to a legitimate word form.
3, the romanized form has to sound correct in both the target and in English.

There are probably 100 of romanizations for the Chinese language. Even in Cantonese, there is about 5 (Yale, LauJyut, YyutPing are what I remembered.....) Many educated people in Hong Kong try to communicate a sound in romanization. They might say "Chinese Character X sounds like romanization R". Such translation could be very tough at times. 波 happens to be a very good example.

Let's try other romanizations, how about "baw"? (rhymes with "jaw") Here we got a sound which is closer to what Hong Kongers speak of the character 波。 Unfortunately, it still require articulating "w" at the end. This is not what exactly Hong Kongers will do.

How about "boo"? Nope, usually this word is pronounced as "b uw" in English so this is even farther from "baw".

I tried out more words in my mind. Then I give up shooting at the air, I looked up CMUdict which contains around 100k entries of English pronounciations, I found that there is only one such word in English which totally fit to 波. (Looking up Merriam Webster would probably give us a more definitive answer. CMUdict is more a speech recognition dictionary.) The word is "Baugh". It is an English surname. You can't find its meaning in Merriam-Webster. Although, it solve the problem but it lost the literally meaning of "Ball". I feel lost by Phonetics. But if this is the answer, may be I should start to call my friend "Baugh" if I insist romanization is the Excalibur.

After this analysis, I start to realize the difficulty of inventing a romanization scheme. I also realize what the collective wisdom of Hong Kong offer --- it doesn't attempt to do a romanization per se. What it does is a *literal translation*. That's why 波 was eventually "romanized" as "Ball". Even though in Hong Kong, "l" is not usually pronounced.

This probably not only happen between English and Cantonese. In other pair of languages, the situation could be exactly opposite, for example if Japanese wants to "Katakanarize" 波, then they could do ボ. But when they make a literally translation, it will become ボ-ル。 So that explains why Japanese friends will occasionally lengthen the term "Ball" in English.

To all my friend whose name is 波 in Hong Kong, let me say from now on I will also call you "Ball zai" on-line. But I will still keep calling you "Bo1 zai2" in Hong Kong. It's a tough compromise but well, I will live with it.

Arthur

Probabiltiy of Car Crashes

(Originally written at Thursday, January 8, 2009 at 2:11pm)

Here is one simple thing in life that you want to note it down. It's not such a common event happened in my life. When this happened though, it will naturally keep me in deep thought.

There was a electric outage in the 3-5 blocks I lived in. My place is very close to Mass. Ave. For non-Bostonians, it's a main avenue that connect several residential regions of Cambridge Massachusetts.

The electric outage caused me to do something a bit different from my normal schedule - I went to a cafe to read and stay there until 10:10 p.m. and decided to walk back. The road is slippery. I nearly fallen down for a couple of times. But well, no big deal I just find the most snowy to put my steps on. Things are ok.

Then I heard a "bam" from my back and I took a look. 6-7 feets behind me, it turns out a van has lost control and hit on a building. Part of the wall has a 1 feet x 1 feet x 1 feet size of hole. There is a part of the building was made of steal. There is a dint. Luckily the car wasn't going too fast. So it just stops after it hits the building.

I am fair in my faculty of imagination. "What if I was hit?" and allow me to skip the description of my imagination because it doesn't look good.

I naturally wanted to leave that spot immediately. At the same time, I wanted to keep an eye on the driver and see if he was hurt. The driver gave me an "ok" gesture. Then he seems to get a grip of his wheel again. So my conclusion is the road might be too slippery and caused the whole thing.

For me, living is U.S. in general means a lot of thinking and philosophizing. Since I live with books and computers too much, I seldom come that close to a near-death experience. But I guess for many, any car accident in their lives will make them think.

Some might decide that's the moment to take up a belief. You will hear that in a lot of convictions from Bible study group or evidence of religion A, B, C. In my case though, I was thinking of the probability of how often this would happen to me.

According the car accident statistics, http://www.car-accidents.com/pages/stats.html, there are around 2.8 million people get hurt annually and 40k people died in car accidents. If we assume 330 million as U.S. population. So we are talking about an odd of 117:1 and 8250:1.

Now what are the odds of being hit and die if you are just walking on the road. (i.e. you are not a driver or passenger). Then according to statistics, there is only ~100 persons who will die like this. So the odd is around 1 million to 1 against that happens.

So 117:1 against me to get hurt makes me feel relatively happy because 117:1 is more like two consecutive drawn out by a 4 outers in Texas Holdem (~1/11). That happens often when you played 100k hands of holdem.

What about the other numbers? well, it sounds like I am naturally protected from dying anyway. In 8250:1, we are talking about someone who 4 consecutive drawn outs by a 3 outters. In the case of 1m: 1 case, that's so unlikely which lightning turns out to be more threatening to my life.

Now what about the fact that there is an electric outage and the slippery road condition, will that increase the chance of accident? Absolutely. It probably double the chance.

So what did I learn from this? Walking was statistically safer than driving. This doesn't take a PhD to figure it out. Well, but the fact that a lot of people change their life perspectives just because some unusual life events is astounding. After some semi-careful analysis, it is not such a mystery at all. These are just usual fact of life.

The event does urge me to search for a girlfriend/companion. (Umm, I don't know the probabily of *that*) Well, after all, if I live till 60, probability will dictate, there will be 1 or 2 days I will get hurt in a car accidents. Before that happens, let's make sure I have a family first. :)

Arthur

Statistically Insignificant Me

(Originally written at Wednesday, November 07, 2007)

Slightly related my last post. It relates to an interesting issue of whether we should share the bookshelf in the first place.

Why is it an issue? Well, privacy. Suppose someone is malicious and try to figure you out. The best way is to try to gather all information about you and work against you.

Another concern of mine is rather interesting and absolutely speculative, what if information I read will affect my thought and what if people could reconstruct it just from the information I read? That will open up a lot of interesting application. e.g. We might be able to predict what a person will do better.

Just like in other time series problem such as speech recognition and quantitative analysis. Human life could simply be defined by a series of time events. Some (forget the quote) believes that one human life could be stored in hard-disk and some starts to collect human life and see whether it could be model.

Information of what you read could tell a lot of who you are. Do you read Arthur C. Clarke? Do you read Jane Austen? Do you read Stephen King? Do you read Lora Roberts? From that information, one could build a machine learner to reverse map to who you are and how you make decision. We might just call this a kind of personality modeling.

It seems to me these are entirely possible from the standpoint of what we know. Yet, I still decide to share my bookshelf? Why?

Well, this was crystal-clear moment for me (and perhaps for you as well) which helps me to make a decision: Very simple, *I* am statistically in-significant.

If you happen to come to this web page, the only reason you come is because you are connected to me. How likely will that happened?

I know about 150 persons in my life. The world has about 6 billion. So that simply means the chance of me being discovered is around 1.5 x 10^-8. It is already pretty low.

Now, when other people know me and recommend me to someone else. Then this probability will be boosted up because 1) my PageRank will increase, 2) people follow my link deep enough will eventually discovered my bookshelves.

Yet, if I try to stay low-profile, (say not try to do SEO, not recommend any friends to go to my page) then it is reasonable to expect the factor mentioned is smaller than 1.

Further, 1.5 x 10^-8 is an upper bound as an estimate because
1, Not all my friends are interested in me (discounting factor : 0.6, a conservative one, the actual number is probably higher but I just don't want to face it. ;) )
2, My friends who are interested in me might not follow my links (discounting factor: 0.01)

So we are talking about an event with probability as low as 10^-9 or 10^-10 here. That seems to me close to cheap cryptographic algorithm.

But notice here, my security is not come from hiding or cryptography. My security merely come from my statistical insignificance. In English, I am very open but no one cares. And I am still a happy treebear. ;)

That's why you see my bookshelf. Long story for a simple decision. If you happen to read this, I hope you enjoy it.

-a

Visual Bookshelf

(Originally written at Wednesday, November 07, 2007)

I love to read and like to write reviews for every books I read. None of them will change the world but it still loves to do it. That's why by definition - I'm a bookworm. Not even feel shy about it. ;)

I go quite far: try to record every books I read on a blog and start to put them in a blog called "ContentGeek". Luckily, I haven't gone very far. Because once I discovered Visual Bookshelves, there is no need for me to do it all.

Visual Bookshelves allow users to look up a book from Amazon, add comments and stored it in a database. It also shows the book cover of the books. What else could I want more?

So anyway, this is the link of my visual bookshelves:

http://www.cs.cmu.edu/~archan/personal/bookshelf.html

Enjoy.

-a