U Me N CS: Frequently Raised Questions (FRQ) of Speech Recognition

Wednesday, July 23, 2014

Frequently Raised Questions (FRQ) of Speech Recognition

Posted by umencs

What is Speech recognition?

Speech is one of the oldest method to exchange information between Human-Human interface. SR (Speech Recognition) is all about converting acoustic signal captured through a noise free microphone or telephone to a set of words using pronunciation dictionary and language grammar model.

Speech Recognition Engine

What are the applications of speech Recognition?

-Telephone / mobile based IVR (Interactive Voice Response) applications
-Computer applications for illiterate.
-Multi media search through voice.
-Developing language translators.
-Voice based Desktop controlling applications for disabled.
-Software for dictation.
-Voice bases interactive medical advisory system.
-Voice based form filling applications etc.

What to consider when developing SR applications?

SR applications are completely dominated by the underlying Speech Recognition engine. Design decisions of SR applications depends on "what type of input (continuous words or isolated words) can SRE (Speech Recognition Engine) handle and when it can handle?. There are mainly 2 types of SRE's are there in market. 1). DVI - Direct Voice Input 2). LVCSR - Large Vocabulary Continuous Speech Recognition. If you are developing command and control application then DVI is better otherwise LVCSR is better for dictation kind of applications.

What are the different speech recognition techniques used by SRE?

Template based approach: in which unknown spoken speech is compared against a set of pre-recorded words( templates) in order to find the best match.

Knowledge based approach: An expert knowledge about variations in speech is coded into a system, but unfortunately such expert knowledge is difficult to obtain and use successfully.

Statistical based approach: In which variations in speech are modelled statistically, using statistical learning procedure typically the HMM (Hidden Markov) Model. The main disadvantage of statistical model is that they must take priori modelling assumptions.

Learning based approach: To overcome the disadvantage of the HMM machine learning methods such as neural networks and genetic algorithms are introduced.

What are the challenges in developing SR Applications?

-SR application must cope with speech variability of different users based on their age, gender and accent.
-Increasing vocabulary size decreases the recognition.
-Language complexity for which SR application is going to be develop.
-Environment conditions such as noise, distorted signal etc.

What is acoustic model in SRE (Speech Recognition Engine)?

An acoustic model is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word. Generally all people use HTK (Hidden Markov Model Toolkit) or HSLAB to create acoustic models.

Acoustic Model Labelling With HTK

Acoustic Model labelling With HSLAB

What is pronunciation dictionary in SRE (Speech Recognition Engine)?

A pronunciation dictionary is a simple text file which defines "how to pronounce each word". In the following picture you can see a sample pronunciation dictionary file.

Pronunciation Dictionary

What is Grammar Model in SRE (Speech Recognition Engine)?

A speech recognition language grammar model is a set of word patterns, and tells a speech recognition system what to expect a human to say. In the following picture you can see the picture of JSGF (Java Speech Grammar Format) file for recognizing "Hello Reniguntla" or "Hello Sambaiah" or "Good morning Reniguntla" and "Good Morning Sambaiah"

JSGF Grammar Example

How exactly a Speech Recognition Engine Works?

Following picture depicts the complete functionality of Speech Recognition Engine.

Speech Recognition Engine Working

References:

http://speech.tifr.res.in/tutorials/fundamentalOfASR_picone96.pdf
ftp://public.dhe.ibm.com/software/pervasive/info/products/Introduction_to_Speech_Recognition.pdf
http://www.phon.ucl.ac.uk/resource/sfs/howto/htk.htm
http://www.fon.hum.uva.nl/IFA-publications/Others/Courses/MuhirweJackson_final_report-1.pdf
http://www.cs.rochester.edu/u/james/CSC248/Lec12.pdf