Man made machines to help him in performing some of his dull and difficult works. When the machines worked wonders and provided solutions for many of his complicated tasks, he shifted his energy to build more powerful ones that will bring in more prosperity easily, without the necessity of putting in labor. And thus the science of Artificial Intelligence emerged.
Man has taught the machine to do a lot of things e.g. play chess with him. But can he make a machine talk? Currently, a lot of research is going on in this area known as Speech Recognition and Synthesis.
Speech recognition is understanding basically what someone speaks to a computer, asking a computer to translate speech into its corresponding textual message. Whereas in speech synthesis, a computer reads out a stored text message.
Speech understanding is much more difficult than natural language understanding. Words are often spoken in a run-on manner that makes it difficult to recognize where one word ends and another begins. Great variations in pronunciations exists among people: regional accents can make extreme differences in pronunciation; women generally have higher pitched voices than men; stress and intonation pattern vary; some people stammer and some drop endings of words; and even the same person's speech may vary from day to day as a result of stress, mood or something else.
Humans are able to perform speech recognition with amazing competence, even under extremely adverse circumstances. Even the best recognition system of today are unable to come anywhere near this level of performance. How is this skill acquired? How did it develop in evolution? It's rewarding to study how cognitive skills develops in humans or, better still, how they develop in a child. This will prove useful while implementing such speech recognition strategies in a computer.
It would seem desirable to divide the speech signal into segments of reasonable length, in order to simplify recognition. The problem of segmentation is however not a simple one. It would seem attractive to choose long segments. This would de-emphasize the problem of contextual effects because such context effects would become relevant only in the neighborhood of segment boundaries. If the length of the segment is significantly greater than the average range of the contextual influence in speech, this problem can essentially be ignored. The difficulty with long segments however is that the number of labels (vocabulary size) increases exponentially. Vocabulary size can be kept small if short segments are used; this however increases the problem of context effects. To deal effectively with such context influences, it becomes necessary to incorporate knowledge concerning these into the recognition strategy in some form.
Word based recognition system operate without explicit incorporation of techniques to deal with inter-element context effects in the recognition systems, it becomes necessary to take care of context influence either implicitly or explicitly. The performance of word based system is quite satisfactory for small vocabularies. As vocabulary sizes increases, confusion between similar words causes recognition accuracies to fall below acceptable limits.
A speech understanding system must be able to separate linguistically significant variations in speech signals from insignificant variations such as variations in word pronunciations. Background noise can contribute to aural confusion and to the likelihood of misclassification of speech sounds.
Speech and voice are seen in a multi-disciplinary framework: electrical engineering (computer science, signal processing and acoustics), linguistics (phonetics, phonology and syntax) and psychology (psycho - acoustics and cognition). Analysis and synthesis are intimately tied.
Speech remains the most natural form of human communications, and some early studies have shown how speech can be the most efficient form of communications for handling many problems. For example, one experiments showed that speech was the single most effective medium to use in explaining to someone how to assemble a bicycle.
Speech has an important advantage as an input medium. Casual users with relatively little training could access a computer that was capable of receiving speech input. Speech is our fastest mode of communication (about twice as fast as the average typist). And a speech input allows users free use of their hands for pointing, manipulating a display, flying an airplane, or making a repair.
Issues involved in Speech Recognition are more complex than Speech Synthesis. Whether in India or abroad, synthesizers equipped with large and accurate vocabulary are still confined to the laboratory even though unlimited speech synthesizers have hit the market at least a decade back.
Initially, speech synthesizers were considered for the exclusive use of the handicapped. For example, Professor Stephen Hawkins speaks through a synthesizer. But rapid advances in computer and communication technology, coupled with a growing need for information, have increased the importance of speech technology for all. Some immediate application could be - access to railway reservation status, flight schedules and latest share prices over the telephone and mobile, reading of e-mail over the phone etc.
Of course, some of these systems are already available but because speech is language dependent, each tongue has its own rhythm and stressing pattern. Speech synthesizers, available in the west, may not appeal to Indians. So, it is important that we develop our own synthesizer, so that we may be able to comprehend when the computer talks.
India has made good progress in speech synthesis. Several Indian institutes are dedicated to speech synthesis research in Indian languages. The Indian Statistical Institute, Calcutta, has developed a talking dictionary-cum-spellchecker. Deccan College, Pune, has developed a text-to-speech synthesizer. The biggest success has been for the team at Tata Institute of Fundamental Research (TIFR), Mumbai. They have developed a continuous speech synthesizer using formant synthesis technique. Some other Indian Institutes like Central Electronics Engineering Research Institute at New Delhi is also engaged in research in this field.
Three persons from TIFR are seriously involved in speech research for the last thirty years. They are Prof. Aniruddha Sen, Prof. Xavier Furtado and Prof. Saugata Sanyal. The activities picked up from mid 80s following a renewed interest shown by the Govt. for promoting Indian languages in computers, and the availability of super fast computers for implementing speech recognition and synthesis system, which needs a lot of computing power. These researchers observed it is easier to make a machine do an expert's work (like CAD/CAM) than to make it mimic some of the common sense activities like reasoning, vision, speech, understanding of language etc. A possible explanation for this may be expert knowledge has evolved over a few centuries, whereas common sense was prevalent since pre-historic times.
The TIFR systems, built mostly with inexpensive converters, speech boards, telephone interfaces and filters works on the following mechanism. The user connects to the information base by dialing into the computer which prompts him to speak a keyword from among the choices given. When it receives an input from the user, the speech is conveyed to the computer in digital mode. Speech recognizer then processes it and passes the textual query it interprets to the query handler which accesses the information from the information database. The user may be asked to specify his requirements in greater details. The prompt is actually passed onto a Text-to-Speech synthesis system which generates the equivalent speech. However, this trio says that considerable more work is needed for phonological and prosody to make the pronunciations more authentic and the synthesized speech more natural sounding.
A 'natural sounding' speech hasn't been achieved, though. Decodable speech in four tones, High/Low for both males and females, has been successfully synthesized. However, the ideal solution is still elusive as because reading of texts with correct pronunciations and stressing habits is a complex task possessing intricacies with linguistic and literary knowledge and intuition.
Efforts on Text-to-Speech (TTS) synthesis concern several main aspects. The conversion rules have to be improved, particularly for difficult cases, such as acronyms, proper names and abbreviations. A fast and robust syntactic parser suited to the needs of TTS has to be developed.
As of now, the best commercial speech recognition product is possibly Dragon speak. Dragon Naturally speaking software can transcribe or write 160 words/per minute of your speech into the computer. Another good product is ViaPhone. Microsoft has a good speech recognition s/w called MSDictaphone.
The following is a useful site on speech besides IEEE journals on Acoustics and Speech processing.
Speech recognition and synthesis have found applications, not only in robotics, but also in human systems, such as security systems, announcing systems, etc. Moreover their usefulness in imparting knowledge to the handicapped is being increasingly felt. There are a number of software in the USA which scan and convert text into audio, using a voice synthesizer. 'Jaws' is one such popular software, and this is increasingly being used by Indian blind students to become computer literate and more self-reliant. There is no technology in India to match 'Jaws.' 'Jaws' heavily-accented American English is proving difficult to follow for some students. Most importantly, Jaws costs around Rs 40,000, a sum few of the students can afford.
In conclusion, speech recognition and synthesis have wide applications in the Indian context. Industry is interacting with speech laboratories in a positive way and very soon, the benefits from this technology will be made available for us. Because researchers are spending sleepless nights working their way towards making this innovative technology a reality. And that day is not very far off.