|
|
Talking
Computers
Man made machines
to help him in performing some of his dull and difficult
works. When the machines worked wonders and provided solutions
for many of his complicated tasks, he shifted his energy to
build more powerful ones that will bring in more prosperity
easily, without the necessity of putting in labor. And thus
the science of Artificial Intelligence emerged.
Man has taught the machine to do a lot of things e.g. play
chess with him. But can he make a machine talk? Currently, a
lot of research is going on in this area known as Speech
Recognition and Synthesis.
Speech recognition is understanding basically what someone
speaks to a computer, asking a computer to translate speech
into its corresponding textual message. Whereas in speech
synthesis, a computer reads out a stored text message.
Speech understanding is much more difficult than natural
language understanding. Words are often spoken in a run-on
manner that makes it difficult to recognize where one word
ends and another begins. Great variations in pronunciations
exists among people: regional accents can make extreme
differences in pronunciation; women generally have higher
pitched voices than men; stress and intonation pattern vary;
some people stammer and some drop endings of words; and even
the same person's speech may vary from day to day as a result
of stress, mood or something else.
Humans are able to perform speech recognition with amazing
competence, even under extremely adverse circumstances. Even
the best recognition system of today are unable to come
anywhere near this level of performance. How is this skill
acquired? How did it develop in evolution? It's rewarding to
study how cognitive skills develops in humans or, better
still, how they develop in a child. This will prove useful
while implementing such speech recognition strategies in a
computer.
It would seem desirable to divide the speech signal into
segments of reasonable length, in order to simplify
recognition. The problem of segmentation is however not a
simple one. It would seem attractive to choose long segments.
This would de-emphasize the problem of contextual effects
because such context effects would become relevant only in the
neighborhood of segment boundaries. If the length of the
segment is significantly greater than the average range of the
contextual influence in speech, this problem can essentially
be ignored. The difficulty with long segments however is that
the number of labels (vocabulary size) increases
exponentially. Vocabulary size can be kept small if short
segments are used; this however increases the problem of
context effects. To deal effectively with such context
influences, it becomes necessary to incorporate knowledge
concerning these into the recognition strategy in some form.
Word based recognition system operate without explicit
incorporation of techniques to deal with inter-element context
effects in the recognition systems, it becomes necessary to
take care of context influence either implicitly or
explicitly. The performance of word based system is quite
satisfactory for small vocabularies. As vocabulary sizes
increases, confusion between similar words causes recognition
accuracies to fall below acceptable limits.
A speech understanding system must be able to separate
linguistically significant variations in speech signals from
insignificant variations such as variations in word
pronunciations. Background noise can contribute to aural
confusion and to the likelihood of misclassification of speech
sounds.
Speech and voice are seen in a multi-disciplinary framework:
electrical engineering (computer science, signal processing
and acoustics), linguistics (phonetics, phonology and syntax)
and psychology (psycho - acoustics and cognition). Analysis
and synthesis are intimately tied.
Speech remains the most natural form of human communications,
and some early studies have shown how speech can be the most
efficient form of communications for handling many problems.
For example, one experiments showed that speech was the single
most effective medium to use in explaining to someone how to
assemble a bicycle.
Speech has an important advantage as an input medium. Casual
users with relatively little training could access a computer
that was capable of receiving speech input. Speech is our
fastest mode of communication (about twice as fast as the
average typist). And a speech input allows users free use of
their hands for pointing, manipulating a display, flying an airplane,
or making a repair.
Issues involved in Speech Recognition are more complex than
Speech Synthesis. Whether in India or abroad, synthesizers
equipped with large and accurate vocabulary are still confined
to the laboratory even though unlimited speech synthesizers
have hit the market at least a decade back.
Initially, speech synthesizers were considered for the
exclusive use of the handicapped. For example, Professor
Stephen Hawkins speaks through a synthesizer. But rapid
advances in computer and communication technology, coupled
with a growing need for information, have increased the
importance of speech technology for all. Some immediate
application could be - access to railway reservation status,
flight schedules and latest share prices over the telephone
and mobile, reading of e-mail over the phone etc.
Of course, some of these systems are already available but
because speech is language dependent, each tongue has its own
rhythm and stressing pattern. Speech synthesizers, available
in the west, may not appeal to Indians. So, it is important
that we develop our own synthesizer, so that we may be able to
comprehend when the computer talks.
India has made good progress in speech synthesis. Several
Indian institutes are dedicated to speech synthesis research
in Indian languages. The Indian Statistical Institute,
Calcutta, has developed a talking dictionary-cum-spellchecker.
Deccan College, Pune, has developed a text-to-speech
synthesizer. The biggest success has been for the team at Tata
Institute of Fundamental Research (TIFR), Mumbai. They have
developed a continuous speech synthesizer using formant
synthesis technique. Some other Indian Institutes like Central
Electronics Engineering Research Institute at New Delhi is
also engaged in research in this field.
Three persons from TIFR are seriously involved in speech
research for the last thirty years. They are Prof. Aniruddha
Sen, Prof. Xavier Furtado and Prof. Saugata Sanyal. The
activities picked up from mid 80s following a renewed interest
shown by the Govt. for promoting Indian languages in
computers, and the availability of super fast computers for
implementing speech recognition and synthesis system, which
needs a lot of computing power. These researchers observed it
is easier to make a machine do an expert's work (like CAD/CAM)
than to make it mimic some of the common sense activities like
reasoning, vision, speech, understanding of language etc. A
possible explanation for this may be expert knowledge has
evolved over a few centuries, whereas common sense was
prevalent since pre-historic times.
The TIFR systems, built mostly with inexpensive converters,
speech boards, telephone interfaces and filters works on the
following mechanism. The user connects to the information base
by dialing into the computer which prompts him to speak a
keyword from among the choices given. When it receives an
input from the user, the speech is conveyed to the computer in
digital mode. Speech recognizer then processes it and passes
the textual query it interprets to the query handler which
accesses the information from the information database. The
user may be asked to specify his requirements in greater
details. The prompt is actually passed onto a Text-to-Speech
synthesis system which generates the equivalent speech.
However, this trio says that considerable more work is needed
for phonological and prosody to make the pronunciations more
authentic and the synthesized speech more natural sounding.
A 'natural sounding' speech hasn't been achieved, though.
Decodable speech in four tones, High/Low for both males and
females, has been successfully synthesized. However, the ideal
solution is still elusive as because reading of texts with
correct pronunciations and stressing habits is a complex task
possessing intricacies with linguistic and literary knowledge
and intuition.
Efforts on Text-to-Speech (TTS) synthesis concern several main
aspects. The conversion rules have to be improved,
particularly for difficult cases, such as acronyms, proper
names and abbreviations. A fast and robust syntactic parser
suited to the needs of TTS has to be developed.
As of now, the best commercial speech recognition product is
possibly Dragon speak. Dragon Naturally speaking software can
transcribe or write 160 words/per minute of your speech into
the computer. Another good product is ViaPhone. Microsoft has
a good speech recognition s/w called MSDictaphone.
The following is a useful site on speech besides IEEE journals
on Acoustics and Speech processing.
http://www.research.att.com/
Speech recognition
and synthesis have found applications, not only in robotics,
but also in human systems, such as security systems,
announcing systems, etc. Moreover their usefulness in
imparting knowledge to the handicapped is being increasingly
felt. There are a number of software in the USA which scan and
convert text into audio, using a voice synthesizer. 'Jaws' is
one such popular software, and this is increasingly being used
by Indian blind students to become computer literate and more
self-reliant. There is no technology in India to match 'Jaws.'
'Jaws' heavily-accented American English is proving difficult
to follow for some students. Most importantly, Jaws costs
around Rs 40,000, a sum few of the students can afford.
In conclusion, speech recognition and synthesis have wide
applications in the Indian context. Industry is interacting
with speech laboratories in a positive way and very soon, the
benefits from this technology will be made available for us.
Because researchers are spending sleepless nights working
their way towards making this innovative technology a reality.
And that day is not very far off.
– Subhajit Ghosh
June 15, 2000
Top
|
|