When we talk about speech technology, Stephen Hawking, the late physicist, may come to mind. Tracing his eyeball movement, his speech synthesizer could pick up the letters one by one, and read out rather mechanically the words and sentences formed. Prof. Helen Meng (middle) of the Faculty of Engineering spoke on ‘Artificial Intelligence in Speaking and Listening for Learning and Well-Being’ in the fifth lecture of ‘The Pursuit of Wisdom’ Public Lecture Series on 3 June. She shared with the 200 audience members present how artificial intelligence may be applied to enhance speech technology and used in aid of communication and language learning.
The speech recognition technology developed by Microsoft can transcribe human speech with an error margin of 5%, which is as good as human communication. When it is unsure about a certain sound, it can take into account the context of the speech to make better guesses, like humans do. Applying speech recognition technology, Professor Meng developed a language learning platform which can not only identify words accurately, but also detect mispronunciation and perform diagnosis. ‘Take, for example, the interdental fricative sounds which are absent in Cantonese and Putonghua. Speakers whose mother tongue is Cantonese may mispronounce “thick [θIk]” as [fIk] while those of Putonghua may mispronounce the same word as “sick [sIk]”,’ said Professor Meng. The platform can detect these discrepancies and generate corrective feedback. In addition to giving the correct pronunciation, the platform uses animation to illustrate the sound’s articulation.
The challenges are amplified going from single words to sentences. The meaning of a sentence derives not only from the order of the words but also from whether the speaker has enounced some part of it in a special way to achieve a certain purpose. AI-synthesized speech has to mimic this if it is to sound perfectly human. Through studying lexicons in their different manifestations in natural speech, Professor Meng’s team is able to deliver accurate and full meanings in synthesized speech. Further, speech conversion technology could transport the characteristics of a human voice in a language into another. If the characteristics of Hawking’s voice in his spoken English are captured and analysed, then it’s possible to re-present his voice speaking Chinese to a Chinese audience.
In Hong Kong, around 50,000 people suffer from speech impairment. Forty per cent of them are unable to communicate orally. The Hospital Authority has published a special book which enables those with speech impairment to communicate by pointing at images. Professor Meng has gone a step further and developed a customizable version of the book—e-Commu-Book. Upon the user clicking an icon thereon, the e-Commu-Book will read out the corresponding lexicon. The user can then edit the content by, say, inputting the picture or the appellation of a family member. The e-Commu-Book will then convert the text into speech. Collaborating with Microsoft, Professor Meng’s team has developed the e-Commu-Book in 13 languages covering over 20 vernaculars. In recent years, Professor Meng has been dedicated to developing Cantonese smart speech systems for patients with stroke and cerebral palsy.
Like other emerging technologies, speech technology comes with security problems. ‘Some security systems use speech for identification, making speech synthesis a convenient tool for sabotage. Our attention is also turned to creating “shields” to keep synthesized speech distinct and distinguishable from human speech,’ said Professor Meng.
This article was originally published in No. 540, Newsletter in Jun 2019.