Revolutionizing your business with Speech Recognition

Saturday, May 17, 2014

Publicizing our blog and learnings on the way .......

Hola!

This is our last blog and we wanted to sign off by sharing some of the techniques we have used to publicize our blog and its effectiveness in general. We also wanted to share some of our learning’s from this exercise in order to help students like us be more effective in promoting blogs for free.

To be honest it was not an easy task to get people to our blog. We started off by using the most common social media platforms such as Facebook, Twitter and Google + to attract visitors. However, with a topic specific about a recent technological innovation, we found little interest for our blog. With so much content on the Internet awaiting people’s attention, it was hard to even get noticed from our friends and families. Luckily, things improved half way through the term when we sought feedback from our Technology Professor and friends with previous experience with blogs. One message that consistently resonated across all feedbacks was the lack of a unified message in our content. While we focused mostly on diverse applications, we did not have a story connecting the posts or any structure that covered all aspects of the technology. So the message became clear to us – we needed to improve our content first and then focus on publicity.

We therefore took a step back and posted a message that attempted to connect our previous postings with a message around why people should care about this technology. Overall, this exercise renewed our motivations and we publicized the blog using direct communication channels such as email and word of mouth in addition to Facebook. We particularly targeted the influencers among our friends and families – individuals who spent lots of time on social media and had a big following as a result. Throughout the period we constantly monitored the free statistical analysis provided by Blogger. The direct marketing through email made it easier for people to view our blog on mobile devices such as iPad and iPhone. Additionally, our international backgrounds helped us generate interest from across the world.

Lastly, we wanted to leave you with some of the techniques that we learnt along the way and which are particularly applicable to technology related blogs.

Ø Having a rough idea of the broader story or picture you want to share with your blog is critical. Thinking about this right from the beginning can make it easier for your publicity campaign. Connecting the technology with a story can also attract a broader audience base that is less familiar with technical jargons and terminology. We learnt this later but it made a huge difference to our page visits!

Ø While everyone uses Twitter, Facebook, Google+ and LinkedIn to promote blogs for their popularity, we believed that in the future we could get better attention on smaller but lesser-known networks such as Quora.com and Empire Avenue. This becomes even more relevant for a niche blog like ours seeking out focused audiences interested in emerging technologies.

Ø Although it’s a known fact that adding graphics, photos and illustrations improves site visits, we wanted to emphasize this one as our posts with pictures definitely got more clicks that the ones without.

Ø Finally, an interesting way to publicize a well-written post is to directly post the link to that specific post instead of providing a link to the blog home page. This technique of sharing the link of your best post will not only maximize the chances of the audience reading the entire post, but also the positive experience can increase the likelihood of other posts being read in the blog. We could measure the effect of this easily as the post we shared emerged as the one with the highest number of views and helped us increase the overall number of page views for our blog.

Gracias y Saludos,

N2 Team B

Speech Recognition Opportunities

Thanks for always coming back to our blog!

Speech Recognition is a subject we are very passionate about and want to spread the importance and news of it through our blog.

We have been discussing the challenges that Speech Recognition has been facing in many applications. However true and applicable those challenges may be, ongoing research is being conducted to solve these issues.

I was reading a very interesting research [i] the other day about some of the opportunities and challenges in Speech Recognition and wanted to share with you some interesting facts.

How can Automatic Speech Recognition be improved? (ASR)

Three main challenges have been identified: accuracy, throughput and latency.

1. In order to improve accuracy, the application needs to account for noisy environments in which current systems don’t perform well. This will increase the efficiency of the technology.

In many circumstances, speech recognition lacks recognition accuracy. This is mainly due to disturbing noises or variability speakers.

A past approach to effectively deal with this issue is the so called multi stream approach which incorporates multiple features of sets that help to improve performance for both small and large ASR tasks.

However, a more recent approach is to generate many feature streams with different spectro-temporal properties. The reason for that is that some streams might be more sensitive to speeches that vary at a slower rate and others might vary at a higher rate.

2. In order to improve throughput, the application should allow batch processing of the speech recognition task to execute as efficiently as possible which will therefore increase the utility for multimedia search.

More recently, a data-parallel automatic speech recognition inference engine was implemented on the graphics processing unit achieving a higher speed. With substantially lower overhead costs the solution promises a better throughput.

3. In order to improve latency, the next step would be to allow speech-based applications such as speech-to-speech translation to achieve real time performance.

The main issue with latency is to recognize “who is speaking when” which is a process called “speaker diarization”.

A current approach to online diarization consisted of a training step and an online recognition step. Basically the first 1000 seconds of the input are taken and performed offline speaker diarization. Then speaker models are trained and a speech/non-speech model are taken from the output of the system.

Further research is being conducted to improve these challenges regarding ASR. I strongly believe that there will come a day when this technology will be flawless. Until then, keep visiting our blog for more news and updates on Speech Recognition!

[i] Makhijani R, Shrawankar U, Thakare V, “Opportunities and Challenges in Automatic Speech Recognition”

Friday, May 16, 2014

Speech recognition challenges: Effects of daily use on the economy

The major challenges of speech recognition we've covered so far are the different dialects and speech variability in user as well as the low quality of input devices for speech. weren't But what if These problems? speech recognition Would be ready for everyday use ?

A few more issues Arise When These problems are corrected. First is how to Distinguish Between the speaker and background noise. How can the program know to translate your speech instead of the guy behind you in line at Starbucks? Unless the device is tucked nicely next to your mouth, how does it know who to translate?

The other issue, and May be more important, is how many jobs would be lost if the software can be perfected? Can you think of any Jobs that rely on translating speech to text?

Pretty much any type of reporter Could Potentially be eliminated, such as a court reporter.

It could take over machines drive-thru or for that matter any servers where it's not important to the business (fast-food). The only employees needed would be the cooks.

Customer service employees would be obsolete. Even though this is the trend now anyways, it Could Become completely useless in the future.

In essence, although speech recognition That is an amazing technology has many different uses, we can not completely ignore the negative effects it May have on people or the economy as a whole. We Should keep in mind These next time we use the Siri!

Wednesday, May 14, 2014

Speech Recognition - Other challenges

As we've seen on our last post, one of the major challenges of Speech Recognition is Speaker Variability.

But there are others...

And the fact that there are still so many big challenges in this theme, is what drives our interest for it. Speech Recognition is, on the one hand, a technology with huge potential for growth due to its wide possibility of use and, on the other hand, something about which that is a lot to learn and investigate.

So, what are these other main challenges?

> Quality of inputs devices (typically microphones): the microphone's too low or too high sensitivity may origin problems to the software deciphering the message. A microphone too sensitive, for example, may capture unintended sounds (other people speaking in the same room, for example) and thus prejudice the whole Speech Recognition process. Related to this, today makes sense to all of us the possibility of controlling our TV by voice instructions. "incresase volume", "next channel", "shut down TV" are basic instructions that we can imagine ourselves giving to the TV in our living room. But let's think: will the TV respond only to pre-programed "instructors"? Will it then recognize our voice over some other noise existing in the same division? Will everyone be able to give instructions? To make it work, will the ASR device have to be in the control, rather than on the TV set? These are all questions whose answers are still under a lot of research.

> Meaning and context of what is being said: although this topic has to do with the sound itself, is not so much related to the variability of the voice, but more related to the intended meaning of whoever is speaking. Words like "there" and "their”, “whole" and "hole" and "leave" and "live" are pronounced the same way but have totally different meanings. No Speech Recognition technology is yet able to take this step to identify one's meaning, and our research tells us that this is a big step that won’t probably be taken soon.

To emphasize the idea again, some of the people involved in studies in this field question if - given that there are so many problems associated to it, all the research should continue. The majority though see it like we do... the potential of Speech Recognition Technology and the early stage where we still are, makes us believe that this will be the solution to many problems and, therefore, research should and will continue.

Sunday, May 4, 2014

Speaker Variability – one of the biggest challenges to Speech Recognition

All speakers have their special voices, due to their unique physical body and personality. The same factors discussed below that make human speech unique become a challenge for Automatic Speech Recognition (ASR) to work effectively.

Realization

The realization of speech changes over time. Even if the speaker tries to sound exactly the same, there will always be some small diﬀerences in the acoustic wave we produce.

Speaking style

Speaking is a way of expressing our personality and we communicate our emotions via speech. We speak differently when we are happy, sad, frustrated, stressed, disappointed, or defensive. Our speaking styles also vary in different situations and depending on whether we are speaking with our parents, or with our friends.

The sex and age of the speaker

Men and women have different voices, and the main reason to this is that women have in general shorter vocal tract than men. Likewise, the anatomy of the vocal tract changes over time depending on the health or the age of the speaker.

Speed of speech

We speak in different modes of speed, at different times. If we are stressed, we tend to speak faster, and if we are tired, the speed tends to decrease. We also speak in different speeds if we talk about something known or something unknown.

Regional and social dialects

Regional dialects involve features of pronunciation, vocabulary and grammars, which differ according to the geographical area the speaker, come from. Social dialects are distinguished by features of pronunciation, vocabulary and grammar according to the social group of the speaker.

The long list of variations does not mean that we give up on ASR. It may seem quite unlikely that we will ever succeed to do perfect ASR, but there is definitely potential for improvement. One thing that we can consider is if humans should speak differently to computers. For instance, we could strive to be unambiguous and speak in a hypercorrect style to get the computer to understand us perfectly. Although this could simplify ASR, not all variations discussed above can be addressed. Our goal with ASR should therefore not be to have ’natural’ verbal communication with machines but rather seek efficient user interfaces.

Monday, April 28, 2014

WHY SHOULD YOU USE SPEECH RECOGNITION? – Seeing the bigger picture.

Hello friends,

Thank you for your continued interest in reading our blog on Speech Recognition. So far we have seen how this technology works and specifically how recent innovations have made it explode through multiple industries. Speech Recognition has touched everything from the daily convenience of Siri to futuristic applications such as digital dictation and real-time language translators. One important distinction we want to make here is that the depending upon the industry the need for achieving accurate and reliable recognition could vary significantly. Even though the technology is slowly making its way into our phones, tablets and personal computers, it'll still be some time before keyboards disappear from our digital lives altogether. Additionally, for the most promising applications in healthcare (documenting using deferred speech recognition) and military (automatic flight control, ATC control etc.), the variation of human speech in terms of accent, pronunciation, articulation, roughness, nasality, pitch, volume, and speed has proved to be a major roadblock. While we will address the challenges faced by Speech Recognition in detail in our future blogs, the focus today is to provide some of the benefits of the technology including the primary business value propositions. In other words – ‘WHY SHOULD YOU USE SPEECH RECOGNITION?’

Everyone from physicians to the disabled is making use of speech recognition. For physicians, any time saved from taking notes and filling out reports means more time to help patients. Likewise, for a disabled individual, regardless of their physical shortcomings, speech recognition can in many cases provide a significant level of self-sufficiency. Therefore one the biggest advantages of this technology lie in its ability to improve efficiency and provide a level of independence. Initially, speech recognition could only be used with the help of a specific program or a exact piece of equipment, however with advances in technology it is now possible for this software to be made available in not only almost any program, but also in a significantly large amount of computer systems. This means that the technology today is within reach of every individual with access to a digital device.

The ability of voice activated programs to improve human lifestyle and give you more time for more pressing tasks can simply be explained through statistics. For example, the average office worker is able to type anywhere between 50-70 words per minute. However, with the use of speech recognition programs the average typing speed can be increased to 120 words per minute at an astounding accuracy rate of 98%! With suitable training, you can also be able to leave the editing portion of the task to be completely handled through the use of voice-activated programming. With all these advancements in voice recognition software it can definitely be said that vast improvements in the lives of many and the productivity levels will surely increase.

Talking computers has been nothing more than a science fiction dream, however the real change in our lives will come when computers can listen and understand.

References:

1. http://www.gizmag.com/speech-recognition-ntnu/23870/

2. http://mirskdigital.blogspot.com.es/2013/06/800x600-normal-0-21-false-false-false.html

Saturday, April 26, 2014

Simultaneous translation by computer is getting closer

IN “STAR TREK”, a television series of the 1960s, no matter how far across the universe the Starship Enterprise travelled, any aliens it encountered would converse in fluent Californian English. It was explained that Captain Kirk and his crew wore tiny, computerised Universal Translators that could scan alien brainwaves and simultaneously convert their concepts into appropriate English words.

Science fiction, of course. But the best sci-fi has a habit of presaging fact. Many believe the flip-open communicators also seen in that first “Star Trek” series inspired the design of clamshell mobile phones. And, on a more sinister note, several armies and military-equipment firms are working on high-energy laser weapons that bear a striking resemblance to phasers. How long, then, before automatic simultaneous translation becomes the norm, and all those tedious language lessons at school are declared redundant?

Not, perhaps, as long as language teachers, interpreters and others who make their living from mutual incomprehension might like. A series of announcements over the past few months from sources as varied as mighty Microsoft and string-and-sealing-wax private inventors suggest that workable, if not yet perfect, simultaneous-translation devices are now close at hand.

Over the summer, Will Powell, an inventor in London, demonstrated a system that translates both sides of a conversation between English and Spanish speakers—if they are patient, and speak slowly. Each interlocutor wears a hands-free headset linked to a mobile phone, and sports special goggles that display the translated text like subtitles in a foreign film.

In November, NTT DoCoMo, the largest mobile-phone operator in Japan, introduced a service that translates phone calls between Japanese and English, Chinese or Korean. Each party speaks consecutively, with the firm’s computers eavesdropping and translating his words in a matter of seconds. The result is then spoken in a man’s or woman’s voice, as appropriate.

Microsoft’s contribution is perhaps the most beguiling. When Rick Rashid, the firm’s chief research officer, spoke in English at a conference in Tianjin in October, his peroration was translated live into Mandarin, appearing first as subtitles on overhead video screens, and then as a computer-generated voice. Remarkably, the Chinese version of Mr Rashid’s speech shared the characteristic tones and inflections of his own voice.

Que?

Though the three systems are quite different, each faces the same problems. The first challenge is to recognise and digitise speech. In the past, speech-recognition software has parsed what is being said into its constituent sounds, known as phonemes. There are around 25 of these in Mandarin, 40 in English and over 100 in some African languages. Statistical speech models and a probabilistic technique called Gaussian mixture modelling are then used to identify each phoneme, before reconstructing the original word. This is the technology most commonly found in the irritating voice-mail jails of companies’ telephone-answering systems. It works acceptably with a restricted vocabulary, but try anything more free-range and it mistakes at least one word in four.

The translator Mr Rashid demonstrated employs several improvements. For a start, it aims to identify not single phonemes but sequential triplets of them, known as senones. English has more than 9,000 of these. If they can be recognised, though, working out which words they are part of is far easier than would be the case starting with phonemes alone.

Microsoft’s senone identifier relies on deep neural networks, a mathematical technique inspired by the human brain. Such artificial networks are pieces of software composed of virtual neurons. Each neuron weighs the strengths of incoming signals from its neighbours and send outputs based on those to other neighbours, which then do the same thing. Such a network can be trained to match an input to an output by varying the strengths of the links between its component neurons.

One thing known for sure about real brains is that their neurons are arranged in layers. A deep neural network copies this arrangement. Microsoft’s has nine layers. The bottom one learns features of the processed sound waves of speech. The next layer learns combinations of those features, and so on up the stack, with more sophisticated correlations gradually emerging. The top layer makes a guess about which senone it thinks the system has heard. By using recorded libraries of speech with each senone tagged, the correct result can be fed back into the network, in order to improve its performance.

Microsoft’s researchers claim that their deep-neural-network translator makes at least a third fewer errors than traditional systems and in some cases mistakes as few as one word in eight. Google has also started using deep neural networks for speech recognition (although not yet translation) on its Android smartphones, and claims they have reduced errors by over 20%. Nuance, another provider of speech-recognition services, reports similar improvements. Deep neural networks can be computationally demanding, so most speech-recognition and translation software (including that from Microsoft, Google and Nuance) runs in the cloud, on powerful online servers accessible in turn by smartphones or home computers.

Quoi?

Recognising speech is, however, only the first part of translation. Just as important is converting what has been learned not only into foreign words (hard enough, given the ambiguities of meaning which all languages display, and the fact that some concepts are simply untranslatable), but into foreign sentences. These often have different grammatical rules, and thus different conventional word orders. So even when the English words in a sentence are known for certain, computerised language services may produce stilted or humorously inaccurate translations.

Google’s solution for its Translate smartphone app and web service is crowd-sourcing. It compares the text to be translated with millions of sentences that have passed through its software, and selects the most appropriate. Jibbigo, whose translator app for travellers was spun out from research at Carnegie Mellon University, works in a similar way but also pays users in developing countries to correct their mother-tongue translations. Even so, the ultimate elusiveness of language can cause machine-translation specialists to feel a touch of Weltschmerz.

For example, although the NTT DoCoMo phone-call translator is fast and easy to use, it struggles—even though it, too, uses a neural network—with anything more demanding than pleasantries. Sentences must be kept short to maintain accuracy, and even so words often get jumbled.

A universal translator that works only in conference halls, however, would be of limited use to travellers, whether intergalactic or merely intercontinental. Mr Powell’s conversation translator will work anywhere that there is a mobile-phone signal. Speech picked up by the headsets is fed into speech-recognition software on a nearby laptop, and the resulting text is sent over the mobile-phone network to Microsoft’s translation engine online.

One big difficulty when translating conversations is determining who is speaking at any moment. Mr Powell’s system does this not by attempting to recognise voices directly, but rather by running all the speech it hears through two translation engines simultaneously: English to Spanish, and Spanish to English. Since only one of the outputs is likely to make any sense, the system can thus decide who is speaking. That done, it displays the translation in the other person’s goggles.

At the moment, the need for the headsets, cloud services and intervening laptop means Mr Powell’s simultaneous system is still very much a prototype. Consecutive, single-speaker translation is more advanced. The most sophisticated technology currently belongs to Jibbigo, which has managed to squeeze speech recognition and a 40,000-word vocabulary for ten languages into an app that runs on today’s smartphones without needing an internet connection at all.

Nani?

Some problems remain. In the real world, people talk over one another, use slang or chat on noisy streets, all of which can foil even the best translation system. But though it may be a few more years before “Star Trek” style conversations become commonplace, universal translators still look set to beat phasers, transporter beams and warp drives in moving from science fiction into reality.

From the print edition: Science and technology