Saturday, May 17, 2014

Speech Recognition Opportunities

Thanks for always coming back to our blog!

Speech Recognition is a subject we are very passionate about and want to spread the importance and news of it through our blog.
We have been discussing the challenges that Speech Recognition has been facing in many applications. However true and applicable those challenges may be, ongoing research is being conducted to solve these issues.

I was reading a very interesting research [i] the other day about some of the opportunities and challenges in Speech Recognition and wanted to share with you some interesting facts.

How can Automatic Speech Recognition be improved? (ASR)
Three main challenges have been identified: accuracy, throughput and latency.

1.   In order to improve accuracy, the application needs to account for noisy environments in which current systems don’t perform well. This will increase the efficiency of the technology.
In many circumstances, speech recognition lacks recognition accuracy. This is mainly due to disturbing noises or variability speakers.
A past approach to effectively deal with this issue is the so called multi stream approach which incorporates multiple features of sets that help to improve performance for both small and large ASR tasks.
However, a more recent approach is to generate many feature streams with different spectro-temporal properties. The reason for that is that some streams might be more sensitive to speeches that vary at a slower rate and others might vary at a higher rate.

2.    In order to improve throughput, the application should allow batch processing of the speech recognition task to execute as efficiently as possible which will therefore increase the utility for multimedia search.
More recently, a data-parallel automatic speech recognition inference engine was implemented on the graphics processing unit achieving a higher speed. With substantially lower overhead costs the solution promises a better throughput.

3.    In order to improve latency, the next step would be to allow speech-based applications such as speech-to-speech translation to achieve real time performance.
The main issue with latency is to recognize “who is speaking when” which is a process called “speaker diarization”.
A current approach to online diarization consisted of a training step and an online recognition step. Basically the first 1000 seconds of the input are taken and performed offline speaker diarization. Then speaker models are trained and a speech/non-speech model are taken from the output of the system.

Further research is being conducted to improve these challenges regarding ASR. I strongly believe that there will come a day when this technology will be flawless. Until then, keep visiting our blog for more news and updates on Speech Recognition!





[i] Makhijani R, Shrawankar U, Thakare V, “Opportunities and Challenges in Automatic Speech Recognition” 

No comments:

Post a Comment