Thanks for
always coming back to our blog!
Speech
Recognition is a subject we are very passionate about and want to spread the
importance and news of it through our blog.
We have been
discussing the challenges that Speech Recognition has been facing in many
applications. However true and applicable those challenges may be, ongoing
research is being conducted to solve these issues.
I was reading
a very interesting research [i] the other day about some
of the opportunities and challenges in Speech Recognition and wanted to share
with you some interesting facts.
How can Automatic Speech Recognition be improved? (ASR)
Three main challenges have been identified: accuracy, throughput and latency.
1. In order to improve accuracy, the application needs to account for noisy
environments in which current systems don’t perform well. This will increase the efficiency of the technology.
In
many circumstances, speech recognition lacks recognition accuracy. This is mainly
due to disturbing noises or variability speakers.
A
past approach to effectively deal with this issue is the so called multi stream
approach which incorporates multiple features of sets that help to improve
performance for both small and large ASR tasks.
However, a more recent approach is to generate many feature streams with different spectro-temporal
properties. The reason for that is that some streams might be more sensitive to
speeches that vary at a slower rate and others might vary at a higher rate.
2.
In
order to improve throughput, the application should allow batch processing of
the speech recognition task to execute as efficiently as possible which will
therefore increase the utility for multimedia search.
More
recently, a data-parallel automatic speech recognition inference engine was
implemented on the graphics processing unit achieving a higher speed. With
substantially lower overhead costs the solution promises a better throughput.
3.
In
order to improve latency, the next step would be to allow speech-based
applications such as speech-to-speech translation to achieve real time
performance.
The
main issue with latency is to recognize “who is speaking when” which is a
process called “speaker diarization”.
A
current approach to online diarization consisted of a training step and an
online recognition step. Basically the first 1000 seconds of the input are
taken and performed offline speaker diarization. Then speaker models are
trained and a speech/non-speech model are taken from the output of the system.
Further
research is being conducted to improve these challenges regarding ASR. I strongly
believe that there will come a day when this technology will be flawless. Until
then, keep visiting our blog for more news and updates on Speech Recognition!
[i] Makhijani
R, Shrawankar U, Thakare V, “Opportunities and Challenges in Automatic Speech
Recognition”
No comments:
Post a Comment