This is a guest post from Quenton Hall, AI System Architect for Industrial, Vision, Healthcare and Sciences Markets.
You may not be intimately familiar with Baidu's DeepSpeech2 Automatic Speech Recognition model (Amodei et al., 2015: ) but I am willing to bet that if you are reading this, speech recognition is now part of your daily life.
The roots of ASR technology date back to the late 1940s and early 1950s. In 1952, Bell Labs (Davis, Biddulph, Balashek) designed "AUDREY", an "Automatic Digit REcognition" device which could recognize the digits 0 – 9. This system could be trained (tuned, actually) per user and could achieve accuracies beyond 90% for speaker-dependent recognition and ~50-60% for speaker-independent recognition.
Over the next six decades, many different techniques were developed for ASR, often leveraging atomic, phoneme-level hand-crafted feature extraction, and later moving into the use of deep neural networks trained on phoneme and later, word recognition. Such systems were often proprietary, difficult to train, and suffered from a lack of accuracy and limited vocabulary. They largely remained "toys".
In 2015, Baidu, often known as the "Google of China" released DeepSpeech2. Andrew Ng, (yes, that Andrew Ng) Baidu's Chief Scientist, who led the development of this novel End-to-End ASR model announced "In the future, I would love for us to be able to talk to all of our devices and have them understand us. I hope to someday have grandchildren who are mystified at how, back in 2016, if you were to say 'Hi' to your microwave oven, it would rudely sit there and ignore you". While I haven't spoken to my microwave lately, it is very clear that the day is coming when I will (my wife would tell you that I do already, but as Paul Sutherland used to say on that Hammy Hamster show “…that’s another story….”).
End-to-End ASR systems have the advantage that they do not leverage hand-crafted features, nor are they trained on phonemes or words, but rather, on large corpora of labeled speech data. DeepSpeech2 leveraged a combination of convolution, fully-connected and bidirectional LSTM layers and was trained to recognize both Mandarin and English. The model was trained with datasets incorporating challenging accents and background noise. The net result was a model that could outperform human transcription. Accented speech and noisy speech samples transcriptions resulted in higher WERs (Word-Error-Rates) than humans, but it is anticipated that this was largely due to size (accent) and synthetic (noise) nature of the related training data.
Xilinx customers who are leveraging Xilinx low-cost MPSoC devices in embedded applications will soon have opportunity to leverage DeepSpeech2 in their platforms. The skilled Xilinx design team in India has implemented a complete model in C/C++, trained on the Libirspeech dataset, supporting English ASR.
Because the model is coded in C, there is no need for runtime interpretation or inference frameworks, resulting in optimal performance
The Xilinx implementation is a dense model (not pruned) that is deployed without quantization and leverages a 2-layer CNN and 5-layer bi-directional LSTM (672, 800). The model achieves a WER of 10.357 and leverages only 250MB of PS-DDR memory.
If you are interested in ASR at the edge, reach out to your local Xilinx FAE or sales team who can put you in the queue for a live demonstration!