E2E Architectures for ASR systems | Verbio Technologies

E2E Architectures for ASR systems

E2E (End to end) deep learning architectures are currently one of the most discussed topics in the speech community. These architectures are being applied to many tasks such as automatic speaker verification (ASV), text to speech (TTS) or automatic speech recognition (ASR) systems. In this post we will analyze the advantages of these architectures for ASR systems, and explain why Verbio is so interested in them from an industry-focused point of view. 

To understand the relevance of E2E architectures for ASR, we have to analyze one of the previous approaches used so far: hybrid models. Hybrid systems (these systems are built from hybrid models) are ASR architectures composed of acoustic, lexicon, and language models. The acoustic model extracts the phoneme-level information from the acoustic signal. The lexicon model then maps phonemes into words. Finally, the language model creates coherent sentences from the word information extracted by the lexicon models. These three models are trained individually with similar or different technologies and then connected as a coherent hybrid model. 

Although hybrid models have been successfully applied to ASR and have shown good performance, they have three main disadvantages:

–  The acoustic, lexicon, and language models are tuned individually with different technologies before making them work together. Hence, significant effort is needed to integrate the models and make the hybrid system work, since it has not been trained as a unique module. Training these models before integration involves significantly more effort than integrating the models before training the joint module.

– The acquisition, creation, and maintenance of a lexicon dictionary, i.e. the knowledge to map phonemes into words, is hard and expensive to obtain for each language. A linguist is usually needed to create and maintain these dictionaries.

– Acoustic models are expensive to build with Deep Neural Networks (DNNs) because they need an slow audio phoneme alignment using HMM (Hidden Markov Models). This alignment requires a lot of compute time and induces some error in the acoustic training.

End to End models (E2E) were created as a solution to deal with these issues. These architectures were composed of a whole and unique block that jointly optimizes, acoustic, lexicon and language models. Compared to hybrid models, they have important advantages:

– Only one model needs to be trained. The E2E block allows capturing acoustic, lexicon and language information jointly with the same architecture.

– E2E does not require lexicon dictionaries. Any E2E model can be trained with only the use of speech signals and their respective transcriptions. Therefore, any language can be trained with these resources, and without requiring linguists or dictionaries.

– There is word-alignment dependency. Current E2E architectures based on transducers (CTC) or seq2seq (attention) do not require word alignments resources to train the models. This avoids the need to automatically create word alignments.

From a product point of view, these technical improvements lead to the major advantages of ease and speed when training new languages. Firstly, E2E architectures might allow to train some languages which weren’t trained due lack of data resources (lexicon dictionaries and linguists for that language). Secondly, E2E will substantially reduce the time needed to train these languages in comparison with hybrid based models. Finally, this approach, being completely based on the same DNN algorithm, is much easier to optimize by software or hardware when going into production. In conclusion, for Verbio, E2E models are a significant step forward in the development of ASR systems.