People like to find ways to simplify their work. It is no surprise then, that as artificial intelligence technology continues to improve we are witnessing a revolution in the employment of speech recognition technology to assist us in our work. By utilizing real-world data, speech recognition systems such as ours developed here at Verbio Technologies are able to grow and learn, in turn contributing to this revolution in technology.
Click here to view Verbio’s Speech Recognition Center for Cloud
Given that voice recognition systems base their learning on real-world speech examples, let us take a moment to reflect on the significance of the data itself and the extent to which the data can influence the resulting technology. As with any system that learns by example, it is clear that the more data it has to learn from, the more robust a system becomes. However, it is also important to take into consideration the very high cost of time that database collection and cleaning require. In the end, every training dataset can only be a finite sample of the real world. So how exactly does the data we select affect the resulting voice recognition system?
A clear consequence of working with datasets limited in volume is that the resulting system might achieve a lower word recognition success rate than originally expected. This increase in error is due to an underrepresentation of certain speech examples in the dataset, which are present in real-world interactions with the system. However, there are also less obvious consequences which can be attributed to the limitations of using samples of data to represent the real world.
As Dr. Joan Bajorek [HBR] explains, a speech recognition system may suffer a performance bias for a particular gender or group of people if they are not well represented in the training dataset. According to the article, women, and especially minority women, tend to experience lower quality speech recognition than men in commonly used products, such as mobile phones and smart devices.
Here at Verbio, improving performance biases such as these is one of the reasons why we strive for our automatic voice transcription to be the very best, and aim to make our users the center of our work. We have a passion for voice and we don’t want to miss a single detail of it, regardless of our users’ gender, accent, or dialect. We love everything that makes a voice unique. Our passion for voice is what drives us to strive for perfection with our technology and inspires us to continuously improve our systems to achieve excellent results for all our speech recognition users.
How do we make our speech recognition all-inclusive?
There are three main ways to solve this speech recognition problem:
1) Adding training data to cover particular “misrecognized” groups
2) Manually adding specific underperforming examples to the dataset
3) Designing algorithms that are more robust and attentive to all users’ differing speech characteristics
Modifying data or adding examples
At Verbio we make sure that our training data includes a balanced set of examples containing various features that can affect speech recognition such as gender, age, acoustic environments, microphone quality, and dialect or speech style. We ensure that our system receives an approximately equal amount of female recordings as male voice recordings to learn from. We also ensure that different age ranges are well represented, as the voices of children are quite different from those of adults or elders. And finally, we strive to include recordings of people where dialect or speaking style may affect voice recognition. By continuously monitoring and evaluating our speech recognition systems, we are able to pinpoint less common regional accents which have a lower recognition rate within our system and provide our models with real-world speech examples from these areas in order to build more robust and versatile systems, that provide every user with the absolute best recognition experience.
Another way to overcome the bias issue is by adding knowledge manually. Generally, speech recognition systems strive to be as generic as possible, yielding acceptable results in most environments and use cases. However, to achieve truly excellent results, we can adapt a system to specific use cases and particular conversational features of a speaker or regional accent. We do this at Verbio by creating multiple language models of the same language, which we can then modify to understand multiple regional dialects and accents within one language.
Often speech recognition systems use two types of language models in cohesion, one which is based on rules and programmed by a linguist, and another which is learned automatically from text.
By modifying the rules to take into account regional expressions or vocabulary that is less common within the generic language, we are able to develop language models that are aware of both generic language patterns as well as language structures that make certain regions more unique. Examples can include anything from programming slang words and distinct phrases into language models, to expanding our list of alternative dictionary pronunciations to include variations in pronunciation differing from region to region. These methods of including specific regional phrases and pronunciation alternatives have the advantage of being very quick to implement. However, as the generic language itself develops, we are also able to add new phrases in bulk to the datasets we use to train statistical language models, ensuring that our systems never become archaic.
In addition to adding new training examples or introducing a particular set of linguistic rules, the third way to reduce bias in a speech recognition system is to design algorithms that are in their nature robust and sensitive to variation in language. At Verbio this is an ongoing task.
Historically, we have used signal processing that attempts to eliminate language features not required for transcription. That is, before transcribing a message, our algorithms strive to neutralize and eliminate all the biometric information about the speaker’s voice that is not required to transcribe the message into text. Both mel-frequency cepstrum and convolutional neural networks (CNNs) are methods used in sound processing to accomplish this. These systems are not perfect, as separating a person’s biometric information from the rest of their utterance has proven to be a challenging task.
As a result, we are ceaselessly working on making our speech processing algorithms immune to possible data bias by continuing to monitor our speech recognition performance in the real-world and ensuring that our users are not only satisfied, but impressed with our system’s capabilities. Every day as we evaluate our own performance, we continue to focus on targeting and improving cases where we have failed, because to us, every interaction is important and we believe that personalization is the key to making our system truly flourish.
Of course, as all languages are so unique and fluid it is not a simple task to pinpoint every person’s distinctive way of speaking. At the end of the day, we recognize that even humans struggle to understand one another from time to time. But at Verbio we feel driven to create a personalized experience equally responsive to every user’s voice. It’s no doubt that we have a long way to go to achieve perfection, but we believe that the more we strive for it the closer we get, and that is why we are already on our way there.