Personalization for Automatic Speech Recognition

Automatic speech recognition (ASR) is the process of automatically getting the textual information that the raw audio signal is carrying, thus essentially converting speech to text. With the technological advances of the last few years, and especially with the advent of deep learning, machines can achieve remarkable results for the task, especially under relatively clean conditions. Additionally, even though ASR traditionally relies on convoluted architectures involving several different modules and components, nowadays end-to-end solutions are successfully employed.

Automatic speech recognition revolutionalizes human-computer interaction.
Photo by Malte Helmhold on Unsplash.

Despite all those advances, both in terms of performance and in terms of simplicity, there is definitely still room for improvement. One aspect which holds promise for improved ASR performance is speaker (or, more broadly, domain/environment) adaptation and personalization. In general, since ASR systems depend on the data they are trained on, their performance can degrade when the conditions of use substantially differ from the training conditions. Adaptation tries to compensate for the potential mismatch between training and testing data.

One direction for personalization is adaptation of ASR models to the particular speaker’s acoustic characteristics. This is precisely the area I was working on during my 2021 summer internship at Apple, as a member of the Siri ASR R&D team. After returning to Apple as a full-time employee, another problem—still falling under the umbrella of personalization—that the team has been trying to tackle is the recognition of speaker-specific entities, such as contact names, entries of media libraries, etc. To address this problem, here we propose a computationally efficient algorithm to accurately retrieve relevant entries within the framework of neural contextual biasing.