Google has thoroughly explained how one of its best creations to date works.
It arrived in 2019, together with the google pixel 4, and since then it has become a fundamental piece of software in the Pixel series devices. The voice recorder app It seems like a simple tool, but Google has made it a demonstration of its advances in the fields of artificial intelligence, machine learning and voice recognition.
Recently, Google has included in this app an option that almost seems like magic: it allows automatically detect if there are multiple parties in a conversationand label the interventions of each one of them, for later assign tags in the transcription of the recording (such tags can be changed to the names of the interlocutors later by the user himself). All this happens in real time and on device, offline to Internet.
Though the operation seems simplebehind this function hides a very advanced technologywhich Google has wanted explain in great detail.
The Tensor processor brings to life one of the best features of the Google Pixel
In its blog post focused on advances related to artificial intelligence, Google explains that a large part of the speaker labeling system acts on the Tensor CPU block, the processor integrated into the Google Pixel series devices since the Pixel 6. However, in the future they intend to delegate some of the tasks to the Tensor Processing Unit (TPU) to reduce power consumption.
The operation of this function is based on a system of dialization of interlocutors called “Turn-To-Diarize“. Its task is to create models of machine learning optimizedto obtain segment according to the interlocutor hours of audio recordings in real time, using the technical resources available in the Google Pixel.
Google has combined several different techniques to effectively make this system work. On the one hand, it is able to detect every time there is a change of interlocutor in the recording through a coding model in charge of extract the voice characteristics of each person.
On the other hand, a grouping algorithm is in charge of assigning the labels to each of the people who participate in the recording.
Once the audio recording has been segmented into homogeneous speaker turns, we use a speaker encoder model to extract an embedding vector (ie, a vector d) representing the vocal characteristics of each speaker turn.
One of the most striking features of this function is that learn from your mistakes over time. Google explains that as the model analyzes more and more audio, it is able to more accurately assign labelsand can even make corrections to previously assigned tags.
In our real-time speaker diarization system, as the model consumes more audio input, it accumulates confidence in the predicted speaker labels and may occasionally make corrections to previously low-confidence predicted speaker labels. The Recorder app automatically updates the speaker labels on the screen during recording to reflect the latest and most accurate predictions.
It is quite incredible that All this process can be executed on a smartphone without the need to resort to any kind of connection to a server, and in real time. And, although today the automatic tagging is only available in Englishthe feature is expected to include support for multiple languages in the future.