Analysis of voiceprint recognition technology of Sun Yiting in Super Brain Battle

The man-machine challenge in the super brain is to identify people by listening to sounds, and the technical background behind it is voiceprint recognition technology. Voiceprint recognition is actually a behavior recognition technology, which is to test the waveform and change of the collected sound and match it with the registered sound template. This technology was first developed by Bell Laboratories in the late 1940s and is mainly used in the field of military intelligence. With the development of technology, it is gradually widely used in forensic identification, court evidence and other fields.

Theoretical basis of voiceprint recognition

Each voice has a unique feature, through which different people's voices can be effectively distinguished.

This feature is mainly determined by two factors. The first is the size of the acoustic cavity, including the throat, nasal cavity and oral cavity. The shape, size and position of these organs determine the magnitude of vocal cord tension and the range of sound frequency. Just like fingerprints, everyone's voice has its own unique characteristics. The second factor is the manipulation of vocal organs, and the interaction between vocal organs will produce clear speech. In the process of learning to speak, people will gradually form their own voiceprint characteristics by simulating the speaking styles of different people around them.

Theoretically, voiceprints are like fingerprints, and few two people have the same voiceprint characteristics.

Analysis of Small Voiceprint Recognition Technology

The voiceprint recognition technology of small and medium-sized robots in super brain actually belongs to real-time dynamic sound detection technology, and also includes VAD, noise reduction and reverberation removal (the purpose of VAD is to detect whether it is human voice, and noise reduction and reverberation removal are to eliminate environmental interference).

Considering that the challenge scene is to find characteristic vocals from the choir, the difficulty lies in how to extract and express the information related to the speaker in the speech signal and how to distinguish the subtle differences of similar vocals. Generally speaking, the speaker-related features of speech are extracted mainly according to the flow as shown in the figure:

For the collected speech, effective speech detection (VAD) will be carried out first, and the invalid part of the collected speech will be removed, and then the acoustic features will be extracted. Because speech signal is a kind of short-term non-stationary signal with variable length, windowing method is generally used to extract features, and features are obtained in units of frames. At present, the commonly used acoustic features include the classic Mel cepstrum coefficient MFCC, the current perceptual prediction coefficient PLP, and the current fiery deep feature based on deep learning. After obtaining acoustic features, it is the further extraction of speaker information. The modeling methods used here mainly adopt ivector algorithm and deep convolution neural network algorithm with residual processing. After modeling, we can express the characteristics of speech at a deeper level, thus further presenting the information related to the speaker. The final model can further transform the features obtained in the feature extraction stage into samples that can characterize the speaker's features.

In this way, we can completely convert the speech of a specific speaker into a model that can characterize the characteristics of the speaker. (In the actual competition, when 2 1 chorus members are singing, we feed the singing voices of these 2 1 chorus members into the model respectively, and finally get 2 1 chorus model which can represent the information of these chorus members).

The identification and matching stage is relatively easy to understand. After the test speech collection is completed, the corresponding feature extraction operation is carried out, and then the similarity distance is calculated with all template samples in the template library, and then the closest one is selected as the final judgment result. (In the actual competition, this is equivalent to three tests. In each test, we send the code voice of the informant into the model, extract features, and then compare it with 2 1 model, and the highest score is the informant that the machine thinks is the most likely. The whole process is shown in the following figure:

The difficulty of voiceprint recognition this time

Perhaps everyone is most interested in the smallness of the strongest artificial intelligence and our little player Bao Xiao's 3 questions, which are only for 1 questions. Here I briefly talk about the factors that affect everyone's play, as follows:

1, noise problem

2. Many people sing

3. Forget the sound memory

4. Function migration

The number one problem is noise, including live noise and music noise, which has a greater impact than face recognition (slightly affected in the early stage), and music itself will also affect the judgment of machines and players; The second is that many people sing. As we all know, voiceprint recognition mainly depends on spectral features, and many people will have spectral aliasing, which makes it difficult to separate and identify features. Third, it is mainly the influence on human players. It is more difficult for ordinary people to remember time series than spatial series, especially after remembering three tone series, which is why Dr. Wei repeatedly hopes to listen to it several times. Finally, talk about feature transfer. The challenge is to speak from memory and recognize the singing. However, people often have different voiceprints when they talk and sing, so there is a problem of feature transfer, which corresponds to the fact that our two players need certain inductive reasoning ability.

The above four factors make the final result not so perfect, but it is these imperfections that will make us make continuous progress in technology and surpass ourselves in the past.