Skip to main content
  • Home
  • Projects
  • Cross Audio-to-Visual Speaker Identification

Projects

Cross Audio-to-Visual Speaker Identification

Speaker recognition technology has achieved significant performance for some real-world applications. However, the performance of speaker recognition is still greatly degraded when used in noisy environments. One approach to improve speech recognition/identification is by combining video and audio sources to link the visual features of lip motion with vocal features, two modalities which are correlated and convey complementary information.

In this project, we are interested in identifying an individual face from a coupled video/audio clip of several individuals based on data collected in an unrestricted environment (wild). For this effort, we are proposing to use visual lip motion features for a face in a video clip and the co-recorded audio signal features from several speakers to identify the individual who uttered the audio recorded along with the video. To solve this problem, we are proposing to use an auto-associative deep neural network architecture which is a data-driven model and does not model phonemes or visemes (the visual equivalent of a phoneme).

A speech-to-video auto-associative deep network will be used where the network has learned to reconstruct the visual lip features given only speech features as the input. The visual lip feature vector generated by our deep network for an input test speech signal will be compared with a gallery of individual visual lip features for speaker identification. The proposed speech-to-video deep network will be trained with our current WVU voice and video training dataset using the corresponding audio and video features from individuals as inputs to the network.

For the audio signal we will use the Mel-frequency cepstral coefficients (MFCC), and for video, we will extract static and temporal visual features of the lip motion.