You are here

Technical report | Face and Voice Fusion for Human Recognition in Non-controlled Environments

Abstract

The individual performance of biometric technologies such as speaker recognition (SR) and face recognition (FR) has enabled their prolific use in applications worldwide (e.g. FR at airports and SR for access to telephone banking and taxation purposes). However, in challenging environments (e.g. CCTV videos), where the data is of low quality, establishing the identity of non-cooperative individuals is still a difficult task.

This paper documents the verification performance gains possible when fusing low quality face and voice samples at the matching score level. Three normalisation and five classifier-based fusion techniques were evaluated on a real life audio-video dataset ('Mobio'). When compared to the performance of the individual biometrics, all fused results showed a notable improvement.

Executive Summary

The individual performance of biometric technologies such as speaker recognition (SR) and face recognition (FR) has enabled their prolific use in applications worldwide (e.g. FR at airports and SR for access to telephone banking and taxation purposes). However, in challenging environments (e.g. surveillance and online videos), where the data is of low quality, establishing the identity of non-cooperative individuals is still a difficult task.

A wide range of approaches for combining biometric samples have been published to overcome some of the practical problems of using only a single biometric trait, demonstrating the benefits of combining different biometric features using fusion algorithms. The Defence Science and Technology (DST) Group has also developed a score-level fusion method (hereinafter referred to as the DST-developed canonical method) [1], which was shown to improve the quality of FR algorithms.

This paper uses a real life non-controlled scenario to examine verification performance gains possible when fusing low quality face and voice samples at the matching score level. Three normalisation methods (z-score, min-max and DST-developed canonical) and five fusion techniques (weighed sum, Support Vector Machines (SVM) with linear or quadratics kernels and Boosting using AdaBoost or RUSBoost algorithms) were evaluated. The audio-video dataset (‘Mobio’) used in this evaluation involved both face and voice samples of a person using a mobile phone, which mimics a real life scenario used for authentication.

When compared to the performance of the individual biometrics, the fused results showed a notable improvement. For the Mobio dataset (676 genuine and 48,594 impostor comparisons), at a false match rate (FMR) of 0.1%, the speaker recognition alone achieves a false non-match rate (FNMR) of 35.1%, while FR alone achieves a FNMR of 19.3%. When these two modalities are normalised and fused using the methods listed above, the FNMR is reduced to 7.49-13.9%.

Future research from this evaluation could include:

  1. Verifying the significance of current fusion performance using a different database that contains poorer quality face and voice samples than MOBIO.
  2. Expanding the current face and voice fusion work to include other emergent biometrics, such as 3D face and body part measurements.
  3. Exploring the use of other fusion techniques such as sensor level or feature level fusion.

Key information

Author

Sau Yee Yiu, Dmitri Kamenetsky, Jason Littlefield and Jonathan Willmore

Publication number

DST-Group-TR-3426

Publication type

Technical report

Publish Date

November 2017

Classification

Unclassified - public release

Keywords

Face Recognition, Speaker Recognition, Biometric Fusion, Classification