Exploring Self-supervised Embeddings and Synthetic Data Augmentation for Robust Audio Deepfake Detection

  1. Martín-Doñas, Juan M.
  2. Álvarez, Aitor
  3. Rosello, Eros
  4. Gomez, Angel M.
  5. Peinado, Antonio M.
Proceedings:
Interspeech 2024

ISSN: 2958-1796

Year of publication: 2024

Pages: 2085-2089

Congress: Interspeech 2024 1-5 September 2024, Kos, Greece

Type: Conference paper

DOI: 10.21437/INTERSPEECH.2024-942 GOOGLE SCHOLAR lock_openOpen access editor

Sustainable development goals

Abstract

This work explores the performance of large speech selfsupervised models as robust audio deepfake detectors. Despite the current trend of fine-tuning the upstream network, in this paper, we revisit the use of pre-trained models as feature extractors to adapt specialized downstream audio deepfake classifiers. The goal is to keep the general knowledge of the audio foundation model to extract discriminative features to feed up a simplified deepfake classifier. In addition, the generalization capabilities of the system are improved by augmenting the training corpora using additional synthetic data from different vocoder algorithms. This strategy is also complemented by various data augmentations covering challenging acoustic conditions. Our proposal is evaluated under different benchmark datasets for audio deepfake and anti-spoofing tasks, showing state-of-the-art performance. Furthermore, we analyze the relevant parts of the downstream classifier to achieve a robust system.