The Doctoral School in Science and Engineering is happy to invite you to Van Dat NGUYEN’s defence entitled
Modeling and Exploiting Vulnerabilities for Deepfake Detection
Supervisor: Assoc. Prof Djamila AOUADA
We are witnessing a rapid rise in the automatic generation of highly realistic facial forgeries, commonly known as deepfakes. These forgeries pose serious threats to privacy, security, and public trust. While deep learning-based detectors have achieved impressive results on individual deepfake benchmarks, their effectiveness is still not sufficient for enabling their deployment in a real-world setting. Indeed, they often fail in the presence of unseen domains, struggle with high-quality manipulations, and are evaluated with metrics that do not fully reflect real-world requirements. This thesis aims to address these issues by developing vulnerability-aware deepfake detection frameworks for both images and videos and by proposing a more realistic and fair evaluation protocol for deepfake detection.
This thesis introduces a family of vulnerability-aware detectors at both the image and video levels. The main idea is to enforce the model to focus on small vulnerable regions defined as the zones that are most likely to carry artifacts. At the image level, LAA-Net is first proposed as a fine-grained CNN-based method that combines an explicit multi-task attention mechanism with a newly designed Enhanced Feature Pyramid Network (E-FPN). Through a classification branch and two auxiliary branches for heatmap and self-consistency regression, trained using real images and pseudo-fakes, LAA-Net learns to focus on vulnerable pixels that are most likely to carry artifacts, while E-FPN injects non-redundant multi-scale low-level features into the final representation. This design provides a model that is more robust to high-quality and unseen manipulations while allowing the precise localization of forgery cues. In addition to this CNN-based approach, the thesis further proposes transformer-based detectors, namely, LAA-Former and LAA-Swin, equipped with a lightweight Learning-based Local Attention (L2-Att) module that extends the notion of vulnerable pixels to vulnerable patches. By guiding Vision Transformer backbones to explicitly attend to artifact-prone patches, these models consistently improve the performance of their plain counterparts and achieve state-of-the-art generalization performance with a reduced model size and computational cost. At the video level, we propose FakeSTormer, a vulnerability-aware spatio-temporal framework for deepfake video detection built upon a revisited TimeSformer backbone. Using a Self-Blended Video (SBV) synthesis strategy and dedicated temporal and spatial vulnerability heads, FakeSTormer captures subtle, intertwined artifacts in space and time, leading to improved robustness to high-quality and unseen manipulations.
Lastly, we further revisit the traditional evaluation protocol used for assessing the generalization capabilities of deepfake detectors. In particular, we introduce Cross-AUC, a metric that combines average AUC with a polarization term based on Wasserstein distances between positive and negative score distributions. Cross-AUC better approximates performance on mixed-domain data and reveals hidden weaknesses of existing detectors. As such, these different contributions advance the design, analysis, and evaluation of deepfake detectors toward more generalizable, robust, and interpretable tools.