Unveiling Subtle Yet Impactful AI Modifications in Authentic Video Content
April 11, 2025
StevenWalker
42
In 2019, a deceptive video of Nancy Pelosi, then Speaker of the US House of Representatives, circulated widely. The video, which was edited to make her appear intoxicated, was a stark reminder of how easily manipulated media can mislead the public. Despite its simplicity, this incident highlighted the potential damage of even basic audio-visual edits.
At the time, the deepfake landscape was largely dominated by autoencoder-based face-replacement technologies, which had been around since late 2017. These early systems struggled to make the nuanced changes seen in the Pelosi video, focusing instead on more overt face swaps.
The 2022 ‘Neural Emotion Director' framework changes the mood of a famous face. Source: https://www.youtube.com/watch?v=Li6W8pRDMJQ
Fast forward to today, and the film and TV industry is increasingly exploring AI-driven post-production edits. This trend has sparked both interest and criticism, as AI enables a level of perfectionism that was previously unattainable. In response, the research community has developed various projects focused on 'local edits' of facial captures, such as Diffusion Video Autoencoders, Stitch it in Time, ChatFace, MagicFace, and DISCO.
Expression-editing with the January 2025 project MagicFace. Source: https://arxiv.org/pdf/2501.02260
New Faces, New Wrinkles
However, the technology for creating these subtle edits is advancing much faster than our ability to detect them. Most deepfake detection methods are outdated, focusing on older techniques and datasets. That is, until a recent breakthrough from researchers in India.
Detection of Subtle Local Edits in Deepfakes: A real video is altered to produce fakes with nuanced changes such as raised eyebrows, modified gender traits, and shifts in expression toward disgust (illustrated here with a single frame). Source: https://arxiv.org/pdf/2503.22121
This new research targets the detection of subtle, localized facial manipulations, a type of forgery often overlooked. Instead of looking for broad inconsistencies or identity mismatches, the method zeroes in on fine details like slight expression shifts or minor edits to specific facial features. It leverages the Facial Action Coding System (FACS), which breaks down facial expressions into 64 mutable areas.
Some of the constituent 64 expression parts in FACS. Source: https://www.cs.cmu.edu/~face/facs.htm
The researchers tested their approach against various recent editing methods and found it consistently outperformed existing solutions, even with older datasets and newer attack vectors.
‘By using AU-based features to guide video representations learned through Masked Autoencoders (MAE), our method effectively captures localized changes crucial for detecting subtle facial edits.
‘This approach enables us to construct a unified latent representation that encodes both localized edits and broader alterations in face-centered videos, providing a comprehensive and adaptable solution for deepfake detection.'
The paper, titled Detecting Localized Deepfake Manipulations Using Action Unit-Guided Video Representations, was authored by researchers at the Indian Institute of Technology at Madras.
Method
The method starts by detecting faces in a video and sampling evenly spaced frames centered on these faces. These frames are then broken down into small 3D patches, capturing local spatial and temporal details.
Schema for the new method. The input video is processed with face detection to extract evenly spaced, face-centered frames, which are then divided into ‘tubular' patches and passed through an encoder that fuses latent representations from two pretrained pretext tasks. The resulting vector is then used by a classifier to determine whether the video is real or fake.
Each patch contains a small window of pixels from a few successive frames, allowing the model to learn short-term motion and expression changes. These patches are embedded and positionally encoded before being fed into an encoder designed to distinguish real from fake videos.
The challenge of detecting subtle manipulations is addressed by using an encoder that combines two types of learned representations through a cross-attention mechanism, aiming to create a more sensitive and generalizable feature space.
Pretext Tasks
The first representation comes from an encoder trained with a masked autoencoding task. By hiding most of the video's 3D patches, the encoder learns to reconstruct the missing parts, capturing important spatiotemporal patterns like facial motion.
Pretext task training involves masking parts of the video input and using an encoder-decoder setup to reconstruct either the original frames or per-frame action unit maps, depending on the task.
However, this alone isn't enough to detect fine-grained edits. The researchers introduced a second encoder trained to detect facial action units (AUs), encouraging it to focus on localized muscle activity where subtle deepfake edits often occur.
Further examples of Facial Action Units (FAUs, or AUs). Source: https://www.eiagroup.com/the-facial-action-coding-system/
After pretraining, the outputs of both encoders are combined using cross-attention, with the AU-based features guiding the attention over the spatial-temporal features. This results in a fused latent representation that captures both broader motion context and localized expression details, used for the final classification task.
Data and Tests
Implementation
The system was implemented using the FaceXZoo PyTorch-based face detection framework, extracting 16 face-centered frames from each video clip. The pretext tasks were trained on the CelebV-HQ dataset, which includes 35,000 high-quality facial videos.
From the source paper, examples from the CelebV-HQ dataset used in the new project. Source: https://arxiv.org/pdf/2207.12393
Half of the data was masked to prevent overfitting. For the masked frame reconstruction task, the model was trained to predict missing regions using L1 loss. For the second task, it was trained to generate maps for 16 facial action units, supervised by L1 loss.
After pretraining, the encoders were fused and fine-tuned for deepfake detection using the FaceForensics++ dataset, which includes both real and manipulated videos.
The FaceForensics++ dataset has been the cornerstone of deepfake detection since 2017, though it is now considerably out of date, in regards to the latest facial synthesis techniques. Source: https://www.youtube.com/watch?v=x2g48Q2I2ZQ
To address class imbalance, the authors used Focal Loss, emphasizing more challenging examples during training. All training was conducted on a single RTX 4090 GPU with 24Gb of VRAM, using pre-trained checkpoints from VideoMAE.
Tests
The method was evaluated against various deepfake detection techniques, focusing on locally-edited deepfakes. The tests included a range of editing methods and older deepfake datasets, using metrics like Area Under Curve (AUC), Average Precision, and Mean F1 Score.
From the paper: comparison on recent localized deepfakes shows that the proposed method outperformed all others, with a 15 to 20 percent gain in both AUC and average precision over the next-best approach.
The authors provided visual comparisons of locally manipulated videos, showing their method's superior sensitivity to subtle edits.
A real video was altered using three different localized manipulations to produce fakes that remained visually similar to the original. Shown here are representative frames along with the average fake detection scores for each method. While existing detectors struggled with these subtle edits, the proposed model consistently assigned high fake probabilities, indicating greater sensitivity to localized changes.
The researchers noted that existing state-of-the-art detection methods struggled with the latest deepfake generation techniques, while their method showed robust generalization, achieving high AUC and average precision scores.
Performance on traditional deepfake datasets shows that the proposed method remained competitive with leading approaches, indicating strong generalization across a range of manipulation types.
The authors also tested the model's reliability under real-world conditions, finding it resilient to common video distortions like saturation adjustments, Gaussian blur, and pixelation.
An illustration of how detection accuracy changes under different video distortions. The new method remained resilient in most cases, with only a small decline in AUC. The most significant drop occurred when Gaussian noise was introduced.
Conclusion
While the public often thinks of deepfakes as identity swaps, the reality of AI manipulation is more nuanced and potentially more insidious. The kind of local editing discussed in this new research might not capture public attention until another high-profile incident occurs. Yet, as actor Nic Cage has pointed out, the potential for post-production edits to alter performances is a concern we should all be aware of. We're naturally sensitive to even the slightest changes in facial expressions, and context can dramatically alter their impact.
First published Wednesday, April 2, 2025
Related article
Civitai tăng cường các quy định của Deepfake trong bối cảnh áp lực từ Thẻ Mastercard và Visa
Civitai, một trong những kho lưu trữ mô hình AI nổi bật nhất trên Internet, gần đây đã thực hiện những thay đổi đáng kể đối với các chính sách của mình về nội dung NSFW, đặc biệt liên quan đến người nổi tiếng Loras. Những thay đổi này đã được thúc đẩy bởi áp lực từ MasterCard và Visa của người hỗ trợ thanh toán. Người nổi tiếng Loras, đó là bạn
Google sử dụng AI để đình chỉ hơn 39 triệu tài khoản AD vì bị nghi ngờ gian lận
Google đã công bố vào thứ Tư rằng họ đã có một bước quan trọng trong việc chống gian lận quảng cáo bằng cách đình chỉ một tài khoản nhà quảng cáo đáng kinh ngạc 39,2 triệu trên nền tảng của mình vào năm 2024.
Tạo video AI chuyển sang kiểm soát hoàn toàn
Các mô hình nền tảng video như Hunyuan và WAN 2.1 đã có những bước tiến đáng kể, nhưng chúng thường bị thiếu hụt khi nói đến điều khiển chi tiết cần thiết trong sản xuất phim và TV, đặc biệt là trong lĩnh vực hiệu ứng hình ảnh (VFX). Trong VFX Studios chuyên nghiệp, những mô hình này, cùng với hình ảnh trước đó
Comments (25)
0/200
KevinAnderson
April 13, 2025 at 4:16:26 PM GMT
The Nancy Pelosi video was a wake-up call! It's scary how easily AI can manipulate videos. I appreciate the app for showing how subtle changes can have big impacts. But it's also a bit unsettling; makes you question what's real. Needs more transparency, I think.
0
NicholasYoung
April 13, 2025 at 1:51:07 AM GMT
ナンシー・ペロシのビデオは目覚まし時計のようなものでした!AIがどれだけ簡単にビデオを操作できるかは恐ろしいです。このアプリが微妙な変更が大きな影響を与えることを示してくれたのは良かったです。でも、ちょっと不気味ですね。本物が何か疑問に思います。もっと透明性が必要だと思います。
0
MichaelDavis
April 12, 2025 at 11:12:26 AM GMT
O vídeo da Nancy Pelosi foi um alerta! É assustador como a IA pode manipular vídeos tão facilmente. Gosto do app por mostrar como mudanças sutis podem ter grandes impactos. Mas também é um pouco perturbador; faz você questionar o que é real. Precisa de mais transparência, acho eu.
0
JustinNelson
April 14, 2025 at 1:30:55 AM GMT
नैन्सी पेलोसी का वीडियो एक जागृति कॉल था! यह डरावना है कि AI कितनी आसानी से वीडियो को मैनिपुलेट कर सकता है। मुझे ऐप पसंद है कि यह दिखाता है कि सूक्ष्म परिवर्तन कैसे बड़े प्रभाव डाल सकते हैं। लेकिन यह भी थोड़ा असहज है; आपको यह सोचने पर मजबूर करता है कि क्या सच है। मुझे लगता है कि इसमें और पारदर्शिता की जरूरत है।
0
MarkLopez
April 12, 2025 at 2:16:16 PM GMT
La vidéo de Nancy Pelosi a été un signal d'alarme ! C'est effrayant de voir à quel point l'IA peut facilement manipuler des vidéos. J'apprécie l'application pour montrer comment des changements subtils peuvent avoir un grand impact. Mais c'est aussi un peu dérangeant ; ça vous fait douter de ce qui est réel. Il faudrait plus de transparence, je pense.
0
RogerMartinez
April 13, 2025 at 12:33:37 AM GMT
The Nancy Pelosi video was a wake-up call on how AI can subtly change videos to mislead us. It's scary how simple it was to make her look intoxicated. This app really shows the power of AI in media manipulation. Needs to be more accessible though, so more people can understand the risks!
0






In 2019, a deceptive video of Nancy Pelosi, then Speaker of the US House of Representatives, circulated widely. The video, which was edited to make her appear intoxicated, was a stark reminder of how easily manipulated media can mislead the public. Despite its simplicity, this incident highlighted the potential damage of even basic audio-visual edits.
At the time, the deepfake landscape was largely dominated by autoencoder-based face-replacement technologies, which had been around since late 2017. These early systems struggled to make the nuanced changes seen in the Pelosi video, focusing instead on more overt face swaps.
The 2022 ‘Neural Emotion Director' framework changes the mood of a famous face. Source: https://www.youtube.com/watch?v=Li6W8pRDMJQ
Fast forward to today, and the film and TV industry is increasingly exploring AI-driven post-production edits. This trend has sparked both interest and criticism, as AI enables a level of perfectionism that was previously unattainable. In response, the research community has developed various projects focused on 'local edits' of facial captures, such as Diffusion Video Autoencoders, Stitch it in Time, ChatFace, MagicFace, and DISCO.
Expression-editing with the January 2025 project MagicFace. Source: https://arxiv.org/pdf/2501.02260
New Faces, New Wrinkles
However, the technology for creating these subtle edits is advancing much faster than our ability to detect them. Most deepfake detection methods are outdated, focusing on older techniques and datasets. That is, until a recent breakthrough from researchers in India.
Detection of Subtle Local Edits in Deepfakes: A real video is altered to produce fakes with nuanced changes such as raised eyebrows, modified gender traits, and shifts in expression toward disgust (illustrated here with a single frame). Source: https://arxiv.org/pdf/2503.22121
This new research targets the detection of subtle, localized facial manipulations, a type of forgery often overlooked. Instead of looking for broad inconsistencies or identity mismatches, the method zeroes in on fine details like slight expression shifts or minor edits to specific facial features. It leverages the Facial Action Coding System (FACS), which breaks down facial expressions into 64 mutable areas.
Some of the constituent 64 expression parts in FACS. Source: https://www.cs.cmu.edu/~face/facs.htm
The researchers tested their approach against various recent editing methods and found it consistently outperformed existing solutions, even with older datasets and newer attack vectors.
‘By using AU-based features to guide video representations learned through Masked Autoencoders (MAE), our method effectively captures localized changes crucial for detecting subtle facial edits.
‘This approach enables us to construct a unified latent representation that encodes both localized edits and broader alterations in face-centered videos, providing a comprehensive and adaptable solution for deepfake detection.'
The paper, titled Detecting Localized Deepfake Manipulations Using Action Unit-Guided Video Representations, was authored by researchers at the Indian Institute of Technology at Madras.
Method
The method starts by detecting faces in a video and sampling evenly spaced frames centered on these faces. These frames are then broken down into small 3D patches, capturing local spatial and temporal details.
Schema for the new method. The input video is processed with face detection to extract evenly spaced, face-centered frames, which are then divided into ‘tubular' patches and passed through an encoder that fuses latent representations from two pretrained pretext tasks. The resulting vector is then used by a classifier to determine whether the video is real or fake.
Each patch contains a small window of pixels from a few successive frames, allowing the model to learn short-term motion and expression changes. These patches are embedded and positionally encoded before being fed into an encoder designed to distinguish real from fake videos.
The challenge of detecting subtle manipulations is addressed by using an encoder that combines two types of learned representations through a cross-attention mechanism, aiming to create a more sensitive and generalizable feature space.
Pretext Tasks
The first representation comes from an encoder trained with a masked autoencoding task. By hiding most of the video's 3D patches, the encoder learns to reconstruct the missing parts, capturing important spatiotemporal patterns like facial motion.
Pretext task training involves masking parts of the video input and using an encoder-decoder setup to reconstruct either the original frames or per-frame action unit maps, depending on the task.
However, this alone isn't enough to detect fine-grained edits. The researchers introduced a second encoder trained to detect facial action units (AUs), encouraging it to focus on localized muscle activity where subtle deepfake edits often occur.
Further examples of Facial Action Units (FAUs, or AUs). Source: https://www.eiagroup.com/the-facial-action-coding-system/
After pretraining, the outputs of both encoders are combined using cross-attention, with the AU-based features guiding the attention over the spatial-temporal features. This results in a fused latent representation that captures both broader motion context and localized expression details, used for the final classification task.
Data and Tests
Implementation
The system was implemented using the FaceXZoo PyTorch-based face detection framework, extracting 16 face-centered frames from each video clip. The pretext tasks were trained on the CelebV-HQ dataset, which includes 35,000 high-quality facial videos.
From the source paper, examples from the CelebV-HQ dataset used in the new project. Source: https://arxiv.org/pdf/2207.12393
Half of the data was masked to prevent overfitting. For the masked frame reconstruction task, the model was trained to predict missing regions using L1 loss. For the second task, it was trained to generate maps for 16 facial action units, supervised by L1 loss.
After pretraining, the encoders were fused and fine-tuned for deepfake detection using the FaceForensics++ dataset, which includes both real and manipulated videos.
The FaceForensics++ dataset has been the cornerstone of deepfake detection since 2017, though it is now considerably out of date, in regards to the latest facial synthesis techniques. Source: https://www.youtube.com/watch?v=x2g48Q2I2ZQ
To address class imbalance, the authors used Focal Loss, emphasizing more challenging examples during training. All training was conducted on a single RTX 4090 GPU with 24Gb of VRAM, using pre-trained checkpoints from VideoMAE.
Tests
The method was evaluated against various deepfake detection techniques, focusing on locally-edited deepfakes. The tests included a range of editing methods and older deepfake datasets, using metrics like Area Under Curve (AUC), Average Precision, and Mean F1 Score.
From the paper: comparison on recent localized deepfakes shows that the proposed method outperformed all others, with a 15 to 20 percent gain in both AUC and average precision over the next-best approach.
The authors provided visual comparisons of locally manipulated videos, showing their method's superior sensitivity to subtle edits.
A real video was altered using three different localized manipulations to produce fakes that remained visually similar to the original. Shown here are representative frames along with the average fake detection scores for each method. While existing detectors struggled with these subtle edits, the proposed model consistently assigned high fake probabilities, indicating greater sensitivity to localized changes.
The researchers noted that existing state-of-the-art detection methods struggled with the latest deepfake generation techniques, while their method showed robust generalization, achieving high AUC and average precision scores.
Performance on traditional deepfake datasets shows that the proposed method remained competitive with leading approaches, indicating strong generalization across a range of manipulation types.
The authors also tested the model's reliability under real-world conditions, finding it resilient to common video distortions like saturation adjustments, Gaussian blur, and pixelation.
An illustration of how detection accuracy changes under different video distortions. The new method remained resilient in most cases, with only a small decline in AUC. The most significant drop occurred when Gaussian noise was introduced.
Conclusion
While the public often thinks of deepfakes as identity swaps, the reality of AI manipulation is more nuanced and potentially more insidious. The kind of local editing discussed in this new research might not capture public attention until another high-profile incident occurs. Yet, as actor Nic Cage has pointed out, the potential for post-production edits to alter performances is a concern we should all be aware of. We're naturally sensitive to even the slightest changes in facial expressions, and context can dramatically alter their impact.
First published Wednesday, April 2, 2025



The Nancy Pelosi video was a wake-up call! It's scary how easily AI can manipulate videos. I appreciate the app for showing how subtle changes can have big impacts. But it's also a bit unsettling; makes you question what's real. Needs more transparency, I think.




ナンシー・ペロシのビデオは目覚まし時計のようなものでした!AIがどれだけ簡単にビデオを操作できるかは恐ろしいです。このアプリが微妙な変更が大きな影響を与えることを示してくれたのは良かったです。でも、ちょっと不気味ですね。本物が何か疑問に思います。もっと透明性が必要だと思います。




O vídeo da Nancy Pelosi foi um alerta! É assustador como a IA pode manipular vídeos tão facilmente. Gosto do app por mostrar como mudanças sutis podem ter grandes impactos. Mas também é um pouco perturbador; faz você questionar o que é real. Precisa de mais transparência, acho eu.




नैन्सी पेलोसी का वीडियो एक जागृति कॉल था! यह डरावना है कि AI कितनी आसानी से वीडियो को मैनिपुलेट कर सकता है। मुझे ऐप पसंद है कि यह दिखाता है कि सूक्ष्म परिवर्तन कैसे बड़े प्रभाव डाल सकते हैं। लेकिन यह भी थोड़ा असहज है; आपको यह सोचने पर मजबूर करता है कि क्या सच है। मुझे लगता है कि इसमें और पारदर्शिता की जरूरत है।




La vidéo de Nancy Pelosi a été un signal d'alarme ! C'est effrayant de voir à quel point l'IA peut facilement manipuler des vidéos. J'apprécie l'application pour montrer comment des changements subtils peuvent avoir un grand impact. Mais c'est aussi un peu dérangeant ; ça vous fait douter de ce qui est réel. Il faudrait plus de transparence, je pense.




The Nancy Pelosi video was a wake-up call on how AI can subtly change videos to mislead us. It's scary how simple it was to make her look intoxicated. This app really shows the power of AI in media manipulation. Needs to be more accessible though, so more people can understand the risks!












