https://h1.md/wp-content/uploads/2019/08/deep-fake.jpg

50 layers of autoencoders Part 1

reuglewicz jean-edouard

--

Or how to detect deepfake with autoencoders

Deepfake consist in switching ones face in a video or picture for someones’ else face in the same picture in the best way possible.
Advanced techniques based on deep learning and specifically auto-encoders along with the ease of access to computing power and image data enable the creation of very powerful models resulting in fake pictures that are hard to discriminate from real pictures.
In this article, I’ll expose some of my experiments based on autoencoder for deepfake detection.

Setting up the scene

I’ll be using the following dataset https://www.kaggle.com/tatatatata/deepfake/#data

Exploring the dataset

repartition of the data

The dataset being very unbalanced, my approach will be to use autoencoder trained on the FAKE data.

First, I want to create a dataset of facial pictures to train my model on.
I first exctract the first frame of each video. Then extract faces and zoom on them.

The first approach was to extract faces using facial frontal or profile haarcascade, but the algorithm is not very robust and sometimes not able to detect faces correctly

Fail to detect faces
Too many faces detection

About the haarcascade:
They are based o the Histogram of Oriented Gradient algorithm
First, the image is converted to black and white, then, it measures how dark each pixel is compared to its neighbors. The direction of brightness change is then used to create vectors of oriented lights.
Then the image is broken into smaller squares and the most common brightness direction is used to aggregate the overall brightness direction of the square. This aggregated direction si then reintroduced into the source image, to extract the basic structure of a face.
Then, this high level representation is compared to the haarcascade (frontal or profile) and if it is close enough it will detect a face.
But due to this measure of light variation and the fact that the image is black and white, poor lighting, darker skin or round faces are more difficult to detect. Moreover, depending of the angle, the distance with faces in the haaracascade may change a lot.

So I switched to using dlib for better faces extraction

About dlib:
Dlib uses a convolutionnal neural network (CNN), for faces extraction. It is therefore not limited by angle, shape of face or color of skin as the dataset it was trained on is much bigger and much more general
But, it is much more GPU instensive than the HOG and haarcascade, depending of your specific infrastructure, the added detection capabilities may not be worth it.

Approach

  • Select all fake video
  • Extract the first frame of the video
  • Correct light balance
  • Extract the face in the picture
  • Apply laplacian on the picture
  • Extract the spectrum of the picture
  • Train an autoencoder on the spectrum

I’ll suppose that creating a deepfake image results in a lot of artifacts in the output image. Therefore, applying a Laplacian filter will highlight this noise even more and highlight those specific features even more.

Laplacian filtered picture

About the Laplacian:
The Laplacian of an image highlights regions of rapid intensity change and is therefore often used for edge detection. The Laplacian is often applied to an image that has first been smoothed with something approximating a Gaussian smoothing filter in order to reduce its sensitivity to noise. In our case, I won’t smooth it because on the contrary, I wish to highlight the noise in order to extract features that are specific to a deepfake image.

Then, I’ll apply a spectrum analysis on the laplacian filtered picture to extract the spectrum of fake data. I can’t use the laplacian image because it still has the face of the person inside it and I don’t want my model to recognize people, I want it to learn what the effect of deepfake on an image is.

Extract spectrum of laplacian filtered image

About the spectrum:
The spectrum of the laplacian shows the strength shows at which frequencies variations are strong and at which frequencies variations are weak.
Therefore, if the artifacts resulting from deepfake and highlighted by the laplacian have a specific pattern, they should appear in some given range of frequencies. The spectrum will represent those frequencies. Therefore, the study is not on faces anymore but on the effect of deepfake on the nature of the picture.

Finally, I’ll train an autoencoder on the spectrum. The goal is to train a model that is very good at reconstructing fake spectrum, measure the reconstruction error and based on this error set a threshold to discriminate between fake data and true data.

autoencoder

About the autoencoder:
“Autoencoding” is a data compression algorithm where the compression and decompression functions are 1) data-specific, 2) lossy, and 3) learned automatically from examples rather than engineered by a human. Additionally, in almost all contexts where the term “autoencoder” is used, the compression and decompression functions are implemented with neural networks.

1) Autoencoders are data-specific, which means that they will only be able to compress data similar to what they have been trained on. This is different from, say, the MPEG-2 Audio Layer III (MP3) compression algorithm, which only holds assumptions about “sound” in general, but not about specific types of sounds. An autoencoder trained on pictures of faces would do a rather poor job of compressing pictures of trees, because the features it would learn would be face-specific.

2) Autoencoders are lossy, which means that the decompressed outputs will be degraded compared to the original inputs (similar to MP3 or JPEG compression). This differs from lossless arithmetic compression.

3) Autoencoders are learned automatically from data examples, which is a useful property: it means that it is easy to train specialized instances of the algorithm that will perform well on a specific type of input. It doesn’t require any new engineering, just appropriate training data.

To build an autoencoder, you need three things: an encoding function, a decoding function, and a distance function between the amount of information loss between the compressed representation of your data and the decompressed representation (i.e. a “loss” function). The encoder and decoder will be chosen to be parametric functions (typically neural networks), and to be differentiable with respect to the distance function, so the parameters of the encoding/decoding functions can be optimize to minimize the reconstruction loss, using Stochastic Gradient Descent.

The created model is as follow

Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_4 (Dense) (None, 100) 4915300
_________________________________________________________________
dense_5 (Dense) (None, 50) 5050
_________________________________________________________________
dense_6 (Dense) (None, 100) 5100
_________________________________________________________________
dense_7 (Dense) (None, 49152) 4964352
=================================================================
Total params: 9,889,802
Trainable params: 9,889,802
Non-trainable params: 0
_________________________________________________________________
None

Resulting in the following

Reconstruction of fake data (left: source image, right: output of the autoencoder)
Reconstruction of the Real data (left: source image, right: output of the autoencoder)

The output of the autoencoder indeed seems to be of lesser quality for the testing set real data compared to the testing set fake data.

Analyze results

After training, I tested the model on some fake data and some real data and measured the reconstruction error to find a threshold

Reconstruction error Fake vs True

It seems that for fake picture that the autoencoder was trained on, the max reconstruction error is around 0.04. For true images, that the model has never seen before and don’t know how to reconstruct, the error ranges from 0.04 to 0.12. A threshold of 0.04 seems to be a good limit to discriminate between True and Fake video with a decent sensitivity vs specifcity trade off.

This model having an Area Under Curve (AUC) close to 1, its discriminating power is very high

Recall being the true positive rate and precision the proportion of correctly predicted real data, the model is very good at discriminating tru positive and also has a low number of false positive

Those different graphs, especially the precision for different threshold values and recall for different threshold values seem to confirm that a threshold around 0.04 or 0.045 could be a good discriminant. As it is the one with highest precision (proportion of corectly predicted over all predicted positive and highest true positive rate).
On the other hand, the high value of the AUC looks very high to me and could be resulting from overfitting.

Disclaimer

The confusion matrix based on the threshold of 0.045 gives some great results, but the dataset being very heterogenous as the input data are often variation of the same fake, the prediction may be biased.

Disclaimer

After running the model on the new data to classify, it resulted in lot of pictures being classified as true images, which makes me doubt the real capabilities of the model and suspect that it overfits on the training data.

Conslusion

Using the reconstruction error of an autoencoder based (either on true or fake data) on the spectrum and laplacian seem to be a promising approach, I suppose that the model could have better performances given a more diverse dataset

References

--

--

reuglewicz jean-edouard

Engineer passionate about technology, data processing and AI at large, doing my best to help in the machine uprising https://elbichon.github.io/