This standalone blog post is describing my exploration of the very relevant Machine Learning questions:

  1. How big of a dataset do I need for training?
  2. How long do I have to train a classifier (number of epochs)?

The project and its main learnings were shared in the previous two companion blog posts Using ML to detect fake face images created by AI which describes how I made an ML-classifier production ready by elaborating with different datasets for training, and Using ML to understand the behavior of an AI that focuses on understanding the features of AI-generated face images and it’s implications. A bonus treat in this blog post is declaring a higher final accuracy for the deployed ML-classifier than shown in the previous blog posts.

To try out the ML-classifier in action, head to this super simple web app. The ML-model will present you with its prediction of face images you upload for it to classify.


(or Spoiler Alert if you will)

  1. Feeding a Deep Learning neural net with many images for training takes lots of GPU power and will be running for hours which leads to monetary cost mounting up. This blog post offers a discussion on how to know when to stop the training and call it good enough. For the particular ML-classifier discussed, 10 epochs (run), was sufficient to reach an accuracy of 99,74%.
  2. While being in the exploratory phase, wanting to understand the problem at hand to solve, you would like things to run as quick as possible to minimize your idling time, while waiting for the model to train to produce the result of the training. The blog post offers discussions on how big the subset of the full dataset one needs to work with in order to understand how you are doing in the ML-classifier improvement process.
  3. The final accuracy, for the ML-classifier discussed in the two previous companion blog posts describing the actual project to classify AI-generated fake face images, has improved from 99,21% and declared to be 99,75%!

Size of dataset and the training time/cost relationship

I had initially acquired a total of 54 000 images as my Training set to work with. The wait of many hours for each iteration to test any new experimental hypothesis wasn’t very appealing.

The first thing I wanted to know was if I could work with a subset of the training data, to speed the experimentation up, and still have satisfying results to work with. A random small number of images were chosen as a subset of the dataset. Below you can see the result of 3000 images picked out for training, rendering a 100% accuracy in 10 minutes.

To know how well this version of the model was holding up, it needed to be tested. So, the full original dataset of 54 000 images was run as a Test set. Technically, to be fully correct in testing out the error rate, the 3000 training images should have been removed, but I didn’t bother. I just wanted to quickly get an indication of how well it went, nothing precise and exact. The error rate reflected in large the prediction of 51 000 for the model unseen images.

The error rate, as seen in the Confusion Matrix below, showed up as approximately 3%. Which I knew from the first run, if trained with the full dataset and 5.5 hours instead of 10 minutes, will decrease down to an error rate of close to 0%.

For the most part, I went with this smaller Training set while experimenting, to test new ideas and verify assumptions when step by step improving the model. The full training, using the full dataset, was upscaled only at the end to produce the final ML-classifier for production.

The finalized ML-classifier did not show any signs, what I could tell, that potentially could have invalidated the result of the assumptions tested on the smaller Training set. It is assuring to know from experimentation, that training with as little images as only approximately 5% of the full dataset, was robust enough and not ending in any surprises at the end.

This result is of course only fully valid for this particular problem and these particular datasets used. It is an indication though, for practicing Machine Learning in the phase of experimenting, pointing out that very small datasets can be used to speed up testing and finding out what is useful training.

Number of epochs in training and the time/cost relationship

To train on the relatively large dataset that was accumulated takes a couple of hours for each run of training to complete. By the end of the experimentation for this project, before finalizing the ML-classifier, the full dataset for training contained 76 000 images (20% of them were being used for validation).

Here is a very brief introduction to how Deep Learning neural nets operate to understand its end ability of pattern recognition. To run an epoch, in the language of ML, means applying a mathematical function through the entire dataset and updating the function parameters, to produce a resulting model that fits more accurately to the data with as little loss (also a function) as possible. To let the model learn more from the data, more epochs are run, though the images presented are shuffled into random order for each epoch. The finalized ML-model is trained for many epochs and gain its insights by looking exactly one time at every image in the full dataset for each epoch. During every epoch, the ML-model improves by updating its knowledge: model parameters, algorithm weights, function/matrix coefficients.

Overfitting is when you hit the fine line where you have overtrained and the model is no longer at its best in generalizing the problem solving when presented with unseen data but rather only solves your training data very good. I use the loss to find out if I am still on the right track. As long as the training set has a lower loss compared to the loss for the validation set, that is not overfitting. While not overfitting, one can have the rule of thumb that you are good to go ahead and continue to do more training by running more epochs.

While a few more epochs are possible, it will still take hours after hours to complete, and it will cost some GPU time on your cloud service of choice. The question is always, how many epochs should we train before stopping. Before we feel satisfied to let the ML-classifier meet and greet the world in production. Which actually translates into the question: What is good enough for the specific production purpose?

Look at the accuracy chart above. It looks like it flatlines around the 10th epoch of training. That means the model now has learned everything it can possibly learn from the training data available. To improve the model further, more images of different features would be needed, or some other tweak to the training.

Practically, do we really need to improve the model from here? The answer will be provided by a requirement specification from the customer or from looking at it from a security perspective. The underlying concerns should be based on what kind of consequences a wrong answer could render and what kind of outcome you are fine with accepting.

For this particular project one can base the answer to what is good enough on the graph data and draw the following conclusions:

  • The highest accuracy of 0,997511 was acquired after 19 epochs.
    Out of 1 000 images, this classifier will predict the wrong answer 2.5 times or 0.25% of the time.
  • The accuracy is 0,997446 was acquired after 10 epochs
    An error will be made every 2,6 times or 0.26% of the time.

In most use cases, it seems like it’s not worth pursuing the last bit of 0.01% accuracy. The cost came in the form of doubling the number of epochs to run. This translates into having the model train for another 18 hours on the Google Cloud Platform (GCP), on the relative fast running GPU I had chosen, resulting in doubling the cost in $$$.

Final accuracy benchmark for the trained ML-classifier

From the first blog post describing the results of this project, you might recall that 99,21% was my declared accuracy for the trained ML-model. As seen in the graph above, or the metrics below, it meant I stopped training after the 4th run, which took 8 hours to complete.

To put this ML-classifier into production, I would see that it makes more sense to stop at the 10th run, where the improvements kind of flatlines. And the accuracy of the end training is then 99,74%. Substantial better than the previous 99,21%, and worth pursuing.

Now when the model has been trained for 19 epochs, because of me wanting to experiment and see what would happen if I just let it run a few more runs, and a few more, and… the final accuracy is hereby declared to be 99,75%!

This version of this ML-classifier is deployed for you to play around with. Head to this super simple web app and upload a face image of your choice to get back a probability prediction of real or fake.

Signing off here,
for the Jayway Blog,

Silvia Man,
Senior software engineer


The classifier I’ve built is a variation of the Fastai Lab talked about in the course Deep Learning for Coders in Lesson 2 which I expanded quite a bit beyond the scope into the experimentation talked about in this blog post.

Face datasets for training (and validation)

AI-generated – a total of 33 000 images

  1. The 1 million Fake Faces dataset – used 30 000 images
  2. The Generated Data Dataset – used 3 000 images

Real-world in the wild – total 43 000 images / 54 000 images

  1. The UTKFace dataset – used 24 000 images
  2. The LFW dataset – used 13 000 images
  3. The Large Age Gap Face Verification Dataset – used 3 800 images
  4. The Real and Fake Face Detection Dataset – used 2 x 1000 images, both sets with real-world faces in the wild but one of the sets were Photoshopped
  5. Flickr-Faces-HQ Dataset – used 11 000 images. Was used by Nvidia when training a GAN network to produce the AI-generator that generated the face images for The 1 million Fake Faces dataset

Face dataset for test (indication of production quality)

AI-generated – a total of 30 000 images

  1. The 1 million Fake Faces dataset – used (another set of) 30 000 images

Real-world in the wild – a total of 202 000 images

  1. The CelabFaces Attributes dataset – used 202 000 images


  1. The 1 million Fake Faces dataset
    StyleGAN algorithm and model by NVIDIA under CC BY-NC 4.0
  2. The Generated Data Dataset
    Photo by Generated Photos
  3. The UTKFace dataset
    The UTKFace dataset is available for non-commercial research purposes only. The copyright belongs to the original owners.
  4. The LFW dataset
    Labeled Faces in the Wild is a public benchmark for face verification.
  5. The Large Age Gap Face Verification Dataset
    author = {Bianco, Simone},
    year = {2017},
    pages = {36-42},
    title = {Large Age-Gap Face Verification by Feature Injection in Deep Networks},
    volume = {90},
    journal = {Pattern Recognition Letters},
    doi = {10.1016/j.patrec.2017.03.006}}
  6. The Real and Fake Face Detection Dataset
    Available at Kaggle, License unknown, Visibility public
  7. Flickr-Faces-HQ Dataset
    The individual images were published in Flickr by their respective authors under either Creative Commons BY 2.0, Creative Commons BY-NC 2.0, Public Domain Mark 1.0, Public Domain CC0 1.0, or U.S. Government Works license. All of these licenses allow free use, redistribution, and adaptation for non-commercial purposes. However, some of them require giving appropriate credit to the original author, as well as indicating any changes that were made to the images. The license and original author of each image are indicated in the metadata.
  8. The CelabFaces Attributes dataset
    title = {Deep Learning Face Attributes in the Wild},
    author = {Liu, Ziwei and Luo, Ping and Wang, Xiaogang and Tang, Xiaoou},
    booktitle = {Proceedings of International Conference on Computer Vision (ICCV)},
    month = {December},
    year = {2015}}

Leave a Reply