note
These notes come from High Quality Face Recognition with Deep Metric Learning
note
The documentation about dlib resides in http://dlib.net/term_index.html
Network Architecture
The Architecture
The model is a ResNet network with 29 convolutional layers. It is modified from the 34-layer ResNet network from Deep Residual Learning for Image Recognition by He, Zhang et al.
note
The 34-layer network is removed by a few layers and the number of filters per layer reduced by half.
Detail about Training and Loss Function
The network training started with randomly initialized weights (how?), and used a structured metric loss that tries to project all the identities into non-overlapping balls of radius 0.6.
The loss is basically a type of pair-wise hinge loss that runs over all pairs in a mini-batch and includes hard-negative mining at the mini-batch level.
The loss is described in the loss_metric_documentation, however D.King does not have a reference paper for it.
Loss Layer
The Object
Allow the network learn to map objects into a vector space where objects sharing the same class label are close to each other. To be specific, it considers all pairs of objects in a mini-batch, and computes a different loss depending on their respective class labels.
Loss Functions
So if objects A1 and A2 in a mini-batch share the same class label then their contribution to the loss is:
max(0, length(A1 - A2) - get_distance_threshold() + get_margin())
While if A1 and B1 have different class labels then their contribution to the loss function is:
max(0, get_distance_threshold() - length(A1 - A2) + get_margin())
For example, D.King trained the network by using:
get_distance_threshold() = 0.6.
How to Balance the Loss
The loss balances the number of negative pairs relative to the number of positive pairs.
For example If there are N pairs that share the same identity in a mini-batch then the algorithm will only include the N worst non-matching pairs in the loss.
That is, the algorithm performs hard negative mining on the non-matching pairs. This is important since there are way more non-matching pairs than matching pairs. So to avoid imbalance in the loss this kind of hard negative mining is useful.
note
Refer to
dlib/dlib/dnn/loss.h
anddlib/dlib/dnn/loss_abstract.h
for details of loss layer class and its functions.
Training Datasets
The network was trained from scratch on a dataset of about 3 million faces.
Datasets Used
- the face scrub dataset.
- the VGG dataset
- a large number of images scraped from the internet.
note
We might also be interested in the Microsoft celeb-1M dataset which is bigger than the dataset used here.
Dataset Cleanup
note
The datasets needs clean-up to remove labeling errors, which meant filtering out a lot of stuff from VGG.
Davis King (the author of dlib) did the cleanup by repeatedly training a face recognition model and the using graph clustering methods and a lot of manual review.
In the end, about half the images are from VGG and face scrub (LOL).
Dataset Cleanup Algorithm (Clustering)
Chinese Whisper. (the paper should be in dlib documentation)
Result
The total number of identities in the dataset is 7,485.
note
D.King also made sure to avoid overlap with identities in LFW so the LFW evaluation is valid.
Existing Problems about dlib_face_recognition_resnet_model_v1
Problem 1
The model seems to perform poorly for african-americans.
The dataset along with LFW, is definitely biased towards white guys in the sense that they are overepresented in the data.
Although D.King did augment the training data with random color shifts.
note LFW is heavily biased towards white adult american public figures.
LFW Testing Script
The entire program that runs the LFW test can be found here.
Interesting Questions
Question 1
A dude is trying to track unique faces in a video by ensuring the computed Enclidean distance between faces is within 0.6 (or something else). And when seeing a new face, the descriptors for the previous person will be collated.
And a descriptor for a unique person would be an "averaged descriptor" for each person, 50 frames for example.
Would that approach be more accurate? Is averaging embeddings of a person a good idea?
D.King Answered
Trying to average is not usually going to work very well.
To be specific, suppose you have two sets of points in 128 dimensions, call them A and B, such that all the points in A are within 0.6 distance of each other, and the same scenario for B. Moreover, suppose that none of the points in A are within 0.6 distance of any point in B.
It is surprising, but true that it is quite likely that the distance between the centroid (average of all points) of A and that of B is less than 0.6 apart.
This kind of thing can happen in low dimensions as well, but it becomes increasingly more likely to happen when dimension goes up. (WHY?)
You should use a k-nearest-neighbor type of algorithm.
Question 2
The general workflow of face recognitnion.
D.King Answered
See the C++ example "examples/dnn_face_recognition_ex.cpp"
Question 3
General questions about not so good results about testing random faces.
D.King Answered
Even in the good case of data like LFW, a 99.38% accuracy does not mean there will not be mistakes (not to mention that LFW is heavily biased to white male Americans).
For instance, if you have 30 images and want to compare them all to each other, then 435 comparisons is needed because that is how many pairs there are. Therefore, (1- 0.9938\^435) is the probability that at least one of those comparisons makes mistake. That is 0.9333 -- very likely.
So you have to think carefully about how to use a system like this if you want to get good results. Be aware of the details.