note

These notes come from High Quality Face Recognition with Deep Metric Learning

note

The documentation about dlib resides in http://dlib.net/term_index.html

Network Architecture

The Architecture

The model is a ResNet network with 29 convolutional layers. It is modified from the 34-layer ResNet network from Deep Residual Learning for Image Recognition by He, Zhang et al.

note

The 34-layer network is removed by a few layers and the number of filters per layer reduced by half.

Detail about Training and Loss Function

The network training started with randomly initialized weights (how?), and used a structured metric loss that tries to project all the identities into non-overlapping balls of radius 0.6.

The loss is basically a type of pair-wise hinge loss that runs over all pairs in a mini-batch and includes hard-negative mining at the mini-batch level.

The loss is described in the loss_metric_documentation, however D.King does not have a reference paper for it.

Loss Layer

The Object

Allow the network learn to map objects into a vector space where objects sharing the same class label are close to each other. To be specific, it considers all pairs of objects in a mini-batch, and computes a different loss depending on their respective class labels.

Loss Functions

So if objects A1 and A2 in a mini-batch share the same class label then their contribution to the loss is:

max(0, length(A1 - A2) - get_distance_threshold() + get_margin())

While if A1 and B1 have different class labels then their contribution to the loss function is:

max(0, get_distance_threshold() - length(A1 - A2) + get_margin())

For example, D.King trained the network by using:

get_distance_threshold() = 0.6.

How to Balance the Loss

The loss balances the number of negative pairs relative to the number of positive pairs.

For example If there are N pairs that share the same identity in a mini-batch then the algorithm will only include the N worst non-matching pairs in the loss.

That is, the algorithm performs hard negative mining on the non-matching pairs. This is important since there are way more non-matching pairs than matching pairs. So to avoid imbalance in the loss this kind of hard negative mining is useful.

note

Refer to dlib/dlib/dnn/loss.h and dlib/dlib/dnn/loss_abstract.h for details of loss layer class and its functions.

Training Datasets

The network was trained from scratch on a dataset of about 3 million faces.

Datasets Used

  • the face scrub dataset.
  • the VGG dataset
  • a large number of images scraped from the internet.

note

We might also be interested in the Microsoft celeb-1M dataset which is bigger than the dataset used here.

Dataset Cleanup

note

The datasets needs clean-up to remove labeling errors, which meant filtering out a lot of stuff from VGG.

Davis King (the author of dlib) did the cleanup by repeatedly training a face recognition model and the using graph clustering methods and a lot of manual review.

In the end, about half the images are from VGG and face scrub (LOL).

Dataset Cleanup Algorithm (Clustering)

Chinese Whisper. (the paper should be in dlib documentation)

Result

The total number of identities in the dataset is 7,485.

note

D.King also made sure to avoid overlap with identities in LFW so the LFW evaluation is valid.

Existing Problems about dlib_face_recognition_resnet_model_v1

Problem 1

The model seems to perform poorly for african-americans.

The dataset along with LFW, is definitely biased towards white guys in the sense that they are overepresented in the data.

Although D.King did augment the training data with random color shifts.

note LFW is heavily biased towards white adult american public figures.

LFW Testing Script

The entire program that runs the LFW test can be found here.

Interesting Questions

Question 1

A dude is trying to track unique faces in a video by ensuring the computed Enclidean distance between faces is within 0.6 (or something else). And when seeing a new face, the descriptors for the previous person will be collated.

And a descriptor for a unique person would be an "averaged descriptor" for each person, 50 frames for example.

Would that approach be more accurate? Is averaging embeddings of a person a good idea?

D.King Answered

Trying to average is not usually going to work very well.

To be specific, suppose you have two sets of points in 128 dimensions, call them A and B, such that all the points in A are within 0.6 distance of each other, and the same scenario for B. Moreover, suppose that none of the points in A are within 0.6 distance of any point in B.

It is surprising, but true that it is quite likely that the distance between the centroid (average of all points) of A and that of B is less than 0.6 apart.

This kind of thing can happen in low dimensions as well, but it becomes increasingly more likely to happen when dimension goes up. (WHY?)

You should use a k-nearest-neighbor type of algorithm.

Question 2

The general workflow of face recognitnion.

D.King Answered

See the C++ example "examples/dnn_face_recognition_ex.cpp"

Question 3

General questions about not so good results about testing random faces.

D.King Answered

Even in the good case of data like LFW, a 99.38% accuracy does not mean there will not be mistakes (not to mention that LFW is heavily biased to white male Americans).

For instance, if you have 30 images and want to compare them all to each other, then 435 comparisons is needed because that is how many pairs there are. Therefore, (1- 0.9938\^435) is the probability that at least one of those comparisons makes mistake. That is 0.9333 -- very likely.

So you have to think carefully about how to use a system like this if you want to get good results. Be aware of the details.

results matching ""

    No results matching ""