DeepFace for Advanced Facial Recognition

Artificial Intelligence

DeepFace for Advanced Facial Recognition

admin

September 9, 2023

DeepFace for Advanced Facial Recognition

Facial recognition has been a trending field in AI and ML for several years now, and the widespread cultural & social implications of facial recognition are far reaching. Nonetheless, there exists a performance gap between human visual systems and machines that currently limits the applications of facial recognition.

To beat the buffer created by the performance gap, and deliver human level accuracy, Meta introduced DeepFace, a facial recognition framework. The DeepFace model is trained on a big facial dataset that differs significantly from the datasets used to construct the evaluation benchmarks, and it has the potential to outperform existing frameworks with minimal adaptations. Moreover, the DeepFace framework produces compact face representations in comparison to other systems that produce hundreds of facial appearance features.

The proposed DeepFace framework uses Deep Learning to coach on a big dataset consisting of various forms of knowledge including images, videos, and graphics. The DeepFace network architecture assumes that after the alignment is accomplished, the placement of each facial region is fixed on the pixel level. Due to this fact, it is feasible to make use of the raw pixel RGB values without using multiple convolutional layers as done in other frameworks.

The standard pipeline of recent facial recognition frameworks comprises 4 stages: Detection, Alignment, Representation, and Classification. The DeepFace framework employs explicit 3D face modeling to use a piecewise transformation, and uses a nine-layer deep neural network to derive a facial representation. The DeepFace framework attempts to make the next contributions

Develop an efficient DNN or Deep Neural Network architecture that may leverage a big dataset to create a facial representation that will be generalized to other datasets.
Use explicit 3D modeling to develop an efficient facial alignment system.

Understanding the Working of the DeepFace Model

Face Alignment

Face Alignment is a way that rotates the image of an individual in accordance with the angle of the eyes. Face Alignment is a well-liked practice that’s used to preprocess data for facial recognition, and facially aligned datasets assist in improving the accuracy of recognition algorithms by giving a normalized input. Nonetheless, aligning faces in an unconstrained manner generally is a difficult task due to the multiple aspects involved like non-rigid expressions, body poses, and more. Several sophisticated alignment techniques like using an analytical 3D model of the face or trying to find fiducial-points from external dataset might allow developers to beat the challenges.

Although alignment is the preferred method for coping with unconstrained face verification & recognition, there isn’t a perfect solution in the meanwhile. 3D models are also used, but their popularity has gone down significantly up to now few years especially when working in an unconstrained environment. Nonetheless, because human faces are 3D objects, it is likely to be the correct approach if used accurately. The DeepFace model uses a system that uses fiducial points to create an analytical 3D modeling of the face. This 3D modeling is then used to warp a facial crop to a 3D frontal mode.

Moreover, similar to most alignment practices, the DeepFace alignment also uses fiducial point detectors to direct the alignment process. Although the DeepFace model uses an easy point detector, it applies it in several iterations to refine the output. A Support Vector Regressor or SVR trained to prejudice point configurations extracts the fiducial points from a picture descriptor at each iteration. DeepFace’s image descriptor relies on LBP Histograms even though it also considers other features.

2D Alignment

The DeepFace model initiates the alignment process by detecting six fiducial points throughout the detection crop, centered at the center of the eyes, mouth locations, and tip of the nose. They’re used to rotate, scale, and translate the image into six anchor locations, and iterate on the warped image until there isn’t a visible change. The aggregated transformation then generates a 2D aligned corp. The alignment method is kind of much like the one utilized in LFW-a, and it has been used through the years in an try and boost the model accuracy.

3D Alignment

To align faces with out of plane rotations, the DeepFace framework uses a generic 3D shape model, and registers a 3D camera that will be used to wrap the 2D aligned corp to the 3D shape in its image plane. In consequence, the model generates the 3D-aligned version of the corp, and it’s achieved by localizing an extra 67 fiducial points within the 2D-aligned corp using a second SVR or Support Vector Regressor.

The model then manually places the 67 anchor points on the 3D shape and is thus capable of achieve full correspondence between 3D references and their corresponding fiducial points. In the subsequent step, a 3D-to-2D affine camera is added using generalized least squares solution to the linear systems with a known covariance matrix that minimizes certain losses.

Frontalization

Since non-rigid deformations and full perspective projections will not be modeled, the fitted 3D to 2D camera serves only as an approximation. In an attempt to scale back the corruption of necessary identity-bearing aspects to the ultimate warp, the DeepFace model adds the corresponding residuals to the x-y components of every reference fiducial point. Such leisure for the aim of warping the 2D image with less distortions to the identity is plausible, and without it, the faces would have been warped into the identical shape in 3D, and losing necessary discriminative aspects in the method.

Finally, the model achieves frontalization through the use of a piecewise affine transformation directed by the Delaunay triangulation derived from 67 fiducial points.

Detected face with 6 fiducial points.
Induced 2D-aligned corp.
67 fiducial points on the 2D-aligned corp.
Reference 3D shape transformed to 2D-aligned corp image.
Triangle visibility with respect to the 3D-2D camera.
67 fiducial points induced by the 3D model.
3D-aligned version of the ultimate corp.
Recent view generated by the 3D model.

Representation

With a rise in the quantity of coaching data, learning based methods have proved to be more efficient & accurate in comparison with engineered features primarily because learning based methods can discover and optimize features for a selected task.

DNN Architecture and Training

The DeepFace DNN is trained on a multi-class facial recognition task that classifies the identity of a face image.

The above figure represents the general architecture of the DeepFace model. The model has a convolutional layer (C1) with 32 filters of size 11x11x3 that’s fed a 3D aligned 3-channels RGB image of size 152×152 pixels, and it leads to 32 feature maps. These feature maps are then fed to a Max Pooling layer or M2 that takes the utmost over 3×3 spatial neighborhoods, and has a stride of two, individually for every channel. Following it up is one other convolutional layer (C3) that comprises 16 filters each of size 9x9x16. The first purpose of those layers is to extract low level features like texture and straightforward edges. The advantage of using Max Pooling layers is that it makes the output generated by the convolutional layers more robust to local translations, and when applied to aligned face images, they make the network way more robust to registration errors on a small scale.

Multiple levels of pooling does make the network more robust to certain situations, but it surely also causes the network to lose information regarding the precise position of micro textures and detailed facial structures. To avoid the network losing the data, the DeepFace model uses a max pooling layer only with the primary convolutional layer. These layers are then interpreted by the model as a front-end adaptive pre-processing step. Although they do many of the computation, they’ve limited parameters on their very own, and so they merely expand the input right into a set of local features.

The next layers L4, L5, and L6 are connected locally, and similar to a convolutional layer, they apply a filter bank where every location within the feature map learns a novel set of filters. As different regions in an aligned image have different local statistics, it cannot hold the spatial stationarity assumption. For instance, the world between the eyebrows and the eyes have a better discrimination ability in comparison to the world between the mouth and the nose. Using loyal layers affects the variety of parameters subject to training but doesn’t affect the computational burden in the course of the feature extraction.

The DeepFace model uses three layers in the primary place only since it has a considerable amount of well-labeled training data. Using locally connected layers will be justified further as each output unit of a locally connected layer will be affected by a big patch of input data.

Finally, the highest layers are connected fully with each output unit being connected to all inputs. The 2 layers can capture the correlations between features captured in several parts of the face images like position and shape of mouth, and position and shape of the eyes. The output of the primary fully connected layer (F7) can be utilized by the network as its raw face representation feature vector. The model will then feed the output of the last fully connected layer (F8) to a K-way softmax that produces a distribution over class labels.

Datasets

The DeepFace model uses a mix of datasets with the Social Face Classification or SFC dataset being the first one. Moreover, the DeepFace model also uses the LFW dataset, and the YTF dataset.

SFC Dataset

The SFC dataset is learned from a set of images from Facebook, and it consists of 4.4 million labeled images of 4,030 individuals with each of them having 800 to 1200 faces. Probably the most recent 5% of the SFC dataset’s face images of every identity are not noted for testing purposes.

LFW Dataset

The LFW dataset consists of 13,323 photos of over five thousand celebrities which are then divided into 6,000 face pairs across 10 splits.

YTF Dataset

The YTF dataset consists of three,425 videos of 1,595 subjects, and it’s a subset of the celebrities within the LFW dataset.

Results

Without frontalization and when using only the 2D alignment the model achieves an accuracy rating of only about 94.3%. When the model uses the middle corp of face detection, it doesn’t use any alignment, and on this case, the model returns an accuracy rating of 87.9% because some parts of the facial region may fall out of the middle corp. The evaluate the it’s discriminative capability of face representation in isolation, the model follows the unsupervised learning setting to check the inner product of normalized features. It boosts the mean accuracy of the model to 95.92%

The above model compares the performance of the DeepFace model in comparison with other state-of-the-art facial recognition models.

The above picture depicts the ROC curves on the dataset.

Conclusion

Ideally, a face classifier will have the option to acknowledge faces with the accuracy of a human, and it should have the option to return high accuracy no matter the image quality, pose, expression, or illumination. Moreover, a perfect facial recognition framework will have the option to be applied to a wide range of applications with little or no modifications. Although DeepFace is some of the advanced and efficient facial recognition frameworks currently, it isn’t perfect, and it may not have the option to deliver accurate leads to certain situations. However the DeepFace framework is a big milestone within the facial recognition industry, and it closes the performance gap by making use of a strong metric learning technique, and it should proceed to get more efficient over time.