Principles of construction and advantages of 3D face recognition system.
Recognition of objects falling into the field of view of video surveillance systems is an important and technically complex task. Many developers, including the Vokord company, have been successfully working on this issue for a long time.
In particular, the principles of construction of recognition systems based on the analysis of a two-dimensional image have long been used in practice. The next step is the development of a more promising technology for recognizing objects based on its three-dimensional model (3D recognition).
Advantages of 3D technologies
Currently, recognition systems are usually divided into two categories – two-dimensional (based on flat, or two-dimensional, images, 2D) and three-dimensional (recognition is carried out using reconstructed three-dimensional images, 3D).
There are a number of significant drawbacks in recognition systems based on 2D images. For example, 2D recognition systems are very sensitive to lighting conditions. When the face is unevenly illuminated, the reliability of 2D recognition drops significantly. While for 3D recognition systems, changes in illumination only affect the texture of the face, and the reconstructed surface of the face remains unchanged.
Another important difference between 3D recognition technologies and 2D recognition is their resistance to face changes. To compensate for this effect, 2D recognition uses image transformation to a canonical position. However, the effectiveness of this approach depends on the accuracy of the placement of anthropometric points on the face and does not work well with large deviations in the angle from the frontal view. The situation is further aggravated by the fact that even with an ideally accurate placement of anthropometric points, the problem of bringing to a canonical view does not have a strict mathematical solution due to the property of perspective projection. As a result, the permissible deviation of the angle from the frontal position is 15 degrees vertically and horizontally for the best examples of 2D recognition systems.
In 3D recognition, the permissible angle of head deviation from the frontal view can reach 45 degrees. If the reconstructed model and its reference image stored in the database are obtained in different views, then the model can be rotated using software. In addition, the object can be rotated and brought to the front view for subsequent recognition using standard two-dimensional algorithms.
Facial recognition systems use stable anthropometric points, the location of which characterizes the individual features of the face. On 3D models, anthropometric points are determined with greater accuracy than on 2D images. In addition, points on 3D models have three coordinates and, accordingly, provide more information than the same points on a 2D image. Figure 1 shows an example of automatic placement of 68 anthropometric points.
Figure 1. Anthropometric points connected into triangles
Another important advantage of 3D recognition systems is the ability to use absolute distances between biometric points, while 2D recognition systems can only work with relative sizes.
Traditional 2D recognition systems use areas of the image with high contrast, such as the eyes, mouth, nose, facial boundaries, and poorly use information in areas of low contrast — on the cheeks, forehead, chin. Unlike 2D recognition, 3D also uses information from low-contrast areas for analysis. Moreover, the shape of the forehead and other weakly deformable areas of the face is little subject to change with different facial expressions (such as a smile), which is also used in 3D recognition.
However, 3D recognition is not perfect either. For example, lighting is not a problem at the 3D recognition stage, but it can negatively affect the result of 3D reconstruction of the face shape. Depending on the reconstruction algorithm, some parts of the face (e.g. overexposed areas or areas with very low contrast) may appear as gaps or outliers (artifacts) on the reconstruction surface.
Another disadvantage of 3D recognition is the high cost of the equipment used, since a 3D recognition system requires much more computing resources than 2D recognition systems.
Until recently, the insufficient implementation of 3D systems was probably due to the lack of high-resolution video sensors on the market. Research by leading developers in the field of 3D recognition, as well as the emergence of commercially available video cameras, should, in my opinion, stimulate the development of 3D recognition systems.
Directions of 3D recognition
Among the various approaches to 3D recognition, three main ones can be distinguished: analysis of the shape of the 3D surface of the face, statistical approaches, and the use of a parametric face model.
Methods based on the analysis of the shape of a 3D face image use the geometry of the surface that describes the face. These approaches can be classified into three groups, using local or global properties of the surface (for example, curvature), line profiles, and metrics of distances between two surfaces.
It is possible to use surface curvature to segment the face surface by features that can be used to compare surfaces. Another approach is based on 3D descriptors of the face surface in terms of mean and Gaussian curvature or in terms of distances and angle ratios between characteristic points of the surfaces. Another locally-oriented method is the approach using point signatures. The idea of the method is to form a representation-description of a selected point based on neighboring points around a given surface point. These point signatures are used to compare surfaces.
To improve the efficiency of recognition algorithms, a method is used when those parts of the surface that are subject to change as a result of changes in facial expressions are removed from consideration. Only rigid parts of the face are the basic information for recognition. In addition to 3D information, texture information on facial areas is also used.
There are also hybrid methods based on combining local surface information in the form of local moments with a global 3D grid describing the entire face surface.
In one of these methods, the value of the function Z(x,y), which describes the «face depth map» in an aligned coordinate system, is decomposed into Fourier components. Decomposing the function into moments (basis functions) allows smoothing out small high-frequency «face noise» and random outliers.
In addition to the Fourier decomposition, other basis functions are also used: power series, Legendre polynomials, and Zernike moments.
Global methods use all the information about the 3D image of the entire face as input to the recognition system. For example, a face model is aligned based on its mirror symmetry, after which the face profiles along the alignment plane are extracted and compared. A method for comparing face models based on the maximum and minimum values and the direction of profile curvature is also used.
Another approach is based on the method of comparing distances between surfaces for recognition. Some methods are based on calculating the metrics of the smallest distances between the surfaces of models, others — on measuring the distance not only between surfaces, but also the texture on the surface. However, a significant limitation of these methods is that the face cannot be deformed and its surface is rigid.
The third approach is based on the extraction and analysis of three-dimensional profiles and contours selected on the face.
Statistical methods, in particular the Principal Component Analysis (PCA), were previously widely used in 2D recognition. The PCA method has also been implemented for 3D recognition and has been extended to a combination of depth and color maps. An alternative to PCA is the linear discriminant analysis method, in which, unlike PCA, one object (a given person) is defined not by one face, but by a set of models (3D faces).
Until now, all statistical methods described in the literature have not taken into account the effect of changes in the shape of the face surface associated with facial expressions. To minimize this effect, approaches based on invariant isomorphic transformations were developed. Such transformations do not change the distance between two specified points on the face under the influence of facial expression changes in the shape of the face. For example, a transformation of the face shape to a canonical form is used.
These methods used the PCA algorithm at the final stage of recognition, which was applied to the canonical shape of the face.
There are also recognition methods based on parametric models of the face. The key idea of model recognition is based on so-called parametric 3D models, where the shape of the face is controlled by a set of parameters (coefficients) of the model. These coefficients describe the 3D shape of the face and can also specify the color (texture) on its surface. The model created in this way is then projected onto 2D images, from which the model parameters for the given image are determined.
The disadvantage of the method is the high computational complexity and sensitivity to the initial initialization of the model parameters. To overcome these difficulties, models consisting of independent sections were developed. One of the methods uses a three-dimensional surface of the average face, which is deformed to a given three-dimensional surface using anatomical anthropometric points on the face. The deformation parameters are calculated in the process of 3D reconstruction using an elastic model, which are set as distinctive features of a given face. A cloud of disordered points obtained as a result of 3D reconstruction of the face area is used as the initial data. In this case, a polygonal 3D flexible face model is fitted to the point cloud (Figure 2).
The fitting of a polygonal 3D flexible face model is based on a physical analogy: like an elastic flexible mask pulled over a face, a generalized model under the action of external forces (attraction to a 3D point cloud) and internal forces (tension, elasticity) takes the shape of a specific person's face. The following operations are performed:
Primary alignment. Using the ICP (Iterative Closest Point, ICP) algorithm, the flexible model is reduced to a point cloud without deformation. The initial approximation is roughly specified by the eye centers, the tip of the nose, and the center of the mouth.
Deformation of the model with the purpose of attraction to the point cloud. In the numerical solution of the problem, each face of the flexible model is considered as a curvilinear finite element.
To improve the accuracy of the approximation of the model, the subdivision surfaces method is used, in which each finite element is approximated by the sum of triangles.
A system of linear equations based on the Lagrange equations of motion of the physical model is approximately solved:
where M is the mass matrix of the flexible model, D is the damping matrix, K is the elasticity matrix, fp are the external forces, P are the generalized coordinates of the elastic model.
When the elastic model is deformed, the position of the anthropometric points of the face is preserved — for example, with a correct fit, a specific vertex of the model will always correspond to the tip of the nose, etc. This is ensured by the fact that the initial dimensions of the flexible model are based on statistical data on hundreds of faces, as well as the action of internal elastic forces based on the distances between anthropometric points. In particular, the statistical limitation does not allow the nose to become unrealistically wide, since the elastic force inclines it closer to the width averaged over the population.
Thus, among modern methods of 3D biometric human identification, global methods are used (recognition probability is 90-96%), statistical methods (93-100%) and parametric methods are characterized by a probability of about 88-96%.
Image quality
The main guarantee of success of a 3D recognition system, as well as of 2D recognition systems, is the quality of the resulting image. It is necessary to use high-resolution image sensors — TV cameras with a matrix from 1 to 5 megapixels, a frame rate of up to 200 frames/s, a dynamic range of up to 70 dB and a signal-to-noise ratio of about 60 dB.
For effective recognition, it is necessary that the image transmitted to the recognition server has the highest quality. Compression of the transmitted image at the first stage is unacceptable, since it worsens the image quality, and as a result, the accuracy of the reconstruction worsens.
To solve this problem, it is extremely important to use lenses with high optical resolution (about 100 pairs of lines per mm) and low distortions (geometric aberrations, chromatic aberrations, distortion).
Synchronization of images
Today, there are three main classes of systems that allow obtaining 3D models of real-world objects:
laser scanners;
systems with structured illumination;
systems based on stereo cameras.
Only the last two are suitable for reconstructing dynamically changing objects. Moreover, stereo cameras are subject to a strict requirement: the error in synchronizing the cameras must be at least 100 times less than the characteristic time of the object change.
The cameras are connected by a special cable, through which sync pulses are transmitted. Due to this, the cameras looking into one control zone shoot all frames absolutely synchronously. In this case, one camera is the control one, and the others are slaves. Here, a high degree of synchronization of shots and a guarantee that the moving object will not move a distance exceeding the width of one pixel are very important. Only then will an additional pixel or «blur» not appear on the image, which is very important for subsequent image processing.
Reconstruction of 3D models
The high frame rate used in camera systems provides a new unique opportunity: in conditions of non-cooperative behavior of objects, it is possible to reliably detect them, track them interframe, and obtain a continuous sequence of stereo images of these objects. From such a sequence, images with the best image quality are selected, which are most suitable for reconstructing 3D models of objects. The presence of several stereo images of an object from different shooting angles allows for increasing the accuracy of reconstruction.
The new system makes it possible to implement effective methods and algorithms for compensating for lighting inhomogeneity using a series of stereo images in which the object is captured from different angles.
As a result, the obtained images are well suited for 3D reconstruction of objects. Figure 3 shows the result of 3D reconstruction of an ideal sphere using the specified algorithm as a reference example. The resolution of stereo images is 2048 x 1536 pixels, the radius of the sphere is 80 mm. Subpixel resolution of ¼ pixel on stereo images is used. As a result, the standard deviation of 3D coordinates of reconstructed points from an ideal sphere was 0.12 mm.
Figure 3. Reconstruction of an ideal sphere. (a) – stereo pair of sphere images. (b) 3D reconstruction. Reconstructed points located on the sphere from the side of the cameras are shown in green
Figure 4 shows the result of 3D reconstruction and the mask of a human face constructed from it, which is used for recognition.
Figure 4. Result of constructing a 3D face mask with texture
To improve performance, the CUDA computing platform is used on NVIDIA graphics cards, resulting in a 3D reconstruction speed of 5–10 frames/sec for typical values of the parameters of the 3D human face recognition task.
Big City Safety
Thus, 3D recognition allows monitoring the flow of people, building three-dimensional models of their faces «on the fly» and comparing them with reference values stored in the database. In addition, it is possible to track the movement of individuals around the city who are not yet included in any databases, without identifying them, but with the purpose of analyzing their behavior and identifying suspicious signs in this behavior.
The technologies listed above are already used in various systems.
Figure 5. Result of constructing an elastic model for a 3D surface. (a) – result of matching with 3D model, (b) – constructed elastic model described by set of main coefficients of model
Figure 6. Triangulated face mask used in 3D recognition algorithm
Figure 7. Anaglyph image of a face mask, constructed as a result of the 3D reconstruction algorithm. To view in volume, you need to put on anaglyph glasses: Red — on the right, Cyan — on the left