Palm print recognition system using siamese network and transfer learning

This paper proposes a palmprint authentication approach using a one-shot learning technique based on similarity instead of classification (used by most other proposals). The one-shot learning technique uses the siamese network architecture built on top of the pre-trained VGG16 to efficiently reduce the cost and time of training the siamese network. This technique allows the user registration using only one palmprint and then performs the authentication process by performing a siamese similarity measure instead of classification techniques. The proposed model achieved high accuracies scores of 97%, 96.7% for Tongji datasets, 92.3%, 91.9% for PolyU-IITD datasets, 90.9%, 88.3% for CASIA datasets and 95.5% for COEP dataset. These performances were measured based on the testing dataset for unseen persons while the siamese training dataset was applied to different persons. The proposed model uses the pre-trained part of VGG16 as a feature extraction part then feeds the generated feature vector into the Euclidean distance layer that is trained in conjunction with the sigmoid layer to output the final similarity decision. Compared to other models, this proposed model achieved a high average accuracy of 93.2% and 0.19 EER over the available four palm print datasets which is generalized over proposals. All codes are open-source and available online at https://github.com/ProjectsRebository/PalmPrint-recognition-using-Transfer-Learning


Introduction
*In the last 40 years, automatic computer-based biometric identification has emerged as a strong technique for recognizing an individual's identity. The biometric characteristics derived from human biological organs; such as iris, retina, face, and various hand patterns; including fingerprint, finger knuckle pattern, hand geometry, Palm Print, etc., are grouped as physiological characteristics, while others are called behavioral characteristics; such as gait, voice, eye blinking (Saied et al., 2020), lip movement  signature, and gesture (Wang and Geng, 2009;Kumar and Srinivasan, 2012;Abdelwhab and Viriri, 2018;Saqib and Kazmi, 2018). Token and/or password-based methods have historically been used for personal authentication.
Their limitations hinder their use. However, these tokens can be stolen or misplaced (Student ID Card), information or passwords (ATM pin or Mail ID) can be guessed or forgotten similarly. On the other hand, many biometric systems are based on biological features, such as iris (Liu et al., 2017), retina, face (Moridani et al., 2020), voice, character, fingerprint (Rivaldería et al., 2017), and DNA, etc. have been successfully developed for many commercial applications due to rapid growth in hardware technology in terms of computing speed and highresolution capture devices (Kumar and Srinivasan, 2012;Liu et al., 2017). Samsung and numerous other smartphone companies have recently released their mobile variants with biometric protection systems focused on the face and iris. In addition, over the last 4-5 years, fingerprint, face, and iris technologies have also been used in laptops as a safety feature. However, there are still few constraints on biometric systems today, limiting their use and accuracy in civilian and forensic applications. The iris and retina scan systems are costly and extremely sensitive to the slightest body movement. Since the Palm Print texture is more complex and more resistant to damage and debris, it is tougher than a fingerprint. In addition, a Palm Print meets the basic requirements (Dian and Dongmei, 2016) for personal authentication as a universal, special and permanent biometric pattern because the palm line features; such as palm lines, creases, ridges, minutiae, and delta points, are stable and remain unchanged throughout the life of an individual (Zhang et al., 2012;Kong et al., 2009;Zhang et al., 2017).

Literature survey
Palm print attracted many researchers, and their work on Palm Print extraction approaches, rely on CNNs, can be divided into three categories as tabulated in Table 1: 1. Using pre-trained models (on Image Net); the network's output is the extracted feature. A classifier, such as SVM, is also used. 2. Networks of filters that are optimized using different approaches. 3. Training from scratch (or using transfer learning) of DNNs to determine embedding that minimizes intra-class distance and maximizes inter-class distance Our proposed research presented a system that can directly produce the similarity of two input Palmprint images using the siamese network (Melekhov et al., 2016). We used the pre-trained model VGG16 (Rezende et al., 2018) for feature extraction.  (Ramachandra et al., 2018) Uses transfer learning (AlexNet (Krizhevsky et al., 2017)) to match Palm Prints acquired from infants. The class decision was obtained by a fusion rule, which considered the SVM prediction and the network Softmax prediction.
CASIA 0.12% HKPU-MS  0.0% (Genovese et al., 2019) The PCANet strategy extended to include convolutions in the 2nd layer with fixed-size and variable-size Gabor filters. The defined 'PalmNet' architecture determines the Gabor filters with the highest response, followed by a binarization layer. An alternative architecture, entitled 'PalmNetGaborPCA' is considered, where the first layer filters are tuned using the PCA-based tuning method used in PCANet, while the 2nd layer kernels are configured using the Gabor-based tuning protocol. For classification, a simple KNN classifier is used.
CASIA 0.72% IITD 0.52% REST (Charfi et al., 2016) 4.50% Tongji (Zhang et al., 2017)  Using a siamese architecture, two MobileNets output feature vectors are then fed to an intra-class probability sub-network (0 for inter-class and 1 for intra-class, with 0.5 as a decision threshold). However, it is not clear which loss feature they used (most likely contrastive loss).

Datasets
1. PolyU-IITD includes left-and right-hand photographs from more than 230 subjects, with at least 5 hand image samples from each of the hands contributed by all the subjects. Fig. 1 shows samples of this database." 2. CASIA Palmprint Image Database includes 5,502 Palmprint images taken from 312 subject samples shown in Fig. 1. For each subject, Palmprint images were collected from both left and right palms. Both Palmprint images are 8-bit JPEG files at the gray level. 3. The database of COEP consists of 8 separate single-person palm pictures. The database contains a total of 1344 images belonging to 168 individuals. Fig. 1 shows samples of this dataset.
4. Tongji Photos of 300 volunteers, including 192 males and 108 females, were collected. Two different sessions gathered samples. The subject was asked in each session to provide 10 photos for each palm. Therefore, from each subject, 40 images were collected from 2 palms. In total, there are 12,000 photos taken from 600 different palms in the collection. Samples of this dataset showed in Fig. 1.

Transfer learning
Pre-trained models are used in transfer learning as the starting point for computer vision and natural language processing tasks, given the vast computational and time resources needed to develop neural network models on these issues and the enormous skill jumps they provide on related issues. First, a base network trained on a base dataset and task in transfer learning, and then repurposed or transferred the learned features to a second target network to be trained on a target dataset and task. If the features are general, this process will tend to work; meaning that they are suitable for both base and target tasks rather than specific to the base task. The inductive transfer is known as this method of transfer learning used in deep learning. This is whereby using a model suitable for a different but related task, the scope of possible models (model bias) is narrowed in a useful way. By using transfer learning, three potential advantages are: 1. Higher Beginning: On the source model, the initial skill (before optimizing the model) is higher than it would otherwise be. 2. Slope Higher: The rate of ability enhancement during the source model's training is steeper than it would otherwise be. 3. Asymptote Higher: The educated model has converged competence is higher than it would otherwise be.

Proposed model
We proposed the below network -using one-shot learning this network measures the similarity. Thus, we say that the network predicts the score in one shot, which is the most challenging part of our network. As in a one-shot classification, we require only one training example for each class, which has a great advantage over traditional classification. Our network consists of three main parts: as shown in Fig. 2.

Feature extraction layer: Pre-trained VGG16
For feature extraction, we used the VGG16 pretrained model; its architecture is shown in Fig. 3. The Pre-trained model was previously trained on ImageNet of 14 million images dataset to classify 1000 object types. The model has a large number of trained parameters for networks. Generally, training such a network is time and resource consuming (Rivaldería et al., 2017;Jain et al., 2004). The VGG network is distinguished by its simplicity, using only 3×3 convolutional layers stacked on top of each other in increasing depth. Reducing volume size done by max-pooling SoftMax classifier (above) is then followed by two completely linked layers, each with 4,096 nodes (Simonyan and Zisserman, 2015). Our experiment removes the last classification layer (i.e., SoftMax layer) and the output taken from the last feature vector layer (4096), which output the extracted features from the image.

Similarity measures layer: Using Euclidian distance
After generating features from two images, then we measure the similarity between images by taking the difference between two output vectors using Euclidean distance (Dokmanic et al., 2015), as shown in Eq. 1.

Decision layer: Using a sigmoid activation function
As shown in Fig. 4, this layer will be trained to decide whether the two images belong to the same person or a different person. This layer will produce a similarity score that denotes the probability that the two input images belong to the same person or not. Using a sigmoid function as shown in equation 2, the similarity score is usually squished between 0 and 1, where 0 denotes no similarity and 1 denotes complete similarity. Any number from 0 to 1 will be interpreted accordingly. Our model was learning a function of similarity, which takes as input two images and demonstrates how similar they are (Ramachandran et al., 2017). Sigmoid function is: z is calculated using Eq. 3 which is shown below.

Traditional classification vs one-shotlearning
Using traditional classification, the input image is fed into a series of layers, and finally, we create a probability distribution over all classes at the output (typically using a SoftMax). Assume that we want to develop a palm recognition system for a small organization with only 10 employees. Using a traditional approach for classification, Dataset samples. There are two main issues for worker identification model building as follows: a) First needs many different photos for each of the 10 individuals to train such a system that may not be the feasible company has thousands of workers in an organization b) What if a new worker hired or an existing worker leaves the company? The model has to be re-trained again with every change by increase or decrease the number of output classes. Particularly for large organizations where recruiting and attrition occur almost every week, re-training is not feasible.

Using one-shot-learning
Recently, one-shot learning has found successful applications, including facial recognition and ID checks. Instead of treating the problem as a classification problem, the one-shot learning turns it into a similarity problem. It helps to solve both traditional classification issues, as it does not take too many instances of a class, and only a few are enough to build a good model. The one-shot learning use architecture called the "siamese network." It takes two images as input and encodes their features into a set of numbers. The siamese network trains to measure the distance between the features in two input images, as shown in Fig. 5  The siamese network will be trained so that the feature encoding values for the anchor (of person A) and positive image (of person A) are very close, while that of the negative image (of person B) is very different. The training process will use a few images (sample) from persons to train the model while the traditional new thousands of images for each person for train the model. The limitation of the siamese network is sensitive to variations that degrade the accuracy in the face recognition system if the person in one of the images is wearing a hat or glasses, which not the case in palm recognition.

Performance measures
Using the palm for person recognition, we will need a different measure to evaluate the model performance. Most of the researchers use four different measures. These measures are FAR (false acceptance rate) and FRR (false rejection rate), GAR (Genuine Acceptance Rate), accuracy, and EER (equal error rate). FAR denotes the situation in which an impostor can be marked as an original and is allowed to pass as shown in Eq. 4. FRR denotes the situation in which the initial is rejected, as shown in Eq. 5. GAR denotes the situation in which the initial is accepted, as shown in Eq. 6. The goal is to minimize both FAR and FRR, which EER achieves.
The EER, which calculated using Eq. 7 is the point where FAR and FRR cross; it is the best threshold to pick (Ali et al., 2017) and provides the best accuracy.

Results and discussion
Our model contains two main parts; the first is VGG16, and the second is the decision network. In our system, we need to train only the decision network. We don't need to train VGG16. We remove the top layer and then use it as a feature extractor. Then, we use these features for training decision networks. When training a decision network, we need to generate pairs (anchor: positive and anchor: negative) of palms to train our network on. So, we arrange it as follows; for example, if a user has 8 images; then we create 28 pairs from the same person as following:  Table 2 for the model on the different types of datasets. In the training process, we use early stopping techniques on validation data to get the best model and prevent over fitting. Fig. 6 shows the accuracy vs. epoch graph of Tongji right and left hand; it can be observed from the figure that the accuracy rate was not rugged; it was steady with a small difference as for right hand, we achieved 97.9%, and for the left hand, we achieved 96.9%. and shows the accuracy vs. epoch graph of PolyU-IITD right and left hand It can be observed from the figure that the accuracy rate was nearly steady may be due to image resolution, with a small difference as for the right hand, we achieved 92.3%, and for the left hand, we achieved 91.9%.and shows the accuracy vs. epoch graph of CASIA right and left hand. It can be observed from the figure that the accuracy rate was steady with a small difference as for the right hand, we achieved 90%, and for the left hand, we achieved 88.3%. Accuracy is less than the previous datasets may be due to this dataset is grayscale images. And shows the accuracy vs. epoch graph of COEB. It can be observed that the accuracy rate was nearly steady. Accuracy for this dataset reaches 95.5%. As in Fig. 7 and Table 2, the number of false acceptances (FAR) decreases, the number of false rejections (FRR) will increases, and vice versa. The point where the two lines intersect also has a name: The Equal Error Rate (EER). This is where the percentage of false rejections and false acceptances is the same. The lower EER is for Tongji's left hand and is 0.0264, and the highest is for PolyU-IITD's left hand and is 0.4771.
A plot of the genuine accept rate (GAR) as a FAR function is known as the ROC curve. We compute the GAR and FAR for the different values of t (each setting of t gives us one pair of (GAR, FAR) values and thus corresponds to one point on the ROC curve) provided the classifier scores for each palm pair as in Fig. 8 for the four datasets. Compared to other models that use a similar approach, our accuracy is higher, as shown in Table 3 as the first one that uses siamese with transfer learning (MobileNet) with accuracy reaches 89.91, and the other one uses training from scratch for accuracy vgg16 which is time and resource consuming. Also, our proposed model achieved a high average accuracy of 93.2% and 0.19 EER over the available four palm print datasets.

Conclusion
Using Transfer Learning and Siamese Network, we proposed a Palm Print recognition technique, which can directly extract the similarity of two input Palm Prints. Our proposed model consists of three Parts: 1. Feature extraction layer: Use pre-trained VGG16 2. Similarity measures layer: Use Euclidian distance 3. Decision layer: Use a sigmoid activation function Our proposed model does not take too many instances of a class, and only a few are enough to build a good model. When we have a new person that we want to add to the model, we now only need a single image for him to be stored on the database to recognize this user, which is the great benefit of our model. We achieved the following performance: Also, our proposed model achieved a high average accuracy of 93.2% and 0.19 EER over the available four palm print datasets.

Conflict of interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.