Recognition of static gestures using correlation and cross-correlation

Sign language recognition has been an active area of research for around two decades and numerous sign languages have been extensively studied in order to design reliable sign language recognition systems. Pakistan sign language (PSL) has been used as a case study here. A comprehensive database of static images depicting the signs for different Urdu alphabets is being used as a reference and input images are being compared to perform PSL alphabet recognition. The normalized Correlation technique is being used for image registration between input image and images from the database to find the closest match. The purpose of research is to identify static gestures of any Sign Language which can ultimately lead to an identification of words and sentences. The research starts with image acquisition, image preprocessing, and use of correlation and labeling of an identified symbol. Normalized correlation is used to find the nearest match. This paper includes experiments for 37 static hand gestures related to PSL alphabets. Training dataset consists of 10 samples of each PSL symbol in different lighting conditions, different sizes and shapes of hand by 5 different signers. This gesture recognition system can identify one hand static gestures in any complex background with a “minimum-possible constraints” approach. A comparison is also drawn between normalized correlations and normalized cross-correlation. As compared to other technique, this technique can work with a small dataset size. The technique is based on unsupervised learning.


Introduction
*In our day to day life we deal with huge masses of digital visual information. Any image processing system uses image analysis techniques as major requisite. Need is arising to analyze the embedded visual information. Gestures provide an attractive, user-friendly alternative to using an interface device like a keyboard, mouse, and joystick in humancomputer interaction (HCI). Every gesture recognition research targets to build a system that can identify the information passed through human gestures automatically and to make use of them to convey information (i.e., for communicative use as in sign-language) or to control devices (i.e., manipulative use as in controlling robots without any physical contact). Unfortunately, in this country people who have an in-depth knowledge of sign language are very rare, which has led to the social seclusion of the deaf community. There are two kinds of gestures; static gestures and those which include dynamic hand, body and face movements (Saqib and Kazmi, 2017). For static gestures, within a particular time-frame, the noticeable gesture is physically coordinated. For dynamic gestures, series of finger and hand poses are recognized and analyzed This paper can be divided into six sections. Introduction is followed by Section II, which provides a detailed overview of related work in different systems dealing with sign language recognition. Section III focuses on proposed approach and it briefs on the salient features of the algorithm used to distinguish between different gestures and provides a step-by-step procedure within the proposed algorithm. Section IV elaborates experimental setup, Section V discusses the results and Section VI concludes the major findings of this paper and provides guideline for future work.

Literature review
Within the past two decades, comprehensive literature has been accumulated related to the numerous geographically global and local sign languages as well as the different methods employed to resolve the issue of sign language detection through image-based models. Jie et al. (2016) recognized real-time continuous gesture recognition of sign language using a DataGloveTM. Hidden Markov models (HMMs) is used for identification of 51 fundamental postures, 6 orientations, and 8 motion primitives with the average recognition rate at 80.4%. Alvi et al. (2007) has presented recognition of PSL gestures using statistical template matching using Data Glove developed by 5DT in his project (i.e., Boltay Haath). Use of camera for recognition, which being affected by environmental changes, does not give good results. The system recognizes one handed alphabet signs from PSL. Nagi et al. (2008) have presented a new technique for human face recognition. The technique helps in recognition of faces; it removes redundant details and reduces them in size by using 2D cosine transform (2D-DCT). The technique uses skin color, and DCT coefficients. It uses an unsupervised learning technique to classify DCT-based feature vectors into groups to identify if the input image matches some image in dataset or not.
In 2008, Kausar et al. (2008) have used, a fuzzy classifier to recognize alphabets of Pakistani sign language however her technique is dependent on the use of color gloves. The color gloves help to identify each finger-tip and joint using the angles between fingers help in identification of the alphabet of sign language. She got highly accurate results except 2, all other 35 signs have been recognized accurately. Sarkalehl et al. (2009) suggested the use of discrete wavelet transform (DWT), and then application of a multi layered Perceptron (MLP) Neural Network (NN) to classify the selected images. This technique does not use any gloves or visual marking systems. It gives an accuracy of 98.75% when the network is trained using MATLAB NN Toolbox. Tauseef et al. (2009) developed a scheme for static gestures. The success of the approach is evident from the results and we achieved an accuracy rate of 97.4%. The approach is affected by the number of regions produced during color segmentation of the image which may cause a decrease in accuracy rate in some situation. Shahzad et al. (2009) presented an online system for recognizing isolated, hand-sketched Urdu characters drawn on a Tablet PC. Attributes of Urdu characters are analyzed to define a set of features which are then trained and classified using a weighted, linear classifier. Moghaddam et al. (2011) used kernel based feature extraction methods kernel principle component analysis (KPCA) and kernel discriminant analysis (KDA) on Persian sign language (PSL) postures. To compare the impact of features on signs' recognition rate, classifiers such as minimum distance, support vector machine (SVM) and Neural network (NN) is used. Jalilian and Chalechale (2014) made use of feed forward neural networks and recurrent neural networks along with its different architectures; partially and fully recurrent networks. The system works well for static gesture recognition at an accuracy of 95%. Parveen et al. (2011) used Conditional Random Fields for classification of clause boundary beginning and ending and also detecting the type of subordinate clause. Limitation with CRFs is that it is highly dependent on linguistic rules. Missing of these rules may lead to wrongly classified data. Bansal et al. (2011) have used Hidden Markov Model as an indispensable tool for the recognition of dynamic gestures in real time. They have suggested standardizing the axis through the centroid thus greatly reducing the database size.
Albelwi and Alginahi (2012) used a vision-based automatic sign language recognition system for Arabic letter using Haar classifiers. The skin detection is done in HSV color space. Transforming images into frequency domain using Fourier Transform gives entire signs database into a single vector based on FDs to ease the matching process thus giving upto 90.55 % recognition accuracy in real time. Naoum et al. (2012) presented a new Arabic sign language recognition using K-nearest Neighbor algorithm. K-Nearest Neighbor Algorithm and feature extraction are the guidelines of the recognition system, because hand gestures is treated as a block of curves needed to be extracted in the best fit with a predefined character set in the knowledge base. Sasirekha and Chandra (2012) focused on the PDF images, but variations in text style, font, size, orientation, alignment make the problem an uphill task. This paper suggests two techniques under block-based classification. Pansare et al. (2012) presented use of filters, cropping using a feature vector as centroid and area of edge, which was compared with feature vectors of a training dataset of gestures using Euclidian distance in the fourth stage. Least Euclidian distance gives recognition of perfect matching gesture for display of ASL alphabet, meaningful words using file handling. Abdalla and Hemayed (2013) suggested the use of hand blob using YCbCr color space to detect skin color of hand. The system classifies the input pattern based on correlation coefficients matching technique. The experiment results show that the gesture recognition rate of 20 different signs, performed by 8 different signers, is 85.67%. M. Nachamai (2013) has tried to approach this in a simple but efficient manner using the basic SIFT algorithm for recognition. The efficacy of the approach is proved well through the results obtained, invariably on both the datasets. Ali (2013) has worked on Urdu sign language. He uses text to sign and sign to text conversion. First takes an input text from a text box and shows image on output display. Secondly, it takes a sign through webcam and outputs the equivalent text. Singha and Das (2013) made use of Eigen value weighted Euclidean distance for classification of various Sign Languages of India. Skin Filtering, Hand Cropping, Feature Extraction and Classification are four stages of the process. 24 signs with 10 samples each were considered for which recognition rate obtained was 97%. Szegedy et al. (2013) have done object detection using DNNs. The approach presented use of regression problem to object bounding box masks. A multi-scale inference procedure is used to produce high-resolution object detections at a low cost by a few network applications. Sykora et al. (2014) suggested two feature extraction methods: SIFT as the first method and SURF method as second. They were applied on set of depth map images of left hand gestures. There were 10 gestures. For capturing these images the Microsoft Kinect camera was used. For image classification the Support vector machine was used. The experimental results are prediction accuracies of SVM method on test set images for each descriptor. Sahoo et al. (2014) used two data sets which contains 2600 images for single handed characters and 2340 gestures of double handed characters (A-Z). The structural features, local histogram features and direct pixel values of gray scale images extracted from these gestures are used as input to the recognition system. After extracting features from images, kNN classifier and neural network classifier are used to classify the gestures. In single handed data set 95.30% recognition rates are achieved, and in double handed data set 96.37% accuracy rates are achieved. Mohandes et al. (2014) have combined the enhanced skin, motion, and depth feature in particle filter model, the performing hand can be well localized and tracked in every frame. A shape-order context descriptor is then proposed for gesture sequence matching in temporal spatial domain. Such a rich descriptor can greatly improve the gesture recognition rate and be invariant to gesture to translate and scale. Pandey and Jain (2015) have used Cross-correlation technique for image registration between input image and images from the database to find the closest match. The tolerance level ensures a trade -off between computational complexity and accuracy of the match between set of images. The algorithm tests on the images have been around 75% successful and attempts are being made for more efficient and robust performance. Bhuyan et al. (2014) entailed using gesture spotting to distinguish meaningful gestures from unintentional movements. To avoid the effects of variations in a gesture's motion chain code (MCC), the orientation and length of an ellipse least-squares fitted to motion-trajectory points and the position of the hand are used. The recognition rate for static gestures is almost 96.0 % using the proposed features and for continuous gestures, the recognition rate for the proposed features was 88.9 %. Ohn-Bar and Trivedi (2014) have developed a vision-based system that employs a combined RGB and depth descriptor to classify hand gestures. The feasibility of the system is demonstrated using a challenging RGB hand gesture data set collected under settings of common illumination variation and occlusion. Fernandez et al. (2015) proposed and investigated presents a solution in the form of ZACFs and RACF that completely remove circular correlation effects from CF designs. To address the computational challenges caused by the ZA constraints we introduced the RACF designs as an approximate solution as well as proximal gradient descent based algorithms for exactly solving for the various ZACFs. Zhou et al. (2014) have used large datasets like ImageNet and the rise of Convolutional Neural Networks (CNNs) for learning high-level features. Performance at scene recognition has not attained the same level of success. This may be because current deep features trained from ImageNet are not competitive enough for such tasks. New methods to compare the density and diversity of image datasets and show that places are as dense as other scene datasets and has more diversity. Shurong et al. (2015) have proposed that the dynamic sign language can be described by a sequence of key frames and then recognized by these key frames. This method focuses on the key frame extraction and can minimize the limitation to the users and the requests of equipment, which makes the interaction between a human and computer more natural and realizes the comprehensive application of sign language recognition. Abdo et al. (2015) have introduced an Arabic Alphabet and Numbers Sign Language Recognition (ArANSLR). The technique recognizes the alphabet and numbers signs of Arabic sign language to text or speech. Experiments on real-world datasets showed that the proposed algorithm for Arabic alphabet and numbers sign language recognition is suitability and reliability compared with other competitive algorithms. Tehsin et al. (2013) have compared different shape descriptors. She has used two groups of signs to analyze, for central moments and Hu Moments, whereas Fourier descriptor have given have given very encouraging results for both the groups. Raees et al. (2016) suggested a deep pixels-based analysis for the recognition of fingers (from index to small finger) while thumb position is determined through Template Matching. The system's accuracy achieved a satisfactory level of 84.2% when evaluated with signs comprising 180 digits and 240 alphabets. Darwish et al. (2016) have presented several results concerning static hand gesture recognition using an algorithm based on Type-2 Fuzzy HMM (T2FHMM). The features used as observables in the training as well as in the recognition phases are based on Singular Value Decomposition (SVD) that optimally exposes the geometric structure of a matrix. SVD is an extension of Eigen decomposition to suit non-square matrices to reduce multi-attribute hand gesture data to feature vectors. The recognition rate of the proposed system is 100% for uniform hand images and 95.5% for cluttered hand images. Jain (2016) has designed and developed an application Zauber. The windows based application is able to recognize gestures and act like a remote to windows media player at a high rate using Haar-like feature for gesture detection. Singha and Laskar (2017) have introduced a two-level speed normalization procedure using DTW and Euclidean distance-based techniques. Three features such as 'orientation between consecutive points', 'speed' and 'orientation between first and every trajectory points' were used for the speed normalization. An accuracy of 94.78 % was achieved using the classifier fusion technique as compared to baseline CRF (85.07 %) and HCRF (89.91 %) models. Zhang et al. (2017) used fully end-to-end approach for visual tracking in videos that learns to predict the bounding box locations of a target object at every frame. Based on this intuition, he has formulated his model as a recurrent convolutional neural network agent that interacts with a video overtime, and the model can be trained with reinforcement learning (RL) algorithms to learn good tracking policies that pay attention to continuous, inter-frame correlation and maximize tracking performance in the long run.

Proposed approach
In an image recognition system, the process starts with image acquisition and ends at recognition of the image. A large number of techniques exist that are grouped into supervised learning, unsupervised learning, reinforcement learning and deep learning. The recognition techniques usually use algorithms in the domains of Logistic Regression, Decision Tree, SVM, Naive Bayes, KNN, K-Means, Random Forest, Dimensionality Reduction Algorithms, and Gradient Boosting Algorithms. This paper suggests use of correlation in combination with filters and morphological operations.

Difficulties in object recognition under varied circumstances
Any sign recognition system faces following difficulties:

Lightning
The lightning conditions may differ during the course of the day and also from region to region. Change in weather also changes the lighting in an image. In-door and outdoor images for same object can have varying lightning condition. Shadows in the image can affect the image light. Whatever the lightning may be the system must be able to recognize the object in any of the image. Fig. 1 shows different problems encountered in gesture recognition system.

Positioning
Position in the image of the object can be changed. If template matching is used, the system must handle such images uniformly.

Rotation
The image can be in rotated form. The system must be capable to handle such difficulty. As shown in Fig. 2, the character "ālf" can appear in any of the form. But the orientation of the letter must not affect the recognition of the character or any image.

Mirroring
Any good recognition system must take into care the mirror images as well.

Obstacle
This is the situation when object is not visible rather it is obscured by some other object. The system of object recognition must handle such type of condition and in the output the correct result.

Scale
The system should be able to handle any variance in the size of the object.
Any object recognition scheme must handle all above problems. An efficient and robust object detection system can be developed by conquering the above stated difficulties.

Template matching
Template matching is a technique for finding small parts of an image which match a template image. We match the input image with the images in data set. Templates help identifying characters, numbers, objects, etc. It can be performed on either color or gray level images. Template matching can either be pixel to pixel matching or feature based.

Color based
Another approach to compare images is colored based using histograms. These techniques use the incorporation of color for object detection based on the above mentioned criteria and demonstrate the Cost advantages of combining color with shape. We get encouraging results on detection of gestures using even challenging datasets. The color models give good response to object occlusion and cluttering; robustness to noise in the images.

Shape based
New techniques in object recognition are using shape based features to identify objects. Shape features are often used as a replacement or complement to local features.

Suggested solution
This paper presents a very simple technique for object recognition based on correlation.
The approach suggests use of cross-correlation to detect and recognize images. Fig. 3 shows steps involved in the image recognition system.
Correlation is used to test relationship between two things. Usually in place of simple correlation, normalized correlation is used. The value of crosscorrelation value varies between 1 to -1. A 0 value of normalized correlation means no relation at all. But a positive or negative value of correlation gives measure of the way two things are related.

Fig. 3: Process of gesture recognition
Correlation when applied to images gives the similarity between the two. It is basically sum of products of the two vectors whose correlation we are determining. (1) However normalized correlation gives a value between -1 and 1.

Experimental setup
The experimental setup starts with capturing image with the help of laptop camera or mobile camera. The input images are then compared with the images in the dataset after preprocessing phase. The closest match is the image input.

Preprocessing
Image preprocessing in all image recognition systems includes the set of operations on images whose goal is the improvement of the image data that suppresses undesired distortions or enhances some image features important for further processing. Hand region is extracted from the image using various image processing techniques like image cropping, background subtraction, noise removal using any median or Gaussian filter etc. The image is structured as a flat rectangular shape. The algorithm for the entire process is as under: 1. Input the image A. 2. Reduce the resolution of A by a factor of 10 to increases speed of future calculations. Remove noise using any box filter or Gaussian Filter. 3. Convert to Gray scale from RGB Agrey 4. Compare input image Agrey with the images in the dataset one by one using normalized correlation. 5. Find the highest or lowest value of normalized correlation r. If |r| > 0.75, then we declare the image as recognized. Table 1 shows that for 37 alphabets, 370 images were tested and we got incorrect result for 40 images. This gives an accuracy of 89.2%. Fig 4, 5, and 6 give correlation of "ālf", "ṭ", "ḍ "with other Urdu language alphabets. "ālf" and "ṭ" are recognized but for "ḍ" are not recognized correctly.
However system works good when there is only one signer, for more than one signer correlation fails to give high accuracy in recognition. There are many environments where device needs only one signer. In those scenarios we can use correlation for image recognition.

Comparison with cross-correlation
Normalized Cross Correlation (NCC) is used to compare an input image with a bigger image. It is usually used to find one image in another image. It is also affected by nonuniform illumination and shadows. It calculates the degree to which two signals resemble. It has been commonly used as a metric to evaluate the degree of similarity (or dissimilarity) between two compared images. The main advantage of the normalized cross correlation over the ordinary cross correlation is that it is less sensitive to linear changes in the amplitude of illumination in the two compared images. Like simple normalized correlation, it also returns value in the range between -1 and 1. The following Eq. 6, gives coefficient of normalized cross correlation, Whereas, the mean value of f(x,y) is given as below by Eq. 7.
Same data was used with cross correlation, the image recognition from same signer is done more efficiently as compared to simple normalized correlation. However as is evident from above equations, normalized cross correlation uses more calculations in terms of divisions, square roots etc.

Results and discussion
In this section, the performance of real time static hand gesture recognition system is evaluated for each of the 37 hand gesture 370 samples are stored in dataset. The input image is used for the performance evaluation. The suggested solution faces following problems: Rotation: To solve problem of rotation invariance, the position of wrist is detected and taken as reference point. Fig. 8 shows detection of wrist as starting point for image recognition. Scaling: The handling of scaling is left to next phase of this research. The images to be compared are brought to same size using the morphological operation of dilation.
Nonuniform illumination: Solution to nonuniform illumination is removal of back ground, the image we get after the subtraction of background is free from the problem of nonuniform illumination. Fig. 9 shows how an image changes once its background is removed.
Shadow: It is another problem with image recognition systems. For the time being it is ensured that images especially those in dataset are free for shadow. Moreover while segmenting the image skin detection technique is used, this reduces the effect of shadow in the image matching process.

Conclusion and future work
This paper is an attempt to solve the problem of image recognition in an unsupervised environment. The methodology used here is based on the concept of normalized correlation and normalized crosscorrelation. The suggested solution exists for the situations where there are not more than one signers e.g. on a mobile device. The problem with the approach is that to make it scale invariant, rotation

C o r r e l a t i o n o f ḍ w i t h o t h e r A l p h a b e t s
invariant and background independent some extra measures are to be taken.
Future work involves improved solution for shadow removal and scaling invariance. Also, combining correlation with the entropy of the image is another area that can be used for efficiency.

Background Image
Original Image Background removed Fig. 9: Background removal