A comparative study on mixture of Gaussians for object segmentation

Segmentation is the fundamental step in most of digital image processing and computer vision based applications for feature extraction. The purpose of segmentation is to partition an image into foreground and background. Numerous segmentation algorithms have been proposed for the last four decades ranging from degraded images; high and low contrast images, indoor video, outdoor videos, videos with static background and dynamic backgrounds. This paper presents evaluation and comparison of segmentation techniques used for real-time moving objects through static and adaptive number of Gaussians. The techniques are tested for both indoor and outdoor scenes. The comparison is presented on the basis of qualitative results and computational complexities.


Introduction
*Segmentation is one of the fundamental steps for many computer vision applications to separate foreground and background. Extracted foreground can be used in a variety of applications related to surveillance, object detection and classification etc. A number of techniques for segmentation have been proposed by different researchers which include histogram methods, thresholding, compression based methods, split-and-merge methods, region growing methods, speaker's vocal identification, Multi resolution imaging segmentation, clustering methods etc. (Balafar, 2014;Douglas, 2015). Each technique offers some advantages and disadvantages based on results and efficiency.
When segmentation is done through mixture of Gaussian it may use k-mean algorithm for clustering of mixture components into foreground and background. It may also use expectation maximization algorithm for classification of foreground and background pixels but it is not cost effective when applied to each pixel of an image. A comparative study of segmentation is presented in this paper that uses mixture of Gaussian (Atev et al., 2004;Carlos et al., 2008). Both of the techniques are based upon adaptive multi-modal background subtraction method (Stauffer and Grimson, 1999).
One of the techniques focuses on the Gaussian parameters used in (Stauffer and Grimson, 1999) in order to improve the results while the other technique focuses on the computational cost of (Stauffer and Grimson, 1999) and presents the method to reduce the computational cost. In this paper two aspects of both techniques have been compared i.e., segmentation evaluation and computational cost.
This study is organized as follows: section two is based on related work and describes in detail the two techniques of segmentation based on mixture of Gaussian and their implementations, while section three is dedicated to the comparison of both the techniques based upon results and computational cost. Finally, section four is focused on the conclusion which is presented on the basis of analysis and experimental results.

Segmentation based on mixture of Gaussians
KaewTraKulPong and Bowden (2002) proposed a new technique for image segmentation in which segmentation is achieved through finding edges of a color image by using isotropic edge detector and then applying seeded region growing algorithm. Proposed technique may be used for automatic face detection and content based multimedia applications. Stauffer and Grimson (1999) presented segmentation technique which was based on mixture of Gaussian. It was adaptive to deal robustly with illumination changes, repetitive motions of scene elements, tracking through cluttered regions, slow-moving objects, and introducing or removing objects from the scene. Atev et al. (2004) proposed technique which was based on (Stauffer and Grimson, 1999). It performed segmentation by continuous monitoring of the brightness of pixels over time. This technique has been used in many applications related to pattern recognition (Yogameena et al., 2009;Yogameena et al., 2010). Carlos et al. (2008) introduced a new and efficient strategy for segmentation which reduced the computational cost of stauffer's technique (Stauffer and Grimson, 1999) as it used dynamic number of Gaussian per pixel. Demirci and Karaguuml (2010) introduced segmentation technique for medical images by using phenomena of light refraction. This technique is based on region growing image segmentation method. Proposed technique is computationally more efficient than the existing methods as it does not need to keep knowledge about number of regions in the image. Zezhi and Ellis (2014) used Gaussian mixture model for modeling the background in a scene and resultantly performed segmentation. They use adaptive learning instead of the static one to anticipate sharp changes in the scene. It anticipates the illumination changes as well. Srikanth et al. (2016) used Gaussian Mixture Model (GMM) for segmenting cervical cells that is a basic step for diagnosing cervical cancer from smear images. Yan and Shui (2015) in their paper used GMM by using spatial and color information provided by user as a marker. They formulated their problem as iterative energy minimization problem. Zeng et al. (2014) presented a three step methodology for image segmentation where they employed GMM for partitioning image into small groups followed by calculating the distance between GMM components through Kullback-Leibler (KL) divergence and finally similar GMM were merged through spectral clustering. Shah and Chauhan (2015) applied Hidden Markov Models (HMM) that used GMM for segmenting brain tumor from MRI images. They employed expectation maximization for achieving their objective and provided comparison with fuzzy c-mean based segmentation. Khatoon et al. (2012) used GMM for object segmentation where the primary task was human counting in crowd scenes.
In the mixture of Gaussian, at each pixel location mixture of K Gaussian distributions is applied to model luminance changes rather than explicitly modeling the values of all the pixels by considering one particular type of distribution. To classify a pixel as background or foreground it is analyzed on the basis of three parameters i.e., mean, variance and weight. Both of the techniques presented below are based on adaptive multi-modal background subtraction method, which was presented by Stauffer and Grimson (1999) which deals strongly with lighting changes, repetitive motions of elements in the scene, tracking through messy regions, slowmoving objects, and introducing or removing of scene objects. As the color of slow moving objects has a large variance than the background so, slow moving objects remain in the background for a long time. KaewTraKulPong and Bowden (2002) expectation maximization (EM) presented standard method for maximizing the likelihood of the observed. Instead of using EM algorithm, K-means approximation is used because there is a mixture model for every pixel in the image. If exact EM algorithm is applied on each pixel of the image then it would be expensive.
In both the techniques under study sequence of n frames each of dimension W×H are taken as an input. In each frame I, pixel value (pv) at each position is represented by I(x; y), (0 < x < W and 0 < y < H). Probability of each color value of frame I; is represented by (Eq. 1): (1) Where ( ) →Probability of j th mixture →Mean of j th mixture ∑ d×d covariance matrix for j th mixture Techniques presented by both Atev et al. (2004) and Carlos et al. (2008) are based on adaptive multimodal background subtraction method (Stauffer and Grimson, 1999) which adapts itself to deal with illumination changes, tracking through cluttered regions, repetitive motions of scene elements, introducing or removing objects from the scene and slow-moving objects robustly.
The static number of Gaussians (Atev et al., 2004) is updated using (Eqs. 1-3): While the update for dynamic number of Gaussians (Carlos et al., 2008) is done through following expressions (Eqs. 5-7): , The values of parameters used in (1)-(9) are shown in Table 1. Mixture of Gaussian for segmentation, instead of explicitly modeling values of all the pixels as one particular type of distribution, uses a mixture of K-Gaussian distribution to model luminance change at each pixel location. Each pixel is analyzed on mean, variance and weight to classify it as foreground or background.
In Fixed Gaussian case (Atev et al., 2004) once a pixel value matches some j th mixture then its classification towards foreground or background is achieved using (Eq. 8) If the above relation (8) holds then pixel at (x,y) is declared as background otherwise it is declared as foreground. For adaptive case (Carlos et al., 2008) every pixel at (x,y) in nth frame is needed to be evaluated whether it belongs to foreground or the background. If the below Eq. 9 holds i.e., Then the pixel corresponding at (x,y) is considered as background otherwise it is assumed to that it belongs to foreground. Where NOG represents number of Gaussians.

Evaluation and comparison
In this section details of experimental setup and experimental results are presented.

Experimental setup
The experimental setup defines dataset, system platform, parameter selection and evaluation metrics Eqs. 10 and 11).
The total number of positives is calculated from the ground truths. Ground truths of the input frames are created both by manual method and automatic computation. The video frames where background is available, ground truths are created through background subtraction followed by post processing. The video frames where backgrounds are not available, ground truth are created through manual manipulations.

Experimental results
The average number of true positives detected and total number of positives detected for technique (Atev et al., 2004) and (Carlos et al., 2008) are for each scenario are shown in Table 2.
The overall precision and recall of technique (Atev et al., 2004) and (Carlos et al., 2008) for each scenario is given in Table 3.   (Atev et al., 2004) Avg. No of True positives detected (Atev et al., 2004) Avg. No of positives detected (Carlos et al., 2008) Avg   (Atev et al., 2004) Avg. Precision (Carlos et al., 2008) Avg. Recall (Atev et al., 2004) Avg. Recall (Carlos et al., 2008)  Scenario 1: Segmentation is applied on a video sequence in which one person is walking in corridor using both techniques. The result of precision and recall for 27 frames of the video sequence is given in Fig. 1.
The precision of the technique by Atev et al. (2004) is shown with green line and technique by Carlos et al. (2008) is shown with red color in Fig. 1  (b). Precision of the technique by Atev et al. (2004) shows that almost each frame has same precision but precision is low for each frame as on average it is 0.5666 as shown in Table 3. The reason behind this evaluation is that the false segmentation rate is high as the number of positives detected is larger than the number of true positives detected. The technique by Carlos et al. (2008) presents high peaks as well as low peaks in precision graph. The high peaks are observed in the frames in which pixels are segmented almost as true positives detected, especially in frame 3, whereas the low peaks are due to the high rate of false segmentation like in frame 18 and 24. So, the average precision of technique by Carlos et al. (2008) is 0.4702 (Table 3) and on average the total number of positives detected is much greater than number of true positives detected.
Recall of the technique by Atev et al. (2004) is shown with green line and technique by Carlos et al. (2008) is shown with red color in Fig. 1 (c). According to the graph recall of the technique by Atev et al. (2004) Table 3.
Scenario 2: Both techniques of segmentation are applied on a video sequence in which one person is walking in corridor having large shadow. The result of precision and recall for 25 frames of the video sequence is given in Fig. 2.
The precision of technique by Atev et al. (2004) is lower than technique of Carlos et al. (2008) on average, which is 0.4033. It gets lower as low number of true positives is detected than total number of positives detected in almost all frames.
Whereas, for technique of Carlos et al. (2008), we find either high peaks equal to 1 or low peaks equal to zero. The reason behind the low peaks is that it segments only background while leaving the foreground. This behavior of the methodology is for such frames where no moving pixels were present. The overall precision of the technique by Carlos et al. (2008) is 0.8400.
Recall graph for Carlos et al. (2008) contains most of the falls which shows that these frames are not correctly segmented. The recall for Atev et al. (2004) is not very low as compared to technique by Carlos et al. (2008) because the former segments most of pixels which should be segmented as moving pixel. The overall recall of Atev et al. (2004) is 0.7808 and for the technique of Carlos et al. (2008) is 0.4703 as shown in Table 3.
The results of precision and recall show that for such type of scenes in which some frames of video have very slow moving objects or objects remain static for a small chunk of time, technique of Atev et al. (2004) outperforms over the technique of Carlos et al. (2008) as it contains high number of Gaussian for each pixel in comparison with the later technique.
(a) (b) (c) Fig. 1: (a) Images from scenario 1 (b) precision of Atev et al. (2004) and Carlos et al. (2008) (c) recall of Atev et al. (2004) and Carlos et al. (2008) Scenario 3: Both techniques of segmentation are applied on a video sequence in which one person is walking in outdoor scene having flickering leaves in the background. The results of precision and recall for 18 frames of the video sequence are given in Fig.  3. Precision of the technique by Atev et al. (2004) is 1 for first four frames and then there is an abrupt change. The abrupt change in precision is due to the pixels segmented as foreground which are actually the pixels of background. The technique by Carlos et al. (2008) has zero precision for the first frame as it has segmented first frames as totally background frames with no foreground pixel. Overall precision for Atev et al. (2004) is 0.4982 and for the technique by Carlos et al. (2008) is 0.6185 as given in Table 3.
The recall of technique by Atev et al. (2004) is low as it wrongly segments the flickering leaves in the background whereas the Carlos et al. (2008) copes better with such scenarios so its recall is high.
The result of precision and recall shows that for such type of dynamic background scenes Carlos et al. (2008) outperforms over technique Atev et al. (2004).

Discussion
In the test images as shown in Fig. 4, the frame 3 and 4 are constant. It means that there is no motion  Carlos et al. (2008) between these frames. The results obtained from the technique by Atev et al. (2004) cope with such condition whereas the result obtained from the technique by Carlos et al. (2008) do not segment human in such condition where no motion is detected in the frame. Secondly, the blob of human are complete in second row whereas the parts of blob of human is missing in second row so some of the human detection techniques cannot such incomplete blobs as human. This difference is due to the reason that technique (Atev et al., 2004) uses same number of Gaussian for each pixel which keep the status of previous frame pixel and these values are used for next frame if pixels are detected as static whereas, technique (Carlos et al., 2008) assign Gaussians dynamically for each frame so it loses the previous information of pixels.
In Fig. 5 test images contain the frames of human interaction in which humans are in standing and moving backwards quickly due to the action of kicking. The results obtained from the technique (Carlos et al., 2008) evaluate the foreground pixels in row 3 better as compare to the technique (Atev et al., 2004) as shown in row 2.
The reason is that the technique (Atev et al., 2004) contains fix number of Gaussian and the technique (Carlos et al., 2008) contains dynamic number of Gaussian so it does not remember the previous state of the pixel and do not evaluate a pixel as foreground or background pixel on the basis of its state in previous image. So, the technique (Carlos et al., 2008) gives better results of segmentation for actions like fighting and kicking as such actions which give abrupt change in each frame of a video sequence.  Fig. 6 is showing examples of segmentation over two different images. These are taken as example from two different video sequences. It may be observed that for each of the frame, original image along with its ground truth and segmentation results achieved through Atev et al. (2004) and Carlos et al. (2008) are shown. The ground truth for Frame A is manually created while for Frame B it was generated through background subtraction technique as background frames for its video sequences were known. The true positives for frame are 3273 calculated from its ground truth. The dynamic number of Gaussian technique (Carlos et al., 2008) segmentation gives 4811 pixels as fore ground out of which 3273 are true positives detected and 1538 are false positives while the static Gaussian technique (Atev et al., 2004) detected 5479 pixels as foreground with 3273 true positives detected and 2206 pixels are wrongly detected as positives.

Fig. 5:
First row contains the tested frames, Second row contains the result of segmentation obtained from Atev et al. (2004) and third row contains the result of segmentation obtained from Carlos et al. (2008) For the frame B true positives obtained through its ground truth are 24616. The technique (Carlos et al., 2008) detects 37758 pixels as foreground out of which 18290 are true positives while 19468 are false positives. The methodology (Atev et al., 2004) segments 35407 pixels as foreground from which 18671 are truly classified while 16736 are misclassified as foreground.

Computational cost
The computational cost of both the techniques as can be seen in Table 4 depends upon the number of Gaussians used to model each pixel of an image. The computational cost is directly proportional to the number of Gaussians. The generalized computational cost for techniques of Atev et al. (2004) and Carlos et al. (2008) are shown in Eq. 12 and Eq. 13 i.e., [( * ℎ * 4) + 12] (12) [( * ℎ * ) + ( − 1)] where n is number of images, w and h is the width and height of the corresponding image and c is number of Gaussians, while the term ( − 1) is the cost of sorting the values of components for each pixel of an image. Four Gaussians are used for Atev et al. (2004) for each pixel so the average, best and worst case for modeling a pixel as foreground and background is same. Variable number of Gaussians is used in Carlos et al. (2008) for each pixel. In this case c varies from one to six. In best case c=1, for average c=2 and c=6 as worst case.

Conclusion
According to the evaluation and comparison of both the techniques we come to know that the results of technique proposed by Atev et al. (2004) are better than the results of the technique proposed by Carlos et al. (2008). But the computational cost of Carlos et al. (2008) is less than the technique proposed by Atev et al. (2004). So, both of the techniques are good from some aspect either according to computational cost or good results. Future work can be the combination of both the techniques in order to combine the advantages of both. In this way we can do segmentation that will give good results and have a low computational cost as well.