Improved minimum-minimum roughness algorithm for clustering categorical data

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.


Introduction
*Clustering is a fundamental technique in data mining and machine learning. It is actually the finding of groups of objects such that objects in the same group have high similarity, and those in different groups have low similarity (Han and Kamber, 2006). Clustering has been widely deployed in several fields such as data mining, machine learning, pattern recognition, bioinformatics, etc. Literally, many clustering techniques have been proposed and generally classified into two types: partial and hierarchical. Most of the clustering techniques focus on numeric data sets, where each category describing the objects has a value domain that is a continuous interval of real values, and each data object number is treated as a point in a multidimensional metric space with a metric that measures distances between objects, such as the Euclidean metric or the Mahalanobis metric. However, practical applications often encounter data sets classified as categories whose values are finite and unordered; for example, hair color, nationality, etc. fail to be defined with a distance function spontaneously.
Recently, clustering categorical data has attracted special attention from several researchers in data mining areas and several clustering algorithms have been proposed (Khandelwal and Sharma, 2015;Cao et al., 2009;Guha et al., 2000;Gibson et al., 2000;Huang, 1998;Kim et al., 2004;Mesakar and Chaudhari, 2012). Although these algorithms make important contributions to the problem of clustering categorical data, they still fail to handle uncertainty in the clustering process. Handling uncertainty during clustering is an important issue because in many practical applications there are often no clear boundaries between clusters. To handle the uncertainty in the clustering of categorical data, recently Huang (1998), and Kim et al. (2004) proposed two algorithms that apply fuzzy set theory, However, these algorithms require many runs to establish a stable value needed for the parameter used to control the degree of fuzzy membership. A popular approach to dealing with uncertainty is to use Rough Set Theory (RST), proposed by Pawlak (1991). The RST is an effective tool for machine learning and data mining from information systems with category values (Pawlak, 1991). It has been successfully applied in many fields (Zhang et al., 2016;Bello and Falcon, 2017) because it can effectively deal with data that require neither thresholds nor domain-specific expertise (Jensen and Shen, 2008).
Recently, some authors have proposed a new approach to solve the problem of clustering categorical data by using RST and divisive technique (Hassanein and Elmelegy, 2014;Herawan et al., 2010;Jyoti, 2013;Mazlack et al., 2000;Parmar et al., 2007). Its key principle is to choose the best category from many candidate categories to gradually divide the objects into clusters in each run. Specifically, Mazlack et al. (2000) proposed an algorithm that uses an index Total Roughness (TR) in RST to determine the clustering quality of the selected category, i.e., larger TR is better. Herawan et al. (2010) proposed another technique called Maximum Dependency Categories (MDA) which uses the dependency between categories in RST. Qin et al. (2012) argued that TR and MDA values are both determined mainly based on the number of elements in the lower approximation of a category for other categories; so, they often choose the same category as the clustering category in most cases.
One of the most successful and pioneering clustering algorithms based on RST is the Minimum-Minimum Roughness algorithm (MMR) proposed by Parmar et al. (2007). MMR, a top-down hierarchical clustering algorithm, uses Min-Roughness as the criterion to determine the clustering category at each iteration step. MMR allows handling uncertainty during the clustering of categorical data. However, MMR tends to choose the category with fewer values (Qin et al., 2012), i.e., if a category has only a single value, it is selected as a clustering category, resulting in the termination of clustering. Moreover, MMR chooses a leaf node that has more objects to split further, thus producing undesirable clustering results.
In this paper, we propose an improved MMR algorithm called IMMR (Improved Minimum-Minimum Roughness) to overcome the mentioned shortcomings. Besides the advantages of MMR, our proposed IMMR algorithm not only ignores all single-valued categories but also determines the next split node by considering the sum entropy of all categories on the nodes. Experimental results on actual data sets taken from UCI databases show that the IMMR algorithm can be used successfully in clustering analysis of categorical data with better clustering results.

Related concepts
A categorical data set can be represented as a table, where each row represents an object, case, or event, and each column represents a category, property, or a scale to be measured on each object. In RST, such a data table is called an information system. Formally, an information system is defined as follows.
Definition 1: An information system is a quadruple tuple = ( , , , ), where is a non-empty finite set of objects, is a nonempty finite set of categories, = ⋃ ∈ where is a set of all values of category , and : × → is a function, called information function, that assigns value a ( , ) ∈ for every ( , ) ∈ × . Definition 2 Let = ( , , , ) be an information system, ⊆ . Two elements , ∈ is said to beindiscernible in if and only if ( , ) = ( , ), for every ∈ .
We denote the indiscernibility relation induced by the set of categories by ( ). Obviously, ( ) is an equivalence relation and it induces a unique partition (clustering) of . The partition of induced by ( ) in = ( , , , ) denoted by and the equivalence class in the partition containing ∈ , denoted by [ ] .
Definition 3: Let = ( , , , ) be an information system, ⊆ , and ⊆ . The -lower approximation of , denoted by ( ) and -upper approximation of , denoted by ( ), respectively, are defined by: (1) These definitions state that object ∈ certainly belongs to , whereas object ∈ could belong to . Obviously, there is ⊆ ⊆ and is said to be definable if = . Otherwise, is said to be rough with B-boundary ( ) = − .
Definition 4: Let = ( , , , ) be an information system, ⊆ , and ⊆ . The accuracy of approximation of with respect to is defined as: Throughout the paper, | | denotes the cardinality of .
Obviously, 0 ≤ ( ) ≤ 1. If ( ) = 1, then = . The -boundary of is empty, and is crisp with respect to . If ( ) < 1, then ⊂ . The -boundary of is not empty, and is rough with respect to .
Definition 5: Let = ( , , , ) be an information system, ⊆ , and ⊆ . The roughness of with respect to is defined as: Definition 6: Let = ( , , , ) be an information system. For , ⊆ , it is said that depends on in a degree (0 ≤ ≤ 1), denoted by ⟹ , if: Definition 7: Let = ( , , , ) be an information system, ⊆ and = { 1 , 2 , … , }. The entropy of a partition is defined as: where ( ) = | | | | ⁄ , and we define 0log 2 0 = 0. Entropy is a measure of the degree of confusion (uncertainty) about the value of a category in an information system S. The smallest possible value of entropy is 0, which occurs when all the column vector components corresponding to the category a in S are the same, i.e., Pr( = ) = 1 and Pr( = ) = 0 for all ≠ . In other words, there is no disturbance in this column vector. The larger the value of entropy, the more disordered the column vector associated with a The maximum possible value of entropy is 2 | |, which is obtained when Pr( ) is uniformly distributed, i.e., Pr( = ) = 1/ for all ∈ . Entropy depends only on probability and not on the specific value of a.
For the above reason, entropy has been used by many authors to determine how good a clustering operation is Ienco et al. (2012), Jyoti (2013), Parmar et al. (2007), and UCI (2013). Value entropy of a cluster of smaller extent smaller disturbances in clusters, i.e., clusters uniformity in increasingly large over. However, McCaffrey (2013) argued that it is uneasy to modify the entropy definition of a vector to apply to a cluster and a clustering result (essentially a set of tables or matrices). To evaluate the quality of a clustering, McCaffrey (2013) used the following definition.
Definition 8: Given the clustered data set in the form of an information system = ( , , , ), a clustering operation = { 1 , 2 , … , } of the objects contained in U. The entropy of a cluster is determined by the sum of the entropy of each of the above categories . The entropy of clustering is defined as the weighted sum of the entropies of each cluster, where the weight of each cluster is its probability and equals Pr( ) = | | | | ⁄ .
The lower the entropy of the clustering, the higher the clustering quality, in the sense that the similarity of objects in the same cluster is high and the similarity between clusters is low.

MMR algorithm
MMR is a top-down hierarchical clustering algorithm (Parmar et al., 2007). It is an iterative noninverting process that progressively dichotomizes the original set U of objects with the goal of achieving a better clustering result. The algorithm takes the number of clusters to collect predetermined k as input and will terminate when this number of clusters k is reached. At each iteration, the two basic tasks that a top-down hierarchical clustering algorithm must perform include: (1) Choose the best category from all candidate categories to partition the node to further bifurcate into equivalent classes.
(2) Among the obtained equivalent classes, determine a class that becomes a cluster (leaf node), merge all remaining classes into a node for bifurcation in the next step.
To perform the above two tasks, the MMR algorithm uses the roughness concept in the RST as presented in the following definitions.
Definition 9: Mean-Roughness: Given a clustered data set in the form of an information system = ( , , , ), two categories and of A, ≠ . The category mean rawness for the category against the category , denoted by ℎ ( ), is defined as follows (Parmar et al., 2007): for category is determined with formula 3, specifically: The ℎ ( ) smaller the value, the higher the similarity degree of the category among the objects generated by in each class.
Definition 10: Min-Mean-Roughness: Given the clustered data set in the form of information system = ( , , , ), category ∈ . The minimum mean-roughness of the category for each category ∈ , ≠ , denoted by ( ), is determined by (Parmar et al., 2007): Definition 11: Min-Min-Mean-Roughness: Given the clustered data set in the form of an information system = ( , , , ). The minimum value of the mean minimum roughness, denoted MMR, is defined as follows (Bello and Falcon, 2017): In each iteration, the MMR algorithm chooses the category ∈ for the smallest MR as the partition category, specifically,

= argmin ∈ { ( )}
After the partition category, a is determined, X is then dichotomized as follows:  Identify partition of X on a by solving ( ) = { 1 , … , }, ⁄  For each equivalent class , sum the roughness for each category ∈ , ≠ by:  Take the class with the smallest value as a cluster (leaf) and the union of the remaining classes as a node for bifurcation in the next step.
Though MMR is considered as one of the successful and pioneering RST clustering algorithms, it still has certain shortcomings as mentioned in the previous section; specifically, (1) MMR tends to choose the category with fewer values (Qin et al., 2012), and (2) MMR chooses a leaf node that has more objects to split further, thus producing undesirable clustering results. To overcome the above limitations, it can be further improved as follows.

Improved algorithm IMMR
To overcome the first limitation, at each step of the iterative process, before performing the computations to determine the best dichotomous category, we need to remove all the single-valued categories, i.e., remove all categories for the node partition to be split consisting of only a single class. And, to deal with the second one, we need to determine which node to be further dichotomized by considering the sum of the entropy of all the categories on each node as presented in Definition 8. The following example is used as an illustration. Table  1 and we need to group these into 3 clusters (k=3). At the first step, both MMR and IMMR algorithms take all internal objects U as nodes to be dichotomized, and determine the best partition category as the one that gives the smallest MR value (Definition 11); we have: ( 1) = 1, ( 1) = 1; ℎ a1 (a2) = 4 5 ⁄ ,
In the second step, the selected partition category is a1. The final clustering result of MMR can be represented as a tree in Fig. 1  Based on the values of , we can conclude that the clustering of IMMR is better than that of MMR.
Generally, the proposed IMMR algorithm can be coded as the following: Assuming that the given data set has n objects, m categories, k is an assigned number of clusters and l is the maximum value of the possible category domains, then, to group objects into k clusters, the MMR algorithm needs to perform k-1 iterations. At each iteration, the time to find the partition of the categories is mn, the time to calculate the mean roughness is m2l, the time to calculate the MR and MMR is 2m, the time to compute the entropy of the categories is m. Therefore, the time complexity of IMMR is polynomial and is determined with O(knm+km2l).

Performance evaluation
Evaluation of clustering quality is often a difficult and subjective task (Ienco et al., 2012;Parmar et al., 2007). In this paper, we use the index called Overall purity proposed by Parmar et al. (2007) as it is an external criterion, a simple and easily accepted evaluation criterion (Ienco et al., 2012). It evaluates the clustering quality of a clustering result against the actual data set, where each object is preceded by a particular class label. Using information about the actual class labels and information about the cluster labels to which the objects are clustered by the algorithm, it evaluates how well the clustering results match with the initially given classes.
Assuming that a data set includes an actual object that needs to be classified into k classes {c1, c2, …, ck} and k clusters {ω1, ω2, …, ωk}. Let ni denote the number of objects that have been grouped into the cluster ωi, nij denotes the number of objects belonging to the cluster ωi with class labels cj in the set of known class labels.
Cluster purity ω is defined as the ratio between the number of objects ω in the dominant class label and the number of objects : Overall purity is defined as the proportion of properly classified objects among all the objects present in the data set, i.e., The Overall purity has a range of [0,1]. The higher the Overall purity, the better the quality of the clustering result. A perfect clustering gives an Overall purity value of 1. The Overall purity increases as the number of clusters increases. In particular, the Overall purity is 1 if each cluster consists of only one object.
To calculate the Overall purity, we first create a confusion matrix as shown in Table 2, by browsing through each phrase ωi and count how many objects belong to each class cj. Then, from each row for each cluster ωi, we select the maximum value, sum them together and finally get the total divided by the number of all objects in the data set.

Computational environment
All necessary experimental calculations were performed on an Intel computer with Intel core 2, Quad@2.4 GHz, 2GB RAM, 160GB HDD. IMMR and MMR algorithms are developed in the R environment with the support of the RoughSets package.

Experimental data sets of calculation results
We conducted an IMMR test with real datasets, including Zoo, Mushroom, and Car Evaluation taken from the UCI machine learning dataset (UCI, 2013) and compared the clustering results obtained against the results given by MMR. Information about these datasets and the calculation results is as follows:

Zoo dataset
The Zoo dataset contains 101 objects; each object belongs to an animal species, described by 18 taxonomic categories. Subjects were pre-classified into seven classes (mammals, birds, etc.). Since each animal belongs to one of the seven classes, Parmar et al. (2007) tested for MMR the number of clusters to collect = 7. The clustering results given by MMR on the Zoo dataset are summarized in Table 3. Out of 101 objects, there are 3+39+1+13+10+6 +20=92 clustering objects with majority class labels, so the Overall purity of the clustering result is given by the MMR algorithm is 92/101=91%. Also with the Zoo dataset, the clustering result with our proposed IMMR is shown in Table 4.    With IMMR, out of 101 objects, there are 41+20+8+7+13+4+1=94 objects clustered with majority class labels, so the overall purity of the clustering results given by IMMR is 94/101=93%. Thus, the Overall purity by IMMR is 2% higher than that by MMR.

Mushrooms dataset
The Mushroom dataset contains 8124 objects, where each object contains information about a mushroom. Mushroom has 22 taxonomic properties, each corresponding to a physical characteristic of the fungus. Each subject belonged to one of two types of mushrooms: Edible (4208 subjects) and poisonous (3916 subjects). Parmar et al. (2007) tested the MMR algorithm on the Mushroom dataset with 20 clusters (k=20). Their test resulted in an overall purity of 84%. Table 5 briefly shows the clustering results by our proposed IMMR.
Out of 8124 objects, there are 7386 objects belonging to the majority class label. Therefore, the Overall purity of clustering by IMMR is 7386/8124=91%, indicating that the Overall purity by IMMR is 7% higher than that by MMR.

Car evaluation dataset
The Car evaluation dataset has 1728 objects. Each object is described by 6 categorical categories and can belong to four classes: unacc (1210 objects), acc (384 objects), good (69 objects, and v-good (65 objects). The MMR algorithm results in an overall purity of 70%, whereas our proposed IMMR results in an overall purity of 72% as shown in Table 6.
The experimental results on the above actual data sets show that the IMMR algorithm gives better clustering results than the MMR algorithm.

Conclusion
Most algorithms clustering categorical data fail to handle the uncertainty in the data sets. To overcome such shortcomings, we propose an improved version of the MMR algorithm by removing all the singlevalued categories before clustering and considering the sum of the entropy of all the categories on each node to determine which node needs further dichotomized. The experimental results with actual datasets show that our proposed algorithm IMMR gives better clustering results than the MMR algorithm, indicating that IMMR can be used successfully in the clustering of categorical data.

Conflict of interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.