A novel hybrid support vector machine with decision tree for data classification

Article history: Received 18 March 2017 Received in revised form 20 May 2017 Accepted 5 July 2017 The purpose of this paper is to increase the accuracy of a proposed support vector machine model using hybrid model of SVM and ID3. Then the hybrid approach based on SVM and ID3 tree will be evaluated focusing on analyzing the impact of ID3 on SVM performance. The evaluation process was carried out on the global dataset and Adult reference extracted from KEEL dataset repository. The obtained results demonstrate higher classification accuracy (0.9125) of the proposed model compared to SVM and ID3.


Introduction
*Machine learning focuses on how to write a program that is capable of improving its performance through training and learning. The major objective in machine learning deals with choosing a model for current data that is favorably capable of estimating data features and deducing by using statistical principles and rules considering previously-occurred samples or information. The process of structuring a machine that is capable of deducing may be accomplished by undersupervision learning or without-supervision learning (Deniz et al., 2017) Supervision learning is a process that is overseen. Such systems involve labeled data that are required to be recovered during testing phase by the system that learns by making use of training data that the positive or negative samples within them are determined for the supervisor when there are two classes of data. In the case of learning without supervision, the labels are not clear and there is no supervisor, so the system has to identify categories or classes among data based on the features of various samples that exist (Begum et al., 2015).
Classification can be carried out by various approaches including Fisher's Linear Discriminant Analysis (FLDA) (Giron-Sierra, 2017), Weighted Voting (Sun et al., 2005), Native Bayes (NB) (Giron-Sierra, 2017), Neural Networks (Giron-Sierra, 2017), Decision Tree (Giron-Sierra, 2017), Nearest Neighbor (NN) (Giron-Sierra, 2017), SVM, Boosting, etc. Each approach has both advantages and drawbacks presenting various accuracy and precision regarding to attributes of data sets. The approach of decision tree covers a wide range of applications including identification of patterns, classification of patterns, classification, etc. (Song et al., 2013). The major advantage of decision tree method lies in identifying solutions (Mehenni, 2015). In certain situations when we confront a large sample space, this approach can make data preparation much easier and more understandable for users without technical knowledge compared with remaining methods (Mehenni, 2015).
SVM has been proved to be successful in several real-life applications such as handwritten figures, particle identification, face recognition, and bioinformatics (Vijayan et al., 2016). Applying SVM to these cases has yielded higher levels of generalization in comparison with other techniques. SVM model in testing phase is considerably slower than other methods (Cortes and Vapnik, 1995). The reason is that the computational complexity changes decision-making function of SVM depending on the number of support vectors. Thus, as the number of support vectors increases, SVM will spend much longer time sorting the new data points. Approaches such as feature selection and learning algorithms are used to improve SVM performance during testing phase (Cortes and Vapnik, 1995). The major purpose of reducing the number of support vectors is shortening the testing duration. The current study proposes approximation of SVM decision boundary by using decision tree in order to accelerate SVM in testing phase. The suggested approach is a combination of SVM and DT with a focus on achieving a quick classification without sacrificing its precision or accuracy.
Classification of Adult data was accomplished (Deepajothi and Selvarajan, 2012) by some data analysis techniques such as Naive Bayesian, K Star, Random Forest, and Zero R. Preprocessing and size reducing operations were also utilized in order to maximize classification accuracy. Precision of Naive Bayesian model was estimated 84.37 based on its evaluation by WEKA program. PSO and K-Anonymity models (Mohana et al., 2016) were proposed for classification of Adult data. K-Anonymity model is usually used for discovering hidden patterns within an enormous mass of data. The results indicate that K-Anonymity model has a higher identifying precision than PSO.
The present paper has been organized as follows: SVM and decision tree models, two data analysis methods, will be discussed in the Section 2. Section 3 deals with describing the proposed model in more details. Section 4, explains evaluation of the model and its results. Finally, Section 5 deals with conclusions and potential studies for the future.

Data mining
Data mining is used for analyzing large-size data sets to find exact relationships among data (Prakash et al., 2015). Databases contain lots of relationships among existing data that are impossible to be discovered without resorting to data mining techniques. Classification algorithms are among the most important data mining techniques from which SVM and DT are explained in this section.

Support vector machine
SVM (Cortes and Vapnik, 1995) is actually a binary classifier that separates two classes by a boundary line. SVM is capable of reducing the error of experimental categorizing and increasing class reparability using various transformations simultaneously. This ability has endowed SVM with a superior performance in dealing with large-size data or classes with multi-dimensional distribution space. Linear dividing of data is mainly aimed at acquiring a function that determines a hyper plane with the widest margin. As the margin reaches the maximum range, separation between classes will be maximized. Assume that = { , } is a testing sample containing two = ±1 classes, and each class is composed of , = 1, … , attribute. Fig. 1 illustrates a dividing hyper plane within SVM.
As can be seen from Fig. 1, the line . + = 0 classifies the existing data into two ±1 classes. This line is called dividing hyper plane. Two lines, namely . + = +1 and . + = −1 represent y=+1 and y=-1 sets respectively. Distance of each data corresponding to Xi from hyper plane is calculated by Eq. 1.
(1) Eq. 1 is used when data are separable in nonlinear way. In Eq. 2, y and yi represent equation output and value of testing sample Xi class respectively. The vector = ( 1 , 2 , … , ) indicates an input data, and , = 1, 2, … , are support vectors.
(2) K(X, Xi) is the kernel function that generates internal pulses to produce various types of machine from non-linear classes within data space. Radial kernel function usually shows a better performance for prediction. Radial kernel function is defined by Eq. 3 in which ‖X-Xi‖ is a set of training data and γ represents a parameter that is defined by user for the kernel width.

Decision tree
Data classification using decision tree is one the well-recognized approaches to this purpose that obviates the needs for having parameters in advance and previous knowledge about data. This approach is categorized as a supervision method that is able to label and identify the individual attributes of testing data based on the training data that has been provided to form a tree called decision tree. Decision tree can be used to design rules for deduction system and label the unlabeled data (Nasridinov et al., 2013). Decision tree is a well-known approach to classification the results of which are provided on a flow chart similar to a tree structure. On this tree, each node represents a test on feature value, each branch indicates results of each test, and leaves represent classes. Complexity of a decision tree increases by higher number of features. Although only a few features have been observed in certain situations to be capable of determining the belonging class of an object, remaining features have weak or no effect (Nowak et al., 2013). This approach is one of the nonparametric classification methods that may be classified into two groups with regard to type of dependent variable: tree sorting for discrete variable and variable batch, and tree regression for continuous variable. Tree classification is in line with some methods such as auditing analysis of discriminant function and logistic regression. This method involves a set of logic conditions in the form of an algorithm with tree-like structure used for classification or prediction (Salleh, 2014). Generally, test data that are used to create a tree differ from data that are used for tree evaluation, and the number of errors in identifying the data class is a criterion for assessing appropriateness of the algorithm. Most learning algorithms of decision tree operate on the basis of an up-down voracious searching action within the space of existing trees. ID3 algorithm is an algorithm that is used for building decision tree (Quinlan, 1986). Within the ID3 tree, the lack of order for all features is calculated through entropy as the first step, the rate of information is estimated using the obtained value for all features. Entropy shows the randomness in the form of a mathematical criterion. If the set S includes both positive and negative samples of one dataset, entropy of S in relation to Boolean classification is defined by Eq. 4 (Quinlan, 1986).
In Eq. 4, ⊕ represents ratio of positive samples to all samples, and ⊖ is the ratio of negative samples to all samples. Making decision about choosing a feature that should be at the root of tree depends on information rate of each feature. Eq. 5 is used to calculate the information rate (Quinlan, 1986).
In Eq. 5, the parameter Values (A) is a value set of A features, and Sv is a subset of S for which A has the value of V. This algorithm is only capable of data classification with limited and discrete range of features, thus it cannot be applied to noisy and distorted data (Quinlan, 1986). The algorithm C4.5 is a completed form of algorithm ID3. This algorithm is capable of classification continuous and noisy data as well. To do this, data are initially ordered then the useful amounts for all possible modes of segregating the ordered data are acquired. Finally, a separation corresponding to the largest useful amount is chosen as a separator.

The proposed model
When dataset is large, complexity of training phase and required storage memory for saving these data will increase accordingly. Therefore, it's necessary to have a model that is able to reduce the complexity. The proposed model is a combination of SVM and ID3 in order to achieve an efficient integrated procedure for classification Adult data from KEEL dataset repository (sci2s.ugr.es/keel/ datasets.php) that contain 48842 samples and 14 features. Fig. 2 shows the flow chart of the suggested model. Phase 1: Data reading Phase 2: Preprocessing of data (including normalization and removing deviated data) Phase 3: Applying support vector machine on dataset Phase 4: Calculating the distance of sample data for support vectors relevant to each class Phase 5: Predicted label and obtained distance as data membership in one class is calculated together with actual label of each sample class. Phase 6: The obtained results from phase 5 are stored into a new dataset. Phase 7: Classification of training data is carried out using the new dataset by decision tree. Phase 8: Data testing Phase 9: Evaluating outcomes of the proposed model The proposed method involves dividing all data into two groups of experimental and test data in a random way following a proportion of 70 to 30. Then the experimental data are fed into standard SVM, and output is estimated. Data are classified again with the aid of SVM using the obtained coefficients. The estimated class was called new target. As the next step, distance between individual data is obtained by support vectors corresponding to the estimated class then their average value is calculated.
The estimated class and the value of calculated distance for each experimental data as the feature vector together with actual class of data are fed into the decision tree classifier to get the results recalculated.
The above steps are repeated during test phase in such a way that individual test data are initially obtained with the aid of SVM model. Then they are classified, and the estimated class for each test data is considered as the new target. As the next step, distance between test data from support vectors corresponding to new target is averaged, and the resultant value is considered as the second feature. Two obtained features (estimated class and distance) are fed into the already acquired decision tree classifier in order to verify the class of test data.

Evaluation and results
The proposed model implementation was accomplished by MATLAB 2016b program, and the required evaluation has been carried out on Adult dataset containing 48842 samples and 14 features. Various criteria have been utilized for the purposes of evaluation (Bratko et al., 1999) in order to demonstrate the improvements made in the suggested model compared with SVM and ID3.
In the defined equations, the parameter TN represents those records that actually belong to negative classification, and classifying algorithm has accurately detected them as negative. TP represents those records that actually belong to positive classification, and classifying algorithm has accurately detected them as positive. FP represents those records that actually belong to negative classification, but classifying algorithm has mistakenly detected them as positive. FN represents those records that actually belong to positive classification, but classifying algorithm has mistakenly detected them as negative.
Details about precision of classification models on Adult data with 100 iteration cycles within the program are provided in Table 1. As can be seen, precision of the proposed model is higher compared with SVM and ID3. Moreover, precision of ID3 is higher than SVM.  Fig. 3 depicts a diagram comparing models on Adult dataset with 100 iteration cycles based on the criteria of Precision, Recall, F-Measure, and Accuracy. Table 2 summarizes classification precision of the models on Adult dataset with 200 iteration cycles. As can be seen from Table 2, precision of the proposed model is higher compared with 100 iteration cycles. The reason can be attributed to this fact that the suggested model is able to examine the searching space more and generate more precise rules by making use of the ID3.  Fig. 4 depicts a diagram comparing models on Adult dataset with 200 iteration cycles based on the criteria of Precision, Recall, F-Measure, and Accuracy. Fig. 5 depicts a diagram comparing the models with regard to implementation of the iterations and Error Rate criterion on Adult dataset.

Conclusion and future works
Classification is an operation for data analysis and modeling with the major focus on describing significant classes of data. For the purposes of classification, similarity of each classification is estimated on the basis of existing differences among predefined data. Classification is actually a supervised learning in which classes are already specified. The main objective of classification is evaluating features of a dataset and assigning them to a set of classes. The current paper suggests a combined model integrating SVM and ID3 tree in order to sort Adult data. The ID3 tree was utilized to increase precision of classification in SVM model. The obtained results indicate that the suggested model has better detection accuracy compared with SVM and ID3 because this model has verified the acquired classifications from SVM using the ID3 tree.
Furthermore, ID3 provides better accuracy and lower error by comparison with SVM.