Data mining classification algorithms: An overview

Data mining is also defined as the process of analyzing a quantity of data (usually a large amount) to find a logical relationship that summarizes the data in a new way that is understandable and useful to the owner of the data. This paper examines the various types of classification algorithms in Data Mining, their applications and categorically states the strengths and limitations of each type. The weaknesses found in each algorithm demonstrate how tasks cannot be performed well when only one type of algorithm is applied. For this reason, it is the view of the writer that further research needs to be carried out to explore the potential of combining several of these algorithms to solve machine learning problems.


Introduction
*Over the years, the use of technology has evolved from simple ideas to concrete realities. This has opened exciting vistas of opportunities for its application in varied fields of disciplines. The classification algorithm is being used today in important sectors, such as health and business. Awad and Khanna (2015) outlined the extensive application of machine learning to include web searches, stock market predictions, gene sequence analysis, weather forecasting, and drug development. These broad uses of machine learning are indebted to new advances in classification algorithms. Data Mining is a process of searching for knowledge of data without prior assumptions about what this knowledge can be. Data mining is also defined as the process of analyzing a quantity of data (usually a large amount) to find a logical relationship that summarizes the data in a new way that is understandable and useful to the owner of the data.

Definition of classification
A technical process that uses algorithms to analyze data from multiple perspectives and extract meaningful patterns that can be used to predict users 'future behavior. The market basket analysis system that Amazon.com uses recommends new products to its customers based on their past purchases and is a widely known example of how data mining can be used in marketing. Han et al. (2011) defined classification as a method of identifying a form or pattern that illustrates as well as differentiates data types or notions. Such a model, according to them, is developed from the scrutiny of a group of data items with recognized group descriptions. This model is then utilized to forecast the group of items whose group labels we do not know. Neelamegam and Ramaraj (2013) described classification as a machine learning method that is applied to foretell the clusters that data instances belong.

Types of classification algorithms in machine learning
According to Samuel (2000), machine learning is the study of how computers are made to learn automatically. Over time, different types of machine learning processes have been developed. There are types of algorithms as follows: Machine learning systems can be grouped into different categories according to the expected results of the algorithm. In fact, classification involves two steps process. The first step is supervised learning for the sake of the predefined class label for the training data set. The second step involves classification accuracy evaluation (Deng et al., 2010).

Supervised learning
Under supervised learning, the algorithm works with a set of examples whose labels are known. It is mostly applied in situations in which stored data forecasts possible impending events. Supervised learning applications fall into two key groups: Classification and regression. Deng et al. (2010) explained that the labels could be nominal values in the instance of the classification task or numerical values in the case of the regression task. In other words, when a computer is given inputs and anticipated outputs, the computer must make sense of the relationship, if any, or find out the general rule between the two. Nevala (2017) contended that some practical applications of supervised learning are detection of fraud, evaluation of risks, customer segmentation, image, speech, and text recognition.

Decision tree
A decision tree is commonly used for inductive inference. It involves approximating discrete-valued functions for noisy data and capable of learning inconsistent phrases. Kesavaraj and Sukumaran (2013) contended that it is a model that estimates the value of the expected variable due to multiple input variables. Patil et al. (2010) mentioned that the decision tree occurs in two main stages: Tree growth, at a training set, and tree pruning, which occurs when the size of the tree is decreased for easy understanding. Some types of decision trees are ID3, C45, and ASSISTANT. The decision tree works through the classification of instances by arranging them from the root to some leaf node. Each node in the tree indicates a test of some characteristic of the instance, and every branch that inclines from the node matches a probable equivalent of the feature. Classification starts by starting a root node of the tree, then checking the characteristics for the specified feature before finally declining from the branch in line with a given attribute. This process goes on repeatedly for the subtree rooted at the new node.
ID3 is also known as Iterative Dichotomiser 3, and it is a common algorithm used in machine learning. One of its advantages is that it is easy to understand. This is admitted by Quinlan (1986), who posited that its popularity is due to its efficiency and simple use. On the contrary, it cannot be used to handle missing values and does not have a backtracking search and global optimization.
Similarly, C4.5, as an extended version of the ID3 algorithm, reduces the weaknesses found in ID3. Bhukya and Ramachandram (2010) agreed that unwanted branches are replaced with leaf nodes in the pruning stage by C4.5 and returned to the tree after generation. Adhatrao et al. (2013) felt that C4.5 has the advantage of handling the problem of missing feature values, providing both pre and post pruning capacity and handling of distinct and constant features. It also has a great processing time when compared to the other decision trees.
Generally, a decision tree can also be used in classification problems to categorize diseases, equipment malfunction, and also in loan applications. Quinlan (1993) recognized thatID3 or decision tree algorithm could be used to solve some problems in which the final results are changed to C4.5.
Some problems with decision trees are: deciding how intensely the decision tree is developed, selecting the best attribute, handling missing values, and managing differing costs. Another flaw is its unsuitability for small data set 18. Heckerman et al. (1995) claimed that over the past years, the Bayesian network had become a common way of programming indefinite adept information in expert systems. A Bayesian Network is a visual representation of a model used for probability connections among a cluster of variable characteristics. Phyu (2009) reported that it also has graphical models for reasoning under uncertainty where the nodes represent variables (discrete or continuous), and arcs represent direct connections between them. Kotsiantis et al. (2007) established that the form of Bayesian Network is made of a controlled acyclic graph called DAG. The nodes are in parallel communication with the X items. The curved shapes demonstrate an unpredicted connection to the nodes, but the shortage of probable arcs in S codes show provisional autonomies. There are two subgroups: (1) Parameters determination and (2) Network DAG structure learning. Yang and Webb (2009) pointed out that one challenge faced by the Bayesian networks classifier is the fact that it needs discretization of its non-stop features. This is a procedure which comprises changing constant attributes to a distinct attribute or feature to create issues in a grouping. According to Friedman and Goldszmidt (1996), one other problem is when a constant attribute or feature is not changed to a distinct attribute. They explain that the valuation of the attribute's conditional density plays an active role in the process. The problem with conditional density estimation of attributes, according to Friedman and Goldszmidt (1996), Gaussian kernel function has steady constraints for the evaluation of attributes density. In their investigation, a constant attribute or feature gives enhanced classification accuracy in relation to other methods by Gaussian kernel function in the Bayesian Network classifiers.

K-nearest neighbor (KNN)
This is a familiar classification technique that is used when there is little or no previous information about the dissemination of data. It is a potent nonparametric method that escapes the problem of probability densities. The K-Nearest neighbor rule categorized x by giving it the tag appearing regularly and is denoted among the K nearest samples. The technique was made to help in executing discriminant analysis in cases where dependable parametric estimates probability densities are difficult to define. Fig. 1 shows the KNN model.

Fig. 1: KNN model
In the diagram, we need to determine which class the new example should be grouped with. K=3 means the new example will be identified as a Class B member. However, if K=7, the new example will belong to Class A (Fig. 1). Covert and Hart (1967) contended that the technique is also measured as K. This means the total nearest neighbors need observation in order to label the sample data point group. Wu et al. (2008) divided KNN into two main groups: The structurebased KNN and the structureless KNN. The former procedure occurs with the fundamental form, which has a lower system of working in the training of data samples. However, there is an organization of data into sample data point and training data resulting in the calculation of the sample points.
One of the advantages of k-NNMC is its high accuracy and efficiency. Similarly, it is transparent, simple to use, and has the ability to withstand noisy data training and adaptation. Viswanath and Sarma (2011) identified two main problems with the use of KNN: Space requirement and the classification time used. To resolve these difficulties, different procedures such as the K-Nearest Neighbor Mean Classifier can be used. This is done by searching for k nearest neighbors for each training form of class and computing the mean for all k-neighbors. This is shown through the application of different average data-sets. Bhatia (2010) also argued that computing difficulties, memory inadequacy, low run-time presentation for a huge training set of data, and interference by unrelated attributes are some drawbacks of KNN.

Linear classifiers (support vector machines)
These are similar to classical multilayer perception neural networks. In the Support Vector Machine, a predictor variable is called an attribute, and a transformed attribute defines the hyperplane, which we call a feature. Feature selection refers to the work involved in choosing the most appropriate representation. A vector refers to the features that describe a row of predictor values. Fig. 2 shows a support vector machine example.

Fig. 2: Support vector machine example
The yellow broken-line is the ideal hyperplane which separates the two classes with the maximal margin. The other two lines only separate the classes with small margins (Fig. 2). Nizar et al. (2008) suggested that a support vector machine (SVM) is the most popular and most suitable procedure for resolving issues in data classification, learning, and estimation. In addition, Wu et al. (2008) accepted that the origin of SVM is the maximal margin classifier, and it defines the least classification challenges of linear separable training with binary grouping. The main benefits of SVM are solving classification issues such as high dimensional and non-linear separable problems. However, one setback of these vector machines is that it needs important parameters to establish the expected results.

Application of classification algorithms
Currently, classification is used worldwide in different fields, such as data mining, science, industrial, and law. Machine Learning has become very important and useful for technology. Face recognition, Chabot for online customer service, radiologists looking for early signs of cancer, and self-driving cars are few examples of machine learning usage. According to Smola and Vishwanathan (2008), machine learning helps in the classification and verification of faces. Many security applications, such as access control and face recognition, are some other examples of its application. Nevala (2017) clarified that Machine learning could also be used in situations where: 1. Relevant rules are not simply labeled by straightforward logical rules. 2. Possible outputs are not determined earlier than the event. 3. Precision is more essential than interpretability. 4. The data is not sound and has the potential to cause problems for long-established analytic methods.

Problems of classification in machine learning
Classification is beneficial in many fields, but it also has many shortcomings: Handling missing data, data identification irrelevant at the time of entry, data separation due to difference with other known data, and equipment breakdown and misconception due to lack of entry of a record, which can be solved by swapping whole omitted values with a known global constant. Another way to solve this problem is by the removal of values with the feature mean for a particular class. Deng et al. (2010) posited that data miners could overlook the omitted data by observing examples that have omitted values and manually adding a probable value.
Some establishments using machine learning face problems such as biases in the data, the taintlessness of the data, loss of control, and the low quality of work. In the loss of control establishments, machine learning is not capable of controlling the pace of the results produced. For business, careful evaluation and readiness to use the process are important. Funding the process is also of utmost importance.
Another problem is when performing spam filtering. During the process, we concentrate on yes/no answer as to whether an e-mail contains the most important information or not. To combat these problems, we need to build a system that is able to learn how to classify new e-mails. Krishna et al. (2017) observed that hackers could cause a change in algorithms, and they can also make a fake one look genuine. They give an example of the introduction of the iPhone X in which hackers with forged biometric data can pretend to be the rightful owners or users.

Conclusion
Classification is a method that results in sorting out data into groups based on certain important characteristics or features. This paper has outlined the meaning of classification algorithms, applications, types, weaknesses or challenges, and solutions to tackle the challenges. Classification is very helpful in both data mining and machine learning. The types of classification are also mentioned, and the merits and demerits of each technique during its applications were also explained. In stressing the strengths and extensively resolving the weaknesses, the application of these methods can be enhanced. It is important to note that the methods chosen are determined by the problems at hand. For this reason, one method is not seen as better than the others but can complement each other.

Conflict of interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.