Review of feature extraction approaches on biomedical text classification

The overcoming volume of online biomedical literature causes congestion of data and difficulties in organizing these documents and also to retrieve the required documents from the database, especially in the Medline database. One of the solutions to surpass the overwhelming of documents is to apply classification. However, each document must be represented by a set of terminology or feature vectors. The identification of terminology or feature from biomedical literature is one of the most important and challenging tasks in text classification. This is due to a large number of new features and entities that appear in the biomedical domain. In addition, combining sets of features from different terminological resources leads to naming conflicts such as homonymous use of names and terminological ambiguities. Therefore, the purpose of this research is to investigate and evaluate the effective ways for extracting the relevant and meaningful features in order to increase the classification accuracy and improve the performance of web searches. Towards this effort, we conduct several classification experiments to evaluate and compare the effectiveness of feature extraction approaches for extracting the relevant and informative features from the biomedical literature. For our experiments, we use two different sets of features, which are a set of features that are extracted using the Genia tagger tool and set of features that are extracted by medical experts from Pusat Perubatan Universiti Kebangsaan Malaysia (PPUKM). The results show the performance of classification using features that are extracted by medical experts outperform the performance of classification using the Genia Tagger tool when applying feature selection method.

The overcoming volume of online biomedical literature causes congestion of data and difficulties in organizing these documents and also to retrieve the required documents from the database, especially in the Medline database. One of the solutions to surpass the overwhelming of documents is to apply classification. However, each document must be represented by a set of terminology or feature vectors. The identification of terminology or feature from biomedical literature is one of the most important and challenging tasks in text classification. This is due to a large number of new features and entities that appear in the biomedical domain. In addition, combining sets of features from different terminological resources leads to naming conflicts such as homonymous use of names and terminological ambiguities. Therefore, the purpose of this research is to investigate and evaluate the effective ways for extracting the relevant and meaningful features in order to increase the classification accuracy and improve the performance of web searches. Towards this effort, we conduct several classification experiments to evaluate and compare the effectiveness of feature extraction approaches for extracting the relevant and informative features from the biomedical literature. For our experiments, we use two different sets of features, which are a set of features that are extracted using the Genia tagger tool and set of features that are extracted by medical experts from Pusat Perubatan Universiti Kebangsaan Malaysia (PPUKM). The results show the performance of classification using features that are extracted by medical experts outperform the performance of classification using the Genia Tagger tool when applying feature selection method.

Keywords:
Biomedical literature Feature extraction Feature selection Text classification Text mining

Introduction
*Nowadays, the volume and growth rate of online biomedical literature creates new challenges for the researchers. MEDLINE (Sampson et al., 2016) is the primary source of medical literature, which consists of over 23 million online entries with a growth rate of over 800 thousand new citations every year. MEDLINE documents are manually categorized under 22,568 MeSH category names by experts from the National Library of Medicine (NLM). The accessibility of the extensive biomedical online collections presents new challenges to organize and retrieve the relevant documents from MEDLINE.
Therefore, text classification could be one of the solutions to overcome these problems.
Text classification is one of the challenging research topics due to the need to organize and categorize the growing number of electronic documents worldwide. Text classification can help users to effectively handle and exploit useful information hidden in large-scale documents (Wang et al., 2016). In addition, Cohen (2006) mentioned that automated document classification could be a valuable tool for biomedical tasks that involve large amounts of text. Nowadays, text classification has been successfully applied to various domains such as topic detection, spam e-mailing filtering, SMS spam filtering (El-Alfy and AlHasan, 2016), web page classification (Sabbah et al., 2016;Selamat and Omatu, 2004) and author identification.
A conventional text classification framework consists of preprocessing, feature extraction, feature selection, and classification stages. The preprocessing stages usually comprise of tasks, such as stemming, stop word removal, tokenization, and lowercase conversion (Uysal and Gunal, 2014). The feature extraction stages generally utilize the vector space model (Salton et al., 1975) that makes use of the bag-of-words approach (Joachims, 1997). Finally, the feature selection stage typically uses the filter method such as document frequency (Azam and Yao, 2012;Yang and Pedersen, 1997), mutual information (Tang et al., 2019;Al-Angari et al., 2016;Liu et al., 2009), information gain (Mendez et al., 2019;Lee and Lee, 2006), chi-square (Asdaghi and Soleimani, 2019;Chen and Chen, 2011) and Odds Ratio (Raza and Qamar, 2016;Feng et al., 2015).
Many text classification approaches (Saqib et al., 2019;Wang et al., 2016;Thaoroijam, 2014;Aljaber et al., 2011;Fang et al., 2011;Hliaoutakis et al., 2009;Zhang et al., 2008;Li et al., 2007;Chen et al., 2006;Cohen, 2006;Couto et al., 2004;Kamruzzaman et al., 2005) were proposed for improving the results of classification accuracy and retrieving the relevant documents from database. However, the main problem while performing text classification is managing high dimensional data (Rizaldy and Santoso, 2017;Chandrashekar and Sahin, 2014;Khalid et al., 2014;Javed et al., 2012;Maji and Paul, 2011;Wei and Billings, 2007). According to Javed et al. (2012), high-dimensional data may contain a large number of redundant and irrelevant words or features that worsen the performance of a learning algorithm. One of the effective ways to reduce the high dimensionality of data is by performing feature selection.
Feature selection has become an essential and challenging task in which to analyze and select useful knowledge about a given domain. Traditionally, feature selection research has focused on removing irrelevant and redundant features as much as possible (Mirończuk and Protasiewicz, 2018;Maji and Paul, 2011). Recently, some researchers have focused on methods for effectively handling high dimensional datasets (Vinh et al., 2016;Chandrashekar and Sahin, 2014;Khalid et al., 2014). In addition, Vinh et al. (2016) stated that effective feature selection could improve performance while reducing the computational cost of the learning system. While, Dadaneh et al. (2016) mentioned that feature selection is one of the most important fields in pattern recognition, which aims to pick a subset of relevant and informative features from an original feature set. Many other researchers study on feature selection approaches to handle high dimensionality problem (Ghareb et al., 2016;Hernández-Pereira et al., 2016;Tutkan et al., 2016;Vinh et al., 2016;Feng et al., 2015;Pinheiro et al., 2015;Chandrashekar and Sahin, 2014;Inbarani et al., 2015;Khalid et al., 2014;Rehman et al., 2015;Javed et al., 2012;Maldonado and Weber, 2009;Wei and Billings, 2007). Although many feature selection approaches have been proposed and have been employed in various domains, there are still some issues, especially in retrieving the relevant documents.
Therefore, in this research, we investigate several feature selection methods or techniques that could be employed for classification. Most of the research papers that implemented the Odds Ratio produced better results. For example, Raza and Qamar (2016) presented a comparison of using feature selection methods towards large datasets such as Gisette, Isolate, Musk-2, UjlindoorLoc, Egg-Eye-style and Internet advertisement for classification purpose. They found that the use of Odds Ratio as a feature selection method produced high accuracy compared to other feature selection methods. In other research, Ding et al. (2016) proposed a classification method for predicting PH proteins and their distribution in a host cell. The use of feature selection method has been seen to improve the result of their research.
While Feng et al. (2015) performed a comparison among few feature selection methods such as Information Gain, Chi-squared and Odds ratio for classifying MPH-20 and 20 Newsgroups datasets. However, their results show the Odds Ratio and Information Gain outperformed Chi-square for 20 Newsgroups dataset. In other similar research, Tutkan et al. (2016) proposed a new feature selection method named Meaning Based Feature Selection (MBFS). Then, they compared the performance of their proposed feature selection method with other methods such as Information Gain, Chi-squared, Odds ratio. They found that the Odds ratio outperforms other feature selection methods.
Banerjee and Biswas (2012) made a comparison between the Mantel-Haenszel estimator and profile maximum likelihood (PMLE) for estimating the common Odds Ratio. Those estimators converge to the true value of the common Odds Ratio. The result shows that Odds Ratio leads to better performance. In addition, Gregory et al. (2008) used Odds Ratio to calculate cancer incidence from the AIC-minimizing, and their result shows the value that been selected from Odds Ratio leads to the highest performance.
Odds Ratio would be used as a feature selection technique to evaluate the effectiveness of biomedical text classification. Thus, this research focuses on how to execute Odds Ratio as a feature selection method for selecting the relevant and informative features from the candidate features that are extracted using the Genia Tagger tool and also the candidate features that are extracted by medical experts from Pusat Perubatan Universiti Kebangsaan Malaysia (PPUKM). This paper is divided into 5 sections. In Section 2, we describe the details of our methodology for conducting this research. Section 3 contains the experiments and in Section 4, the discussion of the classification results is stated. Finally, we conclude this paper in Section 5.

Methodology
This research is conducted based on the methodology as shown in Fig. 1. The details of the methodology are explained in details as follows: Abstract or text usually holds a large number of unwanted, noise, and uninformative parts such as scripts, HTML tags and stop words. Keeping these unwanted parts will add to the high complexity of the problem. It causes the classification to be more complex and challenging since each word in the text is connected to each other. Eliminating the noisy data will solve the problem of the data being improperly preprocessed.
Typically, text preprocessing involves several steps such as tokenization, stop word elimination, expending abbreviation, stemming and finally feature selection (Al-Angari et al., 2016). However, Uysal and Gunal (2014) mentioned in their research, that a standard text classification framework consists of preprocessing, feature extraction and classification stages. Nevertheless, in this research, the text preprocessing process consists of feature extraction, eliminate stop words and general terms, create a vocabulary and apply feature selection method.

Perform feature extraction
The purpose of feature extraction is to produce a list of unique features from the dataset. In this research, we perform feature extraction using the GENIA tagger tool. GENIA tagger analyzes English sentences and outputs the base forms, part-ofspeech (POS) tagging, phrase chunking and named entity tagging. GENIA tagger may detect the type of entities genes like DNA, RNA and protein name. In addition, this tagger is specifically modified for biomedical text such as the MEDLINE dataset. In this research, the sentences in each abstract assigned or tagged into all chunk types like a noun phrase, verb phrase, adjective, conjunction and etc. Fig. 2 shows the example of biomedical text abstracts from the Ohsumed (2005) dataset, meanwhile Fig. 3 illustrates the example of output after POS tagging and phrase chunking processes, whereby it still contains a few general terms and stop words such as adjectives, verb, conjunction and etc.
In contrast, we also perform the feature extraction process by cardiologist experts from Pusat Perubatan Universiti Kebangsaan Malaysia (PPUKM). In this research, the cardiologist experts identify and extract all the significant medical terms related to heart disease. For both feature extraction approaches, all the stop words and general terms are removed.

Eliminate stop words and general terms
In this phase, all stop words such as adjectives, conjunction and general terms or features from the training and testing documents are removed. While only noun and verb phrases are chosen from each abstract. The primary purpose of removing stop words and the general terms process is to eliminate the noise data.

Create a vocabulary list
Create a vocabulary process is the process of gather all the terms and words in all documents. So through the compilation, we can see the number of times the term has repeated itself in the dataset. The vocabulary list is required to perform the feature selection process. In our experiments, each document must be represented by a set of feature vectors. For that purpose, we create a list of unique

Perform feature selection
Feature selection is one of the most feasible solutions to reduce the dimensionality of the datasets by selecting the most informative features and still retains sufficient information for the classification task. Feature selection has many advantages, such as avoiding over-fitting, facilitating data visualization, reducing storage requirements and reducing training time. In this research, the purpose of applying the feature selection method is to reduce the dimensionality of data. This is because not all features are informative and would affect the classification performance.

Fig. 3: An example of output after POS tagging and phrase chunking processes
Feature selection research has focused on eliminating redundant and irrelevant features as many as possible. Irrelevant features supply no useful information in any context and redundant features are those which provide no more information than the currently selected features. In this research, we select Odds Ratio as a feature selection technique. Odds Ratio evaluates whether the odds of a specific event or outcome is the same for two groups. Odds Ratio is using a simple equation as follows: where, A is the number of exposed cases; B is the number of exposed non-cases; C is the number of unexposed cases; D is the number of unexposed noncases. The list of the terms or features in the vocabulary file will be sorted and categorized into some range based on their frequency. The frequency of the terms will be used to calculate the standard error, 95% confidence interval and Odds Ratio result gained using Equation (1). The value of the Odds Ratio for this research is from 0.6 to 2.1. Thus, all terms in the abstracts that have the Odds Ratio values in the range of 0.6 to 2.1 are selected.
As a result of the features that are extracted using the Genia Tagger tool, 1,061 terms are selected over 25,837 terms, which cover 4.11% over the whole abstracts. While 958 terms of 30,028 terms are selected, which is 3.19% of the whole abstracts for the features that are extracted by medical experts.

Calculate feature weighting
We compute the feature weighting for each training and testing document. For our experiments, we use 4,643 documents that contain the selected features and their frequency. Meanwhile, for testing documents, we use 1,217 documents with the selected features and their frequency. Thus, we calculate each feature weight for both training and testing documents using the Term Frequency-Inverse Document Frequency (TF-IDF) equation as follows; where, term frequency tfi,d is the frequency of term i occurs in document j and d = 1, …, m, document frequency dfi is the total number of documents that contain the term ,i and n is the total number of documents.

Perform text classification
For our experiments, we perform text classification using Library Support Vector Machine (LIBSVM). We conduct several experiments for text classification using a set of features that are extracted using the Genia Tagger tool and also set of features that are extracted by medical experts from PPUKM. Then, we compare the performance of both sets of features based on the precision, recall, and Fmeasure produced in the experiments. In addition, we also conduct the experiments using all features without performing the feature selection process for both sets of features.
The performance of text classification is measured using the standard information retrieval measures in terms of precision, recall, and Fmeasure using the following equation;

Experiments
In this research, we perform classification using the LIBSVM tool. Therefore, we conduct several experiments to compare the classification performance between the features extracted using the Genia Tagger tool and medical experts. In addition, we also compare the performance of classification accuracy for both sets of features that are employing a feature selection method and set of features without employing the feature selection method. Table 1 and Table 2 illustrate the results of classification experiments using two sets of features extracted from Genia Tagger and medical experts, respectively. Finally, we compare the performance of classification between the result of experiments with Odd Ratio as feature selection and without Odd Ratio.
For the experiments that employ a feature selection method, we use 958 features extracted by medical experts and 1,061 features extracted by Genia Tagger. While, for experiments without feature selection method, we use 30,028 features extracted by medical experts and 25,837 features extracted by Genia Tagger.

Results and discussion
In this section, we discuss the performance of classification using a different set of features that are extracted by Genia Tagger and medical experts. From the result of classification experiments, the performance of classification accuracy for a different set of features that employ feature selection method and without employing a feature selection method would be compared. Overall, the results show different performance between the experiments using a different set of features that are extracted by Genia Tagger and medical experts and also the experiments using a different set of features with and without employing feature selection method. In addition, we also compare the performance of our experiment results with other researchers who have published in their work.
Generally, for the experiments with the feature selection method, the results of experiments using a set of features that are extracted by medical experts from PPUKM outperform the results of experiments using a set of features extracted by Genia Tagger. Table 1 and Table 2 show the experimental results produce in our experiments. The results show that the average value for precision, recall, and Fmeasure are 65.38%, 42.37% and 39.57%, respectively. While, the average value of precision, recall and F-measure for the experiments using a set of features extracted using Genia tagger are 61.14%, 40.49% and 35.42%, respectively. Compared to similar work done by Gong (2018), the researcher using Genia corpus 3.02 version in the experiments and the experimental results produce the average precision 69.29%, average recall 56.92% and average F-Measure 62.31%, respectively.
From the results obtained in the experiments, we found that the proposed research to extract the relevant and informative features from biomedical literature such as Ohsumed (2005) dataset using Genia Tagger tool and Medical experts such as Cardiologist expert works well within its limitations. Even though, only 958 features are selected using Odd Ratio from 30,028 features that are extracted by medical experts, however, these experiments produce quite good results. This performance might be caused by the use of Odd Ratio as the feature selection method to eliminate most of the general medical terms or features from the original dataset. In addition, most probably the selected features are meaningful and informative features that influence the classification performance.
Subsequently, we compare the performance of classification experiments without employing a feature selection method for both sets of features that are extracted by Genia Tagger and medical experts. From the experiments, we found the performance measure for a set of features extracted from Genia Tagger shows a higher percentage of precision compared to medical experts. However, recall and F-measure values for a set of features extracted by medical experts illustrate a better percentage compared to Genia Tagger. These results indicate that the number of features (30,028 features) that are extracted by medical experts is higher than the number of features (25,837 features) that are extracted by Genia Tagger causes misclassification during the classification process.

Conclusion
Due to the excessive amount of biomedical literature in digital form, this causes difficulties in organizing and retrieving relevant information from the web. There are a few solutions that have been proposed to solve this problem, especially in the area of data mining, information retrieval, text mining, text classification, and machine learning techniques. In addition, many researchers are studying on classifying biomedical literature to handle the problem of organizing and navigating the websites and also to improve the accuracy of web searches. However, one of the problems raised in the text classification approach is the high dimensionality problem. One of the effective ways to reduce the high dimensionality of data is by performing feature selection. Employing an effective and efficient feature selection method could improve the performance of classification.
In this paper, we explore the effectiveness of feature selection methods for reducing the high dimensionality of features for text classification. Therefore, we conduct several classification experiments in order to evaluate the effectiveness of the feature selection method for reducing the high dimensionality of features. Generally, we conclude that employing the feature selection method for text classification could reduce the high dimensionality of features in biomedical literature and improve classification accuracy. For future research, we have an interest in increasing the number of Ohsumed (2005) dataset and also perform text classification using different feature selection methods in order to reduce the high dimensionality of features and also increase the performance of text classification, especially in biomedical literature area.