Semi-supervised method for sensitivity based documents’ classification for online service providers

Article history: Received 4 November 2019 Received in revised form 5 February 2020 Accepted 7 February 2020 In today’s digital era, many services providing companies exist on the web whereas service is the logical product of a company, which can be utilized through the Internet. Different service providers provide these services i.e., Online counselling service, online doctor consultation, cloud service provider, web hosting service, etc. to their customers. When customers face some problems, they may text to their providers. One solution is that providers can solve these issues based on the First-Come-First-Serve formula. But there should be an option to detect sensitive issue which may need to be solved first. How can this sensitivity be determined? Already there is a lot of researched work based on text to determine the polarity as positive and negative. Besides this classification, there are also some other classification methods investigated, such as aspect, not aspect, subjective, objective, spam, not spam, etc. regarding text sensitivity, whether it is sensitive or not? This classification is not yet considered for service providers. This paper presents a strategy for sensitivity based classification using Latent Semantic Indexing (LSI). The purpose of LSI is to rank documents concerning a given query. However, in this study, a mechanism was provided to generate query automatically based on sensitive general words with the words from all documents. This is a semi-supervised approach because 4782 sensitive words have been labeled from various sources and used based on an unsupervised approach to detect the sensitivity of the document. The sorted lists of documents based on the LSI scores generated by the sensitive-query were checked manually and were proved to be highly satisfactory. The topmost document in this list was the most sensitive, and the last document in the list was least sensitive.


Introduction
*Nowadays, the Internet is widely and publicly used through smartphones. This has paved the way for many services to be carried out through the internet to reach out to customers. Some of these services are provided for entertainment or comfort purposes, while others can be lives saving. Some of the online applications that provide services to the public are online career counseling, online doctor consultation, internet service providers, FTP providers, cloud software providers (Asfoura et al., 2018). All these providers may have several customers, and each customer may have different problems daily. It is very important to handle critical problems first. How can critical problems be determined? The defined problem is written in text format, and already there is much analysis based on text. Here the author provides a method to determine the critical problem of all customers, which assigned an LSI score to each query. LSI (Latent Semantic Indexing) method is better for searching. It has been used for the clustering of documents and concept representations with keyword and key-sentences (Ahmad et al., 2017). LSI has also been used in determining the most positive and most negative review of a product based on an automatically generated query. Researchers have also done work on the detection of spam opinion Teli and Biradar, 2014), co-reference resolution (Ding and Liu, 2010), detection of sense of ambiguated word (Saqib et al., 2018) and aspects grouping (different words belonging to same aspects) (Saqib et al., 2019). Besides covering classifications of text from different dimensions for a customer who purchases a physical product, there is still a need for classification of text based on sensitivity for those customers who purchase a service from a provider.
Some service provider provides a method to the customer to send issue with predefined lists of priority options as high, medium, low. The selection of these options may be ironic. Instead of selecting this option manually, there should be a method which can select one of this option based on the sensitivity of written text automatically. Considering such type of application, the proposed method has been investigated. This study made the following key contributions:  A method is proposed for generating an Automatic Query (AQ) using sensitive and critical words, which is necessary for the Latent Semantic Indexing (LSI) technique, i.e., there is no need to provide the queries (priority label) as input.  Generating a ranked list of documents from highest to lowest sensitivity based on the LSI scores, the highest scored document will have the highest priority, and the lowest scored document will have the lowest priority.

Literature review
How can a company improve the quality of a product, place, etc. from the huge amount of reviews? Many studies have been carried out with regard to sentiment analysis, which is about determining the sentiment orientation of a review or comment (Chen et al., 2017). Sentiment orientation means that a positive opinion will be an exact positive, and a negative opinion will be an exact negative (Liu, 2012). The view, assessment, or feeling of a person towards a product (Jin et al., 2016), aspect (Shu et al., 2017), or service is known as a sentiment (Khan et al., 2009;Asghar et al., 2014). Such a feeling, which is either positive or negative, can be assigned a score. Most of the work in sentiment analysis is based on binary classification, which means that reviews or blogs are divided into "positive" and "negative" classes (Wang et al., 2009). The classification of text sentiments can be done in two ways, i.e. through machine learning and score-based approaches (Wang et al., 2011;Chen et al., 2011). Machine learning uses training data (Hameed et al., 2018), while the other method uses several attributes of an entity to determine the scores. In the score-based approach, opinions can be oriented as positive or negative (Kundi et al., 2014b;Saqib and Kundi, 2016). Kundi et al. (2014a) used a combined approach of SentiWordNet and lexical resources to determine the scores for slangs. A lexicon-based approach for extracting sentiment orientations of opinions has been used for scoring. Gupta and Ekbal (2014) used lists of positive and negative words to determine the polarity of a sentence by creating a training matrix and random forest classifier based on supervised learning. A sentiment analysis can be performed using different methods (Rosenthal et al., 2017), with each method having an improved accuracy with respect to the previous one. Although a lot of work is involved in sentiment orientation with the use of adjectives, frequent nouns and noun phrases, sentiment shifters, handling of 'but' clauses, decreased and increased quantity of an opinionated item; high, low, increased and decreased quantity of a positive or negative potential item; desirable or undesirable facts; deviations from the norm or a desired value range; and the production and consumption of resources and waste, etc., these are very important for determining the polarity of a document or sentence (Htay and Lynn, 2013). However, a large amount of online data is generated every day with unprecedented speed and size. Most of the available information on the Internet is in text and unstructured forms, i.e. online reviews, blogs, chats, and news. An aspect-based sentiment analysis, which can be carried out by using only particular aspects (Gojali and Khodra, 2016), requires less effort compared to a sentiment analysis of an object with respect to all aspects. Reviews are rated according to an object, so there should be a direct method to determine whether a review is positive or negative. LSI (Latent Semantic Indexing) is better for such a purpose (Ahmad et al., 2017;. LSI (Huang et al., 2009) has been used for the clustering of documents and for concept representations. An extended method based on LSI is able to filter unwanted emails in Chinese and English (Yang and Li, 2005). A hybrid approach for sentiment analysis of Arabic tweets based on two stages. Firstly, the pre-processing methods like stopword removal, tokenization and stemming are applied, and then two features weighting algorithms (information gain and chai square) are utilized to assign high weights to the most significant features of the Arabic tweets. Secondly, the deep learning technique is employed to effectively and accurately classify the Arabic tweets either as positive or negative tweets (Altaher, 2017). To improve accuracy of sentiment analysis, lot of work has also been done on words sense disambiguation (Rios et al., 2017;Swathy, 2017). Machine learning approaches, also called corpus-based approaches, do not make use of any knowledge resources for disambiguation (Raganato et al., 2017). Most accurate WSD systems to date exploit supervised methods which automatically learn cues useful for disambiguation from manually sense-annotated data . All above analyses are very useful for a company from where a user can purchase a physical product as well as on line service provider. Online service provider also has different customers to handle their issues and problems. There should be a method which can detect the most critical issues, so they can be handled first.

Applications of proposed work
This methodology is suitable for all those online applications which provide services to many customers and deal with issues daily. Few of them are described as following.

Online counseling service
Online therapy, also known as e-therapy, ecounseling, teletherapy, or cyber-counseling, is a relatively new development in mental health in which a therapist or counselor provides psychological advice and support over the internet. This can occur through email, chat, messaging, or internet phone (Mallen and Vogel, 2005).

Online doctor consultation
In 2000, many people came to treat the internet as a first, or at least a major source of information and communication. Health advice is now the most popular topic. In developed countries, many online doctors prescribe so-called 'lifestyle drugs, such as for weight loss, hair loss, or erectile dysfunction (Glover-Thomas and Fanning, 2010).

Cloud service provider
A cloud service is any service made available to users on-demand via the Internet from a cloud computing provider's servers as opposed to being provided from a company's on-premises servers. Cloud services are designed to provide easy, scalable access to applications, resources, and services, and are fully managed by a cloud services provider (Saqib et al., 2011).

Web hosting service
Hosting (also known as website hosting or Web hosting) is the business of housing, serving, and maintaining files for one or more websites. In a sense, you rent space on a computer to hold your website made by a web-designer (Bazsova, 2019).

Classification of document based on the sensitivity
Latent Semantic Indexing method will determine the closest text with Automated Query (AQ). AQ contains a list of critical words from all texts. In this method, the 1 st step is to find AQ, and then the second is to find the score.

Automatically generated query (AQ)
If we generally think or experienced from daily basis issues, it may clear that sensitive issues contain those words, which are mostly negative. So, here I collected a list of 4782 negative words named SenWord (Liu et al., 2005). The flow chart for generating Automatic Query is given in Fig. 1.
In Fig. 1, C is the chunks, i.e., all words of given text document D, FC are the filtered chunks means filtered words from stop-words. SFC will contain common words from FC and SenWord (sensitive words). These words will be updated with AQ. In the end, AQ has contained all sensitive words from whole documents. Hence automatic query has been generated as AQ. Eq. 1 shows all the documents, and Eq. 2 depicts the tokens. Eq. 3 is used for filtering the tokens, i.e., removing all the stop words.

= ⋃ =1
(1) where x=1, 2, 3,…, n; SW means stop words; D represents the total number of Documents; T(x) represents the tokens of the x th review, and FT(x) represents the filtered tokens of the x th review. Now Automatic Query (AQ) can generate FT(x) and list of 4782 sensitive words SenWord as calculated in Eq. 4: where x=1, 2, 3,…, n and ( ) means the i th chunk of the x th document.
contains those words from all the documents that belong to the SenWord.

Scoring each document with LSI
LSI is an efficient information retrieval algorithm (Phadnis and Gadge, 2014). Basically, in LSI, there is a cosine similarity measure between the coordinates of a document vector and the coordinates of a query vector. If this value is 1, it means the document is 100% closer to the query if it is 0.5, it means the document is 50% closer to the query, and if it is 0.9, it means the document is 90% closer to the query and so forth. The significant point now is finding the coordinates of each document and query. A Singular Value Decomposition (SVD) can determine the points or coordinates of a document and query. Through the SVD, three matrices, S, V, and U, which will be used for further processing, can be To determine the values of such variables, the SVD requires a matrix. The matrix consists of rows and columns containing integers, but the inputs under consideration are the different text documents. A feature matrix can be obtained by calculating the frequencies of each word. This means that first, a feature matrix is created from all the documents, and then, the SVD is calculated. After this, the supporting variables, S, V, and U will be calculated by using NumPy (Numeric Python). The coordinates of all the documents will be determined from S, and these coordinates will be merged with the query to obtain the query coordinates. Finally, a cosine similarity function will be applied to these coordinates to find the documents that are closest to the query . Here, it is clear that a document with the highest score is very close to sensitive words means that this document is most sensitive and vice versa. Fig. 2 shows the flow chart for the score of each document with AQ.
In Fig. 2, LSI (AQ, D) will calculate the score of each document with AQ as defined in the following equation Eq-5.
From Eq. 5, ( ) are the filtered tokens of each x th document. Then, the LSI score ( ) of each document based on LSI can be found through the automated query, .

Result and discussion
Dataset is a list of 1,000 hotels and their reviews provided by Datafiniti's Business Database. The dataset includes hotel location, name, rating, review data, title, username, and more (data.world, 2018). This study selected 225 sensitive reviews about a hotel. These reviews were labeled from S [1] to S [225], and different attempts were made to determine the maturity level of this work. A sample listing of the said datasets is presented in Table 1  "I chose the Cellini based on all the wonderful reviews I had read on Tripadvisor. I was extremely disappointed and agreed with everything said in the honest negative reviews of the hotel. This place is nothing special and needs to do a lot more to regain its #1 rating on Tripadvisor! I stayed in a Junior Suite with my family. The room is not a Suite; it is a room with a king-size bed and two single beds! It is a family room, and nothing more! It was not filthy dirty, but it could have been cleaner, particularly in the bathroom. The hotel staff did no more than they should have done for us. The breakfast was poor, and the cold meat and cheese looked as if it had been there for months. I would not stay there again. I really cannot understand all the positive reviews for this hotel on Tripadvisor. This hotel is far too expensive for what you get! I have traveled all over Italy and have stayed in Rome many times, try the Hotel Sant Anna in Borgo Pio, near the Vatican same price but 100% better!"

S [2]
"This place is too noisy. You can hear the person using the bathroom in the next room and hear every little conversation!! The doors do not shut all the way. You have to give your key when you go and come so anyone can get into your room. You can tell them a room number, and they just give you a key. I spoke with the manager about a problem that i encountered with the receptionist. When all is said and done, they did nothing and said it was a misunderstanding!!! We talked with another couple staying here, and they said the same thing, very noisy!! Avoid this place if you can. Not worth the money!!"

S [3]
"Hotel is kind of a misnomer. The reason there isn't a picture is that the "hotel" takes two buzzers to get in, and you are inside a large, nondescript building. If you weren't looking for it, you'd never find it. I found the price, relative to the reviews, exorbitant. I expected more of a hotel. As an example, every time you open your door, the owner, kind of peeks her head out to see what it is you want, nice, but annoying in the sense of, if I wanted to go to get a toothbrush or get some air outside, you have to go through the buzzer routine to get in and out of the hotel. There are a great many hotels, closer to the Termini Station that are traditional hotels and charge far less money. The hotel would be worth 50-60 euro's, not the ridiculous 120 that we paid." First, all the known sensitive documents and automated query AQ were passed through the proposed algorithm to find the score of each document. Following is the LSI score of the first ten and last ten documents in Table 2 and Table 3 Table 4.
The first ten and last ten sensitive documents are shown in Table 5 and Table 6, respectively. (AQ) were checked manually and were proved to be highly satisfactory.

Statistical results
A confusion matrix is formed from the four outcomes produced as a result of binary classification. A binary classifier predicts all data instances of a test dataset as either positive or negative. This classification (or prediction) produces four outcomes-true sensitive (TS), true not-sensitive (TNS), false sensitive (FS), and false not-sensitive (FNS). Here we used 225 negative hotel reviews as Sensitive and 225 positive hotel reviews as Not Sensitive, i.e., means 50% sensitive and 50% Not Sensitive. After experimental results, we found, at LSI-score greater than 0.7, recall with respect to Not Sensitive is 40%, and with respect to Sensitive is 60%, only 10% Not sensitive considered as Sensitive. Obtained results based on actual Sensitive and Not Sensitive and Predicted Sensitive and Not Sensitive. Some of the samples are given in Table 7.

Conclusion and future work
The major purpose of the proposed work is to separate sensitive and not sensitive text. Also, the major contribution is to find out the most sensitive text to least sensitive text. This can be very beneficial for those environments where online suggestions or solutions can be provided based on the critical condition of the customer. After experimental results using all reviews from hotel-dataset and an automatic query to LSI as input, it is observed that detection of sensitive document or reviews are very satisfactory. We selected 500 hotel reviews (225 positive, i.e., Not Sensitive and 225 negative reviews, i.e., Sensitive) and 4782 sensitive words. Further range of dataset and size of sensitive words could also be analyzed. These words have been selected from different sources according to their negative or critical meanings. Generating lexicon of sensitive words can also be considered as future work by using synonyms of a critical or sensitive word and their path distance specified in SentiWordNet.