Automatic classification of product reviews into interrogative and noninterrogative: Generating real time answer

Article history: Received 12 March 2019 Received in revised form 3 June 2019 Accepted 4 June 2019 Posted reviews on the relevant webpages about a product not only motivate the company to enhance quality but also it helps users to decide in favor of (or against) purchasing the product. These reviews are classified by different researchers through subjectivity based, entity based, or aspect based to find the polarity using the supervised or unsupervised technique. However, classification based on interrogatives and non-interrogatives is not handled yet. Datasets of interrogatives are analyzed as identifying Answer Seeking questions from Arabic tweets, question conveying and not conveying Information, Rhetorical Questions while here classifying the sentences into interrogatives and non-interrogatives is the preliminary step, which is a core contribution of proposed work. If detected questions are answered and moreover real time, it could not only motivate a user positively to buy the product but also users feel full duplex communication. In this work, we formulated this problem proposing linguistic and heuristic rules that automatically senses the interrogative and answer promptly based on the aforementioned aspect. If there is no aspect in an asked question, then LSI (Latent Semantic Indexing) generate answer using classified noninterrogatives. LSI is an efficient information retrieval algorithm, which finds the closest document to a given query. Experimental results using two publically available datasets show a precision of 95% and 96% which has 10% increased performance than alternatives machine learning methods Meta Filtered Classifier and Naive Bayes.


Introduction
*Sentiment analysis, also called opinion mining, is the field of study that analyzes people's opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes. It represents a large problem space. Opinion extraction, sentiment mining, opinion mining, subjectivity analysis, emotion analysis, affect analysis, review mining, etc., are the different names and tasks of sentiment analysis. However, they are now all under the umbrella of sentiment analysis or opinion mining. While in industry, the term sentiment analysis is more commonly used, but in academia both sentiment analysis and opinion mining are frequently employed.
As discussed above, pervasive real-life applications are only part of the reason why sentiment analysis is a popular research problem. It is also highly challenging as an NLP research topic and covers many novel sub problems. In NLP or in linguistic has little research before 2000 because at that time there was less amount of opinion text as digital form. After 2000, this field became most active research area in NLP with extensions of informational data mining and web mining. In fact, it has spread from computer science to management sciences (Liu, 2012;Archak et al., 2007;Chen and Xie, 2008). Useful information is known as data and lot or huge amount of data is known as big-data. Data mining means extraction of meaningful information from data. Text mining is the sub field of data mining. Major important data collection source is known as web mining. Opinion mining belongs to web content mining (Rubini and Chezian, 2014). For Sentiment classification work done on opinions are: -Subjectivity Classification, -Mapping Implicit Aspect, -Co-Reference Resolution Word, -Grouping Aspect into Category, -Dealing with Sarcastic Sentences. Some work has also been done on Comparative Opinions to generate Contrastive View Summary (Lerman and McDonald, 2009). Besides covering classifications of reviews from different dimensions, there is still need of classification of reviews with respect to interrogatives and non-interrogatives. In this regard, question conveying and not conveying Information (Zhao and Mei, 2013), identifying Answer Seeking questions from Arabic tweets (Hasanain et al., 2014), extraction of subjective/objective and questions and Rhetorical Questions (Hasanain et al., 2014;Liu and Jansen, 2015;Ranganath et al., 2016;Liu and Jansen, 2016), investigation of questions asked by Arab journalists (Hasanain et al., 2016), and detection of user intent behind asking the question (Kharche and Mante, 2017) have been studied. However, they all used the data readily available in the form of interrogatives. On the other hand, classifying the sentences in to interrogatives and non-interrogatives is the preliminary step. Li et al. (2011a) coined the notion of Qtweets, the tweets which contain interrogative information that must be answered. For Qtweets, they detect a sentence which ends with question mark or a word starting from 5W1H-words. Resultantly, a sentence like "what a great job" if replaced as "what is a great job", becomes interrogative. However, it is not an interrogative sentence in real sense. We not only cover these deficiencies but also improved this work. This study makes following key contributions: • We propose heuristic rules for classification of subjective reviews into interrogatives and noninterrogatives. • We propose a generation of real time answer from interrogative based on extracted aspects with user friendly interface. • We propose a generation of real time answers using Latent Semantic Indexing (LSI) scores (without aspect).

Literature review
In order to attract customers, to motivate them into buying a product, and in order to get the feedback necessary for the assessment and improvement of the product, different companies have created opinion pages on their websites. Instead of going through the lengthy process of searching for the product, finding it and evaluating it, researchers have applied different techniques to separate, sift and distinguish the opinions in terms of the products. In order to separate the opinions, it is necessary to know about which particular product an opinion is being given. This process is termed 'Entity Extraction'. Before the Entity Extraction can be carried out, it must be determined whether a given comment is an opinion or not. Objective opinions show factual information while the subjective ones are users' personal opinions (Liu, 2015); so, these can be considered for further processing because there may be sentiment in subjective opinion. Users express their views either in implicit way or explicit way: explicit opinions can be detected easily, while the implicit opinions are very hard to detect (Zhang and Liu, 2011;Greene and Resnik, 2009). Now to detect an entity from an opinion has been done using different methods. Authors Greene and Resnik (2009) have used two sets i.e. the set of seed entities Q and the set of candidate entities D to determine which of the entity in D belongs to C. For entity extraction (Pantel et al., 2009;Lee, 1999) have used a method distribution similarity by comparing the similarity of the surround words of each candidate entity with those of the seed entities; and then ranking the candidate entities based on the similarity values. Topic modeling has also been used for entity extraction. Topic modeling is an unsupervised learning method that assumes each document consists of a mixture of topics and each topic is a probability distribution over words (Liu, 2012). Latent Dirichlet Analysis LDA and Probabilistic Latent Semantic Analysis (PLSA) have been used for detection of topic from a document/documents (Blei et al., 2003;Griffiths and Steyvers, 2003). To search a required document from a huge amount of published articles is very time consuming and laborious work, topic modeling offers a computational tool to find relevant topics by capturing meaningful structure among the collections of documents . To obtain fine grained sentiment analysis, researchers have done work on aspects. Method defined by Long et al. (2010) used extracted nouns based on the frequency and information distance as aspects. Noun phrase that has sentiment bearing sentences can be considered as aspects (Blair-Goldensohn et al., 2008). Logic Programming, particularly Answer Set Programming (ASP), has been used to elegantly and efficiently implement the key components of syntaxbased aspect extraction. Logic Programming provides a convenient and effective tool to encode and thus test knowledge needed to improve the aspect extraction methods, so that the researchers can focus on the identification and discovery of new knowledge to improve aspect extraction (Saqib et al., 2016b;Liu et al., 2013). A supervised learning algorithm has been used to extract aspects from an opinion of the product. The projected system implements aspect extraction using frequent item set at phrase level (Jeyapriya and Selvi, 2015;Ahmad et al., 2017). After the extraction of aspects, sentiment analysis can be done. Sentiment analysis means either opinion is positive or negative. This sentiment analysis can be done on all the opinions of all the products of a company; or on a particular product (entity extraction) (Batra and Rao, 2010;Engonopoulos et al., 2011); or on a particular feature (aspect extraction) of a product (Kirange et al., 2014;Brun et al., 2014;Pontiki et al., 2016;Alghunaim, 2015). To improve the precision and accuracy of each of the above mentioned techniques for sentiment analysis, researchers have also done work on the detection of spam opinion (Saqib et al., 2018b;Li et al., 2011b;Ott et al., 2011;Teli and Biradar, 2014), co-reference resolution (Ding and Liu, 2010), detection of sense of ambiguated word (Saqib et al., 2018a), and aspects grouping (different words belonging to same aspects) (Saqib et al., 2019;Garcıa-Pablos et al., 2014). After sentiment analysis on direct opinion, there is need to handle comparative sentence, interrogative sentences, etc. Work of Kwong and Yorke-Smith (2012) has detected the question-answer pairs from email threads to construct the summaries and Li and Zhao (2013) detected imperative sentence with interrogative mood. All the above-mentioned works are about the opinions that end-users post on the relevant webpages. These opinions are important for new users as they help them evaluate the product. But, at the same time, the questions they themselves ask about the product are even more significant than the general opinions of other people about the product. Users will be able to obtain the results through a sentiment analysis of the opinions. But it is even more desirable, useful and effective if users were to get answer to their questions automatically. For this purpose, interrogative sentences need to be separated from opinions.

Proposed framework
Customers in today's world are more concerned about the efficiency of the product they are going to buy. That is why they like to know about the product before they actually buy it through comments on webpage. Keeping these factors in mind, the authors realized that by looking at the progress and market value on the comments' page, the end user can ask any kind of question about the product. The other users can answer that question only when they visit that page. There has to be an automatic and, at the same time, effective system that is able to sense and understand an interrogative sentence, and to answer it on the spot.

Modules in proposed work
The proposed methodology consists of four modules: Classification Based on Interrogative and Non-Interrogative. Extraction of Aspects, Generation of Answers Based on Extracted Aspects and Answer through LSI (Latent Semantic Indexing) without Aspects as shown in Fig. 1.

Fig. 1: Proposed framework
The following algorithm in Table 1 shows the entire process of the proposed methodology, where the 'getData' list contains the specifications of a product; the 'IsInterrogative' function will check whether the given comments are interrogative or not; 'getAspect' will determine the aspects from the interrogatives; and 'answerSummary' will generate an answer based on the extracted aspects.

Classification based on interrogative and non-interrogative
These can be extracted through the following steps:

By "?" symbol
If a sentence ends with a "?" symbol, it means it is an interrogative sentence, so it is directly sent to the next step (i.e., Extract Aspect). The Rule-1 can be written as: "?" ∈ Sentence=> Interrogative (Rule-1)

By helping verb
Although interrogative sentences start with helping verbs, a list of helping verbs (HV) can be easily generated. If the starting word (SW) of a sentence belongs to the list of helping verbs (HV), it means it is an interrogative sentence as in Rule-2.

By W-family words
If the first word of the sentence belongs to the W-Family (What, Where, Why, When, Who, Whom, Whose), it means that the sentence is interrogative as defined in Rule-3.
[SW] ∈ [W − Family] => Interrogative (Rule-3) But "What a great job", here, although the sentence starts with a W-family word, it is not an interrogative sentence. Then Rule-4 will be applied by finding second word (SecW) as HV.
[SW] ∈ [W − Family] and [SecW] ∈ [HV ] => Interrogative (Rule-4) Rule-4 is true on "How are you?" but false on "How this process will work?"; "How can I delete this file?", now Rule-5 will be considered as if any next word (ANW) belongs to HV, then it is considered as Interrogative.

Extraction of aspects
Aspects can be extracted using nouns and adjectives. They can be easily extracted using the tags of NLTK. After the removal of stop words i.e., "is, to, the, on, etc." there may be only nouns or adjectives in a single-line question. These nouns (NN) and adjective (JJ) are regarded as Aspects. The algorithm for extracting aspects from interrogatives is shown in Table 3.

Generation of answers based on extracted aspects
The aspects can be compared with the specifications of the product on the relevant webpage. The algorithm for generating answers from the extracted aspects is shown in Table 4.

Answer through LSI (Latent semantic indexing) without aspects
Latent Semantic Indexing (LSI), proposed by Deerwester et al. (1990) in Blei et al. (2003), is an efficient information retrieval algorithm. Basically, LSI is a cosine similarity measure between the coordinates of a document vector and the coordinates of a query vector with the help of Singular Value Decomposition by using NUMPY package of python. If this value is 1, it means the document is 100% closer to the query, if it is 0.5, it means the document is 50% closer, and if it is 0.9, it means the document is 90% closer to the query (Grossman and Frieder, 2012;Saqib et al., 2016a). If an interrogative has no aspects, then the LSI will determine the closest review for user satisfaction. The algorithm in Table 5 is used to generate answer for interrogative (with no answer), which will use question as query and remaining all comments as dataset.

Mathematical model of proposed work
Whole process for classification can be calculated from Eq-1 to Eq-7: where x= 1, 2, 3, …, n, SW means stop words, R represents the total number of reviews, T(x) represents the tokens of the x th review, and FT(x) represents the filtered tokens of the x th review.
where x= 1, 2, 3, …, n, Cla(x) will classify the x th review as being interrogative or not. 1 and are the first and last words of a review, means the x th review is interrogative, and ~ means the x th review is not interrogative. LR and HR are the linguistic rules and heuristic rules, respectively. The resultant interrogatives and non-interrogatives can be grouped as follows: (6) ~= ⋃= 1 (7)

Experimental results
We have tested this model on Dataset-1 and Dataset-2.
Dataset-1 1031 questions (as interrogatives) and 1031 answers (as Non-Interrogatives) downloaded from http://www.cs.cmu.edu/ ark/QA-data/. This page provides a link to a corpus of Wikipedia articles, manually-generated factoid questions from them, and manually generated answers to these questions, for use in academic research. These data were collected by Noah Smith, Michael Heilman, Rebecca Hwa, Shay Cohen, Kevin Gimpel, and many students at Carnegie Mellon University and the University of Pittsburgh between 2008 and 2010. Version 1.2 released August 23, 2013 has manually-generated factoid question/answer pairs with difficulty ratings from Wikipedia articles. This Dataset includes articles, questions, and answers. Dataset-2 based on product reviews of Samsung Galaxy J7 from URL (http: //www.gsmarena.com/samsunggalaxyj7 − 7185.php visit on Date 31 August 2017) to check the performance of proposed work. This page has more than 5000 reviews from different users of the product. These datasets are arranged as Interrogatives and Non-Interrogatives manually to create 'arrf' file for machine learning algorithms and text file for propose work.

Statistical measures
A confusion matrix is formed from the four outcomes produced as a result of binary classification. A binary classifier predicts all data instances of a test dataset as either positive or negative. This classification (or prediction) produces four outcomes-true positive, true negative, false positive and false negative. True positive (TP): correct positive prediction, False positive (FP): incorrect positive prediction, True negative (TN): correct negative prediction, False negative (FN): incorrect negative prediction. Various measures can be derived from a confusion matrix as shown in Table 6.
We compare our proposed method with the following machine learning methods for sentence classification: i) Method-1 (Naive Bayes): In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. The Naive Bayesian classifier is based on Bayes' theorem with independence assumptions between predictors. A Naive Bayesian model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for very large datasets. Despite its simplicity, the Naive Bayesian classifier often does surprisingly well and is widely used because it often outperforms more sophisticated classification methods.
ii) Method-2 (Meta Filtered Classifier): This Class is used for running an arbitrary classifier on data that has been passed through an arbitrary filter. Similar to classifier, the structure of the filter is based exclusively on the training data and test instances will be processed by the filter without changing their structure (Vijayarani and Muthulakshmi, 2013). iii) Proposed Model: As shown in Table 1, Table 2,  Table 3, and Table 4, the proposed method achieved more robust and more accurate results by proposing five rules to classify sentences into interrogatives and non-interrogatives. After experimental results on Dataset-1, all answers have been considering as non-interrogatives because no rule related to question mistakenly consider as interrogatives. All questions end with "?" while we have omitted such symbol and then trying to detect questions because sometimes user is not put "?" at the end on review page. Hence, this agent had average precision of 95% and accuracy 94% on Dataset-1 and average precision of 96% and accuracy 94% on Dataset-2 in detecting interrogatives (I) and noninterrogatives (NI) as shown in Table 7.

Conclusion and future work
This work proposed classification of product reviews based on interrogatives and noninterrogative. We obtained classification results with improved precision (0.95) when compared to the alternative machine learning methods as shown in Fig. 2.
The proposed method is quite generalized, and it can classify reviews of any product, android applications etc.
The proposed method does feature several limitations that may be borne in mind when interpreting the findings, which are discussed in below using different samples of questions. Some are shown in Table 8.
In the above samples, S. No. 1 and S. No. 2 are not interrogatives because they do not start with either a W-Family word or a helping verb. For S. No. 3 (It is dual sim?); if there had been no interrogation mark (?), then the agent would not have detected them as questions. S. No. 4 (what bands it supports) is actually a question, but the agent did not detect it due to the fact that even though it starts with a W-Family word, yet there is no helping verb in the entire sentence. So, the rule could not succeed here and that is why it was considered as "Not Interrogative". If the user had written it as, "What bands it can support", then the agent would have been able to detect it even without the question mark, because of the W-Family word, "what", and the helping verb, "can". All the possible patterns and modes of questions using W-Family words or the ones using helping verbs in the beginning were incorporated. Utmost care was taken to ensure that all such sentence templates were covered. The authors were confident that this sample was sufficient to serve as substantial evidence to support the arguments established in this study. In Table 9, there are some sample of detected interrogatives whose answers has generated on aspects. Here, the answer to S. No 4 was generated through LSI score in a Table 10, because there is no aspect.