Analysis of latent Dirichlet allocation and non-negative matrix factorization using latent semantic indexing

A word is a major attribute in the field of opinion/text mining. Based on this attribute, it is decided that whether it is a keyword, aspect, feature, entity, title, or topic? Lots of work has been done to detect such targets using both supervised and unsupervised approaches. These targets can be used in further processing such as text analytics, sentiment analysis, information retrieval, and searches, etc. Latent Dirichlet allocation (LDA) and nonnegative matrix factorization (NMF) are the major models used for detecting topics. Understanding the depth and details of them algorithms are necessary for those who want to extend these models. The research community of opinion/text mining uses them as a black box. However, there is a question about which model is the most accurate for detecting topics. Latent semantic indexing (LSI) is the best approach for detecting the best match for document in a given query. In this study, we analyzed the LDA and NMF models using LSI to determine the best model for opinion/text mining and found that both are very good, but NMF is slightly better than LDA.


Introduction
*In the field of opinion mining, topic molding plays a vital role in aspect extraction, key word extraction, and entity extraction (Zhao et al., 2010). Researchers model these using multiple techniques with different accuracy rates. The most commonly used models in topic detection are latent Dirichlet analysis (LDA) and non-negative matrix factorization (NMF). There is lot of confusion, however, regarding which model is most suitable for topic detection. Agrawal et al. (2018) said that LDA methods are more suited in domains where data is in semantic units like words, and NMF methods are more suited to domains where data has the so-called semantic gap." Stevens et al. (2012) stated that "when a descriptive topic is required, LDA is the best choice", and Xue et al. (2014) stated that "NMF is more appropriate when dealing with visual ambiguities". While Chen et al. (2017) and Taniguchi et al. (2018) specified that LDA also shows better adaptability and robustness with clustered visual data. But opinion/text mining researchers have only textual data and are uncertain with respect to the above answers. In LDA and NMF blogs, the most frequently asked questions show confusion about which model is best (What is a good way to perform topic modeling on short text?). That said, validation of topic detection in both models can be very challenging (Suri and Roy, 2017).
LDA and MNF algorithm details are necessary for the statistical research community, as well as those researchers who want to 1 extend these algorithms or make changes to existing algorithms, such as supervised 2 latent Dirichlet allocation (SLDA) (Blei and McAuliffe, 2010), LDA for multiple languages (MLSLDA) (Boyd-Graber and Resnik, 2010), Constrained-LDA (Zhai et al., 2011), constrained symmetric nonnegative matrix factorization (CSNMF) (Peng and Park, 2011), constrained NMF (Liu and Wu, 2010), and semi-supervised 4 NMF (Chen et al., 2008). The opinion/text mining research community uses LDA and NMF as a black box, where an output is produced based on given inputs. There is no need to understand the depth of each model. But they are concerned with the question of which model is best for topic detection, because they use the detected topic for further processing, i.e., aspect extraction, keyword extraction, grouping aspects into categories, spam detection, opinion topics and finding a common semantic space (He et al., 2011;Gao and Li, 2011;Li et al., 2010). The proposed methodology is based on the issues related to say community. The aims of this study are as follows:  We propose to develop a strategy which can determine the best model for opinion/text mining (LDA or NMF) using Latent Semantic Indexing (LSI).  We propose a method to find out which document is closest to or farthest from the detected topic (LDA-topic and NMF-topic).  We propose a method to find out which detected topic (LDA-topic and NMF-topic) is closest to the documents.  We propose to generate an order of documents from the detected topic (LDA-topic and NMF-topic)  Using LSI, and make a final decision based on these orders.
We used documents from three domainsmedicine, politics, and sports to compose the proposed methodology to find which Topic-model is close to the documents.

Related work
Topic modeling is an unsupervised learning approach to clustering documents in order to discover topics based on their contents. It is very similar to how the k-means and expectationmaximization algorithms work. Because we are clustering documents; we have to process the individual words in each document to discover topics and assign values to each based on the distribution of these words. This increases the amount of data we are working with, so to handle the large amount of processing required for clustering documents, we have to utilize efficient sparse data structures.
Topic modeling is concerned with aspect extraction, entity extraction, keyword extraction, etc. Keyword.
Extraction has been used in a variety of natural language processing applications, such as information retrieval systems, digital library searching, web content management, document clustering, and text summarization (Rose et al., 2010). Topic detection enables the automatic identification of semantic content and the assignment of a topic label to a given document. Although these approaches are highly useful for a large spectrum of applications, only a limited number of documents with keywords are available online (El-Fishawy, 2014). Keyword extraction is also a process of identifying a short list of words or noun phrases that capture the most important ideas or topics covered in a document (Awajan, 2014). For (Rammal et al., 2015), the aim was to apply local grammar (LG) to develop an indexing system that automatically extracts keywords from titles of Lebanese official journals. Topic modeling offers a computational tool to find relevant topics by capturing meaningful structure among the collections of documents (Wang et al., 2016). For entity extraction (Pantel et al., 2009) have used a method distribution similarity by comparing the similarity of the surround words of each candidate entity with those of the seed entities; and then ranking the candidate entities based on the similarity values.
When determining the summary of a document or sentiment analysis of an opinion, it is important to find out whether the selected document contains the required key words, aspects, or entities (Chinsha and Joseph, 2015;Qi and Chen, 2011;Thakur and Singh, 2015). Recent studies have proposed a novel, rulebased method for extracting an aspect from reviews of products using an unsupervised approach to uncover the polarity of an aspect in different domains (Gindl et al., 2013;Hu and Liu, 2004). Machine learning and NLP-based rules can also provide better solutions for identifying the aspects, topics, and key words of a paragraph with less effort (Gupta and Ekbal, 2014). The approach uses a classifier trained for each distinct word in a corpus of manually sense-annotated examples as an entirely unsupervised method to cluster the occurrence of words (Raganato et al., 2017). An aspect-based sentiment analysis, which can be carried out by using only particular aspects (Jeyapriya and Selvi, 2015;Gamon et al., 2005;Zhuang et al., 2006;Gojali and Khodra, 2016), requires less effort compared to a sentiment analysis of an object with respect to all aspects. Keyword and topic extraction are not only used in researching the English language, but also in research surrounding other languages, such as Arabic, French, German, Spanish, Chinese, Greek, and Japanese (Pang and Lee, 2008;Tumasjan et al., 2010;Alshammari, 2018). The most important methods used for topic modeling are LDA and MNF (Leek et al., 2000;MacMillan and Wilson, 2017).
In a joint model for sentiment analysis, an aspectsentiment mixture model was built, based on an aspect (topic) model using LDA and extended LDA (Mei et al., 2007;Lin and He, 2009;Jo and Oh, 2011). A joint model was also proposed in Sauper et al. (2011), which worked only on short snippets already extracted from reviews. Another extension of joint model is semi-supervised joint model, where some topics and aspects are detected by providing some seed aspect terms (Mukherjee and Liu, 2012). A method based on Probabilistic Latent Semantic Analysis PLSA produced a rated aspect summarization of short comments from eBay.com (Lu et al., 2009).
An interdependent LDA (ILDA) has been used to find group aspects and to derive their ratings (Moghaddam and Ester, 2011). The extension of LDA known as ILDA, "it is a type of multilevel latent semantic association, where at the first level, all the words in aspect expressions (each aspect expression can have more than one word) are grouped into a set of concepts or topics using LDA" (Guo et al., 2009). There are also studies in which manifold learning is used for modeling a robot's multimodal information, in these studies, they used manifold learning such as NMF, and multimodal information, which is an observation of the model represented by low dimensional hidden parameters (Mangin et al., 2015;Chen and Filliat, 2015). Reviews are rated according to an object, so there should be a direct method to determine whether a review is positive or negative. LSI (Latent Semantic Indexing) is better for such a purpose (Saqib et al., 2016). LSI (Huang et al., 2009) has been used for the clustering of documents and for concept representations. An extended method based on LSI can filter unwanted emails in Chinese and English (Yang and Li, 2005). There are many questions about the LDA and NMF models in the opinion mining research community. Various works have been done to test the accuracy of LDA and NMF based on the nature of the data. In text/opinion mining, only the topic of a document which can be determined by LDA or NMF is used for further processing. These researchers use LDA and NMF as a black box tool.

Analysis of LDA and NMF using LSI
Using this methodology, we generated a topic from LDA as the LDA-topic and NMF as the NMFtopic. We then used LSI by providing the topic as a query and the document as a list. This method determines the score of each document for the LDAtopic and the NMF-topic. After this, a decision is made by comparing the average LSI score of all documents for the LDA-topic and average LSI score of all documents for the NMF-topic. Whichever has the greater score is the best model. This method generates two lists of document LSI scores in descending order. The first order is based on the LSI scores of each document for the LDA-topic, and the second order is based on the LSI scores of each document for the NMF-topic. In the first order, the topmost score will be the closest document to the LDA-topic and the last score will be the document which is farthest away from the LDA-topic. In the second order, the topmost score will be the closest document to the NMF-topic, and the last score will be the document which is farthest away from the NMF-topic. The whole process is depicted in Fig. 1.

LDA
LDA, or Latent Dirichlet Analysis, is a probabilistic model. To obtain cluster assignments, it uses two probability values: P (word-topics) and P (topics-documents). These values are calculated based on an initial random assignment; after which they are repeated for each word in each document to decide their topic assignment. In an iterative procedure, these probabilities are calculated multiple times, until the convergence of the algorithm (Chawla, 2017). Its algorithm is available in course of Advanced Machine Learning at topic "Topic Modeling: Latent Dirichlet Allocation". We can describe LDA more formally with the following notation . Average of All Scores Average of All Scores Greatest Average Score is Best Model document d, which is an element from the fixed vocabulary". With this notation, the generative process for LDA corresponds to the following joint distribution of the hidden and observed variables in Eq. 1 : ( 1: , 1: , 1: , 1: = The algorithm based on above equation was implemented using TfidfVectorizer, Count Vectorizer classes of package sklearn: Feature extraction; text and Latent Dirichlet Allocation of package sklearn; decomposition in Python.

NMF
Non-negative matrix factorization is a Linearalgebraic model, that factors high-dimensional vectors into a low-dimensionality representation. Like Principal component analysis (PCA), NMF takes advantage of the fact that the vectors are nonnegative. By factoring them into the lowerdimensional form, NMF forces the coefficients to also be non-negative, its algorithm is also implemented in "Topic Modelling with LDA and NMF on the ABC News Headlines dataset" by Chawla (2017). NMF is useful in settings where the domain of the data is inherently non-negative and where parts-based decompositions are desired. In general, "NMF seeks a n*d non-negative matrix W and a d*t non-negative matrix H so that V~WH". The matrices W and H are estimated by minimizing the following objective function which is calculated from Eq. 2 (MacMillan and Wilson, 2017): where ||.||F is the Frobenius norm. In topic modeling, W and H have a special interpretation: Wij quantities the relevance of topic j in document i, and Hij quantities the relevance of term j in topic i.

Latent semantic indexing for LDA and NMF
LSI, which was proposed by Deerwester et al. (1990) and Blei et al. (2003), is an efficient information retrieval algorithm (Phadnis and Gadge, 2014). Basically, in LSI, there is a cosine similarity measurement between the coordinates of a document vector and the coordinates of a query vector. If this value is 1, it means that the document matches the query 100%; if it is 0.5, it means the document matches the query 50%; and if it is 0.9, it means the document matches the query 90%. The important step now is finding the coordinates of each document and query. A singular value decomposition (SVD) can determine the points or coordinates of a document and query. Through the SVD, three values S, V and U which will be used for further processing can be determined by a matrix. The matrix consists of rows and columns containing integers, where the inputs are different text documents. A feature matrix can be obtained by calculating the frequencies of each word. This means that first, a feature matrix is created from all the documents, and then the SVD is calculated. After this, the supporting variables S, V and U are calculated using NumPy (Numeric Python). The coordinates of all the documents are determined from S, and these coordinates are merged with the query to obtain the query coordinates. Finally, a cosine similarity function is applied to these coordinates to find the documents that best match the query. The whole process can be done using following equations: The Eq. 3 determines the topics of given documents (DOC) with LDA and NMF.

=1
(3) The following Eq. 4 determines the LSI scores of each document with the LDA-topic.
The Eq. 5 determines the LSI scores of each document with the NMF-topic.
Now coordinates of documents and coordinates of topic will be determined by setting term weights and construct the term-document matrix DOCSmat from all documents and topic matrix Topicmat from Topic (LDA and NMF). Decompose matrix DOCSmat and find the v, S and u matrices using Singular Value Decomposition svd method of Numerical Python numpy package as Eq. 6: , , = . .
VK is a matrix by extracting first two column of V and each row of its inverse. Now transpose of VK i.e., Vk t r elates to coordinates of document named as DOCScoor. Coordinates of Topic can be determined by the product of transpose of Topicmat, Uk and Sk -1 as depected in following Eq. 7: where Uk is matrix from extracting first two columns of U and Sk -1 is a matrix from extracting inverse of first two column and row of S. Hence decomposing Eq. 7 with respect to LDA and NMF, we can find coordinates of LDA-Topic and NMF-Topic using Eq. 8 and Eq. 9: Now cosine similarities method will find the scores of each document-coordinates Docscoor from LDA-Topiccoor in Eq. 10 and from NMF-Topiccoor in Eq. 11: (11)

Results
We took 10 documents each from the political, sports, and medicine domains, each consisting of approximately 50 words. After applying LDA and NMF to these documents, the topics generated (as shown in Table 1, were similar, but there were some differences. LDA and NMF generate more than one topic, but we only considered the first topic in our analysis. Each topic was no more than eight words.

Experimental results
To check the accuracy of LDA and NMF, we applied LSI to the documents from each domain. Table 2 shows the LSI scores of the LDA-topic and NMF-topic for each medical document. The average NMF score (0.725605836) is greater than the average LDA score (0.716558265). This means that the NMF-topic is a better match across all documents than the LDA-topic. The difference between the average scores is 0.0091, which is considered statistically significant.   Table 3 shows the LSI scores of the LDA-topic and the NMF-topic for each political document. The average NMF score (0.898202252) is greater than the average LDA score (0. 886923704). This means that the NMF-topic is a better match across all documents than the LDA-topic. The difference between the average scores is 0.0112, which is considered statistically significant. Table 4 shows the LSI scores of the LDA-topic and NMF-topic for each sports document. The average NMF score (0. 778685937) is lower than the average LDA score (0. 779071217). This means that the LDAtopic is a better match across all documents than the NMF-topic. But the difference is only 0.00039, which is not considered statistically significant. Fig. 2 illustrates the comparison of the LDA and the NMF method with documents of different domain. The Fig.  2 clearly shows that the line for the NMF and LDA methods are very close to each other but with little bit difference.

Conclusion and discussion
From the above analysis, it is clear that both methods work well for topic detection, but NMFgenerated topics are slightly closer to the documents. After arranging these scores in descending order, all documents are arranged in the same order. In the medical domain, d[6] was very close to both the LDA-topic and the NMF-topic, while d[2] was far from both, as shown in  and d[5] was far from both, as shown in Table 6. In this domain, only these three documents were in a different order, rest of the documents were in the same order. In the sports domain, d[5] was very close to both the LDA-topic and the NMF-topic, while d[1] was far from both, as shown in Table 7.    From Table 5, Table 6, and Table 7, it is clear that both the LDA and NMF topics had the same relevancy for each document, but as whole, the average LSI score of NMF was greater than that of LDA. After the analysis of both models based on scores generated using LSI, it is very hard to determine the best model for topic detection in text mining, but NMF can be considered slightly better than LDA as depicted in Fig. 3

Limitations and future work
We used datasets from only three domains; further domains could also be analyzed. We also limited each document to 50 words and each topic to 8 words, which could be increased or decreased. More than one topic can be generated using both LDA and NMF; in this study, we used only the first topic. Furthermore, the topics themselves could also considered for analysis. We used well-written documents without noise, whereas if we want to detect topics from user reviews, there is still a need for further study, as user reviews contain mistakes, omitted words, incomplete sentences, and misspellings. Gao S and Li H (2011 Domain-3 LDA-Topic NMF-Topic