Text mining: A survey of Arabic root extraction algorithms

In all Arab countries, the Arabic language is the official language spoken and written and is one of the oldest known languages. This paper aims to explain and discuss the work done on extracting the root of the Arabic word and Stemming algorithms. Text mining has become of interest to scientists, researchers, and users because of the existence of big data and deep learning algorithms that can analyze giant sets of unstructured data. The basic algorithms are used to extract and classify texts, information retrieval systems, and indexes. Algorithms are used to extract the root of a word from different natural languages. This paper will present a brief background and comprehensive presentation of a number of algorithms that handle the Arabic text to extract the word root in its light, heavy, hybrid, leading, and Markovian form. There are a number of papers, articles, and research papers that deal with extracting the Arabic root from the word. This paper will present a brief background for a number of stemming algorithms on how to extracting the root and stem of the Arabic word, then make a comparison and discussion of a number of selected algorithms in terms of accuracy, data set, method of stemming regarding of strengths and weakness.


*The
Arabic language defines as a complex language based on root-pattern format. Application of the natural language processing includes multiple applications such as text processing, speech tagging information retrieval, machine translation, and the most important topic is the context is root extraction and stemming.
We can use stemming algorithms in text mining, text classification, information retrieval systems, and indexers. A lot of stemming algorithms are built in different natural languages. We will introduce an overview of multiple algorithms work in this field, such as the light, heavy, hybrid, novel, and Markovian Arabic stemmer. We found that there are multiple algorithms that made a comparison between their own stemmer and Khoja stemmer (Khoja and Garside, 1999), which is the standard Arabic stemmer.
When developing Information Retrieval (IR) systems, Arabic light stemmers have been developing by many approaches to use in multiple applications and projects. Researchers select which stemmer to use in their project after evaluate and compare between stemmers.

Related works
In this section, we will present and compare multiple papers and researches worked in data mining, text mining, Arabic morphological analysis, and Arabic root extraction and stemming.
A literature survey allows researchers and readers to simply get to investigate a specific subject by selecting high-quality articles or considers that are related, significant, important, and substantial and summarizing them into one complete report. Moreover, it gives a great beginning point for analysts starting to investigate in an unused region by constraining them to summarize, assess and compare unique investigation in that particular zone, and it makes sure of that don't duplicate work that has as of now been done.
It can give clues as to where future inquire about is heading or prescribe regions on which to center. It gives a supportive examination of the ways and approaches of other researchers. In the following section, there are related works we will discuss them. Boudlal et al. (2011) presented an Arabic morphological examination framework that appoints, for each expression of a vowel Arabic sentence, an individual root contingent upon the condition. The proposed framework is made out of two modules. The first comprises of an investigation. It separates each expression of the sentence into its rudimentary morphological units to distinguish its potential roots. For that, they embrace the division of word into three little portions (prefix, stem, and suffix). In the subsequent module, they utilize the condition to recognize the right root among all the potential underlying foundations of the word. For this reason, they utilize a Concealed Markov Models approach, where the supervisions are the words and the potential roots speak to the shrouded states. They approve the methodology utilizing the NEMLAR Arabic composition corpus comprising of 500,000 words.

Root extraction algorithm
The framework finds the percentage of the right root in the preparation set is over 98% and in the testing set of words, practically is 94%. It can generally have excellent outcomes in picking the right root of the word.
After the effects of tests did on the two pieces of the framework are empowering. They can be improved by further investigation of hamzated words in the examination out of condition and by utilizing a bigger corpus in the Markovian methodology. They broaden crafted by the principal module so as to make different labels of the words (thing, action word, molecule, (a word that portrays a thing), (a word that depicts an action word or a descriptor), conceivable vowelizations).
They utilize a supportive difference in the Markovian way to deal with recognize the best vowelization of the word in the setting. Al-kabi et al. (2015) presented another light and heavy Arabic stemmer and contrasted with two surely understood Arabic stemmers. The result demonstrated the accuracy of the proposed stemmer is somewhat low to those two stemmers. The tests on novel stemmers show accuracy 75.03%, while low accuracy show by the two Arabic stemmers.
They proposed, created, and evaluated another Arabic stemmer. Three principles handling root starting exploratory outcomes demonstrated a satisfactory for roots foreseeing. They contrasted their stemmer, and two Arabic stemmers, where the equivalent dataset is staged, were applying to make Arabic roots from words. Stage 1 is in charge of expelling prefixes and suffixes, Stage 2 is in charge of contrasting yield with standard word sources or shapes, and stage 3 is in charge of revising the developing utilized.
Results demonstrated that their calculation is better regarding accuracy much of the time of various word lengths in examination with the other two Arabic stemmers. Alkhatib et al. (2017) proposed another algorithm, which is called a novel methodology for building up Al-Hadith Al-Shareef WordNet linguistic to fills its needs for various tasks of Arabic natural language processing. Particularly, they build up semantic associations between words in order to achieve a decent comprehension of the implications of the words in Al-Hadith. Their procedure is to use the ontology of Al-Hadith and Traditional Arabic lexicons.
This algorithm capacity was demonstrated in a classification that they created for procedure estimation. The classifier has been applied on around 8500 synsets that incorporate 6126 titulars, 310 adjectives, 1990 verbal and 71 adverbial expressions. Taghva et al. (2005) executed an Arabic stemmer for root extraction, which is the same as the stemmer used by Khoja, but it does not use the root dictionary. It is a light stemmer and applied to the Arabic Trec-2001 collection.
The result of testing demonstrated that in an Arabic stemmer, there is no need for stem lists. Larkey et al. (2002) found that in general, the performance of a light stemmer such as the proposed one that removes affixes without pattern more complex stemmers and the Khoja stemmer, which uses stem lists and pattern checking. Kreaa et al. (2014) proposed a new stemming system (AKK stemmer) for Arabic words, which consolidates Light Stemmer and Looks in tables strategies to take care of the issue of the broken plural, which is the irregular nouns in the Arabic language.
AKK stemmer furnishes exact outcomes in examinations with a different algorithm. They made a correlation between the algorithms; they utilize general techniques are two. Either like Khoja stemmer, which extracting the root of the word, or like Light stemmer that truncation of affixes. The primary technique has numerous issues. Mainly, the root word dictionary expects support to ensure. Ababneh et al. (2012) presented another light stemming system. It is a rule-based light stemmer. They presented another arrangement of PC directions that uses a lot of principles to choose if a specific succession of characters is a piece of the first word or not all that this can help understanding some confounding problems. Additionally, they presented a route for dealing with most broken plural structures and diminishing them to their single to gathering expressions of a similar significance in a typical structure. Al-Omari et al. (2013) introduced an algorithm, which finds the Arabic word root. It uses a set of mathematical rules and relations between letters in Arabic light stemming. It was called the Arabic Rule-Based Light stemmer (ARBLS). It is tested and compared with a Khoja stemmer. The tests and their results had shown a major margin of difference in favor of ARBLS. It needs to be improved its abilities to extract Arabic roots of a large number of Arabic words correctly. Otair (2013) explained that the definitions concerning hybrid stemming approaches where analyzed then summarize the main characteristics of the Arabic language.
This paper expects to look at the greater part of the usually utilized light stemmers in terms of affixes lists, algorithms, main ideas, and information retrieval performance. The outcomes demonstrate that the light 10 stemmer outperformed the other stemmers.
Al-Lahham et al. (2018) illustrated Light 10 stemmer; it is the best one among a grouped of light stemmers. It defines a table with a list of suffixes and prefixes with better retrieval and high performance. Light 10 has no confinements on the affixes, so it is possible to have two distinct terms having the same symbol while they have different meanings. Light 10 stemmer proposes adding to the table more affixes and power a few conditions on removing these affixes. The accomplishment and testing of the proposed strategy show high quality than the Light 10 stemmer.
The proposed stemmer recommends removing affixes if they fulfill one or more of a set of proposed conditions. The utilization of the proposed light stemmer demonstrates that adding a few conditions to light stemmers improves the retrieval at the lower recall levels.
El-Defrawy et al. (2015) estimated different Arabic stemmers by doing a progression of comparisons utilizing a series of comparisons using a manually annotated dataset, which shows the performance of Arabic stemmers, and points out potential enhancements to existing stemmers. They also present improved root extractors by using light stemmers as a preprocessing stage.
The study and the results did show that there is a relationship between linguistic accuracy and other measures. If linguistic accuracy increases, the other related measures will increase. The more Arabic stemmers exist, it will make the stemming analysis job richness. Any stemmers have their own strengths and weaknesses, where the weaknesses could be reduced by combining many stemmers in effective ways.
Yaseen and Hmeidi (2014) presented a new algorithm called WSS. It does not remove any affixes. It creates a set of all substrings of an Arabic word and employments a set of rules to extract root from substring Arabic, roots file, and Arabic patterns file. The accuracy of the proposed algorithm is 83.9%. The algorithm utilized the Holy Quran for testing. This stemmer considers as competitive, and the accuracy can be moved up to 9.9% after doing multiple tests. In most cases, two candidates for the proper root were retrieved by the WSS algorithm. Almusaddar (2014) focused on improving Arabic information recovery by making strides light stemming and preprocessing arranges and includes to the open-source community, moreover, construct a rule for Arabic alteration and stop-word evacuation. To achieve these objectives, he makes a GUI toolkit that performs reprocessing arrange that's essential for information retrieval. One of these steps is alteration, which we made strides and presented a set of rules to do and advance by other researchers.
The following reprocessing step they made strides is stop-word removal, the presented two diverse stop-word lists, the primary one is an intensive stop-word list for decreasing the measure of the file and befuddling words, and the other is a light stop-word list for better results with recall in information retrieval applications. He presents the utilize of Arabized words, 100 words manually collected, these words should not follow the stemming rules since they came to the Arabic language from other languages, and show how these progress results compared to two well-known stemming algorithms like Khoja and Larkey stemmers. The proposed toolkit was combined with a prevalent IR platform known as the Terrier IR platform. He utilized the TF-IDF scoring model from the Terrier IR platform and tested the results utilizing OSAC datasets. He utilizes an existing opensource application that already bolsters other languages, at that point, including the Arabic dialect back to it. He progressed the preprocessing step, which affects the results of any IR framework. The proposed GUI toolkit has numerous alternatives, including reading and writing dataset files, show output in tables, and create statistics around preprocessing steps. This toolkit might be accepted as the first step through a standard and may well be altered broadly within the Arabic language preprocessing and Arabic IR frameworks. Using UTF-8 is imperative and extraordinary alternatives, especially communicating with other schemes. The presented light-stop-word list that contains 119 words and seriously stops word list manually collected and merged from three other stop-words lists and contains 13957 stop-words. These alternatives permit more alternatives for researchers to test and move forward the impact of stop-words evacuation on a different application like TC or IR. They declare the utilize of Arabized words and clear how these words must not comply with any Arabic stemming rules since these words are not Arabic words, they had collected 100 Arabized words and incorporate them within the progressed light stemming calculation to progressing the effect of any stemming algorithm.
They made a comparison of a dataset that contains Arabized words with two well-known Arabic stemming Larkey that failed with 32 Arabized words and Khoja stemmers that failed with 5 Arabized words. The proposed toolkit increments the preprocessing for IR frameworks and permits simple creating and (a combination of distinctive and other groups. Terrier IR now bolsters Arabic language utilizing the proposed toolkit and offers wide alternatives for preprocessing information before indexing it. Larkey et al. (2002) presented numerous light stemmers based on co-occurrence for Arabic recovery. The recovery adequacy of proposed stemmers and a morphological compared to the TREC-2001 information. The finest light stemmer was more viable for cross-language retrieval than a morphological stemmer, which attempted to discover the root for each word. A repartitioning prepares to comprise of vowel expulsion taken after by clustering utilizing co-occurrence examination created stem classes which were way better than no stemming or exceptionally light stemming, but still the second rate to good light stemming or morphological analysis. It appears advancements of around 100% in average high quality due to stemming and related forms, and an indeed bigger impact for dictionary-based cross-language recovery. An online dictionary offers assistance to extricate words, so it contained far fewer distinctive. Without stemming, the dictionary interpretations of question terms were improbable to coordinate the shapes found in documents. Boudchiche and Mazroui (2018) clarified an Arabic root extraction framework that gives the root of each word of a given sentence. It is a critical instrument for a part of natural language preparing applications such as search engines, text classification, and information recovery. The strategy of extraction utilized in this work runs in two steps. The primary one consists in trying to find of all the conceivable roots of each word carefully examined with the morphological analyzer Alkhalil Morpho Sys 2. Then, in the second step, a declaration approach is based on nonstop quadratic splines to select from these roots the one that matches the word context. They got encouraging results with an accuracy of 96%.
Sameer (2016) proposed a Modified light stemming algorithm for Arabic Languages. It is relay on the understanding of Arabic morphology. It is dependable and precise for words that have diverse length and distinctive affixes agreeing to the test but erroneously stems appropriate names and foreign words. Boudad et al. (2018) surveyed the major works that had been managed to Sentiment Analysis in Arabic. This audit appeared that Arabic Sentiment Analysis had become one of the research ranges that have been drawn the consideration of many researchers. Examination of these works appeared that three sorts of approaches, to be specific administered, unsupervised, and hybrid, were utilized to handle an assortment of Estimation Investigation task.
Alhanini and Aziz (2011) discovered how to improve finding the Arabic words stem; the used stemmer is a light stemmer and dictionary. The improved stemmer incorporates the dealing with named entity recognition and word expressions. They have utilized an Arabic corpus that comprises ten records in arrange to figure out the improved stemmer. They detailed the improved stemmer accuracy values, light stemmer, and word referencebased stemmer in each document. The average of accuracy in an improved stemmer is 96.29%. The test shows that the improved stemmer fulfills the most elevated accuracy values, and it is superior to the dictionary-based and light stemmer.
Albogamy and Ramsay (2016) presented a light Arabic stemmer for Arabic tweets. The results increment the fulfillment of some well-known stemmers for Arabic. A new stemmer does not depend on any root dictionary, which is extremely imperative for stemming Arabic tweets, since they have a very open vocabulary. It has two stages: stage 1 is committed to creating a list of all conceivable stems by utilizing the grammar, and stage 2 is to select the shortest stem as the right stem. They compared the new stemmer with three Arabic stemmers, where one of them uses almost the same approach to the new stemmer. Results appeared that the accuracy is better to compare with the other three Arabic stemmers. Momani and Faraj (2007) proposed extracting Arabic trilateral roots using a novel algorithm. The words that have no root where filter then remove the suffixes and prefixes and remove any repeated letters find in the Arabic word "sāltmwnyhā" after sorting term letters. Letter removal was conducted until three letters remain. Lastly, according to the order in the original word, the remaining letters will be arranged. Two Arabic text documents were chosen to make a performance test. After testing 1500 words, it produces the proper root, and the accuracy is 73%.
Al-Kabi (2013) displayed a standard Arabic stemmer called Khoja stemmer and explain its deformity. He makes a comparison between various studies and this one. Al-Kabi (2013) found that the Khoja stemmer is better than other ones assess in his study. These stemmer and Khoja stemmer based on Patterns, Shapes, weight. Adding more Patterns increase 5% of the accuracy. Alshalabi (2005) presented a strategy for extricating the triliteral Arabic root for an unvocalized Arabic corpus. It gives a productive way to expel suffixes and prefixes from the curved words. Then it matches the coming about the word with the accessible designs to discover the appropriate one and, after that, extricates the three letters of the root by expelling all infixes in that design. This procedure does not utilize any lexicon to check the coming about the stem. A few rules had been describing that offer assistance to choose if the letters have a place to the root or not. This algorithm has been tested on a corpus of 72 abstracts (10582 words) from the Saudi Arabian National Computer Conference; the algorithm accuracy is approximately 92%. Khafajeh et al. (2018) explained a crossover strategy to extricate Arabic word roots had been creating. The proposed method depends on optimization work, which is the improving operation performed by playing a set of non-morphological rules to improve the n-gram method. The new method tried employing a dataset containing more than 6000 recognized words having a place to 141 distinctive roots. The results appear a stamped change after utilizing the crossover strategy; the proposed method extricates accurately approximately 99% of three-part solid roots and almost 86% of tripartite vowels roots.
The proposed strategy utilized multi-objective work with a measurable algorithm for finding Arabic roots, a multi-objective work utilized to avoid getting caught within the same roots by finding the best quality solution calculated utilizing modern proposed confinements.
Hawas (2013) presented how to assign an individual root without depending on a dataset of word roots, a list of all the prefixes and the suffixes of the Arabic words, or a list of word patterns. It tries to portray a possible case of the root-letters positions one by one based on a few rules and relations among the word letters and their situation within the word. It centers on two parts of the approach. The proposed approach had been assessed utilizing the Holy Quran words. The assessed results appear as a favorable extraction algorithm.
This approach comprises of two-stage. The first stage shows the ability to discover relations between the word letter and its situation within the word. The assessed result of this organization display a favorable root extraction calculation. The second stage represents an assessment of the classification of the Arabic letter. This stage tests the classification of Arabic word letters. A comparison is made between the roots letters produced each word and the ones put away into the roots file taking into account that the system is in its first stage. If the entire coordinate or sub coordinate is found, then the examination of a root is considered correct. On the other hand, in the case at least one letter created by analyzing the tried word is wrong, the root analysis is considered inaccurate. AbuSafiya (2017) clarified a new strategy based on two phases: The first stage is the generation stage, which makes an introductory set of candidate roots. The second stage is a filtering stage where the root set that was made at the first stage filtered to remove the wrong roots. The primary power of this approach is that it treats distinctive mistakes of Arabic language morphology by making all conceivable roots, including those with remove, flipped, or repeated letters. The other advantage is that it is easy to put into utilize and can extend to effectively deal with new words of new derivation forms. This may be done by fixing the generator to include new roots in case the proper root isn't made within the generation stage and including modern channels to leave out wrong filters. This made the advancement of the system simple and can be improved. Elazhary and Khodeir (2017) proposed a new approach called Art (Arabic word Root Extraction Tutor. It is a cognitive instructor implied to educate students generation rules required for Arabic word root extraction. It works in two-mode active and passive mode and combines numerous ways of doing things for progressed instructing. It gives a positive result for a correct answer and a negative one something else. Art could be a cognitive instructor implied to educate students generation rules required for Arabic word root extraction. Hajjar and Zreik (2010) displayed a new method that assesses the implementation of some Arabic root extraction algorithm. The utilized strategies in this framework chosen agreeing to a previous ranking, where these strategies are classified into five groups. They have chosen a strategy for each group. These strategies are Arabic Stemming without a root dictionary, Light Stemmer, N-gram based on contrast coefficient, MT-based Arabic Stemmer Ngram based on the likeness coefficient. This estimation was conducted on the same terms in a corpus of two thousand words and their roots. These words are taken from the Arabic lexicon "Lesan Al-Arab." This framework works in two ways: Typical and automatic.
The view of this method is to apply these strategies in the waterfall on the same entry to investigate the viability of each combination of methods and compare the execution of this new strategy with a combination of the strategies as of now created. In expansion, this framework can be distributed as a web location to permit all people groups to select their own corpus and to create a joining assessment of these methods. Al-Kabi et al. (2011) presented an assessment of four heavy Arabic root based stemmers. The assessment showed the strength and accuracy of these stemmers. Jaafar et al. (2017) discovered a new Arabic stemmer gives solutions to a few bad results. Also, it estimates and compares Arabic stemmers that take into consideration measurements related to the accuracy of results as well as the execution time of stemmers. The results appear that the stemmer finishes the highest rate of accuracy with 33.7% and occurs in the second position in terms of the Gs-Score metric with 0.1. Nehar et al. (2016) presented a new Arabic root extraction approach dealing with Text Classification, a new approach utilizing transducers and rational kernels. It displays the premise to utilize Arabic Pattern-Based Stemmer designs. Transducers utilized to show these designs and root extraction were done on three-word collections without utilizing a dictionary. The accuracy yields 75.6%. Classification tests were done on the Saudi Press Agency dataset, and N-gram kernels, are tried with different values of N. accuracy and F1 report 90.79% and particularly 62.93%. These results mean that this approach is more accuracy and F1 than other approaches.

Related papers in text mining and data mining
Bharati and Ramageri (2010) examined a few of the data mining techniques, algorithms, and a few of the organizations which have adjusted data mining technology to improve their businesses and found great results. They clarify a few strategies of algorithms and techniques like Artificial Intelligence, Clustering, Regression, Classification, Neural Networks, Association Rules, Genetic Algorithm, Decision Trees, the Nearest Neighbor strategy utilized for knowledge discovery from databases. Many of these organizations are combining data mining with such things as statistics, pattern recognition, and other important tools. Data mining can be utilized to discover patterns and connections that would otherwise be difficult to discover. This innovation is well known for numerous businesses since it allows them to learn more about their customers and make smart marketing decisions. FBTO Dutch Protections Company, Provident Financials Domestic Credit Division, joined together Kingdom and Standard Life Common Monetary Services Companies are a few companies presented by this paper using data mining technology to found solutions for business problems. Gridach and Chenfour (2011) described an approach for the Arabic morphological examination. It is called the Arabic Morphological Automaton (AMAUT). They have evaluated the presented approach utilizing Xerox Arabic Morphological Analyzer and Arabic Morphological Analyzer by Otakar Smrz because they are considered as the most referenced approaches for Arabic morphological investigation, and they are accessible for research and evaluation. There are a few preferences by utilizing the Arabic morphological automaton; it makes the framework portable and reusable since it's created utilizing Java dialect and XML innovation. Another advantage makes the morphological analyzer productive and exceptionally quick. Concerning the improvement of the lexicon, they have utilized XMODEL language for speaking to, planning, and actualizing the lexical resource. Alsaad and Abbod (2014) showed an improved root extraction algorithm for Arabic words, which is based on morphological analysis and linguistic constraints. The algorithm removes prefixes, suffixes then checking the word against a predefined list of patterns. In expansion, a few issues of extricating the roots have been handled by recognizing phonetic based rules to re-place, eliminate or duplicate certain letters where required. The algorithm collects Arabic words from an online Arabic corpus then done an experiment and testing on it. The assessment of the results and accuracy of the algorithm was connecting by human judgment. Saad and Ashour (2010) clarified and surveyed the existing common Arabic stemming, light stemming algorithms. They perform and combine Arabic morphological examination tools into the driving open-source machine learning and data mining tools, Weka, and Rapid Miner. Salloum et al. (2018) presented a wide study of several considerations related to the Arabic text mining with more concentrate on the Holy Quran, estimation analysis, and web documents and their implementation. The study discusses the later development within the field of intelligent computing, and it gives a total rundown of the existing text mining methods, which can be utilized for the extraction of logical patterns from the grammatically incorrect and unstructured textual data.

Discussion
In linguistic morphology and information retrieval, stemming is the method of lessening bent words to their word stem, base, or root. Many algorithms for finding the stem or root from the Arabic word mentioned above, and we can classify some techniques used to extracting root as follow: 1. Root extraction using light stemmers. 2. Root extraction using light and heavy stemmers. 3. Root extraction-using rules and based on a dictionary. 4. Root extraction-using rules without a root dictionary. 5. A Markovian Approach for Arabic Root Extraction. 6. Pattern-based Stemmer for finding Arabic Roots. 7. Root extraction without removing affixes. 8. Root extraction using Rational Kernels and text classification.
We notice that many stemming algorithms are built in different techniques, and some of them depend on the Khoja stemmer, which was widely used and known. Table 1 shows the accuracy of the stemmer algorithm.
It explains the comparison between the accuracy, technique used, and type of the data set of eleven algorithms to extract Arabic root, which is mention in Table 1. In section two, a short description of multiple algorithms had been discussed. The algorithms were tested to utilize Holley Quran words, Arabic newspaper, website, Arabic Trec-2001, Corpus, and NEMLAR Arabic Writing Corpus as a dataset.
The results obtained from Alsaad and Abbod (2014) proposed root extraction algorithm is worth being connected in different Arabic language handling programs, and it is promising. The WSS approach occupies the third position through the algorithms. For this reason, it has competitive accuracy. Ghawanmeh's stemmer (Ghawanmeh et al., 2009) accuracy is 95%. It involves the first position, whereas Taghva's et al. (2005) stemmer accuracy is 38%, which is the slightest accuracy. Boudlal et al. (2011) stemmer accomplish in testing set 93.81% in preparation and in training set 98%.
Khoja stemmer begins by remove diacritics, punctuation, and non-characters of the input word.
Then predefine a set of paths, which is based on word length to let words follow these paths. Then define prefixes and suffixes that had been removed. Apply a set of linguistic rules. Finally, validate the extracted root against a set roots dictionary, then stop if the root is correct. If the extracted root is incorrect, the stemmer continues searching for other root possibilities. An exhaustive search is done if the stemmer doesn't find the root then marked as an unstemmed word. Khoja stemmer does not explore all linguistic possibilities, and there are some missing Patterns not used, which is occurred in its accuracy. The Al-Kabi (2013), Taghva et al. (2005), and Yaseen and Hmeidi (2014) algorithms normalized words by removing affixes; it considers as an extra process deal to a major drawback to producing the wrong root because the stemmer cannot differentiate between extra letters (non-root) and root, another drawback is the overhead produced when extracting the root. The WSS-based algorithm reduces the time of extracting root without removing affixes. Al-Sarhan's stemmer (Al-Sarhan, 2003) assigning words weights (real numbers between 0-5) and ranks (order of a letter in a word) to the letters of words. Weights figure out by some tests on an Arabic text then multiplies the rank of a letter by its weight. Then, the three letters with the least product value were chosen as the three-letter root. This stemmer depends on the correctness of the weights of letters and the formula it uses, which could be tested by running some evaluations on a good dataset. The algorithm of Alshalabi (2005) shows an accuracy of about 92%. It normalizes the corpus by remove stop words, determiner, prefixes, and suffices, then reducing the inflected word. Some rules are added for removing suffixes and defined some constraints. These rules and constraints cannot be for all words, Boudlal et al. (2011) n A Markovian approach system deal with unvoweled words only. It gives correct root in training set more than 98% and in the testing set about 94%.
Many algorithms can be improved by adding extra rules and by combining the strength of two or more strong stemmers to deal with special cases that didn't follow the rules and to find the correct roots which are extracted wrong by some weak stemmers.

Conclusion
The Arabic internet content in the last years raised up the need for effective stemming techniques for the Arabic language. Arabic stemming algorithms can be classified into three categories, root-based approach (ex. Khoja), stem-based approach (ex. Larkey), and statistical approach (ex. N-Garm).
In this paper, we had displayed and discussed many papers and articles work on extracting Arabic word root, data, and text mining and explain the strength and weakness points in them.
The advantages of related works serve many purposes, several of which relate directly to reviewing, the person handling the submission will use the referenced papers to identify good reviewers, reviewers will look at the references to confirm that the submission cites the appropriate work, everyone will use the section to understand the paper's contributions given the state of existing research and future researchers will look to the Related Work section to identify other papers they should read.