International Journal of Advanced and Applied Sciences
Int. j. adv. appl. sci.
EISSN: 2313-3724
Print ISSN: 2313-626X
Volume 4, Issue 8 (August 2017), Pages: 112-122
Title: SVM significant role selection method for improving semantic text plagiarism detection
Author(s): Ahmed Hamza Osman *, Omar M. Barukab
Affiliation(s):
Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21911, Saudi Arabia
https://doi.org/10.21833/ijaas.2017.08.016
Full Text - PDF XML
Abstract:
This research introduces an approach for the prediction and detection of plagiarized text based on Semantic Role Labelling (SRL) and Support Vector Machine (SVM). The introduced method evaluates and analyses text based on semantic position for each term within the text. It additionally detects the source semantic sense in considering the connections between its terms using the Semantic Role Labeling (SRL). SRL presents noteworthy remuneration while creating roles from a text semantically. Selecting for every role created by the SVM method keeping in mind the end goal to foresee significant roles is a noteworthy part of the proposed system. The imperative roles that will vote by the SVM strategy will be chosen in the comparability computation process. The proposed strategy assessed utilizing the PAN-PC-10 dataset. The outcomes proved that the introduced strategy enhanced the execution as far as the assessment measures contrasted and other plagiarism detection methods.
© 2017 The Authors. Published by IASE.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Keywords: Plagiarism detection, Semantic similarity, Semantic role, SVM classifier, NLP
Article History: Received 21 April 2017, Received in revised form 13 July 2017, Accepted 14 July 2017
Digital Object Identifier:
https://doi.org/10.21833/ijaas.2017.08.016
Citation:
Osman AH and Barukab OM (2017). SVM significant role selection method for improving semantic text plagiarism detection. International Journal of Advanced and Applied Sciences, 4(8): 112-122
http://www.science-gate.com/IJAAS/V4I8/Osman.html
References:
- Alzahrani S and Salim N (2010). Fuzzy semantic-based string similarity for extrinsic plagiarism detection (Lab report for PAN@ CLEF10). In the 4th International Workshop PAN-10, Padua, Italy.
- Alzahrani SM, Salim N, and Abraham A (2012). Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(2): 133-149. https://doi.org/10.1109/TSMCC.2011.2134847
- Buckley C, Salton G, Allan J, and Singhal A (1995). Automatic query expansion using SMART: TREC 3. In: Harman DK (Ed.), The Third Text REtrieval Conference (TREC3): 69-80. National Institute of Standards and Technology Special Publication, Gaithersburg, Maryland, USA.
- Burrows S, Potthast M, and Stein B (2013). Paraphrase acquisition via crowdsourcing and machine learning. ACM Transactions on Intelligent Systems and Technology (TIST), 4(3): 1-21. https://doi.org/10.1145/2483669.2483676
- Chhabra P, Wadhvani R, and Shukla S (2010). Spam filtering using support vector machine. Special Issue IJCCT, 1(2): 161-171.
- Elhadi M and Al-Tobi A (2008). Use of text syntactical structures in detection of document duplicates. In the 3rd International Conference on Digital Information Management, IEEE, London, UK: 520-525. https://doi.org/10.1109/ICDIM.2008.4746719
- Elhadi M and Al-Tobi A (2009). Duplicate detection in documents and webpages using improved longest common subsequence and documents syntactical structures. In the Fourth International Conference on Computer Sciences and Convergence Information Technology, IEEE: 679-684. https://doi.org/10.1109/ICCIT.2009.235
- Frakes WB and Baeza-Yates R (1992). Information retrieval: data structures and algorithms. Prentice-Hall, Inc. Upper Saddle River, USA.
- Franco-Salvador M, Rosso P, and Montes-y-Gómez M (2016). A systematic study of knowledge graph analysis for cross-language plagiarism detection. Information Processing and Management, 52(4): 550-570. https://doi.org/10.1016/j.ipm.2015.12.004
- Ghosh A, Bhaskar P, Pal S, and Bandyopadhyay S (2011). Rule based plagiarism detection using information retrieval. Jadavpur University, Kolkata, India.
- Gipp B (2014). Citation-based plagiarism detection. Springer Vieweg Research, Berlin, Germany.
- Gruner S and Naven S (2005). Tool support for plagiarism detection in text documents. In the ACM Conference on Applied Computing, ACM, Santa Fe, New Mexico, USA: 776-781. https://doi.org/10.1145/1066677.1066854
- Jin Q and Ming M (2011). A method to construct self-set for IDS based on negative selection algorithm. In the International Conference on Mechatronic Science, Electric Engineering and Computer, IEEE, Jilin, China: 1051-1053. https://doi.org/10.1109/MEC.2011.6025646
- Kent C and Salim N (2010). Features based text similarity detection. Journal of Computing, 2(1): 53-57.
- Kim H, Kang YK, Kwon PJ, and Kim MH (2005). An application of DICOM architecture for detecting plagiarism in natural language. In the 9th International Conference on Computer Supported Cooperative Work in Design, IEEE, Coventry, UK: 2: 816-819. https://doi.org/10.1109/CSCWD.2005.194290
- Koroutchev K and Cebrián M (2006). Detecting translations of the same text and data with common source. Journal of Statistical Mechanics: Theory and Experiment, 2006(10). https://doi.org/10.1088/1742-5468/2006/10/P10009
- Lennon M, Pierce DS, Tarry BD, and Willett P (1981). An evaluation of some conation algorithms for information retrieval. Journal of Information Science, 3(4): 177-183.
- Màrquez L, Carreras X, Litkowski KC, and Stevenson S (2008). Semantic role labeling: an introduction to the special issue. Computational Linguistics, 34(2): 145-159. https://doi.org/10.1162/coli.2008.34.2.145
- Mikheev A (2000). Document centered approach to text normalization. In the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, Athens, Greece: 136-143. https://doi.org/10.1145/345508.345564
- Mikut R and Reischl M (2011). Data mining tools. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(5): 431-443. https://doi.org/10.1002/widm.24
- Mozgovoy M, Fredriksson K, White D, Joy M, and Sutinen E (2005). Fast plagiarism detection system. In the International Conference on String Processing and Information Retrieval, Springer Berlin Heidelberg, Heidelberg, Germany: 267-270. https://doi.org/10.1007/11575832_30
- Noble WS (2006). What is a support vector machine?. Nature Biotechnology, 24(12): 1565-1567. https://doi.org/10.1038/nbt1206-1565 PMid:17160063
- Osman AH and Salim N (2013). An improved semantic plagiarism detection scheme based on Chi-squared automatic interaction detection. In the International Conference on Computing, Electrical and Electronics Engineering, IEEE, Khartoum, Sudan: 640-647. https://doi.org/10.1109/ICCEEE.2013.6634015
- Osman AH, Salim N, and Binwahlan MS (2010). Plagiarism Detection Using Graph-Based Representation. Journal of Computing, 2(4): 36-41.
- Osman AH, Salim N, and Elhadi AAE (2013). A tree-based conceptual matching for plagiarism detection. In the International Conference on Computing, Electrical and Electronics Engineering, IEEE, Khartoum, Sudan: 571-579. https://doi.org/10.1109/ICCEEE.2013.6634003
- Osman AH, Salim N, Binwahlan MS, Alteeb R and Abuobieda A (2012a). An improved plagiarism detection scheme based on semantic role labeling. Applied Soft Computing, 12(5): 1493-1502. https://doi.org/10.1016/j.asoc.2011.12.021
- Osman AH, Salim N, Binwahlan MS, Hentably H, and Ali MA (2011). Conceptual similarity and graph-based method for plagiarism detection. Journal of Theoretical and Applied Information Technology, 32(2): 135-145.
- Osman AH, Salim N, Binwahlan MS, Twaha S, Kumar YJ, and Abuobieda A (2012b). Plagiarism detection scheme based on Semantic Role Labeling. In the International Conference on Information Retrieval and Knowledge Management, IEEE, Kuala Lumpur, Malaysia: 30-33. https://doi.org/10.1109/ InfRKM.2012.6204978
- Ozgencil N, Mccracken N, and Mehrotra K (2008). A cluster-based classification approach to semantic role labeling. In: Nguyen NT, Borzemski L, Grzech A, and Ali M (eds.), New Frontiers in Applied Artificial Intelligence: 265-275. Springer, Berlin, Germany. https://doi.org/10.1007/978-3-540-69052-8_28
- Palkovskii Y, Belov A, and Muzyka I (2011). Using WordNet-based semantic similarity measurement in external plagiarism detection. In the 5th International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse. Notebook Papers of CLEF. Available online at: http://clef2011.org/resources/proceedings/Palkovskii-Clef2011.pdf
- Palmieri F, Fiore U, and Castiglione A (2014). A distributed approach to network anomaly detection based on independent component analysis. Concurrency and Computation: Practice and Experience, 26(5): 1113-1129. https://doi.org/10.1002/cpe.3061
- Paul M and Jamal S (2015). An improved SRL based plagiarism detection technique using sentence ranking. Procedia Computer Science, 46: 223-230. https://doi.org/10.1016/j.procs.2015.02.015
- Potthast M, Barrón-Cede-o A, Eiselt A, Stein B, and Rosso P (2010a). Overview of the 2nd International Competition on Plagiarism Detection. In the 4th Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse, Notebook Papers of CLEF. Available online at: https://pdfs.semanticscholar.org/44e2/8a94f857cb5f7702a7b86455416726df64e9.pdf
- Potthast M, Stein B, Barrón-Cede-o A, and Rosso P (2010b). An evaluation framework for plagiarism detection. In the 23rd International Conference on Computational Linguistics: Posters, Association for Computational Linguistics, Beijing, China: 997-1005.
- Prechelt L, Malpohl G, and Philippsen M (2002). Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science UCS, 8(11): 1016-1038.
- Salcedo-Campos F, Díaz-Verdejo J, and García-Teodoro P (2012). Segmental parameterisation and statistical modelling of e-mail headers for spam detection. Information Sciences, 195: 45-61. https://doi.org/10.1016/j.ins.2012.01.022
- Salim N, Suanmali L, and Binwahlan MS (2010). SRL-GSM: A hybrid approach based on semantic role labeling and general statistic method for text summarization. Journal of Applied Sciences, 10(3): 166-173. https://doi.org/10.3923/jas.2010.166.173
- Seaward L and Matwin S (2009). Intrinsic plagiarism detection using complexity analysis. In the 25th Annual Conference of the Spanish Society for Natural Language Processing (SEPLN'09), San Sebastian, Spain: 56–61. Available online at: http://ceur-ws.org/Vol-502/pan09-proceedings.pdf#page=64
- Sharma A, Pujari AK, and Paliwal KK (2007). Intrusion detection using text processing techniques with a kernel based similarity measure. Computers and Security, 26(7): 488-495. https://doi.org/10.1016/j.cose.2007.10.003
- Shehata S, Karray F, and Kamel MS (2010). An efficient model for enhancing text categorization using sentence semantics. Computational Intelligence, 26(3): 215-231. https://doi.org/10.1111/j.1467-8640.2010.00357.x
- Stamatatos E (2009). A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology, 60(3): 538-556. https://doi.org/10.1002/asi.21001
- Stamatatos E (2009). Intrinsic plagiarism detection using character n-gram profiles. In the Annual Conference of the Spanish Society for Natural Language Processing (SEPLN'09), Donostia, Spain: 38–46. Available online at: http://ceur-ws.org/Vol-502/paper8.pdf
- Stein B, Lipka N, and Prettenhofer P (2011). Intrinsic plagiarism analysis. Language Resources and Evaluation, 45(1): 63-82. https://doi.org/10.1007/s10579-010-9115-y
- Suárez P, González JC, and Román JV (2010). A Plagiarism Detector for Intrinsic, External and Internet Plagiarism. In Notebook Papers of CLEF 2010 LABs and Workshops, Padua, Italy.
- Temitayo F, Stephen O, and Abimbola A (2012). Hybrid GA-SVM for efficient feature selection in e-mail classification. Computer Engineering and Intelligent Systems, 3(3): 17-28.
- Tomasic A and Garcia-Molina H (1993). Query processing and inverted indices in shared: nothing text document information retrieval systems. The VLDB Journal—The International Journal on Very Large Data Bases, 2(3): 243-276.
- van Rijsbergen CJ (1979). A new theoretical framework for information retrieval. In the 9th annual international ACM SIGIR Conference on Research and development in information retrieval, ACM, Palazzo dei Congressi, Pisa, Italy: 194-200. https://doi.org/10.1145/253168.253208
- Wang L (2005). Support vector machines: theory and applications. Springer Science and Business Media, Berlin, Germany. https://doi.org/10.1007/b95439
- Youn S and McLeod D (2007). A comparative study for email classification. In: Elleithy K (Ed.), Advances and innovations in systems, computing sciences and software engineering: 387-391. Springer, Amsterdam, Netherlands. https://doi.org/10.1007/978-1-4020-6264-3_67
- Zou D, Long WJ, and Ling Z (2010). A cluster-based plagiarism detection method. In the Notebook Papers of CLEF 2010 LABs and Workshops, Padua, Italy. Available online at: http://www.uni-weimar.de/medien/webis/events/pan-10/pan10-papers-final/pan10-plagiarism-detection/du10-notebook.pdf