An academic Arabic corpus for plagiarism detection: design, construction and experimentation

Al-Thwaib, Eman; Hammo, Bassam H.; Yagi, Sane

doi:10.1186/s41239-019-0174-x

Research article
Open access
Published: 16 January 2020

An academic Arabic corpus for plagiarism detection: design, construction and experimentation

International Journal of Educational Technology in Higher Education volume 17, Article number: 1 (2020) Cite this article

6424 Accesses
31 Citations
16 Altmetric
Metrics details

Abstract

Advancement in information technology has resulted in massive textual material that is open to appropriation. Due to researchers’ misconduct, a plethora of plagiarism detection (PD) systems have been developed. However, most PD systems on the market do not support the Arabic language. In this paper, we discuss the design and construction of an Arabic PD reference corpus that is dedicated to academic language. It consists of (2312) dissertations that were defended by postgraduate students at the University of Jordan (JU) between the years 2001–2016. This Academic Jordan University Plagiarism Detection corpus; henceforth, JUPlag, follows the Dewey decimal classification (DDC) in the way it is structured. The goal of the corpus is twofold: Firstly, it is a database for the detection of plagiarism in student assignments, reports, and dissertations. Secondly, the n-gram structure of the corpus provides a knowledgebase for linguistic analysis, language teaching, and the learning of plagiarism-free writing. The PD system is guided by JU Library’s metadata for retrieval and discovery of plagiarism. To test JUPlag, we injected an unseen dissertation with multiple instances of plagiarism-simulated paragraphs and sentences. Experimentation with the system using different verbatim n-gram segments is indeed promising. Preliminary results encourage that permission be sought to enrich this corpus with all the theses in the Thesis Repository of the Union of Arab Universities. The JUPlag corpus is intended to function as an indispensable source for testing and evaluating plagiarism detection techniques. Since the University of Jordan is seeking to become a center for plagiarism detection for Arabic content and being a non-profit organization, it will charge a nominal fee for the use of JUPlag to finance the maintenance and development of the corpus.

Introduction

Plagiarism is simply defined as appropriating others’ words, thoughts, or intellectual property without providing proper citation or giving credit to them as the original source. The Oxford Dictionary^{Footnote 1} defines plagiarism as “The practice of taking someone else’s work or ideas and passing them off as one’s own”. With the exceptionally large volume of articles, reports and books available on the Internet, plagiarism in academic writing is a major concern that has become the matter of the moment.

Plagiarism can be either intentional or unintentional (DeVoss & Rosati, 2002). It is intentional when copying or modifying someone else’s words without providing proper citation to the original source. It is unintentional when one copies from others without knowing the rules and regulations for academic writing. However, ignorance should not be an excuse. For instance, the latest scandal of alleged plagiarism involved a respectable lecturer at an Ivy League university who once was the executive editor of a major newspaper. It cast doubt on the integrity and reputation of an otherwise highly respectable academic and public figure. This academic had properly credited alleged instances of plagiarism to their sources, sometimes repeatedly, but occasionally failed to do so. This ‘unintentional plagiarism’ is a form of academic dishonesty.

Advancement in technology both facilitates plagiarism and prevents it. At the click of a mouse, paper mill websites help students and researchers to copy or buy research papers. Yet, plagiarism detection systems deter the appropriation of others’ intellectual property. Plenty of websites are nowadays offering tools for plagiarism detection. Some sites are commercial but few are free. Turnitin and PlagScan, for instances, are very popular commercial tools that are used world-wide for the detection of text plagiarism. They are capable of detecting different forms of plagiarism that range from simple copy-paste plagiarism to word switching, sentence and paragraph paraphrasing, etc. However, these tools do not prevent plagiarism but catch it after it has occurred (Beute, Van Aswegen, & Winberg, 2008).

Misconduct in Arabic research is not an exception. Unfortunately, however, most of the plagiarism detection tools act on ASCII (American Standard Code for Information Interchange) data and very few support Unicode data for plagiarism comparisons. Plagiarism detection for scholarly research written in the Arabic language is not well supported. The scarcity of Arabic literature and resources on the Internet as well as the shortage of commitment to research in Arabic NLP (Natural Language Processing) are the main reasons behind the absence of efficient plagiarism tools that support a language spoken and written by around 423 million people.

The main contribution of this ongoing project is twofold. At its preliminary stage, it will construct a plagiarism corpus made of defended dissertations in the thesis repository at the library of the University of Jordan. The second is to develop a plagiarism detection system dedicated to the Arabic language that is capable of detecting verbatim plagiarism and some intelligent plagiarism including word order changes, paraphrasing and synonym replacement. Hereafter, we refer to the corpus as JUPlag and to the plagiarism detection system as PD system.

The remaining of the paper is organized as follows. Section 2 provides a background and discusses related literature. Section 3 introduces the research methodology. Section 4 discusses the experiments and findings. Finally, section 5 presents the conclusion of this paper and future work.

Background and literature review

Plagiarism

The lack of fundamental research skills could be the common reason why university students/researchers plagiarize (Devlin & Gray, 2007). However, academic writing is not an easy task. It requires clarity, conciseness, focus, structure, and evidence. It requires a lot of reading, appropriate usage of words and grammar, and learning how to express ideas and thoughts. Several studies pointed to other reasons for plagiarism: lack of author confidence, shortage of time, fear of failure, pressure of parents and scholarship committees to maintain high grades, lack of punishment by the institution, ease of appropriation, and absence of good plagiarism detection systems (Devlin & Gray, 2007; Eret & Ok, 2014; Franklin-Stokes & Newstead, 1995).

From a legal point of view, the act of plagiarism is not considered a crime (Frye, 2016). However, plagiarism during university years is highly condemned by the academic community and it may leave a significant impact on one’s career beyond academia. “Consequences range from loss of reputation to economic fines and ruined careers. Students are expelled from their schools, and faculty fired... Doctoral degrees can be revoked and plagiarizing publications are retracted and cursed” (Satija & Martínez-Ávila, 2019, p. 90). A case in point is the disgrace of politicians (cf. Ruipérez & García-Cabrero, 2016).

Plagiarism is of seven types: paraphrasing a text without proper citation, mosaic plagiarism where text from different sources is combined into one, copy and paste without due citation, incorrect citation, arrogating someone else’s entire work, self-plagiarism where one submits his/her published work as though it were new, and citing a non-existing work (Vij, Soni, & Makhdumi, 2009).

Plagiarism prevention methods have a long-term positive effect, but, unfortunately, their implementation is usually time-consuming (Lukashenko, Graudina, & Grundspenkis, 2007). Relying on such methods to maintain academic integrity, however, won’t be enough to stop researchers from plagiarizing. In the words of Bolkan (2006), “Many educators blame the internet for what they perceive as the rise of plagiarism. Although the Internet certainly enables more efficient plagiarism, blaming it for widespread copying is akin to blaming a bank robbery on the presence of cash in the building … Efforts must be directed at prevention as well as detection and punishment. (p. 4)”.

Plagiarism detection software (PDS) can be content-based (extrinsic) or stylometry-based (intrinsic) (Rahman, 2015). Extrinsic plagiarism detection (EPD) discovers instances of appropriation by comparing a suspicious document with reference documents (a database or a corpus). Intrinsic plagiarism detection (IPD), on the other hand, discovers instances of appropriation in the suspicious document without using any reference corpus. Figure 1 depicts the common types of text plagiarism and the classification of plagiarism detection software tools.

A plagiarism detection system has to ideally handle most types of plagiarism, including text modification by word-shifting, translation, and summarization that bypass string-matching tools. At this preliminary stage, our present work handles string-matching-based plagiarism detection and it is planned that it will be enhanced with such NLP techniques as stemming and part-of-speech tagging, and by the use of such lexical resources as the work of (Baras, Sawalha, and Yagi: A more extensive wordnet for Arabic, submitted), Arabic-WordNet,^{Footnote 2} dictionaries, and thesauri.

Related work

Plagiarism is an old topic and it has been well studied in the literature. In this section, we only focus on the recent work on Arabic text plagiarism detection. However, for further reading on the topic of plagiarism, we refer the reader to Maurer, Kappe, and Zaka (2006). In addition, the following is a sample of scholarly work that exemplifies plagiarism types with reference to Fig. 1. For intrinsic plagiarism, we refer the reader to the work of AlSallal, Iqbal, Palade, Amin, and Chang (2019), Polydouri, Siolas, and Stafylopatis (2017), Tschuggnall and Specht (2012), Zu Eissen and Stein (2006); for string-based extrinsic plagiarism detection, refer to Baba, Nakatoh, and Minami (2017), Leonardo and Hansun (2017), Nakatoh, Baba, Yamada, and Ikeda (2011), Wise (1996); for vector-space-based plagiarism detection, see Kong, Zhao, Lu, Qi, and Zhao (2016), Meuschke, Siebeck, Schubotz, and Gipp (2017), Paul and Jamal (2015); for syntax-based plagiarism detection refer to Si, Leong, and Lau (1997), Vani and Gupta (2017); and for citation-based detection see Gipp and Beel (2010), Gipp and Meuschke (2011); and Meuschke, Gipp, Breitinger, and Berkeley (2012).

The first shared task that addressed plagiarism detection in Arabic texts is “AraPlagDet” (Arabic Plagiarism Detection) introduced in the PAN@Fire2015 competition and it has become since then an annual event that involved extrinsic and intrinsic plagiarism detection (Bensalem et al., 2015). Researchers in Arabic NLP adopted shared tasks to raise awareness of plagiarism problems and to develop solutions to them.

The majority of works on Arabic plagiarism detection involves preprocessing, segmenting documents into chunks of sentences of variable sizes (n-grams), tokenization, removing diacritics and non-alphanumeric characters, normalizing some letters (for example “أ،إ،آ” get normalized into “ا”), stemming, lemmatization, part-of-speech tagging, and synonym replacement.

Zaher, Shehab, Elhoseny, and Osman (2017) developed a web-based plagiarism detection system for Arabic documents, called APDS. The system operated in three phases: preparation, preprocessing, and similarity detection. After preprocessing, the query document was presented as n-gram chunks for similarity detection. The proposed system was tested on a dataset of 10 Arabic documents and evaluated in terms of precision and recall. The authors claimed an average precision of 82% and an average recall of (92.5%). However, the paper does not tell what kind of plagiarism was detected, how the documents were presented or how the precision and recall measures were obtained.

Mahmoud and Zrigui (2017) proposed a system for detecting semantic plagiarism in Arabic documents that benefited from machine learning technology. In the preprocessing phase, the suspicious and source documents were split into sentences then into words without removing stopwords. In the feature extraction phase, the TF*IDF (Term Frequency-Inverse Document Frequency) measure was calculated for weighting words in terms of importance. Then the word2vec algorithm was used for learning word embeddings, and the skip-gram model was employed for predicting the context of words given a current word vector. For similarity calculation, they used cosine and the Euclidean distance measures. The degrees of similarity between sentences were compared to a predefined threshold. Experiments were conducted on an open source Arabic corpus and they claimed a precision rate of (85%) and a recall rate of (84%).

Mahmoud, Zrigui, and Zrigui (2017) used a Convolutional Neural Network (CNN) approach for detecting paraphrasing plagiarism in Arabic documents. This method is said to detect paraphrasing plagiarism through the measurement of semantic relatedness between the suspicious and the original documents. Their approach has three phases: preprocessing, feature extraction, and paraphrase detection. After preprocessing, the feature extraction phase employed a skip-gram model for word-to-vector representation, where each document is represented by a vector in a multidimensional space. The paraphrase detection phase applied the cosine similarity measure on the vectors of both the suspicious and the original documents to reduce dimensionality. Finally, a mathematical function called Softmax was used for paraphrase detection according to some predefined threshold. Experiments showed a precision rate of (88%).

However, Mahmoud et al. (2017) and Mahmoud and Zrigui (2017) conducted their experimentation on an open source Arabic corpus, named OSAC (Saad & Ashour, 2010). The corpus was organized in ten different categories collected from multiple websites. The sources of the articles were news channels and social and commercial websites, which clearly makes it inappropriate for academic plagiarism detection. Specialized content is what the PD corpus ought to consist of, because academics do not normally plagiarize the news or social media.

Abdelrahman, Khalid, and Osman (2017) presented a framework for content-based PD in Arabic documents. Their framework has two phases: preprocessing and document representation. They used a tree-structure model with the document at the root of the tree, the paragraphs at the second level, and the sentences at the third level of the tree. A Longest Common Substring (LCS) matching algorithm was used for comparing hashed text chunks (i.e. words in their case). No experiments were made to evaluate the system or show its effectiveness and therefore there was no plagiarism detection corpus.

Ghanem, Arafeh, Rosso, and Sánchez-Vega (2018) presented a system for detecting extrinsic plagiarism in Arabic texts. Their system, Hybrid Plagiarism (HYPLAG), followed a hybrid detection approach. They adopted corpus-based and knowledge-based approaches for the detection of both the verbatim and rephrasing types of plagiarism. The system was compared to other systems that participated in the Arabic Plagiarism Detection PAN-Forum for Information Retrieval Evaluation (AraPlagDet PAN@FIRE) competition and was tested on a corpus called External Arabic Plagiarism Detection (ExAraPlagDet-2015). The authors reported that HYPLAG outperformed others with a success rate of (89%). They chunked the query (suspicious) document and the source documents into n-term sentences. Then the synonyms of the query document were extracted from the Arabic-WordNet. The original sentences were ranked with respect to the suspicious sentences and the ones with the highest scores were extracted as potentially plagiarized sentences. Finally, the candidate sentences and suspicious sentences were compared for similarity using the vector space model and the TF*IDF weighting measure. A similarity value that exceeded a predefined maximum threshold indicated plagiarism, while a similarity value between minimum and maximum thresholds required a call for the next phase of feature-based semantic similarity measurement based on the synonyms extracted from the Arabic-WordNet.

Khorsi, Cherroun, and Schwab (2018) used a Two-Level Plagiarism Detection System (2 L-APD), which is said to detect different plagiarism cases, including verbatim and paraphrasing. Their system consisted of two consecutive modules: fingerprinting and word embedding detection. The first module is responsible for preprocessing and segmenting the suspicious document into sentences. When sentences exceeded some threshold value, they were passed on to the second module to test for paraphrasing and synonym replacement. The fingerprinting was applied by chunking the text documents into n-grams and then selecting the least frequent ones. Finally, they used a function called Brian Kernighan and Dennis Ritchie (BKDR) for hashing the selected n-grams. The first module applied Jaccard measuring similarity, whilst the second module used the cosine similarity measure. Important words were picked on the basis of their IDF value and their part of speech tags. To test their approach, Khorsi et al. (2018) used the ExAraDet-2015 corpus. Experimental results showed an overall precision rate of (85%) and a recall rate of (87%).

Although the works of Ghanem et al. (2018) and Khorsi et al. (2018) seem promising, they both have been tested on ExAraDet-2015 corpus, which is an Arabic corpus made of short sentences constructed for the PAN@FIRE plagiarism detection competition. We suspect this corpus might not be suitable for academic plagiarism detection as it is not a well-organized academic corpus, neither it is discourse-structure annotated.

Clearly, there is need for a corpus dedicated to plagiarism detection that is authentic, big, versatile, and richly annotated. The JUPlag corpus is intended to meet this need and to function as a test bed for the evaluation of plagiarism detection techniques.

Corpus design methodology

The JUPlag corpus was guided by the following design objectives:

1)
To compile academic texts for the purpose of training and testing the Arabic plagiarism detection system that is to be developed.
2)
To devise a mechanism for organizing the texts and indexing them.
3)
To annotate the texts using a stemmer and a part-of-speech tagger.
4)
To construct an Arabic thesaurus database that can be used for detecting synonym replacements.

Source data collection

Data collection is a fundamental success factor in plagiarism detection. PD systems need to access multitudes of sources of data to detect potential plagiarism. This includes accessing local databases as well as online data available on the internet. Due to the scarcity of scholarly Arabic literature that is in digitized form, it has been deemed necessary to build a resource that would contain a collection of academic texts, a resource that may be used for the detection of plagiarism in dissertations before a defense is scheduled. Postgraduate students usually sign an affidavit stating that they observed the code of ethics in the compilation of their theses, that they accepted all legal repercussions of plagiarism including the revocation of their degrees, and that they agreed that the Deans Council revocation decision would be final.

With the necessary legal provisions, the Library of the University of Jordan graciously gave us permission to access their copyrighted repository of dissertations. The University requires that postgraduate students transfer their copyrights to it and get them to sign an authorization form that permits the University of Jordan “to supply copies of [their] Thesis/Dissertation to libraries or establishments or individuals on request, according to the University of Jordan regulations”. We have obtained permission of the University administration and of the Director of the University Library to access the dissertation repository for the specific purpose of the development of the JUPlag corpus and for experimentation with the repository.

We had access to (2312) dissertations that were defended by University of Jordan postgraduate students between the years 2001–2016. Table 1 shows the number of collected dissertations per year. Notice the significant increase in the number of collected dissertations in 2006 and beyond; this is due to the School of Graduate Studies’ drive to boost the number of master’s, doctoral and high specialization programs. As JU sought to become a pioneer in postgraduate programs, it widened its program offerings resulting in 2012 in (105) master’s programs, (34) doctoral programs, and (16) high specialization programs in Medicine. As of today, the Graduate School offers (123) master’s programs, (38) doctoral programs, (16) high specialization programs in Medicine, and (1) high specialization program in Dentistry.

Table 1 Per year distribution of the collected dissertations

An academic Arabic corpus for plagiarism detection: design, construction and experimentation

Abstract

Introduction

Background and literature review

Plagiarism

Related work

Corpus design methodology

Source data collection

Challenges identified

Construction of the Arabic academic plagiarism detection corpus

Corpus architecture

The Dewey decimal classification system

JU library’s metadata

Data processing outline of the JUPlag corpus

Tokenization

Segmenting dissertations into n-grams

Stemming

Part of speech tagging (POST)

The final academic corpus

Experiments and discussion

Experimenting with the JUPlag corpus: analysis and statistics

Word statistics

Sentence statistics

Experimenting with the plagiarism detection system

Experiment I: plagiarism detection in the original dataset

Experiment II: detecting paragraph simulated-plagiarism

Experiment III: detecting plagiarism-simulated sentences injected in the dataset

Conclusion and research directions

Availability of data and materials

Notes

Abbreviations

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords