In the majority of the previous research on testing of text-matching tools, the main focus has been on coverage. The problem with most of these studies is that they approach coverage from only one perspective. They only aim at measuring the overall coverage performance of the detection tools, whereas the present study approaches coverage from four perspectives: language-based coverage, language subgroup-based coverage, source-based coverage, and disguising technique-based coverage. This study also includes a usability evaluation.
It must be noted that both the coverage and the usability scores are based on work that was done with potentially older versions of the systems. Many companies have responded to say that they now are able to deal with various issues. This is good, but we can only report on what we saw when we evaluated the systems. If any part of the evaluation was to be repeated, it would have to be repeated for all systems. It should be noted that similar responses have come from vendors for all of Weber-Wulff’s tests, such as (Weber-Wulff et al., 2013).
It must be also noted that selection of usability criteria and their weights reflect personal experience of the project team. We are fully aware that different institutions may have different priorities. To mitigate this limitation, we have published all usability scores, allowing for calculations using individual weights.
Language-based coverage
With respect only to the language-based coverage, the performance of the tools for eight languages was evaluated in order to determine which tools yield the best results for each particular language. The results showed that best-performing tools with respect only to coverage are (three systems tied for Italian):
-
PlagAware for German,
-
PlagScan for English, Italian,
-
PlagiarismCheck.org for Italian, Latvian,
-
Strikeplagiarism.com for Czech, Italian and
-
Urkund for Slovak, Spanish, and Turkish.
It is worth noting that, in an overall sense, the text-matching tools tested yield better results for widely spoken languages. In the literature, language-based similarity detection mainly revolves around identifying plagiarism among documents in different languages. No study, to our knowledge, has been conducted specifically on the coverage of multiple languages. In this respect, these findings offer valuable insights to the readers. As for the language subgroups, the tested text-matching tools work best for Germanic languages and Romanic languages while results are not satisfactory for Slavic languages.
Source-based coverage testing
Source-based coverage testing was made using four types of sources; Wikipedia, open access papers, a student thesis and online articles. For many students, Wikipedia is the starting point for research (Howard & Davies, 2009), and thus can be regarded as one of the primary sources for plagiarists. Since a Wikipedia database is freely available, it is expected that Wikipedia texts should easily be identifiable. Testing the tools with Wikipedia texts demonstrates the fundamental ability to catch text matches.
Three articles per language were created, each of which was made using a different disguising technique (copy & paste, synonym replacement and manual paraphrase) for all eight languages. The best performing tools for the sources tested over all languages were
-
PlagiarismCheck.org for online articles,
-
StrikePlagiarism.com for the student thesis (although this may be because the student thesis was in Czech),
-
Turnitin for open-access papers and
-
Urkund for Wikipedia
Since Wikipedia is assumed to be a widely used source, it was worth investigating Wikipedia texts deeper. The results revealed that the majority of tools are successful at detecting similarity with copy & paste from Wikipedia texts, except for Intihal.net, DPV and Dupli Checker respectively. However, a considerable drop was observed in synonym replacement texts in all systems, except for Urkund, PlagiarismCheck.org and Turnitin. Unlike other systems, Urkund, PlagiarismCheck.org and Turnitin yielded promising results in synonym replacement texts. This replicates the result of the study of Weber-Wulff et al. (2013), in which Urkund and Turnitin were found to have the best results among 16 tools.
As for the paraphrased texts, all systems fell short in catching similarity at a satisfactory level. PlagiarismCheck.org was the best performing tool in paraphrased texts compiled from Wikipedia. Overall, Urkund was the best performing tool at catching similarity in Wikipedia texts created by all three disguising techniques.
One aspect of Wikipedia sources that is not adequately addressed by the text-matching software systems is the proliferation of Wikipedia copies on the internet. As discussed in Weber-Wulff et al. (2013), this can lead to the appearance of many smallish text matches instead of one large one. In particular, this can happen if the copy of the ever-changing Wikipedia in the database of the software system is relatively old and the copies on the internet are from newer versions. A careless teacher may draw false conclusions if they focus only on the quantity of Wikipedia similarities in the report.
Disguising technique-based coverage
The next dimension of coverage testing is disguising technique-based coverage. In this phase, documents were created using copy & paste, synonym replacement, paraphrase, and translation techniques. In copy & paste documents, all systems achieved acceptable results except DPV, intihal.net and Dupli Checker. Urkund was the best tool at catching similarity in copy & paste texts. The success of some of the tools tested in catching similarity in copy & paste texts has also been validated by other studies such as Turnitin (Bull et al., 2001; Kakkonen & Mozgovoy, 2010; Maurer et al., 2006; Vani & Gupta, 2016) and Docol©c (Maurer et al., 2006).
For synonym replacement texts, the best-performing tools from copy & paste texts continued their success with a slight decline in scores, except for PlagiarismCheck.org which yielded better results in synonym replacement texts than copy & paste texts. Plagiarism Software and Viper showed the sharpest decline in their scores for synonym replacement. Urkund and PlagiarismCheck.org were the best tools in this category.
For paraphrased texts, none of the systems was able to provide satisfactory results. However, PlagiarismCheck.org, Urkund, PlagScan and Turnitin scored somewhat better than the other systems. PlagScan (Křížková et al., 2016) and Turnitin (Bull et al., 2001) also scored well in paraphrased texts in some studies.
In translated texts, all the systems were unable to detect translation plagiarism, with the exception of Akademia. This system allows users an option to check for potential translation plagiarism. The systems detected translation plagiarism mainly in the references, not in the texts. This is similar to the previous research findings and has not been improved since then. For example, Turnitin and Docol©c have previously been shown not to be efficient in detecting translation plagiarism (Maurer et al., 2006). To increase the chances of detecting translation plagiarism, paying extra attention to the matches with the reference entries should be encouraged since matches from the same source can be a significant indicator of translation plagiarism. However, it should be noted that some systems may omit matches with the reference entries by default.
Multi-source coverage testing
In the last phase of coverage testing, we tested the ability of systems to detect similarity in the documents that are compiled from multiple sources. It is assumed that plagiarised articles contain text taken from multiple sources (Sorokina, Gehrke, Warner, & Ginsparg, 2006). This type of plagiarism requires additional effort to identify. If a system is able to find all similarity in documents which are compiled from multiple sources, this is a significant indicator of its coverage performance.
The multi-source results show that Urkund, the best performing system in single-source documents, shares the top score with PlagAware in multi-source documents, while Dupli Checker, DPV and intihal.net yielded very unsatisfactory results. Surprisingly, only the performance of two systems (Akademia and Unicheck) demonstrated a sharp decline in multi-source documents whereas the performance of ten systems actually improved for multi-source documents. This shows that the systems perform better in catching short fragments in a multi-source text rather than the whole document taken from a single source.
As for the general testing, the results are highly consistent with the Wikipedia results which contributes the validity of the single-source and multi-source testing. Again, in single-source documents, Urkund obtained the highest score, while PlagAware is the best performing system in multi-source documents. Dupli Checker, DPV and intihal.net obtained the least scores in both categories. Most of the systems demonstrated better performance for multi-source documents than for single-source ones. This is most probably explained by the chances the systems had for having access to a source. If one source was missing in the tool’s database, it had no chance to identify the text match. The use of multiple sources gave the tools multiple chances of identifying at least one of the sources. This points out quite clearly the issue of false negatives: even if a text-matching tool does not identify a source, the text can still be plagiarized.
Overall coverage performance
Based on the total coverage performance, calculated as an average of the scores for each testing document, we can divide the systems into four categories (sorted alphabetically within each category) based on their overall placement on a scale of 0 (worst) to 5 (best).
-
Useful systems - the overall score in [3.75–5.0]:
There were no systems in this category
-
Partially useful systems - the overall score in [2.5–3.75):
PlagAware, PlagScan, StrikePlagiarism.com, Turnitin, Urkund
-
Marginally useful systems - the overall score in [1.25–2.5):
-
Akademia, Copyscape, Docol©c, PlagiarismCheck.org, Plagiarism Software, Unicheck, Viper
-
Unsuited for academic institutions - the overall score in [0–1.25):
Dupli Checker, DPV, intihal.net
Usability
The second evaluation focus of the present study is on usability. The results can be interpreted in two ways, either in a system-based perspective or a feature-based one, since some users may prioritize a particular feature over others. For the system-based usability evaluation, Docol©c, DPV, PlagScan, Unicheck, and Urkund were able to meet all of the specified criteria. PlagiarismCheck.org, Turnitin, and Viper were missing only one criterion (PlagiarismCheck.org dropped the original file names and both Turnitin and Viper insisted on much metadata being filled in).
In the feature-based perspective, the ability to process large documents, no word limitations, and using only in the chosen language were the features most supported by the systems. Unfortunately, the uploading of multiple documents at the same time was the least supported feature. This is odd, because it is an essential feature for academic institutions.
A similar usability evaluation was conducted by Weber-Wulff et al. (2013). In this study, they created a 27-item usability checklist and evaluated the usability of 16 systems. Their checklist includes similar criteria of the present study such as storing reports, side-by-side views, or effective support service. The two studies have eight systems in common. In the study of Weber-Wulff et al. (2013), the top three systems were Turnitin, PlagAware, and StrikePlagiarism.com while in the present study Urkund, StrikePlagiarism, and Turnitin are the best scorers. Copyscape, Dupli Checker, and Docol©c were the worst scoring systems in both studies.
Another similar study (Bull et al., 2001) addressed the usability of five systems including Turnitin. For usability, the researchers set some criteria and evaluated the systems based on these criteria by assigning stars out of five. As a result of the evaluation, Turnitin was given five stars for the clarity of reports, five stars for user-friendliness, five stars for the layout of reports and four stars for easy-to-interpret criteria.
The similarity reports are the end products of the testing process and serve as crucial evidence for decision makers such as honour boards or disciplinary committees. Since affected students may decide to ask courts to evaluate the decision, it is necessary for there to be clear evidence, presented with the offending text and a potential source presented in a synoptic (side-by-side) style, and including metadata such as page numbers to ease verifiability. Thus, the similarity reports generated were the focus of the usability evaluation.
However, none of the systems managed to meet all of the stated criteria. PlagScan (no side-by-side layout in the offline report) and Urkund (did not keep the document formatting) scored seven out of eight points. They were closely followed by Turnitin and Unicheck which missed two criteria (no side-by-side layout in online or offline reports).
The features supported most were downloadable reports and some sort of highlighting of the text match in the online reports. Two systems, Dupli Checker and Copyscape, do not provide downloadable reports to the users. The side-by-side layout was the least supported feature. While four systems offer side-by-side evidence in their online reports, only one system (Urkund) supports this feature in the offline report. It can be argued that the side-by-side layout is an effective way to make a contrastive analysis in deciding whether a text match can be considered plagiarism or not, but this feature is not supported by most of the systems.
Along with the uploading process and the understandability of reports, we also aimed to address certain features that would be useful in academia. Eight criteria were included in this area:
-
clearly stated costs,
-
the offer of a free trial,
-
integration to an LMS (Learning Management System) via API,
-
Moodle integration (as this is a very popular LMS),
-
availability of support by telephone during normal European working hours (9–15),
-
availability of support by telephone in English,
-
proper English usage on the website and in the reports, and
-
no advertisements for other products or companies.
The qualitative analysis in this area showed that only PlagiarismCheck.org and Unicheck were able to achieve a top score. PlagScan scored seven points out of eight and was followed by PlagAware (6.5 points), StrikePlagiarism.com (6.5 points), Docol©c and Urkund (6 points). Akademia (2 points), DPV (2 points), Dupli Checker (3 points), intihal.net (3 points) and Viper (3 points) did not obtain satisfactory results.
Proper English usage was the most supported feature in this category, followed by no external advertisements. The least supported feature was clearly stated system costs, only six systems fulfilled this criterion. While it is understandable that a company wants to be able to charge as much as they can get from a customer, it is in the interests of the customer to be able to compare the total cost of use per year up front before diving into extensive tests.
In order to calculate the overall usability score, the categories were ranked based on their impact on usability. In this respect, the interpretation of the reports was considered to have the most impact on usability, since similarity reports can be highly misleading (also noted by Razı, 2015) when they are not clear enough or have inadequate features. Thus, the scores from this category were weighted threefold. The workflow process criteria were weighted twofold and the other criteria were weighted by one. The maximum weighted score was thus 47. Based on these numbers, we classified the systems into three categories (the boundaries for these categories were 35, 23, and 11:
-
Useful systems: Docol©c, PlagScan, Turnitin, Unicheck, Urkund;
-
Partially useful systems: DPV, PlagAware, PlagiarismCheck.org, StrikePlagiarism.com, Viper;
-
Marginally useful systems: Akademia, Dupli Checker, Copyscape, intihal.net, Plagiarism Software.
-
Unsuited for academic institutions: -
Please note that these categories are quite subjective, as our evaluation criteria are subjective and the weightings as well. For other use cases, the criteria might be different.
Combined coverage
If the results for coverage and usability are combined on a two-dimensional graph, Fig. 1 emerges. In this section, the details of the coverage and usability are discussed.
Coverage is the primary limitation of a web-based text-matching tool (McKeever, 2006) and the usability of such a system has a decisive influence on the system users (Liu, Lo, & Wang, 2013). Therefore, Fig. 1 presents a clear portrayal of the overall effectiveness of the systems. Having determined their criteria related to the coverage and usability of a web-based text-matching tool, clients can decide which system works best in their settings. Vendors are given an idea about the overall effectiveness of their systems among the tools tested. This diagram presents an initial blueprint for vendors to improve their systems and the direction of improvement.
One important result that can be seen in this diagram is that the usability performance of the systems is relatively better than their coverage performance (see Fig. 1). As for coverage, the systems demonstrated at best only average performance. Thus, it has been shown that the systems tested fall short in meeting the coverage expectations. They are useful in the sense that they find some text similarity that can be considered plagiarism, but they do not find all such text similarity and they also suffer from false positives.