An academic Arabic corpus for plagiarism detection: design, construction and experimentation

Table 8 Characteristics of the test dataset

Segment	Untampered Test Dataset		Test Dataset with Plagiarized Paragraphs		Test Dataset with Plagiarized Sentences
Segment	Count	unique count	Count	unique count	count	Unique count
unigram	632	413	735	487	678	441
bigram	631	586	734	682	677	626
trigram	630	618	733	718	676	662
4-g	629	624	732	725	675	672
5-g	628	627	732	730	674	673
6-g	627	627	731	731	673	673
7-g	626	626	729	729	672	672