Skip to main content

Determination of writing styles to detect similarities in digital documents

Determinación de estilos de escritura para la detección de similitudes entre documentos digitales

Abstract

Anything involving human intellect is at risk of being plagiarised. This includes scientific and literary works such as articles, theses, audiovisual works, plans, projects and computer programs. However, this article pays special attention to the existence of this phenomenon in written works in general, and in digital documents in natural or programming languages in particular. The objective of the research is to develop and apply a mathematical model that allows the writing style used in the drafting of texts to be determined. The results obtained from the application of the procedure are intended to serve as the basis for reducing the number of documents that need to be compared in order to analyse and detect similarities in them. The procedure was experimentally applied to a set of articles classified by topic and author, where the writing styles used to draft them differed.

Resumen

Todo lo inherente al intelecto humano es susceptible de actos de plagio: obras científicas y literarias tales como artículos, tesis, obras audiovisuales, planos y proyectos, códigos fuentes de programas, entre otros. Sin embargo, el presente trabajo dedica especial atención a la existencia de este fenómeno en obras escritas, en concreto documentos digitales provenientes de lenguajes naturales o de programación, y centra su objetivo en el desarrollo y aplicación de un modelo matemático que permite determinar el estilo de escritura empleado en la redacción de los textos. Los resultados que se esperan obtener a partir de la aplicación del procedimiento servirán de base para la reducción en el número de documentos que se deben comparar en el análisis y detección de similitudes entre estos documentos. De forma experimental se aplica el procedimiento a un grupo de artículos clasificados por temáticas y autores y que difieren entre ellos en el estilo de escritura utilizado para su redacción.

References

  1. Clough, P. (2000). Plagiarism in natural and programming languages: an overview of current tools and technologies. Research Memoranda: CS-00-05, Department of Computer Science, University of Sheffield, UK, 1–31. Retrieved from http://ir.shef.ac.uk/cloughie/papers/plagiarism2000.pdf

  2. Cooper, J. W., Coden, A. R., & Brown, E. W. (2002). Detecting similar documents using salient terms. In Proceedings of the 11th international conference on Information and Knowledge Management. New York, NY: ACM. Retrieved from http://www.labsoftware.com/flahdo/Papers/CIKMDuplicates.pdf

    Google Scholar 

  3. Dale, E., & Chall, J. S. (1948). A formula for predicting readability. Educational Research Bulletin, 27(1), 11–20. Retrieved from http://www.ecy.wa.gov/quality/plaintalk/resources/classics.pdf

    Google Scholar 

  4. Dubay, W. H. (2004). The principles of readability. Costa Mesa, CA: Impact Information. Retrieved from http://files.eric.ed.gov/fulltext/ED490073.pdf

    Google Scholar 

  5. Gitchell, D., & Tran, N. (1999). Sim: a utility for detecting similarity in computer programs. In The proceedings of the 30th SIGCSE technical symposium on Computer Science Education. New York, NY: ACM. Retrieved from http://www.eng.uwi.tt/depts/elec/staff/feisal/ee302/sim-gitchell.pdf

    Google Scholar 

  6. Gruner, S. & Naven, S. (2005). Tool support for plagiarism detection in text documents. In Proceedings of the 2005 ACM symposium on Applied Computing. New York, NY: ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1066677.1066854. doi http://dx.doi.org/10.1145/1066677.1066854

    Google Scholar 

  7. Honoré, A. (1979). Some simple measures of richness of vocabulary. Association for Literary and Linguistic Computing Bulletin, 7(2).

  8. Plagiarise (n.d.). In The Collins English Dictionary. Retrieved from http://www.collinsdictionary.com/dictionary/english/plagiarise

  9. Real Academia Española (Ed.) (2001). Diccionario de la Real Academia Española. Madrid, Spain: Real Academia Española.

    Google Scholar 

  10. Si, A., Leong, H. V., & Lau, R. W. H. (1997). Check: a document plagiarism detection system. In Proceedings of the 1997 ACM symposium on Applied Computing. New York, NY: ACM. Retrieved from http://www.cs.cityu.edu.hk/rynson/papers/sac97.pdf. doi http://dx.doi.org/10.1145/331697.335176

    Google Scholar 

  11. Wikipedia (2011). Gunning fog index. Wikipedia. Online: Wikipedia.org. Retrieved from http://en.wikipedia.org/wiki/Gunning_fog_index

  12. Yule, G. U. (1944). The statistical study of literary vocabulary. Journal of the Royal Statistical Society, 107(2), 129–131. Retrieved from http://www.jstor.org/discover/10.2307/2981280?uid=3737824&uid=2129&uid=2&uid=70&uid=4&sid=21102626763567. doi http://dx.doi.org/10.2307/2981280

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Yohandri Ril Gil.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and Permissions

About this article

Cite this article

Ril Gil, Y., Toll Palma, Y.d.C. & Fonseca Lahens, E. Determination of writing styles to detect similarities in digital documents. Int J Educ Technol High Educ 11, 128–141 (2014). https://doi.org/10.7238/rusc.v11i1.1783

Download citation

Keywords

  • writing style
  • digital documents
  • plagiarism
  • procedure

Palabras clave

  • estilo de escritura
  • documentos digitales
  • plagio
  • procedimiento