Handwriting match and AI content detection

Machine-generated text presents a potential threat not only to the public sphere but also to education, where the authenticity of genuine students is compromised by the presence of convincing, synthetic text. There are also concerns about the spread of academic misconduct, particularly direct replication among students. In response to these challenges, this paper introduces the Handwriting Match and Artificial Intelligence (AI) Content Detection System (HMAC). HMAC utilizes optical character recognition (OCR) mechanisms to convert handwritten and typed content from a single-page portable document format into machine-readable text, thus enabling further analysis. Drawing on recent advances in natural language understanding, HMAC aims to preserve the educational value of assignments by effectively detecting AI-generated content. In addition, HMAC has a strong plagiarism detection system that uses a comparative analysis of student submissions in a particular academic field. This paper describes HMAC’s architecture, methodology, and results, emphasizing its key contributions: improved handwritten content extraction with OCR and improved identification of AI-generated content. The study addresses the research question of how HMAC improves the identification of AI-generated content and supports academic integrity compared to other methodologies.
Bhat, A. (2023). GPT-Wiki-Intro (Revision 0e458f5). Hugging Face. Available from: https://huggingface.co/datasets/aadityaubhat/gpt-wiki-intro [Last accessed on 2024 May 24].
Daniel, F., Cappiello, C., & Benatallah, B. (2019). Bots Acting Like Humans: Understanding and Preventing Harm. Available from: https://www.floriandaniel.it/papers/danielic2019.pdf [Last accessed on 2024 May 24].
Dong, R., & Smith, D.A. (2018). Multi-input attention for unsupervised OCR correction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Vol. 1. p2363–2372.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186.
Englmeier, T., Fink, F., & Schulz, K.U. (2019). AI-PoCoTo-combining automated and interactive OCR postcorrection. In: Proceedings of the Third International Conference on Digital Access to Textual Cultural Heritage. ACM.
Evershed, J., & Fitch, K. (2014). Correcting noisy OCR: Context beats confusion. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage. ACM, p45–51.
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J.(2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning. ACM, p369–376.
Guha, R., Das, N., Kundu, M., Nasipuri, M., & Santosh, K.(2019). Devnet: An efficient cnn architecture for handwritten Devanagari character recognition. In: International Journal of Pattern Recognition and Artificial Intelliegence. World Scientific, Singapore.
Hämäläinen, M., & Hengchen, S. (2019). From the paft to the fiiture: A fully automatic NMT and word embeddings method for OCR post-correction. In: Proceeding of International Conference on Recent Advances in Natural Language Processing. INCOMA, p432–437.
Jain, A.K., & Yu, B. (1998). Automatic text location in images and video frames. In: Proceeding of International Conference of Pattern Recognition. ICPR, Brisbane, p1497–1499.
Jauhiainen, T.S., Linden, B.K.J., & Jauhiainen, H.A. (2016). Heli, a word-based backoff method for language identification. In: Proceedings of the Third Workshop on NLP for Similar Languages Varieties and Dialects VarDial3. Osaka, Japan, p12.
Kauppinen, P. (2016). OCR Post-Processing by Parallel Replace Rules Implemented as Weighted Finite-State Transducers. University of Helsinki, Finland.
Kettunen, K., & Koistinen, M. (2019). Open Source Tesseract in re-OCR of Finnish Fraktur from 19th and Early 20th Century Newspapers and Journals-Collected Notes on Quality Improvement. Digital Humanitarian Network, Virtual, p270–282.
Kettunen, K., Kervinen, J., & Koistinen, M. (2018). Creating and using ground truth OCR sample data for Finnish historical newspapers and journals. In: Proceeding of DHN 2018 Digital Humanities in the Nordic Countries 3rd Conference. Helsinki.
Kim, P.K. (1999). Automatic Text Location in Complex Color Images Using Local Color Quantization. Vol. 1. IEEE TENCON, p629-632.
Kissos, I., & Dershowitz, N. (2016). OCR error correction using character correction and feature-based word classification. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS). IEEE, p198–203.
Koistinen, M., Kettunen, K., & Kervinen, J. (2017). How to improve optical character recognition of historical finnish newspapers using open source tesseract OCR engine? In: Proceedings of the LTC. p279–283.
Koistinen, M., Kettunen, K., & Pääkkönen, T. (2017). Improving optical character recognition of finnish historical newspapers with a combination of fraktur and antiqua models and image preprocessing. In: Proceedings of the 21st Nordic Conference on Computational Linguistics. p277–283.
Levenshtein, V.I. (1966). Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10, 707–710.
Li, H., & Doermann, D. (1998). Automatic text tracking in digital videos. In: Proceeding of IEEE 1998Workshop on Multimedia Signal Processing, Redondo Beach, California, USA, p21–26. Li, M., Lv, T., Cui, L., Lu, Y., Florencio, D.,Zhang, C., Li, Z., & Wei, F. (2021). TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models. arXiv, 2109, 10282.
Lindén, K., Silfverberg, M., Pirinen, T., Hardwick, S.,Drobac, S., & Axelson, E. (2012). HFST-An Environment for Creating Language Technology Applications. Studies in Computational Intelligence. Springer, Berlin.
Llobet, R., Cerdan-Navarro, J.R., Perez-Cortes, J.C., & Arlandis, J. (2010). OCR post-processing using weighted finite-state transducers. In: 2010 20th International Conference on Pattern Recognition. p2021–2024.
Lu, N., Liu, S., He, R., Wang, Q., Ong, Y.S., & Tang, K.(2024). Large Language Models can be Guided to Evade AI-Generated Text Detection. https://doi.org/10.48550/arXiv.2305.10847
Lund, W.B., Kennard, D.J., & Ringger, E.K. (2013). Combining multiple thresholding binarization values to improve OCR output. In: Document Recognition and Retrieval XX, Vol. 8658. International Society for Optics and Photonics, p86580R.
Lund, W.B., Walker, D.D., & Ringger, E.K. (2011). Progressive alignment and discriminative error correction for multiple OCR engines. In: 2011 International Conference on Document Analysis and Recognition. IEEE, p764–768.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. https://doi.org/10.48550/arXiv.1301.3781
Mindner, L., Schlippe, T., & Schaaff, K. (2023). Classification of Human- and AI-Generated Texts: Investigating Features for ChatGPT. arXiv, 2308, 05341.
Mitrović, S., Andreoletti, D., & Ayoub, O. (2023). ChatGPT or Human? Detect and Explain. Explaining Decisions of Machine Learning Model for Detecting Short ChatGPT-generated Text. https://doi.org/10.48550/arXiv.2301.13852
Pelau, C., Dabija, D.-C., & Ene, I. (2021). What makes an AI device human-like? The role of interaction quality, empathy and perceived psychological anthropomorphic characteristics in the acceptance of artificial intelligence in the service industry. Computers in Human Behavior, 122, 106855.
Reul, C., Christ, D., Hartelt, A., Balbach, N., Wehner, M.,Springmann, U., Wick, C., Grundig, C., Büttner, A.,& Puppe, F. (2019). Ocr4all-an Open-Source Tool Providing a (Semi-) Automatic OCR Workflow for Historical Printings. https://doi.org/10.48550/arXiv.1909.04032
Reul, C., Springmann, U., Wick, C., & Puppe, F. (2018). State of the art optical character recognition of 19th century Fraktur scripts using open source engines. https://doi.org/10.48550/arXiv.1810.03436
Reynaert, M.W. (2010). Character confusion versus focus word-based correction of spelling and OCR variants in corpora. International Journal of Documents Analysis and Recognition (IJDAR), 14(2), 173–187.
Rodriguez, J.D., Hay, T., Gros, D., Shamsi, Z., & Srinivasan, R. (2022). Cross-Domain Detection of GPT-2-Generated Technical Text. Available from: https://aclanthology.org/2022.naacl-main.88 [Last accessed on 2024 May 24].
Romero, V., Toselli, A.H., & Vidal, E. (2012). Multimodal Interactive Handwritten Text Transcription. Vol. 80. World Scientific, Singapore.
Sabu, A. M., & Das, A. S. (2018). A survey on various optical character recognition techniques. In Proceedings of the 2018 International Conference on Emerging Devices and Smart Systems (ICEDSS) (pp. 1–5). IEEE. https://doi.org/10.1109/ICEDSS.2018.8544323
Sadasivan V.S., Kumar, A., Balasubramanian, S., Wang, W., & Feizi, S. (2023). Can AI-Generated Text be Reliably Detected? https://doi.org/10.48550/arXiv.2303.11156
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. https://doi.org/10.48550/arXiv.1910.01108
Silfverberg, M., Kauppinen, P., & Lindén, K. (2016). Data-driven spelling correction using weighted finite-state methods. In: Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata. Association for Computational Linguistics, Berlin, p51–59.
Springmann, U., & Lüdeling, A. (2016). OCR of Historical Printings with an Application to Building Diachronic Corpora: A Case Study Using the RIDGES Herbal Corpus. https://doi.org/10.48550/arXiv.1608.02153
Springmann, U., Najock, D., Morgenroth, H., Schmid, H.,Gotscharek, A., & Fink, F.(2014). OCR of historical printings of latin texts: Problems, prospects, progress. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage. ACM, p71–75.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,& Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.
Uzun, L. (2023). ChatGPT and Academic Integrity Concerns: Detecting Artificial Intelligence Generated Content. Available from: https://www.researchgate.net/publication/370299956-chatgpt-and-academic-integrity-concerns-detecting-artificial-intelligence-generated-content [Last accessed on 2024 May 24].
Vobl, T., Gotscharek, A., Reffle, U., Ringlstetter, C., & Schulz, K.U. (2014). Pocoto-an open source system for efficient interactive postcorrection of OCRed historical texts. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage. ACM,p57–61.
Wahle, J. P., Ruas, T., Mohammad, S. M., Meuschke, N., & Gipp, B. (2023). AI Usage Cards: Responsibly reporting AI‑generated content [Conference poster]. 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Santa Fe, NM, USA. https://doi.org/10.1109/JCDL57899.2023.00060
Wick, C., Reul, C., & Puppe, F. (2018). Calamari-a High-Performance Tensorflow-Based Deep Learning Package for Optical Character Recognition. https://doi.org/10.48550/arXiv.1807.02004
Wick, C., Reul, C., & Puppe, F. (2018). Comparison of OCR accuracy on early printed books using the open source engines Calamari and OCRopus. Journal for Language and Conputational Linguistics, 33, 79–96.
Wu, V., Manmatha, R., & Riseman, E.M. (1997). Finding text in images. In: Proceedings of Second ACM International Conference on Digital Libraries. Philadelphia, PA,p23–26.