Code switching: exploring perplexity and coherence metrics for optimizing topic models of historical documents

Muhammad Abdullah Yusof; Suhaila Saee

doi:10.6977/IJoSI.202412_8(4).0007

Code switching: exploring perplexity and coherence metrics for optimizing topic models of historical documents

Authors

Muhammad Abdullah Yusof
Suhaila Saee

DOI:

https://doi.org/10.6977/IJoSI.202412_8(4).0007

Keywords:

Topic modelling, Latent Dirichlet Allocation, hyperparameter, perplexity, topic coherence

Abstract

Latent Dirichlet Allocation (LDA) model has two important hyperparameters that control distribution of per document-topic known as alpha (α), and per topic-word known as beta (β). It is important to find the suitable values for both hyperparameters to achieve an accurate topic cluster. A single evaluation method to determine the optimal and efficient value of hyperparameters is insufficient given the size and complexity of the data. Thus, an experiment was conducted to study the relationship between the hyperparameters with perplexity and coherence scores. The experiment is necessary in order to establish a proper baseline for further topic modelling study. It is the first study that focus on multiple languages in Sarawak Gazette data for topic modelling. The study was conducted on LDA from Gensim package. The result shows that while perplexity scores show a good indicator of the model capability to predict the new or a hidden data, the word cluster within a topic does not necessarily reflecting the similarity or relation between words which compromising the interpretation of topics. The perplexity scored the lowest when the alpha was set to 5 and beta to 0.4. Meanwhile, the coherence evaluation reflecting the best number of topics for each of the hyperparameters values although the status of hidden words is unknown. The coherence score is the most optimal when the number of topics is 5 and 4. In conclusion, the perplexity scores are the good indicator of the word prediction for each of the hyperparameters while coherence capturing a suitable number of topics for these hyperparameters to produce a high coherence word cluster within a topic. The combination of both evaluation methods ensuring the optimal result with interpretable output.

Downloads

Full Paper

Published

2024-12-30

Issue

Vol. 8 No. 4 (2024): International Journal of Systematic Innovation

Section

Regular full papers

License

Copyright in a work is a bundle of rights. IJoSI's, copyright covers what may be done with the work in terms of making copies, making derivative works, abstracting parts of it for citation or quotation elsewhere and so on. IJoSI requires authors to sign over rights when their article is ready for publication so that the publisher from then on owns the work. Until that point, all rights belong to the creator(s) of the work. The format of IJoSI copy right form can be found at the IJoSI web site.
The issues of International Journal of Systematic Innovation (IJoSI) are published in electronic format and in print. Our website, journal papers, and manuscripts etc. are stored on one server. Readers can have free online access to our journal papers. Authors transfer copyright to the publisher as part of a journal publishing agreement, but have the right to:
1.   Share their article for personal use, internal institutional use and scholarly sharing purposes, with a DOI link to the version of record on our server.
2.   Retain patent, trademark and other intellectual property rights (including research data).
3.   Proper attribution and credit for the published work.

Code switching: exploring perplexity and coherence metrics for optimizing topic models of historical documents

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

Information

Make a Submission

Browse