Code switching: exploring perplexity and coherence metrics for optimizing topic models of historical documents

Authors

  • Muhammad Abdullah Yusof
  • Suhaila Saee

DOI:

https://doi.org/10.6977/IJoSI.202412_8(4).0007

Keywords:

Topic modelling, Latent Dirichlet Allocation, hyperparameter, perplexity, topic coherence

Abstract

Latent Dirichlet Allocation (LDA) model has two important hyperparameters that control distribution of per document-topic known as alpha (α), and per topic-word known as beta (β). It is important to find the suitable values for both hyperparameters to achieve an accurate topic cluster. A single evaluation method to determine the optimal and efficient value of hyperparameters is insufficient given the size and complexity of the data. Thus, an experiment was conducted to study the relationship between the hyperparameters with perplexity and coherence scores. The experiment is necessary in order to establish a proper baseline for further topic modelling study. It is the first study that focus on multiple languages in Sarawak Gazette data for topic modelling. The study was conducted on LDA from Gensim package. The result shows that while perplexity scores show a good indicator of the model capability to predict the new or a hidden data, the word cluster within a topic does not necessarily reflecting the similarity or relation between words which compromising the interpretation of topics. The perplexity scored the lowest when the alpha was set to 5 and beta to 0.4. Meanwhile, the coherence evaluation reflecting the best number of topics for each of the hyperparameters values although the status of hidden words is unknown. The coherence score is the most optimal when the number of topics is 5 and 4. In conclusion, the perplexity scores are the good indicator of the word prediction for each of the hyperparameters while coherence capturing a suitable number of topics for these hyperparameters to produce a high coherence word cluster within a topic. The combination of both evaluation methods ensuring the optimal result with interpretable output.

Downloads

Published

2024-12-30