A Text Analytics Approach to Study Python Questions Posted on Stack Overflow

Authors

  • Lee Yong Meng Universiti Sains Malaysia
  • Soo Yin Yi Universiti Sains Malaysia
  • Keng Hoon Gan Universiti Sains Malaysia
  • Nur-Hana Samsudin Universiti Sains Malaysia

DOI:

https://doi.org/10.6977.IJoSI.202109_6(5).0006

Abstract

Stack Overflow (SO) is one of the largest discussion platforms for programmers with different technical backgrounds to discuss and communicate their ideas and thoughts related to various topics, including but not limited to software development and data analysis. Many programmers are actively contributing to this platform and discuss about Python programming language, which is one of the most popular programming languages used for data analysis. To better study the topics related to Python questions posted on the platform, a text analytics approach incorporating text preprocessing steps and Latent Dirichlet Allocation (LDA) topic modelling algorithm is proposed to study and analyze Python questions posted on SO from 2008 to 2016. The two main objectives of this study are: to discover and analyze the topics of the questions about Python programming language posted on SO from 2008 to 2016 to identify and compare the topics being discussed in each year, and to analyze questions about Python programming language with high votes posted on SO from 2008 to 2016 using topic modelling technique with a suitable number of topics. Based on the study, we find that the topics of the Python questions posted on Stack Overflow have gradually shifted towards those related to data modelling and analysis from 2008 to 2016. Furthermore, the study also shows that a suitable number of topics using the topic modelling technique yield a high coherence score concerning the topic model in use, which is important to extract more meaningful topics from the collection of Python questions. A topic model with 8 topics can be used to extract more meaningful topics from Python questions with high votes posted on SO from 2008 to 2016.

Published

2021-10-03

Issue

Section

Special Issue on Info. Tech. (2021)