Comparison of Support vector machine and Naïve Bayes Classification Algorithms Using VADER and Lexicon based Labelling on Indonesian and English Tweets

Ponco Sunarko, Arif Bijaksana Putra Negara, Rina Septiriana

Abstract


Sentiment analysis is essential in natural language processing, and it helps understand public opinion from text, especially on social media. This research compares the effectiveness of Naive Bayes and Support vector machine (SVM) algorithms in sentiment classification of automatically labelled tweets using VADER and Lexicon-based methods. The data consists of Indonesian and English tweets collected through scrapping. The methodology includes business understanding, data understanding, data preparation, modelling, evaluation, and deployment stages. In the preprocessing stage, the data is cleaned and divided into 300 sentences for test data in Indonesian and English; each data will be labelled manually, and then 3762 sentences for Indonesian data and 4308 sentences for English data will be used as training data. The highest accuracy on automatic labelling against manual labelling is on Lexicon-based labelling, showing 66% accuracy for Indonesian and 55% for English. Text features were extracted using TF-IDF, and the model was trained and tested with the labelled data. The results showed that SVM with Lexicon-based auto-labelling had the best performance, with an accuracy of 44% for Indonesian and 57% for English. The combined accuracy of automatic labelling and classification was 29% for Indonesian and 31% for English. Factors such as tweet length, dictionary limitations, and use of slang affected the accuracy. The analysis also showed biases in the data and auto-labelling results.


Full Text:

PDF


DOI: https://doi.org/10.26418/juara.v3i1.86468

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


View My Stats

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.