Part-of-Speech (POS) Tagger for Malay Language using Naïve Bayes and K-Nearest Neighbor Model
1Shamsan Gaber, Mohd Zakree Ahmad Nazri, Nazlia Omar, Salwani Abdullah
Part-of-Speech (POS) tagging effectiveness is essential in the era of the 4th industrial revolution as high technology machines such as cars and smart homes can be controlled using human voice command. POS tagger is important in many domains, including information retrieval. POS tags such as verb or noun, in turn, can be used as features for higher-level natural language processing (NLP) tasks such as Named Entity Recognition, Sentiment Analysis, and Question Answering chatbots. However, research on developing an effective part-of-speech (POS) tagger for the Malay language is still in its infancy. Many existing methods that have been tested in English have not been tested for the Malay language. This study presents an experiment to tag Malay words using the supervised machine learning (ML) approach. The purpose of this work is to investigate the performance of the supervised ML approaches in tagging Malay words and the effectiveness of the affixes-based feature patterns. The Naïve Bayes and k-nearest neighbor models have been used to assign a specific tag for the words. A corpus obtained from Dewan Bahasa dan Pustaka (DBP) has been used in this experiment. DBP has defined 21 tagsets (categories) for the corpus. We have used two sizes of corpora for the tests, which have 20,000 tokens and 40,000 tokens. Moreover, affixes-based feature pattern engineering has been extracted from the corpora to improve the process of tagging.
Natural Language Processing, Machine Learning, Part-of-speech Tagging, Malay Language