Part-of-Speech (POS) Tagger for Malay Language using Naïve Bayes and K-Nearest Neighbor Model
DOI:
https://doi.org/10.61841/98h6jw82Keywords:
Natural Language Processing,, Machine Learning,, Machine Learning, Part-of-speech Tagging, -- , Malay Language., Part-of-Speech (POS) tagging effectiveness is essential in the era of the 4th industrial revolution as high technologyAbstract
machines such as cars and smart homes can be controlled using human voice command. POS tagger is important in many domains, including information retrieval. POS tags such as verb or noun, in turn, can be used as features for higher-level natural language processing (NLP) tasks such as Named Entity Recognition, Sentiment Analysis, and Question Answering chatbots. However, research on developing an effective part-of-speech (POS) tagger for the Malay language is still in its infancy. Many existing methods that have been tested in English have not been tested for the Malay language. This study presents an experiment to tag Malay words using the supervised machine learning (ML) approach. The purpose of this work is to investigate the performance of the supervised ML approaches in tagging Malay words and the effectiveness of the affixes-based feature patterns. The Naïve Bayes and k-nearest neighbor models have been used to assign a specific tag for the words. A corpus obtained from Dewan Bahasa dan Pustaka (DBP) has been used in this experiment. DBP has defined 21 tagsets (categories) for the corpus. We have used two sizes of corpora for the tests, which have 20,000 tokens and 40,000 tokens. Moreover, affixes-based feature pattern engineering has been extracted from the corpora to improve the process of tagging.Keywords-
Downloads
References
1. Abdullah, H. (1972). The morphology of Malay.
2. Abdullah, I. H., Ahmad, Z., Ghani, R. A., Jalaludin, N. H., & Aman, I. (2004). A Practical Grammar of Malay-
-a corpus based approach to the description of Malay: extending the possibilities for endless and lifelong language learning. The National University of Singapore.
3. Al-Adhaileh, M. H., Kong, T. E., & Melamed, I. D. (n.d.). Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms.
4. Alhawiti, D. K. M. (2014). Natural Language Processing and its Use in Education. Computer Science Department, Faculty of Computers and Information Technology, Tabuk University, Tabuk, Saudi Arabia.
5. Allen, J. (1995). Natural Language Understanding. Pearson.
6. Ariffin, S. N. A. N., & Tiun, S. (2018). Part-of-Speech Tagger for Malay Social Media Texts. GEMA Online®Journal of Language Studies, 18(4).
7. Baldwin, T., & Awab, S. (2006). Open source corpus analysis tools for Malay. Proceedings of the 5th International Conference on Language Resources and Evaluation, LREC 2006.
8. Benajiba, Y., Diab, M., & Rosso, P. (2008). Arabic Named Entity Recognition using optimized feature sets. EMNLP 2008 - 2008 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference: A Meeting of SIGDAT, a Special Interest Group of the ACL. https://doi.org/10.3115/1613715.1613755
9. Brants, T. (2000). TnT: a statistical part-of-speech tagger. Proceedings of the Sixth Conference on Applied Natural Language Processing, 224–231.
10. Brill, E. (1995). Transformation-Based Error-Driven Learning and Natural Language. Computational Linguistics.
11. Chowdhury, G. G. (2003). Natural language processing. Annual Review of Information Science and Technology.
12. El-Imam, Y. A., & Don, Z. M. (2005). Improved synthesis of standard Malay. Proceedings of the Seventh IASTED International Conference on Signal and Image Processing, SIP 2005.
13. Feldman, A. (2006). Portable Language Technology: a Resource-Light Approach to Morpho-Syntactic Tagging. Ohio State University.
14. Flanagan, J., Rabiner, L., & Schafer, R. (1974). Speech synthesis by concatenation of formant encoded words. Google Patents.
15. Giménez, J., & Màrquez, L. (2006). Technical Manual v1. 3. Universitat Politecnica de Catalunya, Barcelona.
16. Jurafsky, D., & Martin, J. (2014). Speech and Language Processing. In Speech and Language Processing.
17. Knowles, G., & Don, Z. M. (2003). Tagging a corpus of Malay texts, and coping with ‘syntactic drift.’ Proceedings of the Corpus Linguistics 2003 Conference, 422–428.
18. Knowles, G., & Don, Z. M. (2004). The notion of a "lemma": Headwords, roots, and lexical sets. International Journal of Corpus Linguistics, 9(1), 69–81.
19. Lindberg, J. (1960). Handling Lexicalised Phrases for Natural Language Processing. Stockholm University.
20. Loftsson, H. (2008). Tagging Icelandic text: A linguistic rule-based approach. Nordic Journal of Linguistics. https://doi.org/10.1017/S03325S6508001820
21. Marques, N. C., & Lopes, G. P. (2001). Tagging with small training corpora. International Symposium on Intelligent Data Analysis, 63–72.
22. McCallum, A., Nigam, K., & others. (1998). A comparison of event models for naive Bayes text classification. AAAI-98 Workshop on Learning for Text Categorization, 752(1), 41–48.
23. Mohamed, E. S. (2010). Orthographic enrichment for Arabic grammatical analysis. Indiana University, United States, Indiana.
24. Mohamed, H., Omar, N., & Ab Aziz, M. J. (2011). Statistical Malay part-of-speech (POS) tagger using the Hidden Markov approach. 2011 International Conference on Semantic Technology and Information Retrieval, STAIR 2011. https://doi.org/10.1109/STAIR.2011.5995794
25. Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys. https://doi.org/10.1145/1459352.1459355
26. Qin, I. W., & Schuurmans, D. (2005). Improved estimation for unsupervised part-of-speech tagging. Proceedings of 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering, IEEE NLP-KE’05. https://doi.org/10.1109/NLPKE.2005.1598738
27. Quah, C. K., Bond, F., & Yamazaki, T. (2001). Design and construction of a machine-tractable Malay- English lexicon. Asialex 2001 Proceedings.
28. Ranaivo-Malancon, B. (2005). Malay lexical analysis through the corpus-based approach. Proceedings of International Conference of Malay Lexicology and Lexicography (PALMA), Kuala Lumpur, Malaysia.
29. Schmid, H., & Laws, F. (2008). Estimation of conditional probabilities with decision trees and an application to fine-grained POS tagging. Cooling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference. https://doi.org/10.3115/1599081.1599179
30. Simmons, R. F., Klein, S., & McConlogue, K. (1962). Toward the synthesis of human language behavior. Behavioral Science, 7(3), 402.
31. Tan, Y. L. (2003). A minimally-supervised Malay affix learner. Proceedings of the Class of 2003 Senior Conference, Computer Science Department, Swarthmore College.
32. Tufis, D., & Mason, O. (1998). Tagging Romanian texts: a case study for qtag, a language-independent probabilistic tagger. Proceedings of the First International Conference on Language Resources and Evaluation (LREC). https://doi.org/10.1.1.33.3453
33. Zamin, N., Oxley, A., Bakar, Z. A., & Farhan, S. A. (2012). A lazy man’s way to part-of-speech tagging. Pacific Rim Knowledge Acquisition Workshop, 106–117.
34. Zuraidah, M. D. (2010). Processing natural Malay texts: A data-driven approach. Trames.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
You are free to:
- Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
- Adapt — remix, transform, and build upon the material for any purpose, even commercially.
- The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
- Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Notices:
You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation .
No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.