Part-of-Speech (POS) Tagger for Malay Language using Naïve Bayes and K-Nearest Neighbor Model

Authors

  • Shamsan Gaber , Universiti Kebangsaan Malaysia, Bangi, Selangor, Malaysia, Author

DOI:

https://doi.org/10.61841/98h6jw82

Keywords:

Natural Language Processing,, Machine Learning,, Machine Learning, Part-of-speech Tagging, -- , Malay Language., Part-of-Speech (POS) tagging effectiveness is essential in the era of the 4th industrial revolution as high technology

Abstract

machines such as cars and smart homes can be controlled using human voice command. POS tagger is important in many domains, including information retrieval. POS tags such as verb or noun, in turn, can be used as features for higher-level natural language processing (NLP) tasks such as Named Entity Recognition, Sentiment Analysis, and Question Answering chatbots. However, research on developing an effective part-of-speech (POS) tagger for the Malay language is still in its infancy. Many existing methods that have been tested in English have not been tested for the Malay language. This study presents an experiment to tag Malay words using the supervised machine learning (ML) approach. The purpose of this work is to investigate the performance of the supervised ML approaches in tagging Malay words and the effectiveness of the affixes-based feature patterns. The Naïve Bayes and k-nearest neighbor models have been used to assign a specific tag for the words. A corpus obtained from Dewan Bahasa dan Pustaka (DBP) has been used in this experiment. DBP has defined 21 tagsets (categories) for the corpus. We have used two sizes of corpora for the tests, which have 20,000 tokens and 40,000 tokens. Moreover, affixes-based feature pattern engineering has been extracted from the corpora to improve the process of tagging.Keywords-

 

Downloads

Download data is not yet available.

References

1. Abdullah, H. (1972). The morphology of Malay.

2. Abdullah, I. H., Ahmad, Z., Ghani, R. A., Jalaludin, N. H., & Aman, I. (2004). A Practical Grammar of Malay-

-a corpus based approach to the description of Malay: extending the possibilities for endless and lifelong language learning. The National University of Singapore.

3. Al-Adhaileh, M. H., Kong, T. E., & Melamed, I. D. (n.d.). Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms.

4. Alhawiti, D. K. M. (2014). Natural Language Processing and its Use in Education. Computer Science Department, Faculty of Computers and Information Technology, Tabuk University, Tabuk, Saudi Arabia.

5. Allen, J. (1995). Natural Language Understanding. Pearson.

6. Ariffin, S. N. A. N., & Tiun, S. (2018). Part-of-Speech Tagger for Malay Social Media Texts. GEMA Online®Journal of Language Studies, 18(4).

7. Baldwin, T., & Awab, S. (2006). Open source corpus analysis tools for Malay. Proceedings of the 5th International Conference on Language Resources and Evaluation, LREC 2006.

8. Benajiba, Y., Diab, M., & Rosso, P. (2008). Arabic Named Entity Recognition using optimized feature sets. EMNLP 2008 - 2008 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference: A Meeting of SIGDAT, a Special Interest Group of the ACL. https://doi.org/10.3115/1613715.1613755

9. Brants, T. (2000). TnT: a statistical part-of-speech tagger. Proceedings of the Sixth Conference on Applied Natural Language Processing, 224–231.

10. Brill, E. (1995). Transformation-Based Error-Driven Learning and Natural Language. Computational Linguistics.

11. Chowdhury, G. G. (2003). Natural language processing. Annual Review of Information Science and Technology.

12. El-Imam, Y. A., & Don, Z. M. (2005). Improved synthesis of standard Malay. Proceedings of the Seventh IASTED International Conference on Signal and Image Processing, SIP 2005.

13. Feldman, A. (2006). Portable Language Technology: a Resource-Light Approach to Morpho-Syntactic Tagging. Ohio State University.

14. Flanagan, J., Rabiner, L., & Schafer, R. (1974). Speech synthesis by concatenation of formant encoded words. Google Patents.

15. Giménez, J., & Màrquez, L. (2006). Technical Manual v1. 3. Universitat Politecnica de Catalunya, Barcelona.

16. Jurafsky, D., & Martin, J. (2014). Speech and Language Processing. In Speech and Language Processing.

17. Knowles, G., & Don, Z. M. (2003). Tagging a corpus of Malay texts, and coping with ‘syntactic drift.’ Proceedings of the Corpus Linguistics 2003 Conference, 422–428.

18. Knowles, G., & Don, Z. M. (2004). The notion of a "lemma": Headwords, roots, and lexical sets. International Journal of Corpus Linguistics, 9(1), 69–81.

19. Lindberg, J. (1960). Handling Lexicalised Phrases for Natural Language Processing. Stockholm University.

20. Loftsson, H. (2008). Tagging Icelandic text: A linguistic rule-based approach. Nordic Journal of Linguistics. https://doi.org/10.1017/S03325S6508001820

21. Marques, N. C., & Lopes, G. P. (2001). Tagging with small training corpora. International Symposium on Intelligent Data Analysis, 63–72.

22. McCallum, A., Nigam, K., & others. (1998). A comparison of event models for naive Bayes text classification. AAAI-98 Workshop on Learning for Text Categorization, 752(1), 41–48.

23. Mohamed, E. S. (2010). Orthographic enrichment for Arabic grammatical analysis. Indiana University, United States, Indiana.

24. Mohamed, H., Omar, N., & Ab Aziz, M. J. (2011). Statistical Malay part-of-speech (POS) tagger using the Hidden Markov approach. 2011 International Conference on Semantic Technology and Information Retrieval, STAIR 2011. https://doi.org/10.1109/STAIR.2011.5995794

25. Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys. https://doi.org/10.1145/1459352.1459355

26. Qin, I. W., & Schuurmans, D. (2005). Improved estimation for unsupervised part-of-speech tagging. Proceedings of 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering, IEEE NLP-KE’05. https://doi.org/10.1109/NLPKE.2005.1598738

27. Quah, C. K., Bond, F., & Yamazaki, T. (2001). Design and construction of a machine-tractable Malay- English lexicon. Asialex 2001 Proceedings.

28. Ranaivo-Malancon, B. (2005). Malay lexical analysis through the corpus-based approach. Proceedings of International Conference of Malay Lexicology and Lexicography (PALMA), Kuala Lumpur, Malaysia.

29. Schmid, H., & Laws, F. (2008). Estimation of conditional probabilities with decision trees and an application to fine-grained POS tagging. Cooling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference. https://doi.org/10.3115/1599081.1599179

30. Simmons, R. F., Klein, S., & McConlogue, K. (1962). Toward the synthesis of human language behavior. Behavioral Science, 7(3), 402.

31. Tan, Y. L. (2003). A minimally-supervised Malay affix learner. Proceedings of the Class of 2003 Senior Conference, Computer Science Department, Swarthmore College.

32. Tufis, D., & Mason, O. (1998). Tagging Romanian texts: a case study for qtag, a language-independent probabilistic tagger. Proceedings of the First International Conference on Language Resources and Evaluation (LREC). https://doi.org/10.1.1.33.3453

33. Zamin, N., Oxley, A., Bakar, Z. A., & Farhan, S. A. (2012). A lazy man’s way to part-of-speech tagging. Pacific Rim Knowledge Acquisition Workshop, 106–117.

34. Zuraidah, M. D. (2010). Processing natural Malay texts: A data-driven approach. Trames.

Downloads

Published

30.06.2020

How to Cite

Gaber, S. (2020). Part-of-Speech (POS) Tagger for Malay Language using Naïve Bayes and K-Nearest Neighbor Model. International Journal of Psychosocial Rehabilitation, 24(6), 5468-5476. https://doi.org/10.61841/98h6jw82