Text Automation Classification Using the Lexical Feature Representation Method

1Murnawan, R.A.E. Virgana, Sri Lestari

173 Views
44 Downloads
Abstract:

The process of automating text classification plays an important role in organizing a text document, determining the characteristics and characteristics of a document. To determine a characteristic or information hidden in a large dataset is very necessary, this is because the unstructured document has many meanings, different meanings and purposes. Therefore, it is necessary to have a special method that can provide important information contained in a text document. The feature representation method that will be used in this research is N-Grams, as well as the use of bag of concepts which is the development of the concept of bag of words to reduce the level of computing in forming feature representations. The purpose of this research is to design a question categorization automation feature that does not have a category contained in a text document, by applying lexical feature representation concepts such as bag of concepts, bag of word and N-Gram to the question categorization automation feature. Based on the results of experiments on WEKA, the combination of lexical features between unigram, bigram, trigram and keyword from each category in the implementation of making data models using cross validation with a fold number of 10 shows that the combination of the bigram trigram and keyword features gives the percentage of instance properly classified more correctly high compared to other feature combinations that is equal to 96.5% with the J48 Tree classifier.

Keywords:

text classification, feature representation, bag of concept, n-gram

Paper Details
Month5
Year2020
Volume24
IssueIssue 1
Pages2497-2506