Text Mining Based on Tax Comments as Big Data Analysis Using XGBOOST and Feature Selection

Authors

  • G. VISWANATH Associate Professor, Assistant Professor,. Department of CSE St.Johns College of Engineering and Technology, Yemmiganur, Kurnool (Dist). Author
  • T.ABDUL RAHEEM Associate Professor, Assistant Professor,. Department of CSE St.Johns College of Engineering and Technology, Yemmiganur, Kurnool (Dist). Author

DOI:

https://doi.org/10.61841/5ehw1236

Keywords:

XGBoost method, Software program, Support vector machines,, python, data Mining, decision tree,, XGBoost, algorithm, random forest, correlation mining, KNN

Abstract

With the quick improvement of the Internet, enormous information has been applied in a lot of use. Be that as it may, there are regularly excess or unessential highlights in high dimensional information, so include determination is especially significant. By building subsets with new highlights and utilizing AI calculations including Xgboost and so on. To acquire early notice data with high dependability and constant by applying large information hypothesis, systems, models and techniques just as AI strategies are the unavoidable patterns later on. this examination proposed the fast choice of highlights by utilizing XGboost model in dispersed circumstances can improve the Model preparing proficiency under conveyed condition. GBTs model dependent on the inclination streamlining choice tree was superior to the next two models as far as precision and continuous execution, which meets the necessities under the large information foundation. It runs on a solitary machine, just as the conveyed preparing structures Apache Hadoop, Apache Spark. We can utilize inclination plummet for our slope boosting model. On account of a relapse tree, leaf hubs produce a normal inclination among tests with comparative highlights. Highlight determination is a basic advance in information preprocessing and significant research content in information mining and AI assignments, for example, order.

 

Downloads

Download data is not yet available.

References

R. Bekkerman. The present and the future of the kdd cup competition: an outsider’s perspective.

[2] R. Bekkerman, M. Bilenko, and J. Langford. Scaling Up Machine Learning: Parallel and Distributed

Approaches. Cambridge University Press, New York, NY, USA, 2011.

[3] J. Bennett and S. Lanning. The netflix prize. In Proceedings of the KDD Cup Workshop 2007, pages 3–6, New

York, Aug. 2007.

[4] L. Breiman. Random forests. Maching Learning, 45(1):5–32, Oct. 2001.

[5] C. Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23–581, 2010.

[6] O. Chapelle and Y. Chang. Yahoo! Learning to Rank Challenge Overview. Journal of Machine Learning

Research - W & CP, 14:1–24, 2011.

[7] T. Chen, H. Li, Q. Yang, and Y. Yu. General functional matrix factorization using gradient boosting. In

Proceeding of 30th International Conference on Machine Learning (ICML’13), volume 1, pages 436–444, 2013.

[8] T. Chen, S. Singh, B. Taskar, and C. Guestrin. Efficient second-order gradient boosting for conditional random

fields. In Proceeding of 18th Artificial Intelligence and Statistics Conference (AISTATS’15), volume 1, 2015.

[9] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear

classification. Journal of Machine Learning Research, 9:1871–1874, 2008.

[10] J. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189

1232, 2001.

[11] J. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378, 2002.

[12] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of

Statistics, 28(2):337–407, 2000.

[13] J. H. Friedman and B. E. Popescu. Importance sampled learning ensembles, 2003.

[14] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of the

2001 ACM SIGMOD International Conference on Management of Data, pages 58–66, 2001.

[15] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, and J. Q. n. Candela.

Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on

Data Mining for Online Advertising, ADKDD’14, 2014.

18904

International Journal of Psychosocial Rehabilitation, Vol. 24, Issue 06, 2020

ISSN: 1475-7192

[16] P. Li. Robust Logitboost and adaptive base class (ABC) Logitboost. In Proceedings of the Twenty-Sixth

Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI’10), pages 302–311, 2010.

[17] P. Li, Q. Wu, and C. J. Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In

Advances in Neural Information Processing Systems 20, pages 897–904. 2008.

[18] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen,

D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine learning in apache spark.

Journal of Machine Learning Research, 17(34):1–7, 2016.

[19] B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. Planet: Massively parallel learning of tree ensembles with

mapreduce. Proceeding of VLDB Endowment, 2(2):1426–1437, Aug. 2009.

[20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R.

Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit

learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

[21] G. Ridgeway. Generalized Boosted Models: A guide to the gbm package.

[22] S. Tyree, K. Weinberger, K. Agrawal, and J. Paykin. Parallel boosted regression trees for web search ranking.

In Proceedings of the 20th international conference on World wide web, pages 387–396. ACM, 2011.

[23] J. Ye, J.-H. Chow, J. Chen, and Z. Zheng. Stochastic gradient boosted distributed decision trees. In

Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09.

[24] Q. Zhang and W. Wang. A fast algorithm for approximate quantiles in high speed data streams. In Proceedings

of the 19th International Conference on Scientific and Statistical Database Management, 2007

. [25] T. Zhang and R. Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 36(5), 2014.

[26] Bo Pang and Lillian Lee “A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization

Based on Minimum Cuts” in ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational

Linguistics, 2004, Article No. 271

[26] Xu, Shuo Li, Yan Zheng, Wang. 201 . Bayesian Gaussian Na ve Bayes Classifier to TexClassification. 34 -

352. 10.1007/978-981-10-5041-1_57.

[27]https://www.datacamp.com/community/tutorials/ra nd om-forests-classifier-python

[28] Ben-Hur, Asa, and Jason Weston. ” A users guide to support vector machines.”

[29] Louppe, Gilles. ” Understanding random forests: From theory to practice.” arXiv preprint arXiv:1407.7502

2014.

[30]. Chen, T.; Guestrin, C. Xgboost: A Scalable Tree Boosting System. arXiv 2016, arXiv:1603.02754.

18905

International Journal of Psychosocial Rehabilitation, Vol. 24, Issue 06, 2020

ISSN: 1475-7192

[11]Phoboo, A.E. Machine Learning wins the Higgs Challenge. ATLAS News, 20 November 2014

Downloads

Published

30.06.2020

How to Cite

VISWANATH, G., & RAHEEM, T. (2020). Text Mining Based on Tax Comments as Big Data Analysis Using XGBOOST and Feature Selection. International Journal of Psychosocial Rehabilitation, 24(6), 18898-18906. https://doi.org/10.61841/5ehw1236