Text Mining Based on Tax Comments as Big Data Analysis Using XGBOOST and Feature Selection
DOI:
https://doi.org/10.61841/5ehw1236Keywords:
XGBoost method, Software program, Support vector machines,, python, data Mining, decision tree,, XGBoost, algorithm, random forest, correlation mining, KNNAbstract
With the quick improvement of the Internet, enormous information has been applied in a lot of use. Be that as it may, there are regularly excess or unessential highlights in high dimensional information, so include determination is especially significant. By building subsets with new highlights and utilizing AI calculations including Xgboost and so on. To acquire early notice data with high dependability and constant by applying large information hypothesis, systems, models and techniques just as AI strategies are the unavoidable patterns later on. this examination proposed the fast choice of highlights by utilizing XGboost model in dispersed circumstances can improve the Model preparing proficiency under conveyed condition. GBTs model dependent on the inclination streamlining choice tree was superior to the next two models as far as precision and continuous execution, which meets the necessities under the large information foundation. It runs on a solitary machine, just as the conveyed preparing structures Apache Hadoop, Apache Spark. We can utilize inclination plummet for our slope boosting model. On account of a relapse tree, leaf hubs produce a normal inclination among tests with comparative highlights. Highlight determination is a basic advance in information preprocessing and significant research content in information mining and AI assignments, for example, order.
Downloads
References
R. Bekkerman. The present and the future of the kdd cup competition: an outsider’s perspective.
[2] R. Bekkerman, M. Bilenko, and J. Langford. Scaling Up Machine Learning: Parallel and Distributed
Approaches. Cambridge University Press, New York, NY, USA, 2011.
[3] J. Bennett and S. Lanning. The netflix prize. In Proceedings of the KDD Cup Workshop 2007, pages 3–6, New
York, Aug. 2007.
[4] L. Breiman. Random forests. Maching Learning, 45(1):5–32, Oct. 2001.
[5] C. Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23–581, 2010.
[6] O. Chapelle and Y. Chang. Yahoo! Learning to Rank Challenge Overview. Journal of Machine Learning
Research - W & CP, 14:1–24, 2011.
[7] T. Chen, H. Li, Q. Yang, and Y. Yu. General functional matrix factorization using gradient boosting. In
Proceeding of 30th International Conference on Machine Learning (ICML’13), volume 1, pages 436–444, 2013.
[8] T. Chen, S. Singh, B. Taskar, and C. Guestrin. Efficient second-order gradient boosting for conditional random
fields. In Proceeding of 18th Artificial Intelligence and Statistics Conference (AISTATS’15), volume 1, 2015.
[9] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear
classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
[10] J. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189
1232, 2001.
[11] J. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367–378, 2002.
[12] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of
Statistics, 28(2):337–407, 2000.
[13] J. H. Friedman and B. E. Popescu. Importance sampled learning ensembles, 2003.
[14] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of the
2001 ACM SIGMOD International Conference on Management of Data, pages 58–66, 2001.
[15] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, and J. Q. n. Candela.
Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on
Data Mining for Online Advertising, ADKDD’14, 2014.
18904
International Journal of Psychosocial Rehabilitation, Vol. 24, Issue 06, 2020
ISSN: 1475-7192
[16] P. Li. Robust Logitboost and adaptive base class (ABC) Logitboost. In Proceedings of the Twenty-Sixth
Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI’10), pages 302–311, 2010.
[17] P. Li, Q. Wu, and C. J. Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In
Advances in Neural Information Processing Systems 20, pages 897–904. 2008.
[18] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen,
D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine learning in apache spark.
Journal of Machine Learning Research, 17(34):1–7, 2016.
[19] B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. Planet: Massively parallel learning of tree ensembles with
mapreduce. Proceeding of VLDB Endowment, 2(2):1426–1437, Aug. 2009.
[20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R.
Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit
learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
[21] G. Ridgeway. Generalized Boosted Models: A guide to the gbm package.
[22] S. Tyree, K. Weinberger, K. Agrawal, and J. Paykin. Parallel boosted regression trees for web search ranking.
In Proceedings of the 20th international conference on World wide web, pages 387–396. ACM, 2011.
[23] J. Ye, J.-H. Chow, J. Chen, and Z. Zheng. Stochastic gradient boosted distributed decision trees. In
Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09.
[24] Q. Zhang and W. Wang. A fast algorithm for approximate quantiles in high speed data streams. In Proceedings
of the 19th International Conference on Scientific and Statistical Database Management, 2007
. [25] T. Zhang and R. Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 36(5), 2014.
[26] Bo Pang and Lillian Lee “A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization
Based on Minimum Cuts” in ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational
Linguistics, 2004, Article No. 271
[26] Xu, Shuo Li, Yan Zheng, Wang. 201 . Bayesian Gaussian Na ve Bayes Classifier to TexClassification. 34 -
352. 10.1007/978-981-10-5041-1_57.
[27]https://www.datacamp.com/community/tutorials/ra nd om-forests-classifier-python
[28] Ben-Hur, Asa, and Jason Weston. ” A users guide to support vector machines.”
[29] Louppe, Gilles. ” Understanding random forests: From theory to practice.” arXiv preprint arXiv:1407.7502
2014.
[30]. Chen, T.; Guestrin, C. Xgboost: A Scalable Tree Boosting System. arXiv 2016, arXiv:1603.02754.
18905
International Journal of Psychosocial Rehabilitation, Vol. 24, Issue 06, 2020
ISSN: 1475-7192
[11]Phoboo, A.E. Machine Learning wins the Higgs Challenge. ATLAS News, 20 November 2014
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
You are free to:
- Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
- Adapt — remix, transform, and build upon the material for any purpose, even commercially.
- The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
- Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.
Notices:
You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation .
No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.