Proceedings of the International Conference on Information Economy, Data Modeling and Cloud Computing, ICIDC 2022, 17-19 June 2022, Qingdao, China

Research Article

Research on Income Forecasting based on Machine Learning Methods and the Importance of Features

Download830 downloads
  • @INPROCEEDINGS{10.4108/eai.17-6-2022.2322745,
        author={Jinglin  Wang},
        title={Research on Income Forecasting based on Machine Learning Methods and the Importance of Features},
        proceedings={Proceedings of the International Conference on Information Economy, Data Modeling and Cloud Computing, ICIDC 2022, 17-19 June 2022, Qingdao, China},
        publisher={EAI},
        proceedings_a={ICIDC},
        year={2022},
        month={10},
        keywords={income; classification; gini importance; random forest; knn},
        doi={10.4108/eai.17-6-2022.2322745}
    }
    
  • Jinglin Wang
    Year: 2022
    Research on Income Forecasting based on Machine Learning Methods and the Importance of Features
    ICIDC
    EAI
    DOI: 10.4108/eai.17-6-2022.2322745
Jinglin Wang1,*
  • 1: Foreign Language School attached to Guangxi Normal University
*Contact email: Axon.Wang@outlook.com

Abstract

In modern society, age has a significant impact on the income distribution of employee. However, little research has focused on the precise impacts of different factors of income and their relevant applications in predicting the person’s income. Using 48,842 individuals’ income census data from Adult Data Set, this study aims to predict the annual income level of the individual with machine learning approaches based on 13 attributes of the person (age, workclass, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week and native-country) and determine the key factors of the prediction. For income prediction, 32,561 individuals are divided randomly for training the classification model; the Random Forest (RF), K Nearest Neighbor (KNN), Support Vector Machines (SVM), Logistic Regression (LR) and Naïve Bayes (NB) algorithm have been adopted. Since the accuracy of RT is greater than 0.9 in this task, Gini Importance is used to measure the relativities between each feature and the topic. Among these 5 methods, the RT and KNN models perform relatively well, with accuracies of 0.97973 and 0.8976 respectively. And the age of the employee shows the highest relativity to his or her possible income with the importance of 0.225.