大数据分析 - 数据专业人员薪资水平分析
一、选题背景
近几年,“数据”这个词语越来越火爆。从整个环境来讲,企业愈发关注数据所带来的巨大价值,并将数据业务逐渐渗透到企业的发展版图中。也正是因为企业对数据方向的逐步重视,数据相关岗位的需求增多,近几年呈爆发式增长。
对于现在的人才需求市场,数据类岗位尤以数据分析师最为突出。数据分析现已作为一门学科加入到高等院校的课程体系中,成为炙手可热的一个方向。
二、大数据分析设计方案
1、本数据集的数据内容与数据特征分析
本案例数据基于工作者得年龄、工作类别、学历、婚姻状况、性别、每周工作时间、所在国家、种族、所在州、所在城市、家庭情况来分析工作者的薪资水平。字段名称 | 字段类型 | 字段说明 |
---|---|---|
Age |
数值型 | 年龄 |
workclass |
字符型 | 工作类 |
education |
字符型 | 学历 |
marital-status |
字符型 | 婚姻状况 |
sex |
字符型 | 性别 |
hours-per-week |
数值型 | 每周工作时间 |
salary |
数值型 | 薪水 |
COUNTRY | 字符型 | 所在的国家 |
CITY | 字符型 | 所在城市 |
STATE | 字符型 | 所在州 |
race | 字符型 | 种族 |
relationship |
字符型 | 家庭情况 |
2、数据分析的课程设计方案概述
(1)先对数据集的数据进行所需要的处理,并计算数据集中各种数据与薪资水平的相关性。
(2)对数据集每一种数据与薪资的关系进行python可视化处理,从而得到更加直观的数据信息。
三、数据分析步骤
1.数据源
该数据集来自kaggle
网址:https://www.kaggle.com/datasets/iamsouravbanerjee/analytics-industry-salaries-2022-india
2.数据清洗
导入数据
df = pd.read_csv('./salary.csv') df.head()
显示结果
查看数据行数和列数
df.shape
显示结果
检查数据是否有空值
df.isnull().sum()
显示结果
统计重复值
df.duplicated().sum()
显示结果
删除重复值
#删除重复值 df.drop_duplicates(keep = 'first' , inplace=True)
查看数据信息
df.info()
显示结果
工作类统计
df.drop(columns = ['fnlwgt' , 'education-num' , 'relationship' , 'capital-loss'] , inplace = True)
df['workclass'].value_counts()
显示结果
学历统计
df['education'].value_counts()
显示结果
婚姻状况统计
df['marital-status'].value_counts()
显示结果
职业类型统计
df['occupation'].value_counts()
显示结果
所在地区的统计
df['native-country'].value_counts()
显示结果
人种的统计
df['race'].value_counts()
显示结果
数据清理- 合并变量1st-4th,5th-6th,7th-8th,10th,11th,12th,“学龄前”作为工人阶级小学列
- 并将变量“Assoc voc”、“Assoc acdm”、“Prof school”、“Some college”合并为高中
- 合并军事状态“已婚公民配偶”、“已婚AF配偶”、已婚并保持为其他人
- 更改“?”为本列的众数
- 在国家栏中,大多数值是美国的,因此将其他值视为一个
- 考虑白人和种族栏中的其他人
- 数据包含大量私人工人阶级的信息,因此我们将考虑保留为其他人
- 将薪资和性别列隐藏为1和0(这也是模型预测的需要)
#第一步 df['education'].replace(['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th','10th', '11th', '12th'], 'school' ,inplace = True , regex = True) df['education'].replace(['Assoc-voc', 'Assoc-acdm', 'Prof-school', 'Some-college'], 'higher' , inplace = True , regex = True)
#第二步 df['marital-status'].replace(['Married-civ-spouse', 'Married-AF-spouse'], 'married' , inplace=True , regex = True) df['marital-status'].replace(['Divorced', 'Separated','Widowed', 'Married-spouse-absent' , 'Never-married'] , 'other' , inplace=True , regex = True)
#第三步 df['workclass'] = df['workclass'].str.replace('?', 'Private' ) df['occupation'] = df['occupation'].str.replace('?', 'Prof-specialty' ) df['native-country'] = df['native-country'].str.replace('?', 'United-States' )
#第四步 for i in df['native-country'] : if i != ' United-States': df['native-country'].replace([i] , 'Others' , inplace = True)
#第五步 for i in df['race'] : if i != ' White': df['race'].replace([i] , 'Others' , inplace = True)
#第六步 for i in df['workclass'] : if i != ' Private': df['workclass'].replace([i] , 'Others' , inplace = True)
#第七步 from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() df['salary'] = encoder.fit_transform(df['salary']) df['sex'] = encoder.fit_transform(df['sex'])
读取数据
df.head()
显示结果
3、数据可视化
绘制数据图
import matplotlib.pyplot as plt import seaborn as sns
df.info()
显示结果
绘制薪资大于50k和小于50k的比例图
plt.pie(df['salary'].value_counts() , labels = ['0' ,'1'] , autopct = '%0.2f') plt.show()
显示结果
从图中可以看出接近76%的工作者薪资少于50k,只有24%左右的工作者薪资高于50k
绘制工作者的地区分布图
plt.pie(df['native-country'].value_counts() , labels = ['US' ,'Others'] , autopct = '%0.2f') plt.show()
显示结果
从图中方可以看出数据中大部分薪资高的工作者来自美国
绘制婚姻状况的饼状图
plt.pie(df['marital-status'].value_counts() , labels = ['Married' ,'Others'] , autopct = '%0.2f') plt.show()
显示结果
从图中可以看出本数据中高薪资已婚与未婚人数基本上是一样多的
查看婚姻状况与薪资的关系
sns.histplot(df[df['salary'] ==0]['marital-status']) sns.histplot(df[df['salary'] ==1]['marital-status'] , color='red')
显示结果
从图中可以看出已婚的工资相对更高
年龄与薪资的关系图
sns.histplot(df[df['salary'] ==0]['age']) sns.histplot(df[df['salary'] ==1]['age'] , color='red')
显示结果
从图中可以直观的看出在38至48岁的年龄段,有更多的人的工资高于50k
学历与薪资的关系图
sns.histplot(df[df['salary'] ==0]['education']) sns.histplot(df[df['salary'] ==1]['education'] , color='red')
显示结果
从图中可以看出受教育的程度越高薪资也越高,选取的数据中博士学历的薪资都超过了50k
职业与薪资水平的关系图
sns.histplot(df[df['salary'] ==0]['occupation']) sns.histplot(df[df['salary'] ==1]['occupation'] , color='red') plt.xticks(rotation='vertical') plt.show()
显示结果
关系其实不是很明显但是Exec-managerial这个职业分厂的突出有更高比例是高薪,其次是prof-specialty
数据中薪资超过50k的人种统计
plt.pie(df['race'].value_counts(), labels=['white' , 'others'] , autopct = '%0.2f') plt.show()
显示结果
从图中可以看出绝大多数来自于白人,有一定局限性,因为不同人种的职业分布是不一样的
薪资超过50k的个体和团体的统计
df['workclass'].unique()
plt.pie(df['workclass'].value_counts() , labels=['private' , 'others'] , autopct ='%0.2f') plt.show()
显示结果
从图中可以看出薪资超50k的75%工作类型都是个体
一周工作时间与薪资的关系
sns.distplot(df[df['salary'] ==0]['hours-per-week']) sns.distplot(df[df['salary'] ==1]['hours-per-week'] , color='red') plt.xticks(rotation='vertical') plt.show()
显示结果
从图中可以看出一周的工作时间与薪水的关系非常微妙,并不是工作时间越长薪水越多,一周工作40-60小时的员工,可能薪水更高
薪资与年龄、性别、工作时间的关系
df_heat = df[df['capital-gain'] <6000 ]
sns.heatmap(df_heat.corr() , annot=True)
显示结果
从图中可以看出年龄、性别、工作时间对薪资都有影响
4、大数据分析过程及采用的算法
df.head()
显示结果
标准化,防止量纲的影响
from sklearn.preprocessing import StandardScaler y = df['salary'] df.drop('salary' ,axis = 1 , inplace=True) num_cols = [x for x in df.columns if df[x].dtype != 'object'] Scaler = StandardScaler() df[num_cols] = Scaler.fit_transform(df[num_cols])
转化为onehot变量,消除字符类型数据影响而且保证一定程度的变量取值的不干涉
df['native-country'] =df['native-country'].apply(lambda x : x.strip()) cat_col = [x for x in df.columns if df[x].dtype == 'object'] df=pd.get_dummies(df , columns=cat_col , drop_first=True)
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(df , y , test_size=0.2 ,shuffle=True, random_state=41) print('Shape of training feature:', X_train.shape) print('Shape of testing feature:', X_test.shape) print('Shape of training label:', y_train.shape) print('Shape of training label:', y_test.shape)
显示结果
导入不同的机器学习模型,使用原有参数进行训练
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.neural_network import MLPClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.svm import SVC from sklearn.model_selection import GridSearchCV, cross_val_score,StratifiedKFold, learning_curve import warnings warnings.filterwarnings('ignore') from sklearn.metrics import accuracy_score, confusion_matrix, classification_report from sklearn.model_selection import cross_val_predict
df
显示结果
对算法进行分析
random_state = 2 classifiers = [] classifiers.append(SVC(random_state=random_state)) classifiers.append(DecisionTreeClassifier(random_state=random_state)) classifiers.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state)) classifiers.append(RandomForestClassifier(random_state=random_state)) classifiers.append(ExtraTreesClassifier(random_state=random_state)) classifiers.append(GradientBoostingClassifier(random_state=random_state)) classifiers.append(LogisticRegression(random_state = random_state))
cv_results = [] for classifier in classifiers : cv_results.append(cross_val_score(classifier,X_train, y_train, scoring = "accuracy", cv =5, n_jobs=4)) cv_means = [] cv_std = [] for cv_result in cv_results: cv_means.append(cv_result.mean()) cv_std.append(cv_result.std()) cv_res = pd.DataFrame({"CrossValMeans":cv_means,"CrossValerrors":cv_std,"Algorithm":["SVC","DecisionTree","AdaBoost", "RandomForest","ExtraTrees","GradientBoosting","Logist"]}) g = sns.barplot("CrossValMeans","Algorithm",data = cv_res,palette="Set3",orient = "h",**{'xerr':cv_std}) g.set_xlabel("Mean Accuracy") g = g.set_title("Cross validation scores")
显示结果
发现GBDT、SVC、逻辑回归、随机森林相对质量高一些
对效果好的四个分类器进行网格搜索微调
GBC = GradientBoostingClassifier() gb_param_grid = {'loss' : ["deviance"], 'n_estimators' : [100,200,300], 'learning_rate': [0.1, 0.05, 0.01], 'max_depth': [4, 8], 'min_samples_leaf': [100,150], 'max_features': [0.3, 0.1] } gsGBC = GridSearchCV(GBC,param_grid = gb_param_grid, cv=5,scoring="accuracy", n_jobs= 4, verbose = 1) gsGBC.fit(X_train, y_train) GBC_best = gsGBC.best_estimator_ gsGBC.best_score_
显示结果
svc = SVC() svc_param_grid = {'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]}, gsmvc=GridSearchCV(svc,param_grid=svc_param_grid,cv=5,scoring="accuracy", n_jobs= 4, verbose = 1) gsmvc.fit(X_train, y_train) mvc_best=gsmvc.best_estimator_ gsmvc.best_score_
显示结果
RFC = RandomForestClassifier() rf_param_grid = {"max_depth": [None], "max_features": [1, 3, 10], "min_samples_split": [2, 3, 10], "min_samples_leaf": [1, 3, 10], "bootstrap": [False], "n_estimators" :[100,300], "criterion": ["gini"]} gsRFC = GridSearchCV(RFC,param_grid = rf_param_grid, cv=5,scoring="accuracy", n_jobs= 4, verbose = 1) gsRFC.fit(X_train , y_train) RFC_best = gsRFC.best_estimator_ gsRFC.best_score_
显示结果
logC = LogisticRegression() log_param_grid={'penalty':['l2','l1'] , 'dual':[True , False], 'C':[0.01 , 0.1 , 1 , 1, 10 ]} gslogC = GridSearchCV(logC,param_grid = log_param_grid, cv=5,scoring="accuracy", n_jobs= 4, verbose = 1) gslogC.fit(X_train, y_train) logC_best = gslogC.best_estimator_ gslogC.best_score_
显示结果
chosen_classifiers = [GBC_best, logC_best,mvc_best,RFC_best]
# 模型分数评判 def evaluate_model(model, x_test, y_test): from sklearn import metrics # Predict Test Data y_pred = model.predict(x_test) # Calculate accuracy, precision, recall, f1-score, and kappa score acc = metrics.accuracy_score(y_test, y_pred) prec = metrics.precision_score(y_test, y_pred) rec = metrics.recall_score(y_test, y_pred) f1 = metrics.f1_score(y_test, y_pred) # Display confussion matrix cm = metrics.confusion_matrix(y_test, y_pred) return {'acc': acc, 'prec': prec, 'rec': rec, 'f1': f1,'cm': cm}
GBC_eval = evaluate_model(GBC_best, X_test, y_test) print('Accuracy:', GBC_eval['acc']) print('Precision:', GBC_eval['prec']) print('Recall:', GBC_eval['rec']) print('F1 Score:', GBC_eval['f1']) print('Confusion Matrix:\n', GBC_eval['cm'])
显示结果
logC_best_eval = evaluate_model(logC_best, X_test, y_test) print('Accuracy:', logC_best_eval['acc']) print('Precision:', logC_best_eval['prec']) print('Recall:', logC_best_eval['rec']) print('F1 Score:', logC_best_eval['f1']) print('Confusion Matrix:\n', logC_best_eval['cm'])
显示结果
mvc_best_eval = evaluate_model(mvc_best, X_test, y_test) print('Accuracy:', mvc_best_eval['acc']) print('Precision:', mvc_best_eval['prec']) print('Recall:', mvc_best_eval['rec']) print('F1 Score:', mvc_best_eval['f1']) print('Confusion Matrix:\n', mvc_best_eval['cm'])
显示结果
RFC_best_eval = evaluate_model(RFC_best, X_test, y_test) print('Accuracy:', RFC_best_eval['acc']) print('Precision:', RFC_best_eval['prec']) print('Recall:', RFC_best_eval['rec']) print('F1 Score:', RFC_best_eval['f1']) print('Confusion Matrix:\n', RFC_best_eval['cm'])
显示结果
def plot_learning_curve(models , X , y): for model in models : train_sizes , train_scores , test_scores =learning_curve(model ,X , y , n_jobs=-1 ) train_scores_mean = np.mean(train_scores ,axis = 1) test_scores_mean = np.mean(test_scores ,axis=1) plt.plot(train_sizes , train_scores_mean , 'o-' , color ='r' , label = 'Training score') plt.plot(train_sizes , test_scores_mean , 'o-' , color ='g' , label = 'Cross-validation score') plt.xlabel('Training set size') plt.ylabel('Accuracy') plt.legend() plt.title(model) plt.show() plot_learning_curve(chosen_classifiers , X_train , y_train)
显示结果
调优以后的模型效果可视化,性能都还不错。 如果增加更多的数据,将会获得更高的准确性
4.附上完整代码
import numpy as np import pandas as pd #数据清洗 df = pd.read_csv('./salary.csv') df.head() #查看数据行数和列数 df.shape #检查数据是否有空值 df.isnull().sum() #统计重复值 df.duplicated().sum() #删除重复值 df.drop_duplicates(keep = 'first' , inplace=True) #查看数据信息 df.info() df.drop(columns = ['fnlwgt' , 'education-num' , 'relationship' , 'capital-loss'] , inplace = True) #工作类统计 df['workclass'].value_counts() #学历统计 df['education'].value_counts() # 婚姻状况统计 df['marital-status'].value_counts() #职业类型统计 df['occupation'].value_counts() # 所在地区的统计 df['native-country'].value_counts() #人种统计 df['race'].value_counts() #第一步 df['education'].replace(['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th','10th', '11th', '12th'], 'school' ,inplace = True , regex = True) df['education'].replace(['Assoc-voc', 'Assoc-acdm', 'Prof-school', 'Some-college'], 'higher' , inplace = True , regex = True) #第二步 df['marital-status'].replace(['Married-civ-spouse', 'Married-AF-spouse'], 'married' , inplace=True , regex = True) df['marital-status'].replace(['Divorced', 'Separated','Widowed', 'Married-spouse-absent' , 'Never-married'] , 'other' , inplace=True , regex = True) #第三步 df['workclass'] = df['workclass'].str.replace('?', 'Private' ) df['occupation'] = df['occupation'].str.replace('?', 'Prof-specialty' ) df['native-country'] = df['native-country'].str.replace('?', 'United-States' ) #第四步 for i in df['native-country'] : if i != ' United-States': df['native-country'].replace([i] , 'Others' , inplace = True) #第五步 for i in df['race'] : if i != ' White': df['race'].replace([i] , 'Others' , inplace = True) #第六步 for i in df['workclass'] : if i != ' Private': df['workclass'].replace([i] , 'Others' , inplace = True) #第七步 from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() df['salary'] = encoder.fit_transform(df['salary']) df['sex'] = encoder.fit_transform(df['sex']) df.head() #数据可视化分析 import matplotlib.pyplot as plt import seaborn as sns df.info() #薪资水平分析 plt.pie(df['salary'].value_counts() , labels = ['0' ,'1'] , autopct = '%0.2f') plt.show() #所在的地区 plt.pie(df['native-country'].value_counts() , labels = ['US' ,'Others'] , autopct = '%0.2f') plt.show() #婚姻状况分析 plt.pie(df['marital-status'].value_counts() , labels = ['Married' ,'Others'] , autopct = '%0.2f') plt.show() sns.histplot(df[df['salary'] ==0]['marital-status']) sns.histplot(df[df['salary'] ==1]['marital-status'] , color='red') #年龄因素分析 sns.histplot(df[df['salary'] ==0]['age']) sns.histplot(df[df['salary'] ==1]['age'] , color='red') #受教育的程度分析 sns.histplot(df[df['salary'] ==0]['education']) sns.histplot(df[df['salary'] ==1]['education'] , color='red') #工作职业与薪资的关系 sns.histplot(df[df['salary'] ==0]['occupation']) sns.histplot(df[df['salary'] ==1]['occupation'] , color='red') plt.xticks(rotation='vertical') plt.show() #人种与薪资的分析 plt.pie(df['race'].value_counts(), labels=['white' , 'others'] , autopct = '%0.2f') plt.show() df['workclass'].unique() #工作类型 plt.pie(df['workclass'].value_counts() , labels=['private' , 'others'] , autopct ='%0.2f') plt.show() #一周工作时间与薪资水平的关系 sns.distplot(df[df['salary'] ==0]['hours-per-week']) sns.distplot(df[df['salary'] ==1]['hours-per-week'] , color='red') plt.xticks(rotation='vertical') plt.show() sns.distplot(df[df['salary'] ==0]['capital-gain']) sns.distplot(df[df['salary'] ==1]['capital-gain'] , color='red') plt.xticks(rotation='vertical') plt.show() df_heat = df[df['capital-gain'] <6000 ] sns.heatmap(df_heat.corr() , annot=True) #薪资与年龄、性别、工作时间的综合分析 fig,ax = plt.subplots(1,2, figsize=(15,5)) sns.distplot(df["age"], kde=True, ax=ax[0]) sns.boxplot(df["age"], ax=ax[1]) #异常值检测 outliers = [] q1 = df["age"].quantile(0.25) q3 = df["age"].quantile(0.75) iqr = q3-q1 lower_bound = q1-1.5*iqr upper_bound = q3+1.5*iqr for value in df["age"]: if value > upper_bound or value < lower_bound or value <=0: outliers.append(value) print("{} has {} outliers".format("age", len(outliers))) #用合理值替换 mn = int(df["age"].median()) for value in df["age"]: if value > upper_bound or value < lower_bound: df["age"] = df["age"].replace(value, mn) #(replace(current_value, new_value)) from sklearn.preprocessing import StandardScaler y = df['salary'] df.drop('salary' ,axis = 1 , inplace=True) num_cols = [x for x in df.columns if df[x].dtype != 'object'] Scaler = StandardScaler() df[num_cols] = Scaler.fit_transform(df[num_cols]) df['native-country'] =df['native-country'].apply(lambda x : x.strip()) cat_col = [x for x in df.columns if df[x].dtype == 'object'] df=pd.get_dummies(df , columns=cat_col , drop_first=True) from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(df , y , test_size=0.2 ,shuffle=True, random_state=41) print('Shape of training feature:', X_train.shape) print('Shape of testing feature:', X_test.shape) print('Shape of training label:', y_train.shape) print('Shape of training label:', y_test.shape) #导入不同的机器学习模型 from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.neural_network import MLPClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.svm import SVC from sklearn.model_selection import GridSearchCV, cross_val_score,StratifiedKFold, learning_curve import warnings warnings.filterwarnings('ignore') from sklearn.metrics import accuracy_score, confusion_matrix, classification_report from sklearn.model_selection import cross_val_predict df random_state = 2 classifiers = [] classifiers.append(SVC(random_state=random_state)) classifiers.append(DecisionTreeClassifier(random_state=random_state)) classifiers.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state)) classifiers.append(RandomForestClassifier(random_state=random_state)) classifiers.append(ExtraTreesClassifier(random_state=random_state)) classifiers.append(GradientBoostingClassifier(random_state=random_state)) classifiers.append(LogisticRegression(random_state = random_state)) cv_results = [] for classifier in classifiers : cv_results.append(cross_val_score(classifier,X_train, y_train, scoring = "accuracy", cv =5, n_jobs=4)) cv_means = [] cv_std = [] for cv_result in cv_results: cv_means.append(cv_result.mean()) cv_std.append(cv_result.std()) cv_res = pd.DataFrame({"CrossValMeans":cv_means,"CrossValerrors":cv_std,"Algorithm":["SVC","DecisionTree","AdaBoost", "RandomForest","ExtraTrees","GradientBoosting","Logist"]}) g = sns.barplot("CrossValMeans","Algorithm",data = cv_res,palette="Set3",orient = "h",**{'xerr':cv_std}) g.set_xlabel("Mean Accuracy") g = g.set_title("Cross validation scores") #对GBDT、SVC、逻辑回归、随机森林进行微调 GBC = GradientBoostingClassifier() gb_param_grid = {'loss' : ["deviance"], 'n_estimators' : [100,200,300], 'learning_rate': [0.1, 0.05, 0.01], 'max_depth': [4, 8], 'min_samples_leaf': [100,150], 'max_features': [0.3, 0.1] } gsGBC = GridSearchCV(GBC,param_grid = gb_param_grid, cv=5,scoring="accuracy", n_jobs= 4, verbose = 1) gsGBC.fit(X_train, y_train) GBC_best = gsGBC.best_estimator_ gsGBC.best_score_ svc = SVC() svc_param_grid = {'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]}, gsmvc=GridSearchCV(svc,param_grid=svc_param_grid,cv=5,scoring="accuracy", n_jobs= 4, verbose = 1) gsmvc.fit(X_train, y_train) mvc_best=gsmvc.best_estimator_ gsmvc.best_score_ RFC = RandomForestClassifier() rf_param_grid = {"max_depth": [None], "max_features": [1, 3, 10], "min_samples_split": [2, 3, 10], "min_samples_leaf": [1, 3, 10], "bootstrap": [False], "n_estimators" :[100,300], "criterion": ["gini"]} gsRFC = GridSearchCV(RFC,param_grid = rf_param_grid, cv=5,scoring="accuracy", n_jobs= 4, verbose = 1) gsRFC.fit(X_train , y_train) RFC_best = gsRFC.best_estimator_ gsRFC.best_score_ logC = LogisticRegression() log_param_grid={'penalty':['l2','l1'] , 'dual':[True , False], 'C':[0.01 , 0.1 , 1 , 1, 10 ]} gslogC = GridSearchCV(logC,param_grid = log_param_grid, cv=5,scoring="accuracy", n_jobs= 4, verbose = 1) gslogC.fit(X_train, y_train) logC_best = gslogC.best_estimator_ gslogC.best_score_ def evaluate_model(model, x_test, y_test): from sklearn import metrics # Predict Test Data y_pred = model.predict(x_test) # Calculate accuracy, precision, recall, f1-score, and kappa score acc = metrics.accuracy_score(y_test, y_pred) prec = metrics.precision_score(y_test, y_pred) rec = metrics.recall_score(y_test, y_pred) f1 = metrics.f1_score(y_test, y_pred) # Display confussion matrix cm = metrics.confusion_matrix(y_test, y_pred) return {'acc': acc, 'prec': prec, 'rec': rec, 'f1': f1,'cm': cm} GBC_eval = evaluate_model(GBC_best, X_test, y_test) print('Accuracy:', GBC_eval['acc']) print('Precision:', GBC_eval['prec']) print('Recall:', GBC_eval['rec']) print('F1 Score:', GBC_eval['f1']) print('Confusion Matrix:\n', GBC_eval['cm']) logC_best_eval = evaluate_model(logC_best, X_test, y_test) print('Accuracy:', logC_best_eval['acc']) print('Precision:', logC_best_eval['prec']) print('Recall:', logC_best_eval['rec']) print('F1 Score:', logC_best_eval['f1']) print('Confusion Matrix:\n', logC_best_eval['cm']) mvc_best_eval = evaluate_model(mvc_best, X_test, y_test) print('Accuracy:', mvc_best_eval['acc']) print('Precision:', mvc_best_eval['prec']) print('Recall:', mvc_best_eval['rec']) print('F1 Score:', mvc_best_eval['f1']) print('Confusion Matrix:\n', mvc_best_eval['cm']) RFC_best_eval = evaluate_model(RFC_best, X_test, y_test) print('Accuracy:', RFC_best_eval['acc']) print('Precision:', RFC_best_eval['prec']) print('Recall:', RFC_best_eval['rec']) print('F1 Score:', RFC_best_eval['f1']) print('Confusion Matrix:\n', RFC_best_eval['cm']) def plot_learning_curve(models , X , y): for model in models : train_sizes , train_scores , test_scores =learning_curve(model ,X , y , n_jobs=-1 ) train_scores_mean = np.mean(train_scores ,axis = 1) test_scores_mean = np.mean(test_scores ,axis=1) plt.plot(train_sizes , train_scores_mean , 'o-' , color ='r' , label = 'Training score') plt.plot(train_sizes , test_scores_mean , 'o-' , color ='g' , label = 'Cross-validation score') plt.xlabel('Training set size') plt.ylabel('Accuracy') plt.legend() plt.title(model) plt.show() plot_learning_curve(chosen_classifiers , X_train , y_train)
四、总结
1.通过对数据的分析和挖掘,达到了我们预期的目标,可以看出薪资水平与学历、年龄、职业、性别、工作时间、所在地区有很大的关系。从数据中可以直观看出学历越高的相对薪资水平也就越高。从周工作时间的数据分析来看,并不是工作时间越长工资越高,一周工作40-60小时的员工,可能薪水更高。
2.整体实验我们采用数据预处理-数据可视化-模型选择与调参三步进行,模型调参我们首先选择了模型参数比较优的模型,因为这些模型在调参后更有可能达到更高的分数,我们使用网格搜索对其进行调参。结果我们发现很多学习器基本上都出现了比较严重的过拟合的现象训练分数和交叉验证分数巨大的gap中可以看出来,最明显的是随机森林,其次是GBDT。另外从准确率分数来看,几个学习器都非常的优秀,但是如果落实到recall和F1-score上,每个模型还有很大的提升空间,这很有可能跟数据量以及数据集本身的不平衡有关系,需要更大的数据来佐证这一观点。