Otto商品分类(⼆)----Logistic回归预测超参数调优
⽬录
本demo以kaggle 2015年举办的Otto Group Product Classification Challenge竞赛数据为例,分别调⽤缺省参数LogisticRegression、LogisticRegression+GridSearchCV(可⽤LogisticRegressionCV代替)进⾏参数调优
Otto数据集是著名电商Otto提供的⼀个多类商品分类问题,类别数=9,每个样本有93维数值型特征(整数,表⽰某种事件发⽣的次数,已经进⾏过脱敏处理)
***训练部分***
1.读取数据
#读取数据
#可以⾃⼰在log(x=1)特征和tf_idf特征上进⾏尝试,冰⽐较不同特征的结果
dpath='./data/'
ad_csv(path+'Otto_FE_train_org.csv')
print(train.head())
2.准备数据
y_train=train['target']
X_train=train.drop(['id','target'],axis=1)
#保存特征名字以备后⽤(可视化)
feat_names=lumns
#sklearn的学习器⼤多⽀持稀疏矩阵数据输⼊,模型训练会快很多
#查看⼀个学习器是否⽀持稀疏数据,可以看fit函数是否⽀持:X:{array-like,
# sparse matrix}
#可⾃⾏⽤timeit⽐较稠密数据和稀疏数据的训练时间
from scipy.sparse import csr_matrix
X_train=csr_matrix(X_train)
3.默认参数的Logistic Regression
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
#交叉验证⽤于评估模型性能和进⾏参数调优(模型选择)
#分类任务中交叉验证缺省是采⽤StratifiedKFold
#数据集⽐较⼤,采⽤3折交叉验证
del_selection import cross_val_score
loss=cross_val_score(lr,X_train,y_train,cv=3,scoring='neg_log_loss')
#%timeit loss_sparse=cross_val_score(lr,X_train_sparse,y_train,cv=3,
# scoring='neg_log_loss')
print('cv accuracy score is:',-loss)
print('cv logloss is:',-an())
cv accuracy score is: [0.79764036 0.79738583 0.79737361]
cv logloss is: 0.7974666008423363
4.Logistic Regression+GridSearchCV
logistic回归需要调整超参数有:C(正则系数,⼀般在log域(取log后的值)均匀设置候选参数)和正则参数penalty(L2/L1) ⽬标函数为:J=C*sum(logloss(f(xi),yi) +penalty
在sklearn框架下,不同学习器的参数调整步骤相同:
1. 设置参数搜索范围
2. ⽣成学习器⽰例(参数设置)
3. ⽣成GridSearchCV的实例(参数设置)
4. 调⽤GridSearchCV的fit⽅法
超参数调优
del_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
#需要调优的参数
#请尝试将L1正则和L2正则分开,并配合合适的优化求解算法(solver)
#tuned_parameters={'penalth':['l1','l2'],'C':[0.001,0.01,0.1,1,10,100,
# 1000]}
#参数的搜索范围
penaltys=['l1','l2']
Cs=[0.1,1,10,100,1000]
#调优的参数集合,搜索⽹格为2x5,在⽹格上的交叉点进⾏搜索
tuned_parameters=dict(penalty=penaltys,C=Cs)
lr_penalty=LogisticRegression(solver='liblinear')
grid=GridSearchCV(lr_penalty,tuned_parameters,cv=3,scoring='neg_log_loss'
n_jobs=4)
grid.fit(X_train,y_train)
对于tuned_parameter也类似可以这样写:
parameters = {'kernel':['linear'], 'C':[0.001,0.01,0.1,1,10,100], 'gamma':[0.0001,0.001,0.01,0.1,1,10,100]}
grid=GridSearchCV(SVC(),param_grid=parameters,cv=5)
grid.fit(X_trian_part,y_trian_part)
y_predict=grid.predict(X_val)
accuracy=accuracy_score(y_val,y_predict)
print("accuracy={}".format(accuracy))
print("params={} scores={}".format(grid.best_params_,grid.best_score_))
得到调优的参数:
best_score这⾥输出的是负log似然损失
#examine the best model
print(-grid.best_score_)
print(grid.best_params_)
输出结果:
best_score: 0.6728475285576403
best_params: {'C': 100, 'penalty': 'l1'}
图表的形式显⽰训练和测试的过程:
#plot CV误差曲线
test_means=grid.cv_results_['mean_test_score']
test_stds=grid.cv_results_['std_test_score']
train_means=grid.cv_results_['mean_train_score']
#plot results
n_Cs=len(Cs)
number_penaltys=len(penaltys)
test_scores=np.array(test_means).reshape(n_Cs,number_penaltys) train_scores=np.array(train_means).reshape(n_Cs,number_penaltys) test_stds=np.array(test_stds).reshape(n_Cs,number_penaltys)
train_stds=np.array(train_stds).reshape(n_Cs,number_penaltys)
X_axis=np.log10(Cs)
for i,value in enumerate(penaltys):
#pyplot.plot(log(Cs),test_scores[i],label='penalty:'+str(value)
label=penaltys[i]+'Test')
label=penalty[i]+'Train')
plt.legend()
plt.xlabel('log(C)')
plt.ylabel('logloss')
plt.savefig('LogisticGridSearchCV_c.png')
plt.show()
从图中可以看出L1正则和L2正则下、不同正则参数C对应的模型在训练集上测试集上的logloss(似然损失)可以看出在训练集上C越⼤(正则越少)的模型性能越好;
但在测试集上当C=100时性能最好(L1正则)
5.保存模型,⽤于后续测试
cPickle.dump(grid.best_estimator_,open('Otto_L1_org.pkl','wb'))
***测试部分***
1.读取数据
#读取数据
#⾃⾏在log(x+1)特征和tf-idf特征上尝试,并⽐较不同特征的结果
#我们可以采⽤stacking的⽅式组合这⼏种不同特征编码的得到的模型
dpath='./data/'
ad_csv(dpath+'Otto_FE_test_org.csv')
ad_csv(dpath+'Otto_FE_test_tfidf.csv')
#去掉多余的id
test2=test2.drop(['id'],axis=1)
at([test1,test2],axis=1,ignore_index=False)
print(test.head())
2.准备数据
test_id=test['id']
X_test=test.drop(['id'],axis=1)
#保存特征名字以备后⽤(可视化)
feat_names=lumns
#sklearn的学习器⼤多数⽀持稀疏数据输⼊,模型训练会很快
from scipy.sparse import csr_matrix
X_test=csr_matrix(X_test)
import pickle
#load训练好的模型
import cPickle
lr_best=cPickle.load(open('Otto_Lr_org_tfidf.pkl','rb'))
#输出每类的概率
y_test_pred=lr_best.predict_proba(X_test)
print(y_test_pred.shape)
3.⽣成预测的结果
#⽣成预测的结果
out_df=pd.DataFrame(y_test_pred)
pty(9,dtype=object)
for i in range(9):
columns[i] ='Class_'+str(i+1)
lumns=columns
out_at([test_id,out_df],axis=1)
_csv('LR_org_tfidf.csv',index=False)
原始特征编码:在Kaggle的Private Leaderboard分数为***** ************************
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论