LASSO-Logistic模型--基于R语⾔glmnet包
R语⾔中glmnet包是⽐较重要且流⾏的包之⼀,曾被誉为“三驾马车”之⼀。从包名就可以⼤致推测出,glmnet主要是使⽤Elastic-Net来实现GLM,⼴⼤的user可以通过该包使⽤Lasso 、 Elastic-Net 等Regularized⽅式来完成Linear Regression、 Logistic 、Multinomial Regression 等模型的构建。本⼈学习了CRAN上Glmnet_Vignette.pdf⽂档,有⼀些体会,⾸先是Linear Regression,然后是Logistic Regression(Binomial Models),还是直接上代码吧。
>>>>>>>>>>>>>>>>>>>>#  #glmnet学习#
#原理⼤致如下:The elastic-net penalty is controlled by a(alpha⽤英⽂字母a代替), and bridges the
#gap between lasso (a= 1, the default) and ridge (a = 0). The tuning parameter Lambda controls the
#overall strength of the penalty
#LASSO回归与Ridge回归同属于⼀个被称为Elastic Net的⼴义线性模型家族。 这⼀家族的模型除了相同作⽤的参数λ之外,
#还有另⼀个参数α来控制应对⾼相关性(highly correlated)数据时模型的性状。
#LASSO回归α=1,Ridge回归α=0,⼀般Elastic Net模型0<α<1。
#以下的数据和脚本来⾃Glmnet_Vignette.pdf,说明为本⼈理解。
#Introduction  Glmnet_Vignette.pdf中对Glmnet的介绍
#Glmnet is a package that fits a generalized linear model via penalized maximum likelihood. The regularization
#path is computed for the lasso or elasticnet penalty at a grid of values for the regularization parameter
#lambda. The algorithm is extremely fast, and can exploit sparsity in the input matrix x. It fits linear, logistic
#and multinomial, poisson, and Cox regression models. A variety of predictions can be made from the fitted
#models. It can also fit multi-response linear regression.
#The authors of glmnet are Jerome Friedman, Trevor Hastie, Rob Tibshirani and Noah Simon, and the R
#package is maintained by Trevor Hastie. The matlab version of glmnet is maintained by Junyang Qian.
>>>>>>>>>>>>>>>>>>>>## #⼀、Linear Regression
library(Matrix)
library(foreach)
library(glmnet)
data(QuickStartExample)
#使⽤data()可以发现x、y是glmnet包中的2个数据集
#说明1:nlambda是Lambda的个数,weights是每个观测的权重
fit = glmnet(x, y, alpha = 0.2, weights = c(rep(1,50),rep(2,50)), nlambda = 20)
fit
#说明2:Df (the number of nonzero coefficients), %dev (the percent deviance explained) and Lambda (the corresponding value of Lambda).
#Df  %Dev  Lambda
#[1,]  0 0.0000 7.939000
#[2,]  4 0.1789 4.889000
#[3,]  7 0.4445 3.011000
#[4,]  7 0.6567 1.854000
#[5,]  8 0.7850 1.142000
#[6,]  9 0.8539 0.703300
#[7,] 10 0.8867 0.433100
#说明2.1:LASSO回归复杂度调整的程度由参数λ来控制,λ越⼤对变量较多的线性模型的惩罚⼒度就越⼤,从⽽最终获得⼀个变量较少的模型。
#参照上⾯Lambda和Df的值也可以发现此规律
#说明3:⽂档中提到梯度下降的计算有2种停⽌⽅式:According to the default internal settings, the computations stop if either the fractional
#change in deviance down the path is less than 10-5 or the fraction of explained deviance reaches 0.999.
#说明4:关于选择⾮零参数的个数,可以通过列表的⽅式(fit,print(fit)),也可以画图的⽅式,画图横坐标有3中不同的参数
#Users can decide what is on the X-axis. xvar allows three measures: “norm” for the L1-norm of the coefficients
#(default), “lambda” for the log-lambda value and “dev” for %deviance explained.
#说明5:label = TRUE是在图上标明变量的序号(顺序),Users can also label the curves with variable sequence numbers simply by setting
#label = TRUE.
plot(fit, xvar = "lambda", label = TRUE);plot(fit, xvar = "lambda")
#说明6:顶端的横坐标应该是当前Lambda下⾮零变量的个数:The axis above indicates the number of nonzero coefficients at the current Lambda, #which is the effective degrees of freedom (df ) for the lasso
cvfit = cv.glmnet(x, y, asure = "mse", nfolds = 20)
#说明7:glmnet返回的是⼀系列不同Lambda对应的值(⼀组模型),需要user来选择⼀个Lambda,交叉验证是最常⽤挑选Lambda的⽅法
variable used in lambda#The function glmnet returns a sequence of models for the users to choose from. In many cases, users may
#prefer the software to select one of them. Cross-validation is perhaps the simplest and most widely used
#method for that task.cv.glmnet is the main function to do cross-validation here, along with various supporting methods such as
#plotting and prediction.
#关于lambda.1se、lambda.min的⼀种解释
#lambda.min is the value of Lambda that gives minimum mean cross-validated error. The other Lambda saved is lambda.1se,
#which gives the most regularized model such that error is within one standard error of the minimum. To use
#that, we only need to replace lambda.min with lambda.1se above.
#关于lambda.1se、lambda.min的另⼀种解释
#Functions coef and predict on cv.glmnet object are similar to those for a glmnet object, except that two
#special strings are also supported by s (the values of Lambda requested): * “lambda.1se”: the largest Lambda at which
#the MSE is within one standard error of the minimal MSE. “lambda.min”: the Lambda at which the minimal MSE is achieved.
coef(cvfit, s = "lambda.min");as.matrix(coef(cvfit, s = "lambda.min"))
plot(cvfit)
#说明7.1:lambda.min(误差最⼩)、lambda.1se(误差最⼩⼀个标准差内,模型最简单),对应图中两根竖线的地⽅
#It includes the cross-validation curve (red dotted line), and upper and lower standard deviation(标准差) curves along
#the lambda sequence (error bars). Two selected lambda’s are indicated by the vertical dotted lines (see below).
#说明8:结果是个稀疏矩阵,⽤as.matirx后就可以变成正常矩阵
predict(cvfit, newx = x[1:5,], s = 'lambda.min')
predict(cvfit, newx = x[1:5,], s = c(0.1,0.2))
predict(cvfit, newx = x[1:5,], s= c("lambda.1se","lambda.min")) #"lambda.1se","lambda.min"调换顺序后程序报错,不知道为什么
#说明9:如上是⽤来做预测的,如果s是向量(多个值)则输出结果为矩阵
#说明10:前⾯部分都是在将如何选择合适的lambda,其实也可设置不同的alpha
#Users can control the folds used. Here we use the same folds so we can also select a value for alpha.
foldid=sample(1:10,size=length(y),replace=TRUE)
cv1=cv.glmnet(x,y,foldid=foldid,alpha=1)
cv.5=cv.glmnet(x,y,foldid=foldid,alpha=.5)
cv0=cv.glmnet(x,y,foldid=foldid,alpha=0)
#说明10.1:标准的函数不能画到1个图中
#There are no built-in plot functions to put them all on the same plot, so we are on our own here:
par(mfrow=c(2,2))
plot(cv1);plot(cv.5);plot(cv0)
plot(log(cv1$lambda),cv1$cvm,pch=19,col="red",xlab="log(Lambda)",ylab=cv1$name)
points(log(cv.5$lambda),cv.5$cvm,pch=19,col="grey")
points(log(cv0$lambda),cv0$cvm,pch=19,col="blue")
legend("topleft",legend=c("alpha= 1","alpha= .5","alpha 0"),pch=19,col=c("red","grey","blue"))
#说明11:应该是因为lasso模型满⾜如下条件:误差很⼩,同时变量也较少,所以说是最棒的模型
#We see that lasso (alpha=1) does about the best here. We also see that the range of lambdas used differs
#with alpha.
#说明12:也可以限定参数的范围,⽐如要求参数在如下区间(-0.7,0.5)
tfit=glmnet(x,y,lower=-.7,upper=.5)
plot(tfit)
#说明13:可以强制要求某些变量留在模型中,Penalty factors
#This is very useful when people have prior knowledge or preference over the variables. In many cases, some
#variables may be so important that one wants to keep them all the time, which can be achieved by setting
#corresponding penalty factors to 0:
p.fac = rep(1, 20)
p.fac[c(5, 10, 15)] = 0
pfit = glmnet(x, y, penalty.factor = p.fac)
plot(pfit, label = TRUE)
#说明13.1:结果解读
#We see from the labels that the three variables with 0 penalty factors always stay in the model, while the
#others follow typical regularization paths and shrunken to 0 eventually.
#Some other useful arguments. exclude allows one to block certain variables from being the model at all. Of
#course, one could simply subset these out of x, but sometimes exclude is more useful, since it returns a full
#vector of coefficients, just with the excluded ones set to zero. There is also an intercept argument which
#defaults to TRUE; if FALSE the intercept is forced to be zero.
#说明14:进⾏交叉验证,⽆论是线性回归还是logistic回归都可以使⽤并⾏计算
#Parallel computing is also supported by cv.glmnet. To make it work, users must register parallel beforehand.
#We give a simple example of comparison here.
#但是doMC这个包我没有安装成功,查看cran显⽰该包的状态为not available
>>>>>>>>>>>>>>>>>>>### #Linear Regression中仍然不明⽩的问题,主要alpha的选择
#1、glmnet⽅法的参数中有alpha,默认值为1(侧⾯说明该包还是偏向LASSO),为何到了cv.glmnet⽅法其参数没有了alpha,这是为什么?
#2、说明10,标明user也可以⾃⼰强制设置alpha,难道在实际操作中就是这样处理的吗?由⽤户⾃⼰设置不同的alpha、lambda组合,
#按照这种评价⽅式来到最合适的alpha和Lambda?
TO CSDN的博友,可在评论中说说你们的理解,谢谢!
#3、⽹上⼀篇⽂章提到:glmnet只能接受数值矩阵作为模型输⼊,如果⾃变量中有离散变量的话,需要把这⼀列离散变量
#转化为⼏列只含有0和1的向量,这个过程叫做One Hot Encoding。我个⼈认为是这样的,既然已经是矩阵,可以看看Introduction
#肯定是需要对离散变量进⾏处理的。
#4、看看这篇⽂章:blog.csdn/qiao1245/article/details/53021465
>>>>>>>>>>>>>>>>>>>### #⼆、Logistic Regression(Binomial Models)
#载⼊数据,使⽤data()可以发现x、y是glmnet包中的2个数据集
data(BinomialExample)
head(x) #数据的维度都是⽐较⾼的,都是⽐较wide的
head(y)
fit = glmnet(x, y, family = "binomial")
plot(fit, xvar = "dev", label = TRUE)
cvfit = cv.glmnet(x, y, family = "binomial", asure = "class")
cvfit1 = cv.glmnet(x, y, family = "binomial", asure = "response",asure="auc")
#说明1:试了⼀下response确实是不能⽤的
#Prediction is a little different for logistic from Gaussian, mainly in the option type. “link” and “response” are never
#equivalent and “class” is only available for logistic regression. In summary, * “link” gives the linear predictors
plot(cvfit)
cvfit$lambda.min
cvfit$lambda.1se
#如下是变量的系数
coef(cvfit, s = "lambda.min")
predict(cvfit, newx = x[1:10,], s = "lambda.min", type = "class")

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。

发表评论