LASSO-Logistic模型--基于R语言glmnet包--688IT编程网

LASSO-Logistic模型--基于R语⾔glmnet包

R语⾔中glmnet包是⽐较重要且流⾏的包之⼀，曾被誉为“三驾马车”之⼀。从包名就可以⼤致推测出，glmnet主要是使⽤Elastic-Net来实现GLM，⼴⼤的user可以通过该包使⽤Lasso 、 Elastic-Net 等Regularized⽅式来完成Linear Regression、 Logistic 、Multinomial Regression 等模型的构建。本⼈学习了CRAN上Glmnet_Vignette.pdf⽂档，有⼀些体会，⾸先是Linear Regression，然后是Logistic Regression(Binomial Models)，还是直接上代码吧。

>>>>>>>>>>>>>>>>>>>># #glmnet学习#

#原理⼤致如下：The elastic-net penalty is controlled by a(alpha⽤英⽂字母a代替), and bridges the

#gap between lasso (a= 1, the default) and ridge (a = 0). The tuning parameter Lambda controls the

#overall strength of the penalty

#LASSO回归与Ridge回归同属于⼀个被称为Elastic Net的⼴义线性模型家族。这⼀家族的模型除了相同作⽤的参数λ之外，

#还有另⼀个参数α来控制应对⾼相关性(highly correlated)数据时模型的性状。

#LASSO回归α=1，Ridge回归α=0，⼀般Elastic Net模型0<α<1。

#以下的数据和脚本来⾃Glmnet_Vignette.pdf，说明为本⼈理解。

#Introduction Glmnet_Vignette.pdf中对Glmnet的介绍

#Glmnet is a package that fits a generalized linear model via penalized maximum likelihood. The regularization

#path is computed for the lasso or elasticnet penalty at a grid of values for the regularization parameter

#lambda. The algorithm is extremely fast, and can exploit sparsity in the input matrix x. It fits linear, logistic

#and multinomial, poisson, and Cox regression models. A variety of predictions can be made from the fitted

#models. It can also fit multi-response linear regression.

#The authors of glmnet are Jerome Friedman, Trevor Hastie, Rob Tibshirani and Noah Simon, and the R

#package is maintained by Trevor Hastie. The matlab version of glmnet is maintained by Junyang Qian.

>>>>>>>>>>>>>>>>>>>>## #⼀、Linear Regression

library(Matrix)

library(foreach)

library(glmnet)

data(QuickStartExample)

#使⽤data()可以发现x、y是glmnet包中的2个数据集

#说明1：nlambda是Lambda的个数，weights是每个观测的权重

fit = glmnet(x, y, alpha = 0.2, weights = c(rep(1,50),rep(2,50)), nlambda = 20)

fit

#说明2：Df (the number of nonzero coefficients), %dev (the percent deviance explained) and Lambda (the corresponding value of Lambda).

#Df %Dev Lambda

#[1,] 0 0.0000 7.939000

#[2,] 4 0.1789 4.889000

#[3,] 7 0.4445 3.011000

#[4,] 7 0.6567 1.854000

#[5,] 8 0.7850 1.142000

#[6,] 9 0.8539 0.703300

#[7,] 10 0.8867 0.433100

#说明2.1：LASSO回归复杂度调整的程度由参数λ来控制，λ越⼤对变量较多的线性模型的惩罚⼒度就越⼤，从⽽最终获得⼀个变量较少的模型。

#参照上⾯Lambda和Df的值也可以发现此规律

#说明3：⽂档中提到梯度下降的计算有2种停⽌⽅式：According to the default internal settings, the computations stop if either the fractional

#change in deviance down the path is less than 10-5 or the fraction of explained deviance reaches 0.999.

#说明4：关于选择⾮零参数的个数，可以通过列表的⽅式（fit，print(fit)），也可以画图的⽅式，画图横坐标有3中不同的参数

#Users can decide what is on the X-axis. xvar allows three measures: “norm” for the L1-norm of the coefficients

#(default), “lambda” for the log-lambda value and “dev” for %deviance explained.

#说明5：label = TRUE是在图上标明变量的序号（顺序），Users can also label the curves with variable sequence numbers simply by setting

#label = TRUE.

plot(fit, xvar = "lambda", label = TRUE);plot(fit, xvar = "lambda")

#说明6：顶端的横坐标应该是当前Lambda下⾮零变量的个数：The axis above indicates the number of nonzero coefficients at the current Lambda, #which is the effective degrees of freedom (df ) for the lasso

cvfit = cv.glmnet(x, y, asure = "mse", nfolds = 20)

#说明7：glmnet返回的是⼀系列不同Lambda对应的值（⼀组模型），需要user来选择⼀个Lambda，交叉验证是最常⽤挑选Lambda的⽅法

variable used in lambda#The function glmnet returns a sequence of models for the users to choose from. In many cases, users may

#prefer the software to select one of them. Cross-validation is perhaps the simplest and most widely used

#method for that task.cv.glmnet is the main function to do cross-validation here, along with various supporting methods such as

#plotting and prediction.

#关于lambda.1se、lambda.min的⼀种解释

#lambda.min is the value of Lambda that gives minimum mean cross-validated error. The other Lambda saved is lambda.1se,

#which gives the most regularized model such that error is within one standard error of the minimum. To use

#that, we only need to replace lambda.min with lambda.1se above.

#关于lambda.1se、lambda.min的另⼀种解释

#Functions coef and predict on cv.glmnet object are similar to those for a glmnet object, except that two

#special strings are also supported by s (the values of Lambda requested): * “lambda.1se”: the largest Lambda at which

#the MSE is within one standard error of the minimal MSE. “lambda.min”: the Lambda at which the minimal MSE is achieved.

coef(cvfit, s = "lambda.min");as.matrix(coef(cvfit, s = "lambda.min"))

plot(cvfit)

#说明7.1：lambda.min（误差最⼩）、lambda.1se（误差最⼩⼀个标准差内，模型最简单），对应图中两根竖线的地⽅

#It includes the cross-validation curve (red dotted line), and upper and lower standard deviation（标准差） curves along

#the lambda sequence (error bars). Two selected lambda’s are indicated by the vertical dotted lines (see below).

#说明8：结果是个稀疏矩阵，⽤as.matirx后就可以变成正常矩阵

predict(cvfit, newx = x[1:5,], s = 'lambda.min')

predict(cvfit, newx = x[1:5,], s = c(0.1,0.2))

predict(cvfit, newx = x[1:5,], s= c("lambda.1se","lambda.min")) #"lambda.1se","lambda.min"调换顺序后程序报错，不知道为什么

#说明9：如上是⽤来做预测的，如果s是向量(多个值)则输出结果为矩阵

#说明10：前⾯部分都是在将如何选择合适的lambda，其实也可设置不同的alpha

#Users can control the folds used. Here we use the same folds so we can also select a value for alpha.

foldid=sample(1:10,size=length(y),replace=TRUE)

cv1=cv.glmnet(x,y,foldid=foldid,alpha=1)

cv.5=cv.glmnet(x,y,foldid=foldid,alpha=.5)

cv0=cv.glmnet(x,y,foldid=foldid,alpha=0)

#说明10.1：标准的函数不能画到1个图中

#There are no built-in plot functions to put them all on the same plot, so we are on our own here:

par(mfrow=c(2,2))

plot(cv1);plot(cv.5);plot(cv0)

plot(log(cv1$lambda),cv1$cvm,pch=19,col="red",xlab="log(Lambda)",ylab=cv1$name)

points(log(cv.5$lambda),cv.5$cvm,pch=19,col="grey")

points(log(cv0$lambda),cv0$cvm,pch=19,col="blue")

legend("topleft",legend=c("alpha= 1","alpha= .5","alpha 0"),pch=19,col=c("red","grey","blue"))

#说明11：应该是因为lasso模型满⾜如下条件：误差很⼩，同时变量也较少，所以说是最棒的模型

#We see that lasso (alpha=1) does about the best here. We also see that the range of lambdas used differs

#with alpha.

#说明12：也可以限定参数的范围，⽐如要求参数在如下区间(-0.7,0.5)

tfit=glmnet(x,y,lower=-.7,upper=.5)

plot(tfit)

#说明13：可以强制要求某些变量留在模型中，Penalty factors

#This is very useful when people have prior knowledge or preference over the variables. In many cases, some

#variables may be so important that one wants to keep them all the time, which can be achieved by setting

#corresponding penalty factors to 0:

p.fac = rep(1, 20)

p.fac[c(5, 10, 15)] = 0

pfit = glmnet(x, y, penalty.factor = p.fac)

plot(pfit, label = TRUE)

#说明13.1：结果解读

#We see from the labels that the three variables with 0 penalty factors always stay in the model, while the

#others follow typical regularization paths and shrunken to 0 eventually.

#Some other useful arguments. exclude allows one to block certain variables from being the model at all. Of

#course, one could simply subset these out of x, but sometimes exclude is more useful, since it returns a full

#vector of coefficients, just with the excluded ones set to zero. There is also an intercept argument which

#defaults to TRUE; if FALSE the intercept is forced to be zero.

#说明14：进⾏交叉验证，⽆论是线性回归还是logistic回归都可以使⽤并⾏计算

#Parallel computing is also supported by cv.glmnet. To make it work, users must register parallel beforehand.

#We give a simple example of comparison here.

#但是doMC这个包我没有安装成功，查看cran显⽰该包的状态为not available

>>>>>>>>>>>>>>>>>>>### #Linear Regression中仍然不明⽩的问题，主要alpha的选择

#1、glmnet⽅法的参数中有alpha，默认值为1（侧⾯说明该包还是偏向LASSO），为何到了cv.glmnet⽅法其参数没有了alpha，这是为什么？

#2、说明10，标明user也可以⾃⼰强制设置alpha，难道在实际操作中就是这样处理的吗？由⽤户⾃⼰设置不同的alpha、lambda组合，

#按照这种评价⽅式来到最合适的alpha和Lambda？

TO CSDN的博友，可在评论中说说你们的理解，谢谢！

#3、⽹上⼀篇⽂章提到：glmnet只能接受数值矩阵作为模型输⼊，如果⾃变量中有离散变量的话，需要把这⼀列离散变量

#转化为⼏列只含有0和1的向量，这个过程叫做One Hot Encoding。我个⼈认为是这样的，既然已经是矩阵，可以看看Introduction

#肯定是需要对离散变量进⾏处理的。

#4、看看这篇⽂章：blog.csdn/qiao1245/article/details/53021465

>>>>>>>>>>>>>>>>>>>### #⼆、Logistic Regression(Binomial Models)

#载⼊数据，使⽤data()可以发现x、y是glmnet包中的2个数据集

data(BinomialExample)

head(x) #数据的维度都是⽐较⾼的，都是⽐较wide的

head(y)

fit = glmnet(x, y, family = "binomial")

plot(fit, xvar = "dev", label = TRUE)

cvfit = cv.glmnet(x, y, family = "binomial", asure = "class")

cvfit1 = cv.glmnet(x, y, family = "binomial", asure = "response",asure="auc")

#说明1：试了⼀下response确实是不能⽤的

#Prediction is a little different for logistic from Gaussian, mainly in the option type. “link” and “response” are never

#equivalent and “class” is only available for logistic regression. In summary, * “link” gives the linear predictors

plot(cvfit)

cvfit$lambda.min

cvfit$lambda.1se

#如下是变量的系数

coef(cvfit, s = "lambda.min")

predict(cvfit, newx = x[1:10,], s = "lambda.min", type = "class")

python多元线性回归_numpy-Python中的多元线性回归

« 上一篇

VHDL中的signal(信号)variable(变量)的定义与赋值

688IT编程网

LASSO-Logistic模型--基于R语言glmnet包

发表评论

推荐文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

热门文章

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符回溯引用和前后查匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

beautifulsoupfind_all怎样把带有某种属性的标签选出而不含该属性的标 ...

最新文章

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

0.5的倍数的正则表达式

标签列表

688IT编程网

LASSO-Logistic模型--基于R语言glmnet包

发表评论

推荐文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

热门文章

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符 回溯引用和前后查 匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

beautifulsoupfind_all怎样把带有某种属性的标签选出而不含该属性的标 ...

最新文章

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

0.5的倍数的正则表达式

标签列表

非零金额正则表达式

基本的元字符回溯引用和前后查匹配模式

非零金额正则表达式