...L2regularization正则化修正overfitting过拟合方式--688IT编程网

tensorflow使⽤L2regularization正则化修正overfitting过

拟合⽅式

L2正则化原理：

过拟合的原理：在loss下降，进⾏拟合的过程中（斜线），不同的batch数据样本造成红⾊曲线的波动⼤，图中低点也就是过拟合，得到的红线点低于真实的⿊线，也就是泛化更差。

可见，要想减⼩过拟合，减⼩这个波动，减少w的数值就能办到。

L2正则化训练的原理：在Loss中加⼊（乘以系数λ的）参数w的平⽅和，这样训练过程中就会抑制w的值，w的（绝对）值⼩，模型复杂度低，曲线平滑，过拟合程度低（奥卡姆剃⼑），参考公式如下图：

（正则化是不阻碍你去拟合曲线的，并不是所有参数都会被⽆脑抑制，实际上这是⼀个动态过程，是loss（cross_entropy）和L2 loss博弈的⼀个过程。训练过程会去拟合⼀个合理的w，正则化⼜会去抑制

w的变化，两项相抵消，⽆关的wi越变越⼩，但是⽐零强⼀点（就是这⼀点，⽐没有要强，这也是L2的trade-off），有⽤的wi会被保留，处于⼀个“中庸”的范围，在拟合的基础上更好的泛化。过多的道理和演算就不再赘述。）

那为什么L1不能办到呢？主要是L1有副作⽤，不太适合这个场景。

L1把L2公式中wi的平⽅换成wi的绝对值，根据数学特性，这种⽅式会导致wi不均衡的被减⼩，有些wi很⼤，有些wi很⼩，得到稀疏解，属于特征提取。为什么L1的w衰减⽐L2的不均衡，这个很直觉的，同样都是让loss低，让w1从0.1降为0，和w2从1.0降为0.9，对优化器和loss来说，是⼀样的。但是带上平⽅以后，前者是0.01-0=0.01，后者是1-0.81=0.19，这时候明显是减少w2更划算。下图最能说明问题，横纵轴是w1、w2等⾼线是loss的值，左图的交点w1=0，w2=max（w2），典型的稀疏解，丢弃了w1，⽽右图则是在w1和w2之间取得平衡。这就意味着，本来能得到⼀条曲线，现在w1丢了，得到⼀条直线，降低过拟合的同时，拟合能⼒（表达能⼒）也下降了。

L1和L2有个别名：Lasso和ridge，经常记错，认为ridge岭回归因为⽐较“尖”，所以是L1，其实ridge对应的图⽚是这种，或者翻译成“⼭脊”更合适⼀些，⼭脊的特点是⼀条曲线缓慢绵延下来的。

训练

进⾏MNIST分类训练，对⽐cross_entropy和加了l2正则的total_loss。

因为MNIST本来就不复杂，所以FC之前不能做太多CONV，会导致效果太好，不容易分出差距。为展⽰l2 regularization的效果，我只留⼀层CONV（注意看FC1的输⼊是h_pool1，短路了conv2），两层conv可以作为对照组。

直接取train的前1000作为validation，test的前1000作为test。

代码说明，⼀个基础的CONV+FC结构，对图像进⾏label预测，通过cross_entropy衡量性能，进⾏训练。

对需要正则化的weight直接使⽤l2_loss处理，

把cross_entropy和L2 loss都扔进collection 'losses'中。

wd其实就是公式中的λ，wd越⼤，惩罚越⼤，过拟合越⼩，拟合能⼒也会变差，所以不能太⼤不能太⼩，很多⼈默认设置成了0.004，⼀般情况下这样做⽆所谓，毕竟是前⼈的经验。但是根据我的实际经验，这个值不是死的，尤其是你⾃⼰定制loss 函数的时候，假如你的权重交叉熵的数值变成了之前的⼗倍，如果wd保持不变，那wd就相当于之前的0.0004！就像loss如果⽤reduce_sum，grad也⽤reduce_sum⼀样，很多东西要同步做出改变！

weight_decay = tf.l2_loss(initial), wd, name='weight_loss')

tf.add_to_collection('losses', weight_decay)

tf.add_to_collection('losses', cross_entropy)

total_loss = tf.add__collection('losses'))提取所有loss，拿total_loss去训练，也就实现了图⼀中公式的效果。完整代码如下：

from __future__ import print_function

import tensorflow as tf

ist import input_data

# number 1 to 10 data

mnist = ad_data_sets('MNIST_data', one_hot=True)

def compute_accuracy(v_xs, v_ys):

global prediction

y_pre = sess.run(prediction, feed_dict={xs: v_xs, keep_prob: 1})

correct_prediction = tf.equal(tf.argmax(y_pre,1), tf.argmax(v_ys,1))

accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

#result = sess.run(accuracy, feed_dict={xs: v_xs, ys: v_ys, keep_prob: 1})

result = sess.run(accuracy, feed_dict={})

return result

def weight_variable(shape, wd):

initial = tf.truncated_normal(shape, stddev=0.1)

if wd is not None:

print('wd is not none')

weight_decay = tf.l2_loss(initial), wd, name='weight_loss')

tf.add_to_collection('losses', weight_decay)

return tf.Variable(initial)

def bias_variable(shape):

initial = tf.constant(0.1, shape=shape)

return tf.Variable(initial)

def conv2d(x, W):

# stride [1, x_movement, y_movement, 1]

# Must have strides[0] = strides[3] = 1

v2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):

# stride [1, x_movement, y_movement, 1]

ax_pool(x, ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')

# define placeholder for inputs to network

xs = tf.placeholder(tf.float32, [None, 784])/255. # 28x28

ys = tf.placeholder(tf.float32, [None, 10])

keep_prob = tf.placeholder(tf.float32)

x_image = tf.reshape(xs, [-1, 28, 28, 1])

# print(x_image.shape) # [n_samples, 28,28,1]

## conv1 layer ##

W_conv1 = weight_variable([5,5, 1,32], 0.) # patch 5x5, in size 1, out size 32

b_conv1 = bias_variable([32])

h_conv1 = lu(conv2d(x_image, W_conv1) + b_conv1) # output size 28x28x32

h_pool1 = max_pool_2x2(h_conv1) # output size 14x14x32

## conv2 layer ##

W_conv2 = weight_variable([5,5, 32, 64], 0.) # patch 5x5, in size 32, out size 64

b_conv2 = bias_variable([64])

h_conv2 = lu(conv2d(h_pool1, W_conv2) + b_conv2) # output size 14x14x64

h_pool2 = max_pool_2x2(h_conv2) # output size 7x7x64

>>>>>>>>>>>>>>>##

## fc1 layer ##

W_fc1 = weight_variable([14*14*32, 1024], wd = 0.)#do not use conv2

#W_fc1 = weight_variable([7*7*64, 1024], wd = 0.00)#use conv2

b_fc1 = bias_variable([1024])

# [n_samples, 7, 7, 64] ->> [n_samples, 7*7*64]

h_pool2_flat = tf.reshape(h_pool1, [-1, 14*14*32])#do not use conv2

#h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])#use conv2

>>>>>>>>>>>>>>>####

h_fc1 = lu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

## fc2 layer ##

W_fc2 = weight_variable([1024, 10], wd = 0.)

b_fc2 = bias_variable([10])

prediction = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)

# the error between prediction and real data

cross_entropy = tf.reduce_mean(-tf.reduce_sum(ys * tf.log(prediction),

reduction_indices=[1])) # loss

tf.add_to_collection('losses', cross_entropy)

total_loss = tf.add__collection('losses'))

print(total_loss)

train_op = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)

train_op_with_l2_norm = tf.train.AdamOptimizer(1e-4).minimize(total_loss)

sess = tf.Session()

# important step

# tf.initialize_all_variables() no long valid from

# 2017-03-02 if using tensorflow >= 0.12

if int((tf.__version__).split('.')[1]) < 12 and int((tf.__version__).split('.')[0]) < 1:

init = tf.initialize_all_variables()

else:

init = tf.global_variables_initializer()

sess.run(init)

for i in range(1000):

batch_xs, batch_ys = _batch(100)

sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 1})

# sess.run(train_op_with_l2_norm, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 1})

# sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 0.5})#dropout

if i % 100 == 0:

print('train accuracy',compute_accuracy(

print('test accuracy',compute_accuracy(

下边是训练过程

不加dropout，不加l2 regularization，训练1000步：

weight_variable([1024, 10], wd = 0.)

明显每⼀步train中都好于test（很多有0.01的差距），出现过拟合！

train accuracy 0.094

test accuracy 0.089

train accuracy 0.892

test accuracy 0.874

train accuracy 0.91

test accuracy 0.893

train accuracy 0.925

test accuracy 0.925

train accuracy 0.945

test accuracy 0.935

train accuracy 0.954

test accuracy 0.944

train accuracy 0.961

test accuracy 0.951

train accuracy 0.965

test accuracy 0.955

train accuracy 0.964

test accuracy 0.959

train accuracy 0.962

test accuracy 0.956

不加dropout，FC层加l2 regularization，weight decay因⼦设置0.004，训练1000步：

weight_variable([1024, 10], wd = 0.004)

过拟合现象明显减轻了不少，甚⾄有时测试集还好于训练集（因为验证集⼤⼩的关系，只展⽰⼤概效果。）train accuracy 0.107

test accuracy 0.145

train accuracy 0.876

test accuracy 0.861

train accuracy 0.91

test accuracy 0.909

train accuracy 0.923

test accuracy 0.919

train accuracy 0.931

test accuracy 0.927

train accuracy 0.936

test accuracy 0.939

train accuracy 0.956

test accuracy 0.949

train accuracy 0.958

test accuracy 0.954

train accuracy 0.947

test accuracy 0.95

train accuracy 0.947

test accuracy 0.953

对照组：不使⽤l2正则，只⽤dropout：过拟合现象减轻。

W_fc1 = weight_variable([14*14*32, 1024], wd = 0.)

W_fc2 = weight_variable([1024, 10], wd = 0.)

sess.run(train_op, feed_dict={xs: batch_xs, ys: batch_ys, keep_prob: 0.5})#dropout

train accuracy 0.132

test accuracy 0.104

train accuracy 0.869

test accuracy 0.859

train accuracy 0.898

test accuracy 0.889

train accuracy 0.917

test accuracy 0.906

train accuracy 0.923

test accuracy 0.917

train accuracy 0.928

test accuracy 0.925

train accuracy 0.938

test accuracy 0.94

train accuracy 0.94

test accuracy 0.942

train accuracy 0.947

test accuracy 0.941

train accuracy 0.944

test accuracy 0.947

对照组：双层conv，本⾝过拟合不明显，结果略

第⼆种写法：⼀个公式写完

其实没有本质区别，只是少了⼀步提取，增加了繁琐代码可读性的区别。

loss =tf.reduce_mean(tf.square(y_ - y) + tf.contrib.layers.l2_regularizer(lambda)

(w1)+tf.contrib.layers.l2_regularizer(lambda)(w2)+..........

测⼀下单独运⾏正则化操作的效果（加到loss的代码懒得罗列了，太长，就替换前边的代码就可以）：import tensorflow as tf

正则化解决过拟合

CONST_SCALE = 0.5

w = tf.constant([[5.0, -2.0], [-3.0, 1.0]])

with tf.Session() as sess:

print(sess.run(tf.abs(w)))

print('preprocessing:', sess.duce_sum(tf.abs(w))))

print('manual computation:', sess.duce_sum(tf.abs(w)) * CONST_SCALE))

print('l1_regularizer:', sess.ib.layers.l1_regularizer(CONST_SCALE)(w))) #11 * CONST_SCALE

print(sess.run(w**2))

print(sess.duce_sum(w**2)))

print('preprocessing:', sess.duce_sum(w**2) / 2))#default

print('manual computation:', sess.duce_sum(w**2) / 2 * CONST_SCALE))

print('l2_regularizer:', sess.ib.layers.l2_regularizer(CONST_SCALE)(w))) #19.5 * CONST_SCALE

-------------------------------

[[5. 2.]

[3. 1.]]

preprocessing: 11.0

manual computation: 5.5

l1_regularizer: 5.5

[[25. 4.]

svm泛化误差 -回复

« 上一篇

多项式回归中的过拟合现象

688IT编程网

...L2regularization正则化修正overfitting过拟合方式

发表评论

推荐文章

应用程序的安全检测方法、装置、电子设备和存储介质

nginx map用法正则

VBA之正则表达式(1)--基础篇

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

热门文章

一种任意人头与任意人体的3D结合方法

正则匹配c语言中8进制

fortran数据格式

python中文本转数字用的公式

gh 文本变数值

js判断输入是否为正整数、浮点数等数字的函数代码

qt浮点数正则表达式

QT正则表达式限制输入值

手机号码和电话号码的正则表达式

str转浮点-概述说明以及解释

英豪结尾的诗句

Java正则表达式:符合以特定字符串开头,以特定字符串结尾的所有结果

machinebuilder使用手册

ASP.NET网站建设基本常用代码

LCD显示实时时钟

经纬度正则表达式解析

前端科学计数法转数字

python正则表达式re之compile函数解析

pythonunittest之断言及示例

[lua]lua中匹配字符串小数

最新文章

nginx map用法正则

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

python中re.findall函数实例用法

nginx url表达式

nginx 正则匹配参数

标签列表

688IT编程网

...L2regularization正则化修正overfitting过拟合方式

发表评论

推荐文章

应用程序的安全检测方法、装置、电子设备和存储介质

nginx map用法 正则

VBA之正则表达式(1)--基础篇

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

热门文章

一种任意人头与任意人体的3D结合方法

正则匹配c语言中8进制

fortran数据格式

python中文本转数字用的公式

gh 文本变数值

js判断输入是否为正整数、浮点数等数字的函数代码

qt浮点数正则表达式

QT正则表达式限制输入值

手机号码和电话号码的正则表达式

str转浮点-概述说明以及解释

英豪结尾的诗句

Java正则表达式:符合以特定字符串开头,以特定字符串结尾的所有结果

machinebuilder使用手册

ASP.NET网站建设基本常用代码

LCD显示实时时钟

经纬度正则表达式解析

前端科学计数法转数字

python正则表达式re之compile函数解析

pythonunittest之断言及示例

[lua]lua中匹配字符串小数

最新文章

nginx map用法 正则

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

python中re.findall函数实例用法

nginx url表达式

nginx 正则匹配参数

标签列表

nginx map用法正则

nginx map用法正则