Adam,AdamW,LAMB 优化器原理与代码
前⾔
说到优化器,我们脑海中⾸先浮现的可能就是 Stochastic Gradient Descent (SGD)、Adaptive Gradient (AdaGrad)、Root Mean Square prop (RMSprop)、Adaptive Moment estimation (Adam) 等常⽤的⽼牌优化器。但是神经⽹络发展到了现在,⼤部分 NLP 预训练模型已不再使⽤这些⽅法,⽽是使⽤ Adam Weight Decay Regularization (AdamW) 和19年⾸度亮相的 Layer-wise Adaptive Moments optimizer for Batching training (LAMB)。这些新兴优化器的优点是什么呢?为什么如此受欢迎?这些⽹上已经有很多分析和解释了,这⾥不再说明,本⽂的重点就是Adam,AdamW,LAMB的计算公式和代码实现。
1 Adam
为解决 GD 中固定学习率带来的不同参数间收敛速度不⼀致的弊端,AdaGrad 和 RMSprop 诞⽣出来,为每个参数赋予独⽴的学习率。计算梯度后,梯度较⼤的参数获得的学习率较低,反之亦然。此外,为避免每次梯度更新时都独⽴计算梯度,导致梯度⽅向持续变
化,Momentum 将上⼀轮梯度值加⼊到当前梯度的计算中,通过某种权重对两者加权求和,获得当前批
次参数更新的更新值。 Adam 结合了这两项考虑,既为每⼀个浮点参数⾃适应性地设置学习率,⼜将过去的梯度历史纳⼊考量,其实现原理如下:
计算⼀阶、⼆阶动量矩,加⼊偏置修正,最后更新参数,gt表⽰t时刻梯度。从上述公式可以看出,训练前期的学习率和梯度更新是⽐较激进的,到后期逐渐平稳。虽然 Adam 优化器的使⽤会导致内存中多出两倍于原参数体量的占⽤,但与之换来的训练收益使得学术界并没有放弃这⼀⾼效的⽅法。
代码实现⽐较简单,照着公式敲就⾏了:
import autograd .numpy as np
from autograd import grad
class Adam :
def __init__(self , loss , weights , lr =0.001, beta1=0.9, beta2=0.999, epislon =1e -8):
self .loss = loss
self .theta = weights
self .lr = lr
self .beta1 = beta1
self .beta2 = beta2
self .epislon = epislon
self .get_gradient = grad (loss )
self .m = 0
self .v = 0
self .t = 0
def minimize_raw (self ):
self .t += 1
g = self .get_gradient (self .loss )
self .m = self .beta1 * self .m + (1 - self .beta1) * g
self .v = self .beta2 * self .v + (1 - self .beta2) * (g * g )
self .m_hat = self .m / (1 - self .beta1 ** self .t )
self .v_hat = self .v / (1 - self .beta2 ** self .t )
self .theta = self .theta - self .lr * self .m_hat / (self .v_hat ** 0.5 + self .epislon )
2 AdamW
m =t β∗1m +t −1(1−β)∗1g t v =t β∗2v +t −1(1−β)∗2g t 2=m t ^m /(1−t β)1t =v t ^v /(1−t β)2t θ=t θ−t −1α∗+ϵv
t ^m t
^
Adam 虽然收敛速度快,但没能解决参数过拟合的问题。学术界讨论了诸多⽅案,其中包括在损失函数中引⼊参数的 L2 正则项。这样的⽅法在其他的优化器中或许有效,但会因为 Adam 中⾃适应学习
率的存在⽽对使⽤ Adam 优化器的模型失效,具体分析可见fastai的这篇⽂章:。AdamW 的出现便是为了解决这⼀问题,达到同样使参数接近于 0 的⽬的。具体的举措,是在最终的参数更新时引⼊参数⾃⾝:
λ 即为权重衰减因⼦,常见的设置为 0.005/0.01。这⼀优化策略⽬前正⼴泛应⽤于各⼤预训练语⾔模型。
class AdamW :
def __init__(self , loss , weights , lambda1, lr =0.001, beta1=0.9, beta2=0.999, epislon =1e -8):
self .loss = loss
self .theta = weights
self .lr = lr
self .beta1 = beta1
self .beta2 = beta2
self .epislon = epislon
self .lambda1 = lambda1
self .get_gradient = grad (loss )
self .m = 0
self .v = 0
self .t = 0
def minimize_raw (self ):
self .t += 1
g = self .get_gradient (self .loss )
self .m = self .beta1 * self .m + (1 - self .beta1) * g
self .v = self .beta2 * self .v + (1 - self .beta2) * (g * g )
self .m_hat = self .m / (1 - self .beta1 ** self .t )
self .v_hat = self .v / (1 - self .beta2 ** self .t )
self .theta = self .theta - self .lr * (
self .m_hat / (self .v_hat ** 0.5 + self .epislon ) + self .lambda1 * self .theta )
3 LAMB
m =t β∗1m +t −1(1−β)∗1g t v =t β∗2v +t −1(1−β)∗2g t 2=m t ^m /(1−t β)1t =v t ^v /(1−t β)2t θ=t θ−t −1α∗(++ϵv
t ^m t
^λ∗θ)t −1
LAMB 优化器是 2019 年出现的⼀匹新秀,它将bert模型的预训练时间从3天压缩到了76分钟! LAMB 出现的⽬的是加速预训练进程,这个优化器也成为 NLP 社区为泛机器学习领域做出的⼀⼤贡献。在使⽤ Adam 和 AdamW 等优化器时,⼀⼤问题在于 batch size 存在⼀定的隐式上限,⼀旦突破这个上限,
梯度更新极端的取值会导致⾃适应学习率调整后极为困难的收敛,从⽽⽆法享受增加的 batch size 带来的提速增益。LAMB 优化器的作⽤便在于使模型在进⾏⼤批量数据训练时,能够维持梯度更新的精度。具体来说,LAMB 优化器⽀持⾃适应元素级更新(adaptive element-wise updating)和准确的逐层修正(layer-wise correction)。LAMB 可将 BERT 预训练的批量⼤⼩扩展到 64K,且不会造成准确率损失。BERT 预训练包括两个阶段:1)前 9/10 的训练 epoch 使⽤ 128 的序列长度,2)最后1/10 的训练 epoch 使⽤ 512 的序列长度。LAMB的算法如下:
其中,$\phi \phi(z)=z min(max(z,\gamma_l),\gamma_u)\gamma_l,\gamma_u$为预先设定的超参数,分别代表参数调整的下界和上界。这⼀简单的调整所带来的实际效果⾮常显著。使⽤ AdamW 时,batch size 超过 512 便会导致模型效果⼤幅下降,但在 LAMB 下,batch size 可以直接提到 32,000 ⽽不会导致精度损失。
以下是 LAMB 优化器的 tensorflow1.x 代码,可作为参考以理解算法,具体的代码出处已⽆法寻。
class LAMBOptimizer (tf .train .Optimizer ):
'''
LAMBOptimizer optimizer.
# Important Note
- This is NOT an official implementation.
- LAMB optimizer is changed from arXiv v1 ~ v3.
- We implement v3 version (which is the latest version on June, 2019.).
- Our implementation is based on `AdamWeightDecayOptimizer` in BERT (provided by Google).
# References
- LAMB optimier: github/ymcui/LAMB_Optimizer_TF
- Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. /abs/1904.00962v3
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. /abs/1810.04805
# Parameters
- There is nothing special, just the same as `AdamWeightDecayOptimizer`.
'''
def __init__(self ,
learning_rate ,
weight_decay_rate =0.01,
beta_1=0.9,
beta_2=0.999,
epsilon =1e -6,
exclude_from_weight_decay =None ,
name ="LAMBOptimizer"):
"""Constructs a LAMBOptimizer."""
super (LAMBOptimizer , self ).__init__(False , name )
self .learning_rate = learning_rate
self .weight_decay_rate = weight_decay_rate
self .beta_1 = beta_1
self .beta_2 = beta_2
self .epsilon = epsilon
self .exclude_from_weight_decay = exclude_from_weight_decay
def apply_gradients (self , grads_and_vars , global_step =None , name =None ):
"""See base class."""
variable used in lambdaassignments = []
for (grad , param ) in grads_and_vars :
if grad is None or param is None :
m =t β∗1m +t −1(1−β)∗1g t v =t β∗2v +t −1(1−β)∗2g t 2=m t ^m /(1−t β)1t =v t ^v /(1−t β)2t r =t +ϵv
t ^m t
^θ=t θ−t −1α∗(r +∣∣r +λθ∣∣t t −1ϕ(∣∣θ∣∣)
t −1t λθ)
t −1是⼀个可选择的映射函数,⼀种是,另⼀种则为起到归⼀化作⽤的
。
if grad is None or param is None:
continue
param_name = self._get_variable_name(param.name)
m = tf.get_variable(
name=param_name +"/lamb_m",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
s_initializer())
v = tf.get_variable(
name=param_name +"/lamb_v",
shape=param.shape.as_list(),
dtype=tf.float32,
trainable=False,
s_initializer())
# Standard Adam update.
next_m =(
tf.multiply(self.beta_1, m)+ tf.multiply(1.0- self.beta_1, grad))
next_v =(
tf.multiply(self.beta_2, v)+ tf.multiply(1.0- self.beta_2,
tf.square(grad)))
update = next_m /(tf.sqrt(next_v)+ self.epsilon)
# Just adding the square of the weights to the loss function is *not*
# the correct way of using L2 regularization/weight decay with Adam,
# since that will interact with the m and v parameters in strange ways.
#
# Instead we want ot decay the weights in a manner that doesn't interact
# with the m/v parameters. This is equivalent to adding the square
# of the weights to the loss with plain (non-momentum) SGD.
if self._do_use_weight_decay(param_name):
update += self.weight_decay_rate * param
>>#### BELOW ARE THE SPECIFIC PARTS FOR LAMB >>####
# Note: Here are two choices for scaling function \phi(z)
# minmax: \phi(z) = min(max(z, \gamma_l), \gamma_u)
# identity: \phi(z) = z
# The authors does not mention what is \gamma_l and \gamma_u
# UPDATE: after asking authors, they provide me the code below.
# ratio = array_ops.where(ater(w_norm, 0), array_ops.where(
# ater(g_norm, 0), (w_norm / g_norm), 1.0), 1.0)
r1 = tf.duce_sum(tf.square(param)))
r2 = tf.duce_sum(tf.square(update)))
r = tf.ater(r1,0.0),
tf.ater(r2,0.0),
r1 / r2,
1.0),
1.0)
eta = self.learning_rate * r
update_with_lr = eta * update
next_param = param - update_with_lr
[param.assign(next_param),
m.assign(next_m),
v.assign(next_v)])
v.assign(next_v)])
up(*assignments, name=name)
def_do_use_weight_decay(self, param_name):
"""Whether to use L2 weight decay for `param_name`."""
if not self.weight_decay_rate:
return False
lude_from_weight_decay:
for r lude_from_weight_decay:
if re.search(r, param_name)is not None:
return False
return True
def_get_variable_name(self, param_name):
"""Get the variable name from the tensor name."""
m = re.match("^(.*):\\d+$", param_name)
if m is not None:
param_name = m.group(1)
return param_name
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论