推荐系统(⼗六)推荐系统中的attention 机制
契机
推荐系统中存在各种各样的序列特征,⽐如⼿机APP中各个场景下⽤户的点击⾏为序列,曝光序列,加⼊购物车序列等等,如何建模这些序列与target_item之间的关系成为最近⼤家都在研究的问题。传统处理序列特征的RNN和LSTM模型,由于耗时较多,慢慢不太适⽤于对耗时要求越来越⾼的推荐系统。最近attention机制的崛起恰巧能够解决耗时问题,且效果也⽐较好,所以本⽂会讲⼀些⾃⼰对推荐系统中的attention机制的理解,不对的地⽅请⼤家指出。
target-attention
原理
attention机制在推荐系统中第⼀次⼴泛应⽤应该是,该模型⽤attention⽅式建⽴了⽤户历史点击序列与当前待打分item之间的联系,较为成功地解决了序列特征的建模问题,由于整体原理⽐较简单,在这⾥不过多讲述。
代码实现
DIN模型作者提供的代码不是特别好理解,组内同事按照⾃⼰的理解,改进了下target-attention的实现⽅法,更符合attention本来实现机制,代码如下所⽰,可以看出,target-attention和传统attention之间的区别在于多了⼀个,将原先的转换为,这样做的好处在于模型并不是强依赖于,⽽依赖于的⼀个⾼维抽象表⽰。传统attention机制实现代码可以参见。
Q ,K ,V W K K transform K K
# coding=utf-8
import tensorflow as tf
import numpy as np
def target_attention (Q , K , V ):
""" target_attention implementation
:param Q:
:param K:
:
param V:
:return: target_attention tensor
"""
k1, k2 = K .get_shape ().as_list ()[-1], Q .get_shape ().as_list ()[-1]
W = tf .get_variable ("w", shape =[k1, k2], initializer =tf .keras .initializers .he_normal ())
K_transform = tf .tensordot (K , W , axes =1)
d_k = tf .cast ((k1 + k2) / 2, dtype =tf .float32)
logit = tf .matmul (K_transform , tf .expand_dims (Q , axis =-1)) / tf .sqrt (d_k )
weight = tf .nn .softmax (tf .squeeze (logit , axis =-1), axis =-1)
attention = tf .matmul (tf .expand_dims (weight , axis =1), V )
return tf .squeeze (attention , axis =1)
if __name__ == '__main__':
seq_len = 3
embedding_size = 4
seq_tensor = tf .placeholder (dtype =tf .float32, shape =(None , seq_len , embedding_size )) # 序列特征
target_tensor = tf .placeholder (dtype =tf .float32, shape =(None , embedding_size )) # target_item
t_attn = target_attention (target_tensor , seq_tensor , seq_tensor )
feed_dict = {
target_tensor : np .array ([
[3.0, 4.0, 5.0, 6.0]
]),
seq_tensor : np .array ([[
[1.0, 2.0, 3.0, 4.0],
[5.0, 6.0, 7.0, 8.0],
[9.0, 10.0, 11.0, 12.0]
]])
}
with tf .Session () as sess :
sess .run (tf .global_variables_initializer ())
t_attn_out = sess .run (t_attn , feed_dict =feed_dict )
print (t_attn_out )
通常情况下,推荐系统中attention的和都为⽤户的点击序列,序列由item_id组成,为target_item_id;但也有许多变种,⽐如为item_id + item_side_info,期望通过增加side_info来增加模型稳定性;或者为点击序列的均值,和为曝光未点击序列,期望利⽤点击序列过滤掉曝光未点击序列中⽤户感兴趣的it
em。
对于推荐系统中的target_attention,debug和case排查⼀直都是⽐较难解决的问题。个⼈建议是把和组合计算后的权重(即上述代码的weight)打印出来,分析⼀下均值、⽅差以及直⽅图分布,就能在⼀定程度上发现问题。举个例⼦,如果发现所有权重的均值都差不多,说明attention压根就没起作⽤,要么是序列特征出问题了,要么是target_item构造的emb有问题,这样由果推因定位问题的速度还是⽐较快的。
dual-attention
dual-attention和推荐系统强相关,其核⼼思想为⼆阶target-attention,⼆阶段流程如下:
K V Q Q Q K V Q K
第⼀阶段
利⽤⽤户画像特征和上下⽂特征对⽤户在其他场景下的点击序列进⾏attention操作,得到⽤户意图向量,具体公式如下,其中为求取attention_score函数。
第⼆阶段利⽤对⽤户在当前场景下的点击序列进⾏attention操作,得到最终的target-attention表⽰形式,具体公式如下所⽰,其中为求取attention_score函数。
self-attention
self-attention顾名思义,就是⾃⼰学习⾃⼰的attention权重,旨在增强序列内部各元素之间的联系,更像是⼀种可学习的归⼀化操作,具体代码如下所⽰:
import tensorflow as tf
def attention (Q , K , scaled_=True ):
""" attention implementation
:param Q:
:param K:
:param scaled_: whether scaling logit by sqrt{dim of K}
:return: attention result
"""
logit = tf .matmul (Q , K , transpose_b =True ) # [batch_size, sequence_length, sequence_length]
if scaled_:
d_k = tf .cast (tf .shape (K )[-1], dtype =tf .float32)
logit = tf .divide (logit , tf .sqrt (d_k )) # [batch_size, sequence_length, sequence_length]
weight = tf .nn .softmax (logit , dim =-1) # [batch_size, sequence_length, sequence_length]
return weight
def self_attention (Q , K , V ):
""" self_attention implementation
:param V:
:param K:
:param Q:
:return: self attention result
"""
weight = attention (Q , K ) # [batch_size, sequence_length, sequence_length]
s_attn = tf .matmul (weight , V ) # [batch_size, sequence_length, n_classes]
return s_attn
排查self-attention的问题所⽤的⼿段,个⼈理解和target-attention⽐较类似,观察下weight的分布情况,进⽽由果推因。
multi-head attention
multi-head attention是在self-attention上做进⼀步拓展,具体原理如下图所⽰,
U C F V intent f V =
intent f (F ,C ,U )⋅i =1∑N i i F i
V intent I V g V =g (V ,I )⋅j =1∑M intent j I j
代码参考,如下所⽰。
class MultiHeadAttention(tf.keras.layers.Layer):
def__init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
self.num_heads = num_heads
self.d_model = d_model
assert d_model % self.num_heads ==0
self.depth = d_model // self.num_heads
self.wq = tf.keras.layers.Dense(d_model)
self.wk = tf.keras.layers.Dense(d_model)
self.wv = tf.keras.layers.Dense(d_model)
self.dense = tf.keras.layers.Dense(d_model)
def split_heads(self, x, batch_size):
"""
分拆最后⼀个维度到 (num_heads, depth).
转置结果使得形状为 (batch_size, num_heads, seq_len, depth)
"""
x = tf.reshape(x,(batch_size,-1, self.num_heads, self.depth))
anspose(x, perm=[0,2,1,3])
def call(self, v, k, q):
batch_size = tf.shape(q)[0]
# 第⼀步
q = self.wq(q)# (batch_size, seq_len, d_model)
k = self.wk(k)# (batch_size, seq_len, d_model)
v = self.wv(v)# (batch_size, seq_len, d_model)
# 第⼆步
splitwiseq = self.split_heads(q, batch_size)# (batch_size, num_heads, seq_len_q, depth)
k = self.split_heads(k, batch_size)# (batch_size, num_heads, seq_len_k, depth)
v = self.split_heads(v, batch_size)# (batch_size, num_heads, seq_len_v, depth)
# 第三步
scaled_attention = self_attention(q, k, v)# (batch_size, num_heads, seq_len_q, depth)
scaled_attention = tf.transpose(scaled_attention, perm=[0,2,1,3])# (batch_size, seq_len_q, num_heads, depth) # 第四步
concat_attention = tf.reshape(scaled_attention,(batch_size,-1, self.d_model))# (batch_size, seq_len_q, d_model) # 第五步
m_attention = self.dense(concat_attention)# (batch_size, seq_len_q, d_model)
return m_attention
transformer
transformer总体结构如下图所⽰,
整体流程分为两块:数据和模型。
输⼊数据是positional encoding,核⼼是对embedding进⾏再编码,这⾥不赘述。模型分为编码器和解码器,下⾯会具体讲述这两个部件。
编码器
编码器(Encoder)的任务是把输⼊position encoding转化为较为⾼阶的表⽰。它由N个EncodeLayer组成,具体⽰意图和代码如下,后续会逐步解析。
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论