知识蒸馏基本知识及其实现库介绍--688IT编程网

知识蒸馏基本知识及其实现库介绍

1 前⾔

知识蒸馏，其⽬的是为了让⼩模型学到⼤模型的知识，通俗说，让student模型的输出接近(拟合)teacher模型的输出。所以知识蒸馏的重点在于拟合⼆字，即我们要定义⼀个⽅法去衡量student模型和teacher模型接近程度，说⽩了就是损失函数。

为什么我们需要知识蒸馏？因为⼤模型推理慢难以应⽤到⼯业界。⼩模型直接进⾏训练，效果较差。

下⾯介绍四个⽐较热门的蒸馏⽂章，这四个本⼈均有实践，希望能帮到⼤家。

2 知识蒸馏的开⼭之作

Hinton 在论⽂: Distilling the Knowledge in a Neural Network 提出了知识蒸馏的⽅法。⽹上关于这⽅⾯的资料实在是太多了，我就简单总结下吧。

损失函数：$$Loss = a L_{soft} + (1-a)L_{hard}$$

其中L_{soft}是StudentModel和TeacherModel的输出的交叉熵，L_{hard}是StudentModel输出和真实标签的交叉熵。

再细说⼀下L_{soft}。我们知道TeacherModel的输出是经过Softmax处理的，指数e拉⼤了各个类别之间的差距，最终输出结果特别像⼀个one-hot向量，这样不利于StudentModel 的学习，因此我们希望输出更加软⼀些。因此我们需要改⼀下softmax函数：

loss= \frac{exp(z_i/T)}{\sum^{}_jexp(z_j/T)}

显然T越⼤输出越软。这样改完之后，对⽐原始softmax，梯度相当于乘了1/T^2，因此L_{soft}需要再乘以T^2来与L_{hard}在⼀个数量级上。

算法的整体框架图如下：(图⽚来⾃blog.csdn/nature553863/article/details/80568658)

3 TinyBert

3.1 基本思路介绍

说到对Bert的蒸馏, ⾸先想到的⽅法就是⽤微调好的Bert作为TeacherModel去训练⼀个StudentModel，这正是TinyBert的做法。那么下⾯的问题就是选取什么模型作为StudentModel，这个已经有⼀些尝试了，⽐如有⼈使⽤BiLSTM，但是更多的⼈还是继续使⽤了Bert，只不过这个Bert会⽐原始的Bert⼩。在TinyBert中，StudentModel使⽤的是减少了embedding size、hidden size和num hidden layers的⼩bert。

那么怎么初始化StudentModel呢？最简单的办法就是随机化模型参数，但是更好的⽅法是⽤预训练模型，因此我们需要⼀个预训练的StudentModel。TinyBert的做法是⽤预训练好的Bert蒸馏出⼀个预训练好的StudentModel。

Ok，TinyBert基本讲完了，简单总结下，TinyBert⼀共分为两步：

1. ⽤pretrained bert蒸馏⼀个pretrained TinyBert

2. ⽤fine-tuned bert蒸馏⼀个fine-tuned TinyBert( 它的初始化参数就是第⼀步⾥pretrained TinyBert)

3.2 损失函数

下⾯说⼀说TinyBert的损失函数。

公式如下：

解释下这个公式：

m：整数，0到StudentModel层数之间

S_m：StudentModel第m层的输出

g(m)：映射函数，实际意义是让StudentModel的第m层学习TeacherModel第g(m)层的输出

T_{g(m)}:TeacherModel的第g(m)层的输出

M：StudentModel隐层数量，那么StudentModel第M+1层就是预测层的输出了(logits)

L_{embd}(S_0,T_0)：word embedding层的损失函数，⽤的是MSE

L_{hidden}和L_{attn}：hidden层和attention层的损失函数，都是MSE

L_{pred}：预测层损失函数，⽤的交叉熵，其他⽂献也有⽤KL-Distance的，其实反向传播的时候都⼀样。

再补充⼀句：在进⾏蒸馏的时候，会先进⾏隐层蒸馏(即m<=M)，然后再执⾏m=M+1时的蒸馏。

总结⼀下，有助于⼤家理解：TinyBERT在蒸馏的时候，不仅要让StudentModel学到最后⼀层的输出，还要学到中间⼏层的输出。换⾔之，StudentModel的某⼀隐层可以学到TeacherModel若⼲隐层的输出。

感觉蒸馏的粒度⽐较细，我觉得可以叫做LayerBasedDistillation。

3.3 实战经验

1. 在硬件和数据有限的条件下，我们很难做预训练模型的蒸馏，但是可以借鉴TinyBERT的思路，直接做TaskSpecific的蒸馏，⾄于如何初始化模型，我有两个建议：要不直

接拿原始Teacher模型初始化，要不⼀个别⼈预训练好的⼩模型进⾏初始化。我直接⽤的RBT3模型初始化，效果很好。

2. 蒸馏完StudentModel，⼀定要测StudentModel的泛化能⼒。

3. 灵活⼀些，蒸馏学习⽬前没有⼀个统⼀的⽅法，有很多地⽅可以⾃⼰改⼀改试⼀试。

4 DistilBert

4.1 基本思路

说完了TinyBert，想再和⼤家聊⼀聊DistilBert，DistilBert要⽐TinyBert简单不少，我就少⽤些⽂字，DistilBert使⽤预训练好的Bert作为TeacherModel训练了⼀个StudentModel，这⾥的StudentModel就是层

数少的Bert，注意这⾥得到的DistilBERT本质上还是⼀个预训练模型，因此⽤到具体下游任务上时，还是需要⽤专门的数据去微调，这⾥就是纯粹的微调，不需要考虑再⽤蒸馏学习辅助。HuggingFace已经提供了若⼲蒸馏好的预训练模型，⼤家直接拿过来当Bert⽤就⾏了。

4.2 损失函数

DistillBERT的损失函数：L_{ce} + L_{mlm} + L_{cos}。

L_{ce}，StudentModel和TeacherModel logits的交叉熵

L_{mlm}，StudentModel中遮挡语⾔模型的损失函数

L_{cos}，StudentModel和TeacherModel hidden states的余弦损失函数，注意在TinyBERT⾥⽤的是MSE，⽽且还考虑了attention的MSE。

5 BERT-of-Theseus

这个准确的来说不是知识蒸馏，但是它确实减⼩了模型体积，⽽且思路和TinyBERT、DistillBERT都有类似，因此就放到这⾥讲了。这个思路⾮常优雅，它通过随机使⽤⼩模型的⼀层替换⼤模型中若⼲层，来完成训练。我来举⼀个例⼦：假设⼤模型是input->tfc1->tfc2->tfc3->tfc4->tfc5->tfc6->output，然后再定义⼀个⼩模型input->sfc1->sfc2->sfc3-

>output。再训练过程中还是要训练⼤模型，只是在每⼀步中，会随机的将(tfc1,tfc2),(tfc3,tfc4),(tfc5,tfc6)替换为sfc1，sfc2，sfc3，⽽且随着训练的进⾏，替换的概率不断变⼤，因此最后就是在训练⼀个⼩模型。

放⼀张图便于⼤家理解

⽅式优雅，作者提供了源码，强烈推荐⼤家⽤⼀⽤。

6 MiniLM

刚刚发布的⼀篇新论⽂，也是关于BERT蒸馏的，我简单总结下三个创新点：

1. 先⽤TeacherModel蒸馏⼀个中等模型，再⽤中等模型蒸馏⼀个较⼩的StudentModel。只有在StudentModel很⼩的时候才会这么做。

2. 只对最后⼀个隐层做蒸馏，作者认为这样可以让StudentModel有更⼤的⾃由空间，⽽且这样对StudentModel架构的要求就变得宽松了

3. 对于最后⼀个隐层主要是对attention权重做学习，具体可以去看

放⼀下图以便⼤家理解：

7 知识蒸馏通⽤框架

7.1 KnowledgeDistillation库

本⼈实现了⼀个基于Pytorch的知识蒸馏框架，有兴趣的朋友可以试⼀试。该框架尽可能抽象了多层模型

的蒸馏⽅法，可以实现TInyBERT、DistillBERT等算法。后续在维护过程中发现知识蒸馏还不够成熟，经常出现新的蒸馏算法，没办法制定⼀个统⼀的框架把各类算法集成进去。因此本⼈稍微调整该库，将该库分为两个部分：

1. 基于多层模型的知识蒸馏框架：便于新⼿阅读源码、学习⼊门（不再维护）

2. examples：存放各类新的知识蒸馏算法范例代码（继续维护）

欢迎给位上传新的知识蒸馏算法⽰例代码，⽰例代码尽量简洁易懂，便于执⾏，最好是算法作者没有提供源码的。项⽬地址：

Pypi：

Github：

给⼤家提供⼀个使⽤基于多层模型的知识蒸馏框架的范例代码，使⽤12层bert蒸馏3层bert，使⽤TinyBERT的损失函数，代码完整可以直接运⾏，不需要外部数据：

# import packages

import torch

import logging

import numpy as np

from transformers import BertModel, BertConfig

from torch.utils.data import DataLoader, RandomSampler, TensorDataset

from knowledge_distillation import KnowledgeDistiller, MultiLayerBasedDistillationLoss

from knowledge_distillation import MultiLayerBasedDistillationEvaluator

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

# Some global variables

train_batch_size = 40

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

learning_rate = 1e-5

num_epoch = 10

# define student and teacher model

# Teacher Model

bert_config = BertConfig(num_hidden_layers=12, hidden_size=60, intermediate_size=60, output_hidden_states=True,

output_attentions=True)

teacher_model = BertModel(bert_config)

# Student Model

bert_config = BertConfig(num_hidden_layers=3, hidden_size=60, intermediate_size=60, output_hidden_states=True,

output_attentions=True)

student_model = BertModel(bert_config)

### Train data loader

input_ids = torch.LongTensor(np.random.randint(100, 1000, (100000, 50)))

attention_mask = torch.s((100000, 50)))

token_type_ids = torch.s((100000, 50)))

train_data = TensorDataset(input_ids, attention_mask, token_type_ids)

train_sampler = RandomSampler(train_data)

train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=train_batch_size)

### Train data adaptor

### It is a function that turn batch_data (from train_dataloader) to the inputs of teacher_model and student_model

### You can define your own train_data_adaptor. Remember the input must be device and batch_data.

### The output is either dict or tuple, but must be consistent with you model's input

def train_data_adaptor(device, batch_data):

batch_data = (device) for t in batch_data)

batch_data_dict = {"input_ids": batch_data[0],

"attention_mask": batch_data[1],

"token_type_ids": batch_data[2], }

# In this case, the teacher and student use the same input

return batch_data_dict, batch_data_dict

### The loss model is the key for this generation.

### We have already provided a general loss model for distilling multi bert layer

### In most cases, you can directly use this model.

#### First, we should define a distill_config which indicates how to compute ths loss between teacher and student.

#### distill_config is a list-object, each item indicates how to calculate loss.

#### It also defines which output of which layer to calculate loss.

#### It shoulde be consistent with your output_adaptor

distill_config = [

# means that compute a loss by their embedding_layer's embedding

{"teacher_layer_name": "embedding_layer", "teacher_layer_output_name": "embedding",

"student_layer_name": "embedding_layer", "student_layer_output_name": "embedding",

"loss": {"loss_function": "mse_with_mask", "args": {}}, "weight": 1.0

# means that compute a loss between teacher's bert_layer12's hidden_states and student's bert_layer3's hidden_states

{"teacher_layer_name": "bert_layer12", "teacher_layer_output_name": "hidden_states",

"student_layer_name": "bert_layer3", "student_layer_output_name": "hidden_states",

"loss": {"loss_function": "mse_with_mask", "args": {}}, "weight": 1.0

{"teacher_layer_name": "bert_layer12", "teacher_layer_output_name": "attention",

"student_layer_name": "bert_layer3", "student_layer_output_name": "attention",

"loss": {"loss_function": "attention_mse_with_mask", "args": {}}, "weight": 1.0

{"teacher_layer_name": "pred_layer", "teacher_layer_output_name": "pooler_output",

"student_layer_name": "pred_layer", "student_layer_output_name": "pooler_output",

"loss": {"loss_function": "mse", "args": {}}, "weight": 1.0

]

### teacher_output_adaptor and student_output_adaptor

### In most cases, model's output is tuple-object, However, in our package, we need the output is dict-object,

### like: { "layer_name":{"output_name":value} .... }

### Hence, the output adaptor is to turn your model's output to dict-object output

### In my case, teacher and student can use one adaptor

def output_adaptor(model_output):

last_hidden_state, pooler_output, hidden_states, attentions = model_output

output = {"embedding_layer": {"embedding": hidden_states[0]}}

for idx in range(len(attentions)):

output["bert_layer" + str(idx + 1)] = {"hidden_states": hidden_states[idx + 1],

"attention": attentions[idx]}

output["pred_layer"] = {"pooler_output": pooler_output}

return output

# loss_model

loss_model = MultiLayerBasedDistillationLoss(distill_config=distill_config,

teacher_output_adaptor=output_adaptor,

student_output_adaptor=output_adaptor)

# optimizer

param_optimizer = list(student_model.named_parameters())

no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']

optimizer_grouped_parameters = [

{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},

{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}

]

optimizer = torch.optim.Adam(params=optimizer_grouped_parameters, lr=learning_rate)

# evaluator

# this is a basic evalator, it can output loss value and save models

# You can define you own evaluator class that implements the interface IEvaluator

evaluator = MultiLayerBasedDistillationEvaluator(save_dir="save_model", save_step=1000, print_loss_step=20)

# Get a KnowledgeDistiller

distiller = KnowledgeDistiller(teacher_model=teacher_model, student_model=student_model,

train_dataloader=train_dataloader, dev_dataloader=None,

train_data_adaptor=train_data_adaptor, dev_data_adaptor=None,

device=device, loss_model=loss_model, optimizer=optimizer,

evaluator=evaluator, num_epoch=num_epoch)

# start distillate

distiller.distillate()

7.2 TextBrewer库

再介绍⼀个知识蒸馏库TextBrewer，该库由哈⼯⼤实现，和本⼈的库相⽐实现算法更多，运⾏更为稳定，推荐⼤家使⽤。Github地址：

在这⾥同样的也提供⼀个完整可运⾏的代码，且不需要任何外部数据：

import torch

import numpy as np

import pickle

import textbrewer

from textbrewer import GeneralDistiller

from textbrewer import TrainingConfig, DistillationConfig

from transformers import BertConfig, BertModel

from torch.utils.data import DataLoader, RandomSampler, TensorDataset

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

## 定义模型

bert_config = BertConfig(num_hidden_layers=12, output_hidden_states=True, output_attentions=True)

teacher_model = BertModel(bert_config).to(device)

bert_config = BertConfig(num_hidden_layers=3, output_hidden_states=True, output_attentions=True)

student_model = BertModel(bert_config).to(device)

# optimizer

param_optimizer = list(student_model.named_parameters())

no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']

optimizer_grouped_parameters = [

{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},

{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}

]

optimizer = torch.optim.Adam(params=optimizer_grouped_parameters, lr=2e-5)

### data

input_ids = torch.LongTensor(np.random.randint(100, 1000, (100000, 64)))

attention_mask = torch.s((100000, 64)))

token_type_ids = torch.s((100000, 64)))

train_data = TensorDataset(input_ids, attention_mask, token_type_ids)

train_sampler = RandomSampler(train_data)

train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=16)

# Define an adaptor for translating the model inputs and outputs

# 整合成蒸馏器需要的数据格式

# key需要是固定的

def bert_adaptor(batch, model_outputs):

last_hidden_state, pooler_output, hidden_states, attentions = model_outputs

hidden_states = list(hidden_states)

hidden_states.append(pooler_output)

output = {"inputs_mask": batch[1],

"attention": attentions,

"hidden": hidden_states}

return output

# Training configuration

train_config = TrainingConfig(gradient_accumulation_steps=1,

ckpt_frequency=10,

ckpt_epoch_frequency=1,

log_dir='logs',

output_dir='saved_models',

device='cuda')

# Distillation configuration

# Matching different layers of the student and the teacher

# 重要，如何蒸馏的定义

# 不⽀持⾃定义损失函数

# 不⽀持CLS LOSS，但是可以强⾏写在hidden loss⾥⾯

distill_config = DistillationConfig(

intermediate_matches=[

{'layer_T': 0, 'layer_S': 0, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, # embedding loss

{'layer_T': 4, 'layer_S': 1, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, # hidden loss

{'layer_T': 8, 'layer_S': 2, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1},

{'layer_T': 12, 'layer_S': 3, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1},

{'layer_T': 3, 'layer_S': 0, 'feature': 'attention', 'loss': 'attention_mse', 'weight': 1}, # attention loss

{'layer_T': 7, 'layer_S': 1, 'feature': 'attention', 'loss': 'attention_mse', 'weight': 1},

{'layer_T': 11, 'layer_S': 2, 'feature': 'attention', 'loss': 'attention_mse', 'weight': 1},

{'layer_T': 12, 'layer_S': 3, 'feature': 'hidden', 'loss': 'hidden_mse', 'weight': 1}, # 其实是CLS loss

]

)

# Build distiller

distiller = GeneralDistiller(

train_config=train_config, distill_config=distill_config,

model_T=teacher_model, model_S=student_model,

adaptor_T=bert_adaptor, adaptor_S=bert_adaptor)

# Start!

# callbacker 可以在dev上进⾏评估

# 注意存的是state_dict

with distiller:

define的基本用法8 其它加速BERT的⽅法

还有很多其他加速BERT的⽅法，我就不细说了，有兴趣的可以研究下：

1. 提升硬件，⽬前看性价⽐较⾼的是RTX30系列显卡

2. 提升深度学习框架版本必然能提升训练和推理速度。⽐如⾼版本的TensorFlow会⽀持mkldnn，AVX指令集。

3. ONNXRuntime(这个主要⽤在推理中)

4. BERT的量化

5. StructedDropout了解⼀下，但是这个最好⽤在预训练上，那不然效果不好，ICLR2020的最新论⽂：reducing transformer depth on demand with structured dropout ⽂章可以转载, 但请注明出处:

688IT编程网

知识蒸馏基本知识及其实现库介绍

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

利用正则表达式实现文本数据提取与处理

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符回溯引用和前后查匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

最新文章

java正则表达式选择题

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

标签列表

688IT编程网

知识蒸馏基本知识及其实现库介绍

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

利用正则表达式实现文本数据提取与处理

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符 回溯引用和前后查 匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

最新文章

java正则表达式 选择题

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

标签列表

java正则表达式选择题

非零金额正则表达式

基本的元字符回溯引用和前后查匹配模式

java正则表达式选择题

非零金额正则表达式