AllenNLP框架学习笔记(⼊门篇)
最近接触到⼀个很棒的⾃然语⾔处理⼯具——AllenNLP,解决了很多在⾃然语⾔处理过程中遇到的痛点问题,开发这个⼯具也是⿍⿍⼤名的A2I实验室,然后就去拜读了他们在github上分享的⽂档(当然,对于本英语渣来说,如果不是中⽂资料实在是⽐较少,是不会开启左百度右⾕歌模式的),发觉这个框架!真tm棒!后⾯在打造⾃⼰的⼯作箱时,越发觉得allennlp的⼯程思维值得每个nlper学习,因此决定深度去阅读其中优雅的代码以及官⽅⽂档,并将其记录下来。
为什么值得研究
⼯欲善其事,必先利器。在进⾏⾃然语⾔处理的过程中,有个⼤家都知道的梗–“语料准备3⼩时,训练模型3分钟”。⼀般的处理nlp task 的流程如下:
1. 拿到原始⽂本
2. 预处理,将⽂本切成词或者字
3. 将⽂本转换为index
4. 将⽂本向量化
5. 编码分布式表⽰
6. 解码出最终的输出
7. 进⾏训练
在进⾏这样的⾃然语⾔处理的过程中,我们在不同的任务中重复了很多⼯作,如读取和缓存,预处理循环,NLP中的padding,各种预训练向量。尽管神经⽹络在各种任务上都取得了显著的性能提升,但在调整新模型或复制现有结果⽅⾯,仍可能存在困难,如模型需要⼀定训练时间,同时对初始化和超参数设置很敏感。相信不少同学在调整了参数后,需要分析对⽐结果吧,我之前⽤fasttext训练⼀个分类任务,还是⽤的原始的excel记录下来每个参数的变动对分类效果的影响。
AllenNLP的出现,就是为了帮助nlper可以更快的验证⾃⼰的idea,将双⼿从重复的劳动中解放出来,其优势如下:模块化的框架,集成了不少通⽤的数据处理模块和算法模块,同时具有拓展性。
使⽤json配置⽂件的⽅式使实验的调整更为⽅便
拥有灵活的数据API,可以进⾏智能的batching和padding
拥有在⽂本处理中常⽤操作的⾼级抽象
模块概览
模块名称作⽤
commands实现命令⾏功能
common⼀些通⽤函数,如读取参数,注册
data数据处理
interpret⽤于解释模型的结果,包括显著性图与对抗攻击
models抽象模型,类似于模板,还包含了分类与标注两个具体模型
modules pytorch⽤在处理⽂本上的模块集合,做了⽐较好的封装,作为组件被⽤于models nn张量实⽤函数集合,例如初始化函数和激活函数
predicators模型预测
tools⼀些实⽤的脚本
training模型训练
框架概览
AllenNLP是基于pytorch进⾏开发的,其基本的pipline包含以下组件:
DatasetReader,读数据模块,在不同⽂件中抽取必要的信息,在使⽤的时候通过重写_read ⽅法来实现数据集的读取,输
出Instance的集合。其中这个Instance是由⼀个或多个字段组成,其中每个字段代表模型⽤作输⼊或输出的⼀条数据,字段将转换为张量并馈⼊模型。⽐如分类任务,它的输⼊与输出⼗分简单,其Instance由 TextField 与 LabelField 组成, TextField 为输⼊的⽂本,LabelField 为输出的标签。
Model,抽象模型模块,表⽰将被训练的模型。该模型将使⽤⼀批Instance,预测其输出并计算loss。拿分类任务来说,模型需要做的事情如下:对于每个词表⽰为特征向量,将词级向量组合成⽂档级特征向量,对于⽂档级别的特征向量进⾏分类。在分类模型的构造函数中,需要指定 Vocabulary ,其管理词汇表项(例如单词和标签)及其整数ID之间的映射; embedder 如 TextFieldEmbedder ,获得初始化的词向量,其输出形状为(batch_size,num_tokens,embedding_dim);encoder 如 Seq2VecEncoder,将序列token向量压缩成⼀个向量,形状为 (batch_size, encoding_dim) ;最后加⼀个分类层,它可以将encoder的输出转换为logits,每个标签可能分类到的值(可以理解为未归⼀化的概率 )。 这些值将在以后转换为概率分布,并⽤于计算损失。接下来需要编写forward()⽅法,该⽅法实现接受输⼊,⽣成预测结果,并计算损失的功能。
Trainer ,功能为处理训练与评估指标记录,负责连接必要的组件(包括模型,优化器,实例,数据加载器等)并执⾏循环训练。通过设置serialization_dir 来保存模型与⽇志。
Predictor ,从原始⽂本中⽣成预测结果,主要流程为获得Instance的json表⽰,转换为Instance,喂⼊模型,并以JSON可序列化格式返回预测结果。
安装⽅法
适⽤版本为allennlp 1.2.2,安装命令如下:
# ⾸先安装torch环境
pip install -i pypi.tuna.tsinghua.edu/simple torch==1.6.0 torchvision==0.6.1 -f /whl/torch_stable.html
# 安装allennlp
pip install allennlp
# 如果出现Microsoft Visual C++ Redistributable is not installed, this may lead to the DLL load failure.的错误,安装VC_redist.x64即可,下载位置如下:aka.ms/vs/16/release/vc_
官⽅⼊门demo试运⾏
⾸先准备好⼀些样例数据,如下:
I like this movie a lot! positive
This was a monstrous waste of time negative
AllenNLP is amazing positive
Why does this have to be so complicated? negative
This sentence expresses no sentiment positive
代码如下:
from typing import Dict, Iterable, List, Tuple
import torch
import allennlp
from allennlpmon import JsonDict
from allennlp.data import DataLoader, PyTorchDataLoader, DatasetReader, Instance
from allennlp.data import Vocabulary
from allennlp.data.fields import LabelField, TextField
from ken_indexers import TokenIndexer, SingleIdTokenIndexer
from kenizers import Token, Tokenizer, WhitespaceTokenizer
dels import Model
dules import TextFieldEmbedder, Seq2VecEncoder
_field_embedders import BasicTextFieldEmbedder
ken_embedders import Embedding
dules.seq2vec_encoders import BagOfEmbeddingsEncoder
from allennlp.predictors import Predictor
import util
ics import CategoricalAccuracy
ainer import Trainer, GradientDescentTrainer
aining.optimizers import AdamOptimizer
class ClassificationTsvReader(DatasetReader):
def__init__(self,
lazy:bool=False,
tokenizer: Tokenizer =None,
token_indexers: Dict[str, TokenIndexer]=None,
max_tokens:int=None):
super().__init__(lazy)
self.max_tokens = max_tokens
def text_to_instance(self, text:str, label:str=None)-> Instance:
tokens = kenize(text)
if self.max_tokens:
tokens = tokens[:self.max_tokens]
text_field = TextField(tokens, ken_indexers)
fields ={'text': text_field}
if label:
fields['label']= LabelField(label)
return Instance(fields)
def_read(self, file_path:str)-> Iterable[Instance]:
with open(file_path,'r')as lines:
for line in lines:
text, sentiment = line.strip().split('\t')
_to_instance(text, sentiment)
class SimpleClassifier(Model):
def__init__(self,
vocab: Vocabulary,
embedder: TextFieldEmbedder,
encoder: Seq2VecEncoder):
super().__init__(vocab)
num_labels = _vocab_size("labels")
self.classifier = _output_dim(), num_labels) self.accuracy = CategoricalAccuracy()
def forward(self,
text: Dict[str, torch.Tensor],
label: torch.Tensor =None)-> Dict[str, torch.Tensor]: # Shape: (batch_size, num_tokens, embedding_dim)
embedded_text = bedder(text)
# Shape: (batch_size, num_tokens)
mask = _text_field_mask(text)
# Shape: (batch_size, encoding_dim)
encoded_text = der(embedded_text, mask)
# Shape: (batch_size, num_labels)
logits = self.classifier(encoded_text)
# Shape: (batch_size, num_labels)
probs = functional.softmax(logits)
output ={'probs': probs}
if label is not None:
self.accuracy(logits, label)
# Shape: (1,)
output['loss']= ss_entropy(logits, label) return output
def get_metrics(self, reset:bool=False)-> Dict[str,float]:
return{"accuracy": _metric(reset)}
def get_metrics(self, reset:bool=False)-> Dict[str,float]:
return{"accuracy": _metric(reset)}
def build_dataset_reader()-> DatasetReader:
return ClassificationTsvReader()
def read_data(
reader: DatasetReader
)-> Tuple[Iterable[Instance], Iterable[Instance]]:
print("Reading data")
training_data = ad("data/train.tsv")
validation_data = ad("data/dev.tsv")
return training_data, validation_data
def build_vocab(instances: Iterable[Instance])-> Vocabulary:
print("Building the vocabulary")
return Vocabulary.from_instances(instances)
def build_model(vocab: Vocabulary)-> Model:
print("Building the model")
vocab_size = _vocab_size("tokens")
embedder = BasicTextFieldEmbedder(
{"tokens": Embedding(embedding_dim=10, num_embeddings=vocab_size)}) encoder = BagOfEmbeddingsEncoder(embedding_dim=10)
validation框架return SimpleClassifier(vocab, embedder, encoder)
def build_data_loaders(
train_data: torch.utils.data.Dataset,
dev_data: torch.utils.data.Dataset,
)-> Tuple[allennlp.data.DataLoader, allennlp.data.DataLoader]:
# Note that DataLoader is imported from allennlp above, *not* torch.
# We need to get the allennlp-specific collate function, which is
# what actually does indexing and batching.
train_loader = PyTorchDataLoader(train_data, batch_size=1, shuffle=True)
dev_loader = PyTorchDataLoader(dev_data, batch_size=1, shuffle=False) return train_loader, dev_loader
def build_trainer(
model: Model,
serialization_dir:str,
train_loader: DataLoader,
dev_loader: DataLoader
)-> Trainer:
parameters =[
[n, p]
for n, p in model.named_parameters()quires_grad
]
optimizer = AdamOptimizer(parameters)
trainer = GradientDescentTrainer(
model=model,
serialization_dir=serialization_dir,
data_loader=train_loader,
validation_data_loader=dev_loader,
num_epochs=5,
optimizer=optimizer,
)
return trainer
def run_training_loop():
dataset_reader = build_dataset_reader()
# These are a subclass of pytorch Datasets, with some allennlp-specific
# functionality added.
train_data, dev_data = read_data(dataset_reader)
vocab = build_vocab(train_data + dev_data)
model = build_model(vocab)
# This is the allennlp-specific functionality in the Dataset object;
# we need to be able convert strings in the data to integers, and this
# is how we do it.
train_data.index_with(vocab)
dev_data.index_with(vocab)
# These are again a subclass of pytorch DataLoaders, with an
# allennlp-specific collate function, that runs our indexing and
# batching code.
train_loader, dev_loader = build_data_loaders(train_data, dev_data)
# You obviously won't want to create a temporary file for your training
# results, but for execution in binder for this guide, we need to do this.
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论