LearnNLPwithTransformer(Chapter10)--688IT编程网

LearnNLPwithTransformer（Chapter10）

10. 机器翻译

个⼈总结：机器翻译和其它任务类似，均包括：加载数据、预处理数据、微调预训练模型三个步骤。

我们将展⽰如何使⽤代码库中的模型来解决⾃然语⾔处理中的翻译任务。我们将会使⽤数据集。这是翻译任务最常⽤的数据集之⼀。

下⾯展⽰了⼀个例⼦：

对于翻译任务，我们将展⽰如何使⽤简单的加载数据集，同时针对相应的仍⽆使⽤transformer中的Trainer接⼝对模型进⾏微调。

model_checkpoint ="Helsinki-NLP/opus-mt-en-ro"

# 选择⼀个模型checkpoint

只要预训练的transformer模型包含seq2seq结构的head层，那么本notebook理论上可以使⽤各种各样的transformer模型，解决任何翻译任务。

本⽂我们使⽤已经训练好的 checkpoint来做翻译任务。

10.1 加载数据

我们将会使⽤Datasets库来加载数据和对应的评测⽅式。数据加载和评测⽅式加载只需要简单使⽤load_dataset和load_metric即可。我们使⽤WMT数据集中的English/Romanian双语翻译。

from datasets import load_dataset, load_metric

raw_datasets = load_dataset("wmt16","ro-en")

metric = load_metric("sacrebleu")

Downloading: 2.81kB [00:00, 523kB/s]

Downloading: 3.19kB [00:00, 758kB/s]

Downloading: 41.0kB [00:00, 11.0MB/s]

Downloading and preparing dataset wmt16/ro-en (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size ) to /Users/niepig/.cache/huggingface/datasets/wmt16/ro-en/1.0.0/

Downloading: 100%|██████████| 225M/225M [00:18<00:00, 12.2MB/s]

Downloading: 100%|██████████| 23.5M/23.5M [00:16<00:00, 1.44MB/s]

Downloading: 100%|██████████| 38.7M/38.7M [00:03<00:00, 9.82MB/s]

Dataset wmt16 downloaded and prepared to /Users/niepig/.cache/huggingface/datasets/wmt16/ro-en/1.0.0/0d9fb3e814712c785176ad8cdb9f465fbe64790 00ee6546725db30ad8a8b5f8a. Subsequent calls will reuse this data.

Downloading: 5.40kB [00:00, 2.08MB/s]

这个datasets对象本⾝是⼀种数据结构. 对于训练集、验证集和测试集，只需要使⽤对应的key（train，validation，test）即可得到相应的数据。

raw_datasets

DatasetDict({

train: Dataset({

features: ['translation'],

num_rows: 610320random翻译

})

validation: Dataset({

features: ['translation'],

num_rows: 1999

})

test: Dataset({

features: ['translation'],

num_rows: 1999

})

给定⼀个数据切分的key（train、validation或者test）和下标即可查看数据。

raw_datasets["train"][0]

# 我们可以看到⼀句英语en对应⼀句罗马尼亚语⾔ro

{'translation': {'en': 'Membership of Parliament: see Minutes',

'ro': 'Componenţa Parlamentului: a se vedea procesul-verbal'}}

为了能够进⼀步理解数据长什么样⼦，下⾯的函数将从数据集⾥随机选择⼏个例⼦进⾏展⽰。

import datasets

import random

import pandas as pd

from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):

assert num_examples <=len(dataset),"Can't pick more elements than there are in the dataset."

picks =[]

for _ in range(num_examples):

pick = random.randint(0,len(dataset)-1)

while pick in picks:

pick = random.randint(0,len(dataset)-1)

picks.append(pick)

df = pd.DataFrame(dataset[picks])

for column, typ in dataset.features.items():

if isinstance(typ, datasets.ClassLabel):

df[column]= df[column].transform(lambda i: typ.names[i])

display(_html()))

show_random_elements(raw_datasets["train"])

translation

0{'en': 'I do not believe that this is the right course.', 'ro': 'Nu cred că acesta este varianta corectă.'}

1{'en': 'A total of 104 new jobs were created at the European Chemicals Agency, which mainly supervises our REACH projects.', 'ro': 'Un total de 104 noi locuri de muncă au fost create la Agenția Europeană pentru Produse Chimice, care, în special, supraveghează

proiectele noastre REACH.'}

{'en': 'In view of the above, will the Council say what stage discussions for Turkish participation in joint Frontex operations have reached?', 'ro': 'Care este stadiul negocierilor referitoare la participarea Turciei la operațiunile comune din cadrul Frontex?'}

{'en': 'We now fear that if the scope of this directive is expanded, the directive will suffer exactly the same fate as the last attempt at introducing 'Made in' origin marking - in other words, that it will once again be blocked by the Council.', 'ro': 'Acum ne temem că,

3dacă sfera de aplicare a directivei va fi extinsă, aceasta va avea exact aceeaşi soartă ca ultima încercare de introducere a marcajului de origine "Made in”, cu alte cuvinte, că va fi din nou blocată la Consiliu.'}

4{'en': 'The country dropped nine slots to 85th, with a score of 6.58.', 'ro': 'Ţara a coborât nouă poziţii, pe locul 85, cu un scor de

6,58.'}

translation

metric是类的⼀个实例，查看metric和使⽤的例⼦:

metric

Metric(name: "sacrebleu", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence') , length=-1, id='references')}, usage: """

Produces BLEU scores along with its sufficient statistics

from a source against one or more references.

Args:

predictions: The system stream (a sequence of segments)

references: A list of one or more reference streams (each a sequence of segments)

smooth: The smoothing method to use

smooth_value: For 'floor' smoothing, the floor to use

force: Ignore data that looks already tokenized

lowercase: Lowercase the data

tokenize: The tokenizer to use

Returns:

'score': BLEU score,

'counts': Counts,

'totals': Totals,

'precisions': Precisions,

'bp': Brevity penalty,

'sys_len': predictions length,

'ref_len': reference length,

Examples:

>>> predictions = ["hello there general kenobi", "foo bar foobar"]

>>> references = [["hello there general kenobi", "hello there !"], ["foo bar foobar", "foo bar foobar"]]

>>> sacrebleu = datasets.load_metric("sacrebleu")

>>> results = sacrebleupute(predictions=predictions, references=references)

>>> print(list(results.keys()))

['score', 'counts', 'totals', 'precisions', 'bp', 'sys_len', 'ref_len']

>>> print(round(results["score"], 1))

100.0

""", stored examples: 0)

我们使⽤compute⽅法来对⽐predictions和labels，从⽽计算得分。predictions和labels都需要是⼀个list。具体格式见下⾯的例⼦：

fake_preds =["hello there","general kenobi"]

fake_labels =[["hello there"],["general kenobi"]]

metricpute(predictions=fake_preds, references=fake_labels)

{'score': 0.0,

'counts': [4, 2, 0, 0],

'totals': [4, 2, 0, 0],

'precisions': [100.0, 100.0, 0.0, 0.0],

'bp': 1.0,

'sys_len': 4,

'ref_len': 4}

10.2 数据预处理

在将数据喂⼊模型之前，我们需要对数据进⾏预处理。预处理的⼯具叫Tokenizer。Tokenizer⾸先对输⼊进⾏tokenize，然后将tokens转化为预模型中需要对应的token ID，再转化为模型需要的输⼊格式。

为了达到数据预处理的⽬的，我们使⽤AutoTokenizer.from_pretrained⽅法实例化我们的tokenizer，这样可以确保：

我们得到⼀个与预训练模型⼀⼀对应的tokenizer。

使⽤指定的模型checkpoint对应的tokenizer的时候，我们也下载了模型需要的词表库vocabulary，准确来说是tokens vocabulary。这个被下载的tokens vocabulary会被缓存起来，从⽽再次使⽤的时候不会重新下载。

from transformers import AutoTokenizer

# 需要安装`sentencepiece`： pip install sentencepiece

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading: 100%|██████████| 1.13k/1.13k [00:00<00:00, 466kB/s]

Downloading: 100%|██████████| 789k/789k [00:00<00:00, 882kB/s]

Downloading: 100%|██████████| 817k/817k [00:00<00:00, 902kB/s]

Downloading: 100%|██████████| 1.39M/1.39M [00:01<00:00, 1.24MB/s]

Downloading: 100%|██████████| 42.0/42.0 [00:00<00:00, 14.6kB/s]

以我们使⽤的mBART模型为例，我们需要正确设置source语⾔和target语⾔。如果您要翻译的是其他双语语料，请查看。我们可以检查source和target语⾔的设置：

if"mbart"in model_checkpoint:

tokenizer.src_lang ="en-XX"

<_lang ="ro-RO"

tokenizer既可以对单个⽂本进⾏预处理，也可以对⼀对⽂本进⾏预处理，tokenizer预处理后得到的数据满⾜预训练模型输⼊格式

tokenizer("Hello, this one sentence!")

{'input_ids': [125, 778, 3, 63, 141, 9191, 23, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

上⾯看到的token IDs也就是input_ids⼀般来说随着预训练模型名字的不同⽽有所不同。原因是不同的预训练模型在预训练的时候设定了不同的规则。但只要tokenizer和model的名字⼀致，那么tokenizer预处理的输⼊格式就会满⾜model需求的。关于预处理更多内容参考

除了可以tokenize⼀句话，我们也可以tokenize⼀个list的句⼦。

tokenizer(["Hello, this one sentence!","This is another sentence."])

{'input_ids': [[125, 778, 3, 63, 141, 9191, 23, 0], [187, 32, 716, 9191, 2, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

注意：为了给模型准备好翻译的targets，我们使⽤as_target_tokenizer来控制targets所对应的特殊token：

with tokenizer.as_target_tokenizer():

print(tokenizer("Hello, this one sentence!"))

model_input = tokenizer("Hello, this one sentence!")

tokens = vert_ids_to_tokens(model_input['input_ids'])

# 打印看⼀下special toke

print('tokens: {}'.format(tokens))

{'input_ids': [10334, 1204, 3, 15, 8915, 27, 452, 59, 29579, 581, 23, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

tokens: ['▁Hel', 'lo', ',', '▁', 'this', '▁o', 'ne', '▁se', 'nten', 'ce', '!', '</s>']

如果您使⽤的是T5预训练模型的checkpoints，需要对特殊的前缀进⾏检查。T5使⽤特殊的前缀来告诉模型具体要做的任务，具体前缀例⼦如下：

if model_checkpoint in["t5-small","t5-base","t5-larg","t5-3b","t5-11b"]:

prefix ="translate English to Romanian: "

else:

prefix =""

现在我们可以把所有内容放在⼀起组成我们的预处理函数了。我们对样本进⾏预处理的时候，我们还

会truncation=True这个参数来确保我们超长的句⼦被截断。默认情况下，对与⽐较短的句⼦我们会⾃动padding。

max_input_length =128

max_target_length =128

source_lang ="en"

target_lang ="ro"

def preprocess_function(examples):

inputs =[prefix + ex[source_lang]for ex in examples["translation"]]

targets =[ex[target_lang]for ex in examples["translation"]]

model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

# Setup the tokenizer for targets

with tokenizer.as_target_tokenizer():

labels = tokenizer(targets, max_length=max_target_length, truncation=True)

model_inputs["labels"]= labels["input_ids"]

return model_inputs

以上的预处理函数可以处理⼀个样本，也可以处理多个样本exapmles。如果是处理多个样本，则返回的是多个样本被预处理之后的结果list。

preprocess_function(raw_datasets['train'][:2])

{'input_ids': [[393, 4462, 14, 1137, 53, 216, 28636, 0], [24385, 14, 28636, 14, 4646, 4622, 53, 216, 28636, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1 , 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[42140, 494, 1750, 53, 8, 59, 903, 3543, 9, 15202, 0], [36199, 6612, 9, 15202, 122, 568, 35788, 21549, 53, 8, 59, 903, 354 3, 9, 15202, 0]]}

接下来对数据集datasets⾥⾯的所有样本进⾏预处理，处理的⽅式是使⽤map函数，将预处理函数prepare_train_features应⽤到（map)所有样本上。

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

100%|██████████| 611/611 [02:32<00:00, 3.99ba/s]

100%|██████████| 2/2 [00:00<00:00, 3.76ba/s]

100%|██████████| 2/2 [00:00<00:00, 3.89ba/s]

更好的是，返回的结果会⾃动被缓存，避免下次处理的时候重新计算（但是也要注意，如果输⼊有改动，可能会被缓存影响！）。datasets 库函数会对输⼊的参数进⾏检测，判断是否有变化，如果没有变化就使⽤缓存数据，如果有变化就重新处理。但如果输⼊参数不变，想改变输⼊的时候，最好清理调这个缓存。清理的⽅式是使⽤load_from_cache_file=False参数。另外，上⾯使⽤到的batched=True这个参数是tokenizer的特点，以为这会使⽤多线程同时并⾏对输⼊进⾏处理。

10.3 微调transformer模型

既然数据已经准备好了，现在我们需要下载并加载我们的预训练模型，然后微调预训练模型。既然我们是做seq2seq任务，那么我们需要⼀个能解决这个任务的模型类。我们使⽤AutoModelForSeq2SeqLM这个类。和tokenizer相似，from_pretrained⽅法同样可以帮助我们下载并加载模型，同时也会对模型进⾏缓存，就不会重复下载模型啦。

from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArg

uments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading: 100%|██████████| 301M/301M [00:19<00:00, 15.1MB/s]

由于我们微调的任务是机器翻译，⽽我们加载的是预训练的seq2seq模型，所以不会提⽰我们加载模型的时候扔掉了⼀些不匹配的神经⽹络参数（⽐如：预训练语⾔模型的神经⽹络head被扔掉了，同时随机初始化了机器翻译的神经⽹络head）。

为了能够得到⼀个Seq2SeqTrainer训练⼯具，我们还需要3个要素，其中最重要的是训练的设定/参数。这个训练设定包含了能够定义训练过程的所有属性

688IT编程网

LearnNLPwithTransformer(Chapter10)

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

利用正则表达式实现文本数据提取与处理

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符回溯引用和前后查匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

最新文章

java正则表达式选择题

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

标签列表

688IT编程网

LearnNLPwithTransformer(Chapter10)

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

利用正则表达式实现文本数据提取与处理

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符 回溯引用和前后查 匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

最新文章

java正则表达式 选择题

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

标签列表

java正则表达式选择题

非零金额正则表达式

基本的元字符回溯引用和前后查匹配模式

java正则表达式选择题

非零金额正则表达式