Torchtext 使⽤教程⽂本数据处理
Torchtext
⽂本数据预处理⼯具
|
Field
定义数据处理的⽅式,将原始数据转为TENSOR
Field 使⽤
from torchtext import data
TEXT = data.Field(sequential=True, tokenize=tokenize, lower=True, fix_length=200)
LABEL = data.Field(sequential=False, use_vocab=False)Field 参数
Dataset
使⽤Field 来定义数据组成形式,得到数据集
Dataset 使⽤
⾃定义Dataset 类
参数名
说明sequential
Default: True 是否是序列数据,如果不是就不使⽤tokenization use_vocab
Default: True 是否使⽤a Vocab object.如果不使⽤的话,原始数据应已是数字类型.init_token
Default: None A token that will be prepended to every example using this field, or None for no s_token
A token that will be appended to every example using this field, or None for no end-of-sentence token. Default:None.fix_length
Default: None. 设置序列数据的定长 eg. 100dtype
The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.preprocessing
The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default: None.postprocessing
A Pipeline that will be applied to examples using this field after numericalizing but before the numbers are turned into a Tensor. The pipeline function takes the batch as a list, and the field’s Vocab. Default: None.lower
Default: False. 字符串转为⼩写tokenize
Default: string.split 对原始数据进⾏字符串操作,eg. 输⼊tokenize = lambda x: x.split()tokenizer_language
The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.include_lengths
Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.batch_first
Default: False 是否返回batch 维度在第⼀个维度的数据pad_token
The string token used as padding. Default: “”.unk_token
The string token used to represent OOV words. Default: “”.pad_first
Do the padding of the sequence at the beginning. Default: uncate_first
Do the truncating of the sequence at the beginning. Default: False stop_words
Tokens to discard during the preprocessing step. Default: None is_target Whether this field is a target variable. Affects iteration over batches. Default: Falsenumpy教程 pdf
from torchtext import data
import random
import numpy as np
class MyDataset(data.Dataset):
def __init__(self, csv_path, text_field, label_field, test=False, aug=False, **kwargs):
csv_data = pd.read_csv(csv_path)
# 数据处理操作格式
fields = [("id", None),("text", text_field), ("label", label_field)]
examples = []
if test:
# 如果为测试集,则不加载标签
for text in tqdm(csv_data['text']):
examples.append(data.Example.fromlist([None, text, None], fields))
else:
for text, label in tqdm(zip(csv_data['text'], csv_data['label'])):
# 数据增强
if aug:
rate = random.random()
if rate > 0.5:
text = self.dropout(text)
else:
text = self.shuffle(text)
examples.append(data.Example.fromlist([None, text, label], fields))
# 上⾯是⼀些预处理操作,此处调⽤super调⽤⽗类构造⽅法,产⽣标准Dataset
# super(MyDataset, self).__init__(examples, fields, **kwargs)
super(MyDataset, self).__init__(examples, fields)
def shuffle(self, text):
# 序列随机排序
text = np.random.permutation(text.strip().split())
return ' '.join(text)
def dropout(self, text, p=0.5):
# 随机删除⼀些⽂本
text = text.strip().split()
len_ = len(text)
indexs = np.random.choice(len_, int(len_ * p))
for i in indexs:
text[i] = ''
return ' '.join(text)
Iterator
迭代器 Iterator / BucketIterator
Iterator
保持数据样本顺序不变来构建批数据
BucketIterator
⾃动选取样本长度相似的数据来构建批数据,最⼤程度地减少所需的填充量
from torchtext import data
def data_iter(train_path, valid_path, test_path, TEXT, LABEL):
train = MyDataset(train_path, text_field=TEXT, label_field=LABEL, test=False, aug=1)
valid = MyDataset(valid_path, text_field=TEXT, label_field=LABEL, test=False, aug=1)
test = MyDataset(test_path, text_field=TEXT, label_field=None, test=True, aug=1)
# 传⼊⽤于构建词表的数据集
# TEXT = data.Field(sequential=True, tokenize=tokenize, lower=True, fix_length=200)
TEXT.build_vocab(train)
weight_matrix = TEXT.vocab.vectors
# 只针对训练集构造迭代器
# train_iter = data.BucketIterator(dataset=train, batch_size=8, shuffle=True, sort_within_batch=False, repeat=False) # 同时对训练集和验证集构造迭代器
train_iter, val_iter = data.BucketIterator.splits(
(train, valid),
batch_sizes=(8, 8),
# 如果使⽤gpu,此处将-1更换为GPU的编号
device=-1,
# ⽤来排序的指标
sort_key=lambda x: ),
sort_within_batch=False,
repeat=False
)
test_iter = Iterator(test, batch_size=8, device=-1, sort=False, sort_within_batch=False, repeat=False)
return train_iter, val_iter, test_iter, weight_matrix
Word Embedding
在使⽤pytorch或tensorflow等神经⽹络框架进⾏nlp任务的处理时,可以通过对应的Embedding层做词向量的处理。使⽤预训练好的词向量会带来更优的性能,下⾯介绍如何在torchtext中使⽤预训练的词向量,进⽽传送给神经⽹络模型进⾏训练。
torchtext 默认⽀持的预训练词向量
⾃动下载对应的预训练词向量⽂件到当前⽂件夹下的.vector_cache⽬录下,.vector_cache为默认的词向量⽂件和缓存⽂件的⽬录。
from torchtext.vocab import GloVe
from torchtext import data
TEXT = data.Field(sequential=True)
# 以下两种指定预训练词向量的⽅式等效
# TEXT.build_vocab(train, vectors="glove.6B.200d")
TEXT.build_vocab(train, vectors=GloVe(name='6B', dim=300))
# 在这种情况下,会默认下载glove.6B.zip⽂件,进⽽解压出glove., glove.
外部预训练的词向量
通过name参数指定预训练⽂件,通过cache参数指定预训练⽂件⽬录
cache = '.vector_cache'
vectors = Vectors(name='myvector/glove/glove.', cache=cache)
TEXT.build_vocab(train, vectors=vectors)
在模型中指定Embedding层参数
as nn
# pytorch创建的Embedding层
embedding = nn.Embedding(input_dim, hidden_dim)
# 权重在词汇表vocab的vectors属性中
weight_matrix = TEXT.vocab.vectors
# 指定嵌⼊矩阵的初始权重
embedding.py_(weight_matrix)
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论