pytorchbert测试代码中BertForSequenceClassification。。。从后⾯贴的源码来看,如果是测试,输⼊就是那三个tensor.1.input_id 2.token_type_ids 3.attention_mask. 训练的话还包括label。
三个特征的代码:参数example是⽂本的list:[sentence,]这⾥sentence⽤的是text_a+text_b,没有分句处理。max_seq_length是最⼤序列长度。tokenizer是 tokenizer = BertTokenizer.from_pretrained(WORK_DIR)得到的。WORK_DIR是放bert模型的的路径,⽤来加载tokenizer.
def convert_lines(example, max_seq_length, tokenizer):
max_seq_length -= 2
all_tokens = []
all_segments = []
all_masks = []
longer = 0
for text in tqdm(example):
tokens_a = kenize(text)
if len(tokens_a) > max_seq_length:
tokens_a = tokens_a[:max_seq_length]
longer += 1
one_token = vert_tokens_to_ids(["[CLS]"] + tokens_a + ["[SEP]"]) + [0] * (
max_seq_length - len(tokens_a))
one_segment = [0] * (len(tokens_a) + 2) + [0] * (max_seq_length - len(tokens_a))
one_mask = [1] * (len(tokens_a) + 2) + [0] * (max_seq_length - len(tokens_a))
all_tokens.append(one_token)
all_segments.append(one_segment)
all_masks.append(one_mask)
print(longer)
return np.array(all_tokens), np.array(all_segments), np.array(all_masks)
这个代码是有很多条数据的情况,这个函数的输出是三个np类型的数组。all_tokens装所有句⼦的input_id,是⼀个n
⾏,max_seq_length列的⼆维数组。其他两个也是。
部分main函数,
if __name__ == '__main__':
test = pd.read_csv("./dataset/3_abstracts.csv", encoding='utf-8')
test['NAME'] = test['NAME'].fillna("⽆")
test['CONTENT'] = test['CONTENT'].fillna("⽆")
test['title_content'] = test['NAME'] + test['CONTENT']
seed_everything()
>##config
device = torch.device('cuda')
WORK_DIR = "./bert_pretrain/"
#我这⾥分的三类
bert_config = BertConfig.from_pretrained(WORK_DIR + 'bert_config.json', num_labels=3)
tokenizer = BertTokenizer.from_pretrained(WORK_DIR)
MAX_SEQUENCE_LENGTH = 512
test_tokens, test_segments, test_masks = convert_lines(test["title_content"],MAX_SEQUENCE_LENGTH,tokenizer)
#把得到的⼆维数组包装成tensor.内部为tensor([[][id..]]),⽤test_features包装这三个tensor
test_features = [
]
#pytorch的普遍⽤法,这个函数把参数处理成⼀个tensor数据集,是为了后⾯的loader之类的
test_dataset = torch.utils.data.TensorDataset(*test_features)
#调我的预测函数对标签值进⾏预测
test_preds = test_model(test_dataset)
预测函数:
def test_model(test_dataset):
WORK_DIR = "./bert_pretrain/"
# WORK_DIR = "./bert_pretrain/"
output_model_file = WORK_DIR + '423_model.bin' #⾃⼰训练好的模型
model = BertForSequenceClassification.from_pretrained(WORK_DIR, config=bert_config)
model.load_state_dict(torch.load(output_model_file))
<(device)
model.eval()
# for param in model.parameters():
# quires_grad = False
test_preds = np.zeros((len(test_dataset), 3))
#SequentialSampler这⾥是把测试数据顺序排,还有RandomSampler是随机采样,随机排的
test_sampler = SequentialSampler(test_dataset)
#这⾥加载数据集,主要是设batch是4,也可以设其他,就把刚刚处理过的三兄弟,每个取四个来处理
test_loader = DataLoader(test_dataset, sampler=test_sampler, batch_size=4)
#总的数据量除以4
tk0 = tqdm_notebook(test_loader)
# x_batch1 是⼀个tentor数据类型:tensor([[id,id..]]),其他两个也是。侧⾯说明bert的model的输⼊需要⼀个⼆维tensor.源码⾥有个 num_choices = input_ids.shap for i, (x_batch1, x_batch2, x_batch3,) in enumerate(tk0):
#注意这⾥要把数据转到GPU类型,不然会报错
pred = model((device), (device), (device))
test_preds[i * 4:(i + 1) * 4] = pred[0].detach().cpu().numpy()
return test_preds
预测的结果是⼀个⼆元组,第⼆元⼤概是什么cuda啥啥的,⽤第⼀元就⾏了pred[0],给它转成cpu的numpy。我这⾥是三分类,得到的结
果中[[⼩数,⼩数,⼩数],[],[]...[]]。写⼀个for循环取三个⼩数⾥最⼤的的索引,就是最终需要的标签。
predict = []
for prediction in test_preds: # predict is one by one, so the length of probabilities=1
pred_label = np.argmax(prediction)
predict.append(pred_label)
下⾯写写如果只输⼊⼀条数据有啥要改的,⽤来部署接⼝⽤:
提取特征就只⽤得到三兄弟了每个是⼀个list one_token=[id,id,id] one_segment=[seg][seg][seg]...
def convert_lines(example, max_seq_length, tokenizer):
max_seq_length -= 2
longer = 0
tokens_a = kenize(example)
if len(tokens_a) > max_seq_length:
tokens_a = tokens_a[:max_seq_length]
longer += 1
one_token = vert_tokens_to_ids(["[CLS]"] + tokens_a + ["[SEP]"])+\
eval是做什么的[0] * (max_seq_length - len(tokens_a))
one_segment = [0] * (len(tokens_a) + 2) + [0] * (max_seq_length - len(tokens_a))
one_mask = [1] * (len(tokens_a) + 2) + [0] * (max_seq_length - len(tokens_a))
return one_token, one_segment, one_mask
因为模型需要⼀个⼆维tensor,所以这⾥转tensor要多加⼀个中括号。当然也可以unsqueeze(0)
test_token = sor([test_token])
test_segment = sor([test_segment])
test_sor([test_mask])
然后测试时这样再这样得到结果,这⾥我没写全,加载模型啥的跟上⾯是差不多的
pred = model((device), (device), (device))
predic = pred[0].detach().cpu().numpy()
res = np.argmax(predic)
hugging face 的源码:(现在好像更新了封装得更好了)
@add_start_docstrings("""Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. fo BERT_START_DOCSTRING, BERT_INPUTS_DOCSTRING)
class BertForSequenceClassification(BertPreTrainedModel):
r"""
**labels**: (`optional`) ``torch.LongTensor`` of shape ``(batch_size,)``:
Labels for computing the sequence classification/regression loss.
Indices should be in ``[0, ..., config.num_labels - 1]``.
If ``config.num_labels == 1`` a regression loss is computed (Mean-Square loss),
If ``config.num_labels > 1`` a classification loss is computed (Cross-Entropy).
Outputs: `Tuple` comprising various elements depending on the configuration (config) and inputs:
**loss**: (`optional`, returned when ``labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
Classification (or regression if config.num_labels==1) loss.
**logits**: ``torch.FloatTensor`` of shape ``(batch_size, config.num_labels)``
Classification (or regression if config.num_labels==1) scores (before SoftMax).
**hidden_states**: (`optional`, returned when ``config.output_hidden_states=True``)
list of ``torch.FloatTensor`` (one for the output of each layer + the output of the embeddings)
of shape ``(batch_size, sequence_length, hidden_size)``:
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
**attentions**: (`optional`, returned when ``config.output_attentions=True``)
list of ``torch.FloatTensor`` (one for each layer) of shape ``(batch_size, num_heads, sequence_length, sequence_length)``:
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
Examples:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
input_ids = de("Hello, my dog is cute")).unsqueeze(0) # Batch size 1 labels = sor([1]).unsqueeze(0) # Batch size 1
outputs = model(input_ids, labels=labels)
loss, logits = outputs[:2]
"""
def __init__(self, config):
super(BertForSequenceClassification, self).__init__(config)
self.num_labels = config.num_labels
self.bert = BertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, fig.num_labels)
self.init_weights()
def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None,
position_ids=None, head_mask=None):
outputs = self.bert(input_ids, position_ids=position_ids, token_type_ids=token_type_ids,
attention_mask=attention_mask, head_mask=head_mask)
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
outputs = (logits,) + outputs[2:] # add hidden states and attention if they are here
if labels is not None:
if self.num_labels == 1:
# We are doing regression
loss_fct = MSELoss()
loss = loss_fct(logits.view(-1), labels.view(-1))
else:
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
outputs = (loss,) + outputs
return outputs # (loss), logits, (hidden_states), (attentions)
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论