100⾏代码做⼀个周杰伦歌词⽣成器--python
100⾏代码做⼀个周杰伦歌词⽣成器
数据
-从⽹上到周杰伦歌词⼤全txt
模型
既然是100⾏,那就⽤最简单的rnn模型来⽣成
RNN⽣成⽂本的思路很简单,就是将输⼊的每个序列,t时刻的token挪到t+1时刻,这样就有了input-target, 我们要做的就是输⼊T时,输出e, 输⼊(T,e)时,输出n… 依次类推。
这样就是⼀个循环神经⽹络的结构, 如果时序过长,会有梯度消失问题, 但是我们这次处理的是歌词, 歌词通常⼀句话⼤概⼗⼏个词,所以不会出现梯度消失。
代码部分
import numpy as np
import tensorflow as tf #这⾥⽤的tf2
import os
导⼊数据并处理
read_file = '周杰伦歌词⼤全.txt' #你的⽂件位置
text = open(read_file, 'rb').read().decode(encoding='utf-8')
#清理⽂本中的特殊符号
def clean_text(text):
cleaned = text.strip().replace(' ','').replace('\u3000','').replace('\ufeff','').replace('(','').replace(')','')
cleaned = place('\r', '')
cleaned = place(':', '')
return cleaned
after_clean = clean_text(text)
vocab = sorted(set(after_clean))
# 整个⽂本有33042 characters,
#不同字符的个数(词典⼤⼩)vocab size 2422
# char <-> idx
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)
text_as_int = np.array([char2idx[c] for c in after_clean]) # shape(33042,)
seq_length = 20 # max input length 20
examples_per_epoch = len(after_clean)//seq_length # 33042//20; 1652个句⼦, 如果改进的话,分句再padd会更好char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)
creat input and target
def split_input_target(chunk): # t时刻挪到t+1
input_text = chunk[:-1]
target_text = chunk[1:]
return input_text, target_text
dataset = sequences.map(split_input_target)
查看训练数据
for input_example, target_example in dataset.take(1):
print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))
Input data: ‘跨时代专辑名称跨时代唱⽚公司杰威尔专辑语’
Target data: ‘时代专辑名称跨时代唱⽚公司杰威尔专辑语种’
超参数设置
BATCH_SIZE = 64 # 1652//64; 每个epoch 训练25次
BUFFER_SIZE = 2000
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
vocab_size = len(vocab) # embedding 参数
embedding_dim = 300
rnn_units = 1024
dataset
<BatchDataset shapes: ((64, 20), (64, 20)), types: (tf.int64, tf.int64)>
模型搭建
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
eval是做什么的
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]),
tf.keras.layers.GRU(rnn_units, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'), tf.keras.layers.Dense(vocab_size)
])
return model
model = build_model(
vocab_size = len(vocab),
embedding_dim=embedding_dim,
rnn_units=rnn_units,
batch_size=BATCH_SIZE)
model.summary()
Total params: 7,282,622
Trainable params: 7,282,622
Non-trainable params: 0
def loss(labels, logits):
return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)
modelpile(optimizer='adam', loss=loss)
# 取⼀个训练数据查看⼀下
for input_example_batch, target_example_batch in dataset.take(1):
example_batch_predictions = model(input_example_batch)
print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()
example_batch_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)") print("scalar_loss: ", example_batch_loss.numpy().mean())
Prediction shape: (64, 20, 2422) # (batch_size, sequence_length, vocab_size)
scalar_loss: 7.7921686
保存模型
检查点保存⾄的⽬录
checkpoint_dir = 'training_checkpoints'
检查点的⽂件名
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_prefix,
save_weights_only=True)
训练
EPOCHS=20
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])
Epoch 14/20
24/24 [] - 3s 109ms/step - loss: 2.2302
Epoch 15/20
24/24 [] - 3s 142ms/step - loss: 1.9284
Epoch 16/20
24/24 [] - 3s 105ms/step - loss: 1.6621
Epoch 17/20
24/24 [] - 3s 115ms/step - loss: 1.4117
Epoch 18/20
24/24 [] - 3s 124ms/step - loss: 1.2068
Epoch 19/20
24/24 [] - 2s 100ms/step - loss: 1.0317
Epoch 20/20
24/24 [==============================] - 3s 120ms/step - loss: 0.8887加载模型做预测
# load weight to model for predict, reshape batch to 1
weight = tf.train.latest_checkpoint(checkpoint_dir)
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(weight)
model.build(tf.TensorShape([1, None]))
def generate_text(model, start_string):
# 要⽣成的字符个数
num_generate = 19
input_eval = [char2idx[s] for s in start_string]
input_eval = tf.expand_dims(input_eval, 0)
# 创建存储结果的空列表
text_generated = []
temperature = 1
for i in range(num_generate):
predictions = model(input_eval)
predictions = tf.squeeze(predictions, 0) # delete 1dim
predictions = predictions / temperature
predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
# 把预测字符和前⾯的隐藏状态⼀起传递给模型作为下⼀个输⼊
input_eval = tf.expand_dims([predicted_id], 0)
text_generated.append(idx2char[predicted_id])
return (start_string + ''.join(text_generated))
查看结果
歌词的⽂本数据通常没有标点符号,如果有标点符号的话可能会显得更⾃然⼀些,
print(generate_text(model, start_string=u'烟⾬'))
烟⾬⽽弥补多久不回事难过我⼀天但愿⼼碎⾯现
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论