Kaggle首战记录(3)-English Language Learning-baseline的设计和训练

Kaggle首战记录(3)-English Language Learning-baseline的设计和训练

一个模型训练效果的好坏除了模型本身,还很依赖于训练资源——数据和CPU

baseline的设计和代码

基于上述原因的权衡,baseline采取roberta-base + 两层全连接层的模式。采用roberta最后一层的隐藏层的第一个向量(也就是CLS的embedding),经过全连接层——batchnorm层——relu层——全连接层——sigmoid到(0, 6)作为输出。损失函数采用MSE。

batchnorm是代替dropout的正则化方法,但用在此是有疑问的。经过测试,roberta大概占1700MB显存,使用AdamW优化器的情况下,结合上一篇文章的数据处理方法(一个句子最多5个子句,说明一个batch的向量数最多是batchsize * 5),刚好能支撑batchsize = 4的情况(这就是不选deberta的原因——参数量大、训练慢,而且batchsize更小)。但是batchnorm对小批量的影响肯定会比较差。

初始化部分的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
path = '../input/feedback-prize-english-language-learning/train.csv'
import pandas as pd
data = pd.read_csv(path)
data['full_text'] = data['full_text'].apply(lambda x: x.strip())

import torch
from torch import nn
import torch.nn.functional as F
import numpy as np
import random

def init_seeds(seed=7):
random.seed(seed) # seed for module random
np.random.seed(seed) # seed for numpy
torch.manual_seed(seed) # seed for PyTorch CPU
torch.cuda.manual_seed(seed) # seed for current PyTorch GPU
torch.cuda.manual_seed_all(seed) # seed for all PyTorch GPUs
if seed == 0:
# if True, causes cuDNN to only use deterministic convolution algorithms.
torch.backends.cudnn.deterministic = True
# if True, causes cuDNN to benchmark multiple convolution algorithms and select the fastest.
torch.backends.cudnn.benchmark = False

device = "cuda" if torch.cuda.is_available() else "cpu"

init_seeds(42)

from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('../input/roberta-base')
model = RobertaModel.from_pretrained('../input/roberta-base').to(device)

网络设计

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class Class_Pool_Net(nn.Module):
def __init__(self, batch_size, pretrained_model, device):
super(Class_Pool_Net, self).__init__()
self.batch_size = batch_size
self.model = pretrained_model
self.linear1 = nn.Linear(768, 256).to(device)
self.batchnorm = nn.BatchNorm1d(256).to(device)
self.linear2 = nn.Linear(256, 6).to(device)
def forward(self, x):
output_embedding = self.model(x)['last_hidden_state'][:,0].reshape([self.batch_size, -1, 768])
y1 = F.adaptive_max_pool2d(input=output_embedding, output_size=(1, 768)).squeeze(1) #对embedding出来先做maxpool
y2 = self.linear1(y1)
y3 = self.batchnorm(y2)
y4 = F.relu(y3)
y5 = self.linear2(y4)
y6 = torch.sigmoid(y5) * 6 #映射到[0, 6]
return y6
def change_batch_size(self, size):
self.batch_size = size

评价函数

评价函数小抄了一手别人的代码,因为刚开始自己一直理解错了,以为是横向的mse,而且也没加根号,导致自己结果一直很难看,后来没想到是纵向的。

1
2
3
4
5
6
7
8
9
10
11
12
13
from sklearn.metrics import mean_squared_error
def evaluate_function(y_preds, y_trues):
scores = []
y_preds = y_preds.cpu()
y_trues = y_trues.cpu()
idxes = y_trues.shape[1]
for i in range(idxes):
y_true = y_trues[:,i]
y_pred = y_preds[:,i]
score = mean_squared_error(y_true, y_pred, squared=False) # RMSE
scores.append(score)
mcrmse_score = np.mean(scores)
return mcrmse_score

训练函数

学习率的设置

这里有一个坑,刚开始的时候,我把预训练层和全连接层的学习率都设为1e-5,效果非常差。后来看网上文章,才发现这两个学习率不用必须一样,未训练的层肯定要有更大的学习率。下图的函数就是把学习率精确到每一层。对于bias和Norm层不要权重衰减,全连接层的学习率调大到1e-2。

1
2
3
4
5
6
7
8
9
10
11
12
13
def get_group_parameters(model):
params = list(model.named_parameters())
no_decay = ['bias','LayerNorm', 'batchnorm']
other = ['linear1', 'linear2']
no_main = no_decay + other

param_group = [
{'params':[p for n,p in params if not any(nd in n for nd in no_main)],'weight_decay':1e-2,'lr':1e-5},
{'params':[p for n,p in params if not any(nd in n for nd in other) and any(nd in n for nd in no_decay) ],'weight_decay':0,'lr':1e-5},
{'params':[p for n,p in params if any(nd in n for nd in other) and any(nd in n for nd in no_decay) ],'weight_decay':0,'lr':1e-2},
{'params':[p for n,p in params if any(nd in n for nd in other) and not any(nd in n for nd in no_decay) ],'weight_decay':1e-2,'lr':1e-2},
]
return param_group

梯度累积

这个名词之前也听过但不以为意。直到这次batch_size实在太小了(在实验室10GB的卡上,batchsize只能为2,而且还不能用AdamW,只能用SGD;在kaggle的16GB的卡上也只能batchsize=4)。batchsize小的缺点是,容易震荡,模型每次反向传播的时候只能学到小batch的东西,导致梯度下降的方向反复横跳。梯度累积就是为这个而生的,每次计算梯度,先不要反向传播,等到累积到一定步数再计算平均梯度再传播,效果相当于一个大的batchsize。当然梯度累积也不能完全解决小batch的问题,例如batchnorm在小batch上肯定效果不好。

训练函数

以前的版本找不到了,用一个一折的版本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
batch_size = 4
epoch_num = 10
lr = 1e-5
loss = nn.MSELoss()

accumulate_steps = 80

train_dataset = writing_dataset()
val_dataset = writing_dataset(data=data, typ='val')

train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, collate_fn=collate, drop_last=True)
val_loader = DataLoader(dataset=val_dataset, batch_size=1, collate_fn=collate)

net = Class_Pool_Net(batch_size, model, device)
param = get_group_parameters(net)
optimizer = torch.optim.AdamW(param, lr=lr)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[2, 4, 6, 8], gamma=0.4) #学习率优化,其实有人说用了AdamW就不用手动调了
lo = 0
min_lo = 2e5
min_mse = 200
for epoch in range(epoch_num):
lo = 0
net.train()
net.batch_size=batch_size
for i, (X, y) in enumerate(train_loader):
X = X.to(device)
y = y.to(device)
y_hat = net(X)
l = loss(y_hat, y)
l = l / accumulate_steps
lo += l.item()
l.backward() #backward只是计算反向传播的梯度而已
if (i + 1) % accumulate_steps == 0 or (i + 1) == len(train_loader):#梯度累积,step是把梯度作用到参数上
optimizer.step()
optimizer.zero_grad()
if (i + 1) % (accumulate_steps * 8) == 0 or (i + 1) == len(train_loader): #这是在数据集很大且有大量的数据增强的时候用的,不用直到每个epoch算完再validation
print(f'the {epoch}th: {100 * (i + 1) / len(train_loader)} %')
mse = 0
net.eval()
net.batch_size=1
ypred = None
ytrue = None
for i,(X, y) in enumerate(val_loader):
with torch.no_grad():
X = X.to(device)
y = y.to(device)
y_hat = net(X)
if i == 0:
ypred = y_hat
ytrue = y
else:
ypred = torch.cat((ypred, y_hat))
ytrue = torch.cat((ytrue, y))
mse = evaluate_function(ypred, ytrue)
print(f'mse: {mse}.')
if mse < min_mse:
torch.save({'model': net.state_dict()}, f'./minmse_fold.pth')
min_mse = mse
net.train()
net.batch_size=batch_size
scheduler.step()

if lo < min_lo:
torch.save({'model': net.state_dict()}, f'./minloss_fold.pth')
min_lo = lo
print(f'{epoch}th epoch: last loss: {lo * accumulate_steps}.')
if mse < min_mse:
torch.save({'model': net.state_dict()}, f'./minmse_fold.pth')
min_mse = mse
# print(f'{epoch}th epoch: last loss: {lo * accumulate_steps} and mse: {mse}.')

torch.save({'model': net.state_dict()}, f'./last_fold.pth')
print(f'End.')

训练结果

原始数据已经遗失了,在26%测试集上是0.47的结果,在当天排200多名,现在已经掉到300多了。

然后0.47在CPU上的排名,最高排过前5哈哈。不知道这个赛道有没有发奖牌的呢?还是只有前三名有钱啊qaq。

不足与展望

不足

1.数据少

2.六个维度一起训练,肯定不太可能达到很好

3.使用roberta。deberta-large我老是显存溢出,使用半精度训练效果很一般。

4.我还没空十折交叉训练,也没空跑几个不同的种子。

5.batchnorm到底有没有用也难说

展望

CPU和GPU肯定要有不同的方向。

CPU的话,2肯定改进不了了,打算从数据增强入手。

GPU的话,应着重改善2。

可能可以先在GPU训练6个单独的比较好的网络,然后用伪标签训练来反哺CPU的网络。


Kaggle首战记录(3)-English Language Learning-baseline的设计和训练
https://bebr2.com/2022/09/11/Kaggle首战记录(3)-English Language Learning-baseline的设计和训练/
作者
BeBr2
发布于
2022年9月11日
许可协议