5  Baseline

面对挑战赛中的任务1, 我们还可以使用 深度学习 (Deep learning) 的各种方法来解决这个二分类问题.

DataWhale 提供了一个 Baseline. 我们的任务是

5.1 环境配置

  • 硬件
    • 我的硬件是 Macbook Air M1 (8 + 512). 它可以勉强运行这个 Baseline
  • 软件
    • Pytorch 2.0 可以很好地支持 Mac Silicon 芯片的 神经网络引擎
  • 代码
    • 我们需要设置
      • import torch.mps
      • device = torch.device("mps")
    • 其他的调整还包括
      • torch.backends.mps.is_available()
      • torch.mps.empty_cache()
      • 总之, 所有代码中出现 cuda 的地方, 都要检查

5.2 Setup

5.2.1 Import modules

# import modules
import os
import pandas as pd
import torch
from torch import nn
import torch.mps
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer
from transformers import BertModel
from pathlib import Path

5.2.2 Glob config

batch_size = 16
# 文本的最大长度
text_max_length = 128
# 总训练的 epochs 数,我只是随便定义了个数
epochs = 10
# 学习率
lr = 3e-5
# 取多少训练集的数据作为验证集
validation_ratio = 0.1

if torch.cuda.is_available():
    print ("Nvidia CUDA device is found.")
    device = torch.device('cuda')
elif torch.backends.mps.is_available():
    print ("Apple MPS device is found.")
    device = torch.device("mps")
else:
    print ("No GPU device is found. Use CPU instead.")
    device = torch.device('cpu')

# 每多少步,打印一次loss
log_per_step = 50

# 数据集所在位置
dataset_dir = Path("./data")
os.makedirs(dataset_dir) if not os.path.exists(dataset_dir) else ''

# 模型存储路径
model_dir = Path("./model/bert_checkpoints")
# 如果模型目录不存在,则创建一个
os.makedirs(model_dir) if not os.path.exists(model_dir) else ''

print("Device:", device)
Apple MPS device is found.
Device: mps

5.3 Data Collection and cleaning

5.3.1 Import data

# 读取数据集,进行数据处理

pd_train_data = pd.read_csv('./data/train.csv')
pd_train_data['title'] = pd_train_data['title'].fillna('')
pd_train_data['abstract'] = pd_train_data['abstract'].fillna('')

test_data = pd.read_csv('./data/testB.csv')
test_data['title'] = test_data['title'].fillna('')
test_data['abstract'] = test_data['abstract'].fillna('')
pd_train_data['text'] = pd_train_data['title'].fillna('') + ' ' +  pd_train_data['author'].fillna('') + ' ' + pd_train_data['abstract'].fillna('')+ ' ' + pd_train_data['Keywords'].fillna('')
test_data['text'] = test_data['title'].fillna('') + ' ' +  test_data['author'].fillna('') + ' ' + test_data['abstract'].fillna('')+ ' ' + pd_train_data['Keywords'].fillna('')

5.3.2 Splitting to train set and validation set

# 从训练集中随机采样测试集
validation_data = pd_train_data.sample(frac=validation_ratio)
train_data = pd_train_data[~pd_train_data.index.isin(validation_data.index)]

5.3.3 Dataset

# 构建Dataset
class MyDataset(Dataset):

    def __init__(self, mode='train'):
        super(MyDataset, self).__init__()
        self.mode = mode
        # 拿到对应的数据
        if mode == 'train':
            self.dataset = train_data
        elif mode == 'validation':
            self.dataset = validation_data
        elif mode == 'test':
            # 如果是测试模式,则返回内容和uuid。拿uuid做target主要是方便后面写入结果。
            self.dataset = test_data
        else:
            raise Exception("Unknown mode {}".format(mode))

    def __getitem__(self, index):
        # 取第index条
        data = self.dataset.iloc[index]
        # 取其内容
        text = data['text']
        # 根据状态返回内容
        if self.mode == 'test':
            # 如果是test,将uuid做为target
            label = data['uuid']
        else:
            label = data['label']
        # 返回内容和label
        return text, label

    def __len__(self):
        return len(self.dataset)
train_dataset = MyDataset('train')
validation_dataset = MyDataset('validation')
train_dataset.__getitem__(0)
('Seizure Detection and Prediction by Parallel Memristive Convolutional Neural Networks Li, Chenqi; Lammie, Corey; Dong, Xuening; Amirsoleimani, Amirali; Azghadi, Mostafa Rahimi; Genov, Roman During the past two decades, epileptic seizure detection and prediction algorithms have evolved rapidly. However, despite significant performance improvements, their hardware implementation using conventional technologies, such as Complementary Metal-Oxide-Semiconductor (CMOS), in power and areaconstrained settings remains a challenging task; especially when many recording channels are used. In this paper, we propose a novel low-latency parallel Convolutional Neural Network (CNN) architecture that has between 2-2,800x fewer network parameters compared to State-Of-The-Art (SOTA) CNN architectures and achieves 5-fold cross validation accuracy of 99.84% for epileptic seizure detection, and 99.01% and 97.54% for epileptic seizure prediction, when evaluated using the University of Bonn Electroencephalogram (EEG), CHB-MIT and SWEC-ETHZ seizure datasets, respectively. We subsequently implement our network onto analog crossbar arrays comprising Resistive Random-Access Memory (RRAM) devices, and provide a comprehensive benchmark by simulating, laying out, and determining hardware requirements of theCNNcomponent of our system. We parallelize the execution of convolution layer kernels on separate analog crossbars to enable 2 orders of magnitude reduction in latency compared to SOTA hybrid Memristive-CMOS Deep Learning (DL) accelerators. Furthermore, we investigate the effects of non-idealities on our system and investigate Quantization Aware Training (QAT) to mitigate the performance degradation due to lowAnalog-to-Digital Converter (ADC)/Digital-to-Analog Converter (DAC) resolution. Finally, we propose a stuck weight offsetting methodology to mitigate performance degradation due to stuck RON/ROFF memristor weights, recovering up to 32% accuracy, without requiring retraining. The CNN component of our platform is estimated to consume approximately 2.791Wof power while occupying an area of 31.255 mm(2) in a 22 nm FDSOI CMOS process. CNN; Seizure Detection; Seizure Prediction; EEG; RRAM; Memristive Crossbar Array',
 1)

5.3.4 Dataloader

#获取Bert预训练模型
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
#接着构造我们的Dataloader。
#我们需要定义一下collate_fn,在其中完成对句子进行编码、填充、组装batch等动作:
def collate_fn(batch):
    """
    将一个batch的文本句子转成tensor,并组成batch。
    :param batch: 一个batch的句子,例如: [('推文', target), ('推文', target), ...]
    :return: 处理后的结果,例如:
             src: {'input_ids': tensor([[ 101, ..., 102, 0, 0, ...], ...]), 'attention_mask': tensor([[1, ..., 1, 0, ...], ...])}
             target:[1, 1, 0, ...]
    """
    text, label = zip(*batch)
    text, label = list(text), list(label)

    # src是要送给bert的,所以不需要特殊处理,直接用tokenizer的结果即可
    # padding='max_length' 不够长度的进行填充
    # truncation=True 长度过长的进行裁剪
    src = tokenizer(text, padding='max_length', max_length=text_max_length, return_tensors='pt', truncation=True)

    return src, torch.LongTensor(label)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
validation_loader = DataLoader(validation_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)
inputs, targets = next(iter(train_loader))
print("inputs:", inputs)
print("targets:", targets)
inputs: {'input_ids': tensor([[  101,  9229,  1996,  ...,  6024,  5758,   102],
        [  101,  2311, 21346,  ...,  2213, 16257,   102],
        [  101,  4359,  3945,  ...,  1010,  1998,   102],
        ...,
        [  101,  2047,  8973,  ..., 18872,  1011,   102],
        [  101, 15251,  2005,  ...,  4972,  1012,   102],
        [  101,  2678, 26351,  ...,  1011,  2812,   102]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]])}
targets: tensor([0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0])

5.4 Model

PyTorch 中定义模型需要继承 nn.Module 类,其中需要至少定义两个方法,一个是初始化模型结构的方法__init__,另一个方法forward来完成推理流程。

#定义预测模型,该模型由bert模型加上最后的预测层组成
class MyModel(nn.Module):

    def __init__(self):
        super(MyModel, self).__init__()

        # 加载bert模型
        self.bert = BertModel.from_pretrained('bert-base-uncased', mirror='tuna')

        # 最后的预测层
        self.predictor = nn.Sequential(
            nn.Linear(768, 256),
            nn.ReLU(),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )

    def forward(self, src):
        """
        :param src: 分词后的推文数据
        """

        # 将src直接序列解包传入bert,因为bert和tokenizer是一套的,所以可以这么做。
        # 得到encoder的输出,用最前面[CLS]的输出作为最终线性层的输入
        outputs = self.bert(**src).last_hidden_state[:, 0, :]

        # 使用线性层来做最终的预测
        return self.predictor(outputs)
model = MyModel()
model = model.to(device)

5.4.1 Loss function and Optimizer

#定义出损失函数和优化器。这里使用Binary Cross Entropy:
criteria = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
# 由于inputs是字典类型的,定义一个辅助函数帮助to(device)
def to_device(dict_tensors):
    result_tensors = {}
    for key, value in dict_tensors.items():
        result_tensors[key] = value.to(device)
    return result_tensors

5.4.2 Validation

#定义一个验证方法,获取到验证集的精准率和loss
def validate():
    model.eval()
    total_loss = 0.
    total_correct = 0
    for inputs, targets in validation_loader:
        inputs, targets = to_device(inputs), targets.to(device)
        outputs = model(inputs)
        loss = criteria(outputs.view(-1), targets.float())
        total_loss += float(loss)

        correct_num = (((outputs >= 0.5).float() * 1).flatten() == targets).sum()
        total_correct += correct_num

    return total_correct / len(validation_dataset), total_loss / len(validation_dataset)

5.5 Training and evaluation

# 首先将模型调成训练模式
model.train()

# empty cache to avoid out of memory
if torch.cuda.is_available():
    torch.cuda.empty_cache()

if torch.backends.mps.is_available():
    torch.mps.empty_cache()

# 定义几个变量,帮助打印loss
total_loss = 0.
# 记录步数
step = 0

# 记录在验证集上最好的准确率
best_accuracy = 0

# 开始训练
for epoch in range(epochs):
    model.train()
    for i, (inputs, targets) in enumerate(train_loader):
        # 从batch中拿到训练数据
        inputs, targets = to_device(inputs), targets.to(device)
        # 传入模型进行前向传递
        outputs = model(inputs)
        # 计算损失
        loss = criteria(outputs.view(-1), targets.float())
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        total_loss += float(loss)
        step += 1

        if step % log_per_step == 0:
            print("Epoch {}/{}, Step: {}/{}, total loss:{:.4f}".format(epoch+1, epochs, i, len(train_loader), total_loss))
            total_loss = 0

        del inputs, targets

    # 一个epoch后,使用过验证集进行验证
    accuracy, validation_loss = validate()
    print("Epoch {}, accuracy: {:.4f}, validation loss: {:.4f}".format(epoch+1, accuracy, validation_loss))
    torch.save(model, model_dir / f"model_{epoch}.pt")

    # 保存最好的模型
    if accuracy > best_accuracy:
        torch.save(model, model_dir / f"model_best.pt")
        best_accuracy = accuracy

5.5.0.1 Prediction

#加载最好的模型,然后进行测试集的预测
model = torch.load(model_dir / f"model_best.pt")
model = model.eval()
test_dataset = MyDataset('test')
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)

5.6 Output

results = []
for inputs, ids in test_loader:
    outputs = model(inputs.to(device))
    outputs = (outputs >= 0.5).int().flatten().tolist()
    ids = ids.tolist()
    results = results + [(id, result) for result, id in zip(outputs, ids)]
test_label = [pair[1] for pair in results]
test_data['label'] = test_label
test_data['Keywords'] = test_data['title'].fillna('')
test_data[['uuid', 'Keywords', 'label']].to_csv('submit-task1-DL-baseline.csv', index=None)

5.7 结果 与 讨论

5.7.1 本地运行

  • Baseline 可以成功运行在 Mac silicon 芯片上. 但是 M1 芯片的神经网络单元少, 运行速度很慢.
  • 最后, 我使用一台配备 rtx quadro 5000HP zbook 移动工作站 完成训练.
    • epochs = 10 在竞赛平台得到 0.998

5.7.2 云服务

  • 阿里云会遇到网络问题 (我在浪费了2小时后放弃)
    • 直接上传数据不成功
      • 解决方法
        • 将代码和数据上传到 Github 的一个 Public Repo 上
        • 根据 Github520 提供的 GitHub 的域名 对应的 ip 地址, 修改 /etc/hosts 文件
        • 将代码和数据拉取到阿里云上
    • 无法运行模型
      • 解决方案
        • 放弃使用阿里云
  • 可以考虑使用 Google 的 Colab