2 Baseline (逻辑回归)

你正在阅读的是正在进行中的关于《根据测试集中的论文摘要自动提取关键词》的研究报告本章节处于重构阶段: 正在进行大幅重构，其内容可能令人感到混乱或不完整。

Datawhale 提供一个基于机器学习的 baseline，供参赛选手参考。

原始教程: https://datawhaler.feishu.cn/docx/HGiNdHedwoAtcVx0kkScwaI3nKc

一键运行：https://aistudio.baidu.com/aistudio/projectdetail/6522950

2.1 原始代码

我改动了一部分变量名和部分注释. 但是核心代码没有变化

2.1.1 导入必要的模块

import pandas as pd

# Import BOW (Bag of Words) vectorizer
# we can also use TF-IDF vectorizer instead. import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Import logistic regression model
from sklearn.linear_model import LogisticRegression

# filter out warnings
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

2.1.2 导入数据

# read train data and test data
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/testB.csv')

# fillna with empty string
train['title'] = train['title'].fillna('')
train['abstract'] = train['abstract'].fillna('')

test['title'] = test['title'].fillna('')
test['abstract'] = test['abstract'].fillna('')

2.1.3 文本数据的向量化

# Combine title, author, abstract, keywords as text
train['text'] = train['title'].fillna('') + ' ' +  train['author'].fillna('') + ' ' + train['abstract'].fillna('')+ ' ' + train['Keywords'].fillna('')
test['text']  = test['title'].fillna('')  + ' ' +  test['author'].fillna('')  + ' ' + test['abstract'].fillna('')

# Cectorize text using CountVectorizer (or TfidfVectorizer if you want) 
vector = CountVectorizer().fit(train['text'])
train_vector = vector.transform(train['text'])
test_vector = vector.transform(test['text'])

2.1.4 拟合+预测

# Set model
model = LogisticRegression()

# Fit to data
model.fit(train_vector, train['label'])

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


# Predict
test['label'] = model.predict(test_vector)
test['Keywords'] = test['title'].fillna('')
test[['uuid','Keywords','label']].to_csv('submit_task1_LogisticRegression.csv', index=None)

过拟合 (Overfitting)

将 submit_task1_LogisticRegression.csv 提交到竞赛页面, 得到的 f1_score 仅为 0.67116.

这说明我们的模型有过拟合问题. 造成过拟合的原因可能是. 我们的训练集包含 论文的关键词 而测试集不包含. 而关键词与文章是否属于医学领域高度相关.

2.2 基础 Baseline 添加交叉验证

from sklearn.model_selection import cross_validate

# fit with logistic regression
model = LogisticRegression()
cv = cross_validate(model,
                    X=train_vector, 
                    y=train['label'], cv=5,
                    # n_jobs=-1, # mobilise all CPU cores
                    scoring='f1') # we can also set scoring='accuracy'

cv_scores = cv['test_score']
mean_cv_scores = cv_scores.mean()
fit_time = cv['fit_time'].mean()
score_time = cv['score_time'].mean()
print('Mean Fit time: {}'.format(round(fit_time, 7)))

Mean Fit time: 0.7436823

print('Mean Score time: {}'.format(round(score_time, 7)))

Mean Score time: 0.0028631

# print('CV scores: {}'.format(cv_scores))
print('Mean CV scores: {}'.format(round(mean_cv_scores, 4)))

Mean CV scores: 0.982

2.3 过拟合

交叉验证中验证集的得分明显高于测试集在竞赛平台上的分数. 这说明, 我们有过拟合问题. 考虑到, 验证集中包含原始论文的关键词 (Keywords) 而测试集不包含相应论文的关键词 (事实上, 「根据测试集中的论文摘要自动提取关键词」正是本竞赛的第二个任务.

2.4 讨论

2.4.1 过拟合问题

2.5 总结与展望

基础代码 (Baseline) 提供了一个可运行的框架. 我们可以基于此改动, 调教出更优的模型.

一些接下来的方向

尝试其他的机器学习中介绍的分类器 (Classifier)
- 例如, 机器学习课程中提到朴素贝叶斯在垃圾邮件过滤 (文本二分类问题) 上表现很好.
- 其他的机器学习的分类器见 (An Introduction to Statistical Learning: with Applications in Python 2023, Chap 4)
  - 线性回归
  - 支持向量机
  - etc.
修正 Cross-validation 低估误差的现象
- 自己手写一个 Cross-validation 函数 (返回值, predictions, f1_score, fit_time)
- 测试集中不要使用 Cross-validation
- 用 For 函数做循环就好, 我们的数据量不大 (6000), 因此无须考虑多核运算的问题
使用深度学习算法
- 例如 BERT

2.1 原始代码

2.1.1 导入必要的模块

2.1.2 导入数据

2.1.3 文本数据的向量化

2.1.4 拟合+预测

2.2 基础 Baseline 添加 交叉验证

2.3 过拟合

2.4 讨论

2.4.1 过拟合问题

2.5 总结 与 展望

2.2 基础 Baseline 添加交叉验证

2.5 总结与展望