- 比賽名稱:中文成語填空挑戰賽算法挑戰大賽
- 比賽鏈接:https://challenge.xfyun.cn/topic/info?type=chinese-idioms
一、賽事背景
中國文化博大精深源遠流長,其中成語更是中國文化的精華。成語大多由四個字組成,一般都有典故或出處。有些成語從字面上不難理解,如“小題大做”、“後來居上”等。有些成語必須知道來源或典故才能懂得意思,如“朝三暮四”、“杯弓蛇影”等。
成語學習是小學語文和初中重要的學習內容,如何在語句中選擇合適的成語?本次賽題中希望選手構建模型能理解中文成語。
二、賽事任務
給定一箇中文句子的情況下,需要選手在給定上下文的情況下從待選的成語中選擇最爲合適的成語。即給定句子的上下文,完成合適的成語填入對應位置。
賽題訓練集案例如下:
訓練集5w條數據,測試集1w條數據。測試集中label字段爲空,需要選手預測。
三、評審規則
1. 數據說明
賽題數據由訓練集和測試集組成,訓練集5w條數據,測試集1w條數據,均爲csv格式,列使用\t分割。測試集提交案例見sample_submit.csv文件,不需要表頭,直接按照順序按行寫入1w條成語即可。
2. 評估指標
本次競賽的評價標準採用分類準確率,最高分爲1。評估代碼參考:
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
accuracy_score(y_true, y_pred)
3. 評測及排行
1、賽事提供下載數據,選手在本地進行算法調試,在比賽頁面提交結果。
2、每支團隊每天最多提交3次。
3、排行按照得分從高到低排序,排行榜將選擇團隊的歷史最優成績進行排名。
四、作品提交要求
文件格式:預測結果文件按照csv格式提交
文件大小:無要求
提交次數限制:每支隊伍每天最多3次
預測結果文件詳細說明:
以csv格式提交,編碼爲UTF-8
提交前請確保預測結果的格式與sample_submit.csv中的格式一致。具體格式如下:
label
津津樂道
息息相關
必經之路
顧名思義
痛快淋漓
名列前茅
無所事事
如火如荼
夜以繼日
緊鑼密鼓
源源不斷
五、賽程規則
正式賽
8月16日——9月15日
初賽截止成績以團隊在初賽時間段內最優成績爲準(不含測試排名)。
初賽作品提交截止日期爲9月15日17:00;正式賽名次公佈日期爲8月16日10:00。
長期賽
9月16日——10月24日
因賽事以學習實踐爲主,正式賽將轉變爲長期賽,供開發者學習實踐。本階段提交後,系統會根據成績持續更新榜單,但該階段榜單不再進行公示和獎勵。
六、獎項設置
本賽題設立一、二、三等獎各一名,具體詳情如下:
一等獎:1支隊伍,周賽一等獎證書,獎金:1000元
二等獎:1支隊伍,周賽二等獎證書,獎金:800元
三等獎:1支隊伍,周賽三等獎證書,獎金:500元
七、baseline思路
- 按照NLP中閱讀理解題目處理比賽數據格式,具體內容可以參考swag格式
- 構建描述文本
text
和選項‘choice’,以及候選答案:四個候選‘成語’ - 輸入‘AutoModelForMultipleChoice’模型進行訓練和預測
構建訓練集和測試集
import re
import pandas as pd
from tqdm import tqdm
train = pd.read_csv('data/train.csv', sep='\t')
test = pd.read_csv('data/test.csv', sep='\t')
print(train)
print(test)
def process_text(text):
return re.sub(' +', ' ', text).strip()
def get_question(text):
"""
根據[MASK][MASK][MASK][MASK]獲取問題
:param text:
:return:
"""
sentences = re.split('(。|!|\!|\.|?|\?)', text) # 保留分割符
for sent in sentences:
if '[MASK][MASK][MASK][MASK]' in sent:
return sent
return text
cols = [
"Unnamed: 0",
"video-id",
"fold-ind", # q_id
"startphrase",
"sent1", # content
"sent2", # question
"gold-source",
"ending0", "ending1", "ending2", "ending3", # choice
"label"]
# ======================================================
# 生成訓練集
# ======================================================
res = []
for idx, row in tqdm(train.iterrows()):
q_id = f'train_{idx}'
content = row['text']
content = process_text(content)
question = get_question(content)
modified_choices = eval(row['candidate'])
label = modified_choices.index(row['label'])
## Hard-code for swag format!
res.append(("",
"",
q_id,
"",
content,
question,
"",
modified_choices[0],
modified_choices[1],
modified_choices[2],
modified_choices[3],
label))
df = pd.DataFrame(res, columns=cols)
模型訓練
數據處理函數
@dataclass
class DataCollatorForMultipleChoice:
"""
Data collator that will dynamically pad the inputs for multiple choice received.
Args:
tokenizer (:class:`~transformers.PreTrainedTokenizer` or :class:`~transformers.PreTrainedTokenizerFast`):
The tokenizer used for encoding the data.
padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
among:
* :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).
* :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
maximum acceptable input length for the model if that argument is not provided.
* :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
different lengths).
max_length (:obj:`int`, `optional`):
Maximum length of the returned list and optionally padding length (see above).
pad_to_multiple_of (:obj:`int`, `optional`):
If set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
7.5 (Volta).
"""
tokenizer: PreTrainedTokenizerBase
padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None
pad_to_multiple_of: Optional[int] = None
def __call__(self, features):
label_name = "label" if "label" in features[0].keys() else "labels"
labels = [feature.pop(label_name) for feature in features]
batch_size = len(features)
num_choices = len(features[0]["input_ids"])
flattened_features = [
[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
]
flattened_features = sum(flattened_features, [])
batch = self.tokenizer.pad(
flattened_features,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors="pt",
)
# Un-flatten
batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
# Add back labels
batch["labels"] = torch.tensor(labels, dtype=torch.int64)
return batch
模型訓練
# Metric
def compute_metrics(eval_predictions):
predictions, label_ids = eval_predictions
preds = np.argmax(predictions, axis=1)
return {"accuracy": (preds == label_ids).astype(np.float32).mean().item()}
# Initialize our Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"] if training_args.do_train else None,
eval_dataset=tokenized_datasets["validation"] if training_args.do_eval else None,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
# Training
if training_args.do_train:
if last_checkpoint is not None:
checkpoint = last_checkpoint
elif os.path.isdir(model_args.model_name_or_path):
checkpoint = model_args.model_name_or_path
else:
checkpoint = None
train_result = trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_model() # Saves the tokenizer too for easy upload
output_train_file = os.path.join(training_args.output_dir, "train_results.txt")
if trainer.is_world_process_zero():
with open(output_train_file, "w") as writer:
logger.info("***** Train results *****")
for key, value in sorted(train_result.metrics.items()):
logger.info(f" {key} = {value}")
writer.write(f"{key} = {value}\n")
# Need to save the state, since Trainer.save_model saves only the tokenizer with the model
trainer.state.save_to_json(os.path.join(training_args.output_dir, "trainer_state.json"))
模型參數設置
預訓練模型選擇hfl/chinese-xlnet-base
,大約需要訓練1個小時左右。
#!/bin/bash
python -u baseline.py \
--model_name_or_path 'hfl/chinese-xlnet-base' \
--do_train \
--do_eval \
--do_predict \
--logging_steps=100 \
--max_seq_length 200 \
--train_file data/new_train.csv \
--validation_file data/new_valid.csv \
--test_file data/new_test.csv \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--output_dir 'models/xlnet' \
--gradient_accumulation_steps 4 \
--per_device_eval_batch_size 16 \
--per_device_train_batch_size 16 \
--overwrite_output
八、提升思路
- 參數調整:學習率、最大長度,Batch Size
- 交叉驗證,多種子融合
- 模型投票融合
- 嘗試多種預訓練模型