MindSpore-Powered Emotion Detection of BERT
MindSpore-Powered Emotion Detection of BERT
Author: JeffDing Source: MindSpore Community Forum
01 Model Introduction
Bidirectional Encoder Representations from Transformers (BERT) is a new language model developed and released by Google at the end of 2018. It plays a crucial role in various natural language processing tasks, including question answering (QA), named entity recognition (NER), natural language inference (NLI), and text classification. The model is built on the transformer's encoder and features a bidirectional structure.
The primary innovation of the BERT model lies in its pre-training method, which employs the Masked Language Model (MLM) and Next Sentence Prediction (NSP) techniques to capture word-level and sentence-level representations, respectively.
When the MLM method is used to train BERT, 15% of words in the corpus are randomly masked. This masking operation is divided into three cases: 80% of the words are replaced with [Mask], 10% are replaced with a different word, and 10% remain unchanged.
For QA and NLI tasks, NSP is added to help the model understand the relationship between two sentences. Compared with MLM, NSP is simpler. Two sentences A and B are input to BERT for training, and there is a half probability that the model predicts the sentence B as the next sentence of sentence A.
After BERT is pre-trained, its embedding table and 12-layer transformer weights (BERT-BASE) or 24-layer transformer weights (BERT-LARGE) are saved. The pre-trained BERT model can be used to fine-tune downstream tasks, such as text classification, similarity calculation, and reading comprehension.
Emotion Detection (EmoTect) focuses on identifying user emotion in a conversation scenario. For user text in the conversation scenario, an emotion type of the text is automatically determined, and a corresponding confidence level is provided. The emotion type is classified into positive, negative, and neutral. Conversational EmoTect is applicable to various scenarios, including chat and customer service. It helps enterprises better control the conversation, improve user interaction experience, analyze the customer service quality, and reduce manual inspection costs.
02 MindNLP Installation
pip install mindnlp
The following uses a text sentiment classification task as an example to describe how to use BERT.
import os
import mindspore
from mindspore.dataset import text, GeneratorDataset, transforms
from mindspore import nn, context
from mindnlp._legacy.engine import Trainer, Evaluator
from mindnlp._legacy.engine.callbacks import CheckpointCallback, BestModelCallback
from mindnlp._legacy.metrics import Accuracy
# prepare dataset
class SentimentDataset:
"""Sentiment Dataset"""
def __init__(self, path):
self.path = path
self._labels, self._text_a = [], []
self._load()
def _load(self):
with open(self.path, "r", encoding="utf-8") as f:
dataset = f.read()
lines = dataset.split("\n")
for line in lines[1:-1]:
label, text_a = line.split("\t")
self._labels.append(int(label))
self._text_a.append(text_a)
def __getitem__(self, index):
return self._labels[index], self._text_a[index]
def __len__(self):
return len(self._labels)
03 Dataset Preparation
A labeled chatbot dataset that has been tokenized is provided here. The data consists of two columns separated by tabs (\t). The first column is the emotion category (0 indicates negative, 1 indicates neutral, and 2 indicates positive). The second column is the Chinese text separated by spaces. The file is encoded in UTF-8 format. The following is an example:
label–text_a
0–谁骂人了?我从来不骂人,我骂的都不是人,你是人吗 ?
1–我有事等会儿就回来和你聊
2–我见到你很高兴谢谢你帮我
This part includes dataset reading, data format conversion, data tokenization, and padding.
wget https://baidu-nlp.bj.bcebos.com/emotion_detection-dataset-1.0.0.tar.gz -O emotion_detection.tar.gz
tar xvf emotion_detection.tar.gz
04 Data Loading and Preprocessing
Create a process_dataset function for data loading and preprocessing. The following is an example:
import numpy as np
def process_dataset(source, tokenizer, max_seq_len=64, batch_size=32, shuffle=True):
is_ascend = mindspore.get_context('device_target') == 'Ascend'
column_names = ["label", "text_a"]
dataset = GeneratorDataset(source, column_names=column_names, shuffle=shuffle)
# transforms
type_cast_op = transforms.TypeCast(mindspore.int32)
def tokenize_and_pad(text):
if is_ascend:
tokenized = tokenizer(text, padding='max_length', truncation=True, max_length=max_seq_len)
else:
tokenized = tokenizer(text)
return tokenized['input_ids'], tokenized['attention_mask']
# map dataset
dataset = dataset.map(operations=tokenize_and_pad, input_columns="text_a", output_columns=['input_ids', 'attention_mask'])
dataset = dataset.map(operations=[type_cast_op], input_columns="label", output_columns='labels')
# batch dataset
if is_ascend:
dataset = dataset.batch(batch_size)
else:
dataset = dataset.padded_batch(batch_size, pad_info={'input_ids': (None, tokenizer.pad_token_id),
'attention_mask': (None, 0)})
return dataset
Static shapes are used for data preprocessing.
from mindnlp.transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
tokenizer.pad_token_id
dataset_train = process_dataset(SentimentDataset("data/train.tsv"), tokenizer)
dataset_val = process_dataset(SentimentDataset("data/dev.tsv"), tokenizer)
dataset_test = process_dataset(SentimentDataset("data/test.tsv"), tokenizer, shuffle=False)
print(next(dataset_train.create_tuple_iterator()))
05 Model Build
Use BertForSequenceClassification to build a BERT model for sentiment classification, load pre-trained weights, and set hyperparameters for sentiment classification with three categories. Then, perform automatic mixed precision operations on the model to improve the training speed. Next, instantiate an optimizer and evaluation metrics, and set a policy for saving weights for model training. Finally, build a trainer and start model training.
from mindnlp.transformers import BertForSequenceClassification, BertModel
from mindnlp._legacy.amp import auto_mixed_precision
# set bert config and define parameters for training
model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=3)
model = auto_mixed_precision(model, 'O1')
optimizer = nn.Adam(model.trainable_params(), learning_rate=2e-5)
metric = Accuracy()
# define callbacks to save checkpoints
ckpoint_cb = CheckpointCallback(save_path='checkpoint', ckpt_name='bert_emotect', epochs=1, keep_checkpoint_max=2)
best_model_cb = BestModelCallback(save_path='checkpoint', ckpt_name='bert_emotect_best', auto_load=True)
trainer = Trainer(network=model, train_dataset=dataset_train,
eval_dataset=dataset_val, metrics=metric,
epochs=5, optimizer=optimizer, callbacks=[ckpoint_cb, best_model_cb])
# start training
trainer.run(tgt_columns="labels")
06 Model Verification
Add the validation dataset to the trained model to check the effect of the model on the validation data. The evaluation metric is "accuracy".
evaluator = Evaluator(network=model, eval_dataset=dataset_test, metrics=metric)
evaluator.run(tgt_columns="labels")
07 Model Inference
Traverse the inference dataset and map the results and labels.
dataset_infer = SentimentDataset("data/infer.tsv")
def predict(text, label=None):
label_map = {0: "negative," 1: "neutral," 2: "positive"}
text_tokenized = Tensor([tokenizer(text).input_ids])
logits = model(text_tokenized)
predict_label = logits[0].asnumpy().argmax()
info = f"inputs: '{text}', predict: '{label_map[predict_label]}'"
if label is not None:
info += f" , label: '{label_map[label]}'"
print(info)
from mindspore import Tensor
for label, text in dataset_infer:
predict(text, label)
08 Custom Inference Dataset
Input inference data to test the generalization capability of the model.
predict("家人们咱就是说一整个无语住了 绝绝子叠buff")