MindSpore-Powered Emotion Detection of BERT

2024/08/06

Practices

MindSpore-Powered Emotion Detection of BERT

Author: JeffDing Source: MindSpore Community Forum

01 Model Introduction

Bidirectional Encoder Representations from Transformers (BERT) is a new language model developed and released by Google at the end of 2018. It plays a crucial role in various natural language processing tasks, including question answering (QA), named entity recognition (NER), natural language inference (NLI), and text classification. The model is built on the transformer's encoder and features a bidirectional structure.

The primary innovation of the BERT model lies in its pre-training method, which employs the Masked Language Model (MLM) and Next Sentence Prediction (NSP) techniques to capture word-level and sentence-level representations, respectively.

When the MLM method is used to train BERT, 15% of words in the corpus are randomly masked. This masking operation is divided into three cases: 80% of the words are replaced with [Mask], 10% are replaced with a different word, and 10% remain unchanged.

For QA and NLI tasks, NSP is added to help the model understand the relationship between two sentences. Compared with MLM, NSP is simpler. Two sentences A and B are input to BERT for training, and there is a half probability that the model predicts the sentence B as the next sentence of sentence A.

After BERT is pre-trained, its embedding table and 12-layer transformer weights (BERT-BASE) or 24-layer transformer weights (BERT-LARGE) are saved. The pre-trained BERT model can be used to fine-tune downstream tasks, such as text classification, similarity calculation, and reading comprehension.

Emotion Detection (EmoTect) focuses on identifying user emotion in a conversation scenario. For user text in the conversation scenario, an emotion type of the text is automatically determined, and a corresponding confidence level is provided. The emotion type is classified into positive, negative, and neutral. Conversational EmoTect is applicable to various scenarios, including chat and customer service. It helps enterprises better control the conversation, improve user interaction experience, analyze the customer service quality, and reduce manual inspection costs.

02 MindNLP Installation

pip install mindnlp

The following uses a text sentiment classification task as an example to describe how to use BERT.

import os

import mindspore
from mindspore.dataset import text, GeneratorDataset, transforms
from mindspore import nn, context

from mindnlp._legacy.engine import Trainer, Evaluator
from mindnlp._legacy.engine.callbacks import CheckpointCallback, BestModelCallback
from mindnlp._legacy.metrics import Accuracy

# prepare dataset
class SentimentDataset:
    """Sentiment Dataset"""

    def __init__(self, path):
        self.path = path
        self._labels, self._text_a = [], []
        self._load()

    def _load(self):
        with open(self.path, "r", encoding="utf-8") as f:
            dataset = f.read()
        lines = dataset.split("\n")
        for line in lines[1:-1]:
            label, text_a = line.split("\t")
            self._labels.append(int(label))
            self._text_a.append(text_a)

    def __getitem__(self, index):
        return self._labels[index], self._text_a[index]

    def __len__(self):
        return len(self._labels)

03 Dataset Preparation

A labeled chatbot dataset that has been tokenized is provided here. The data consists of two columns separated by tabs (\t). The first column is the emotion category (0 indicates negative, 1 indicates neutral, and 2 indicates positive). The second column is the Chinese text separated by spaces. The file is encoded in UTF-8 format. The following is an example:

label–text_a

0–谁骂人了？我从来不骂人，我骂的都不是人，你是人吗？

1–我有事等会儿就回来和你聊

2–我见到你很高兴谢谢你帮我

This part includes dataset reading, data format conversion, data tokenization, and padding.

wget https://baidu-nlp.bj.bcebos.com/emotion_detection-dataset-1.0.0.tar.gz -O emotion_detection.tar.gz
tar xvf emotion_detection.tar.gz

04 Data Loading and Preprocessing

Create a process_dataset function for data loading and preprocessing. The following is an example:

import numpy as np

def process_dataset(source, tokenizer, max_seq_len=64, batch_size=32, shuffle=True):
    is_ascend = mindspore.get_context('device_target') == 'Ascend'

    column_names = ["label", "text_a"]

    dataset = GeneratorDataset(source, column_names=column_names, shuffle=shuffle)
    # transforms
    type_cast_op = transforms.TypeCast(mindspore.int32)
    def tokenize_and_pad(text):
        if is_ascend:
            tokenized = tokenizer(text, padding='max_length', truncation=True, max_length=max_seq_len)
        else:
            tokenized = tokenizer(text)
        return tokenized['input_ids'], tokenized['attention_mask']
    # map dataset
    dataset = dataset.map(operations=tokenize_and_pad, input_columns="text_a", output_columns=['input_ids', 'attention_mask'])
    dataset = dataset.map(operations=[type_cast_op], input_columns="label", output_columns='labels')
    # batch dataset
    if is_ascend:
        dataset = dataset.batch(batch_size)
    else:
        dataset = dataset.padded_batch(batch_size, pad_info={'input_ids': (None, tokenizer.pad_token_id),
                                                         'attention_mask': (None, 0)})

return dataset

Static shapes are used for data preprocessing.

from mindnlp.transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

tokenizer.pad_token_id

dataset_train = process_dataset(SentimentDataset("data/train.tsv"), tokenizer)
dataset_val = process_dataset(SentimentDataset("data/dev.tsv"), tokenizer)
dataset_test = process_dataset(SentimentDataset("data/test.tsv"), tokenizer, shuffle=False)

print(next(dataset_train.create_tuple_iterator()))

05 Model Build

Use BertForSequenceClassification to build a BERT model for sentiment classification, load pre-trained weights, and set hyperparameters for sentiment classification with three categories. Then, perform automatic mixed precision operations on the model to improve the training speed. Next, instantiate an optimizer and evaluation metrics, and set a policy for saving weights for model training. Finally, build a trainer and start model training.

from mindnlp.transformers import BertForSequenceClassification, BertModel
from mindnlp._legacy.amp import auto_mixed_precision

# set bert config and define parameters for training
model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=3)
model = auto_mixed_precision(model, 'O1')

optimizer = nn.Adam(model.trainable_params(), learning_rate=2e-5)

metric = Accuracy()
# define callbacks to save checkpoints
ckpoint_cb = CheckpointCallback(save_path='checkpoint', ckpt_name='bert_emotect', epochs=1, keep_checkpoint_max=2)
best_model_cb = BestModelCallback(save_path='checkpoint', ckpt_name='bert_emotect_best', auto_load=True)

trainer = Trainer(network=model, train_dataset=dataset_train,
                  eval_dataset=dataset_val, metrics=metric,
                  epochs=5, optimizer=optimizer, callbacks=[ckpoint_cb, best_model_cb])
# start training
trainer.run(tgt_columns="labels")

06 Model Verification

Add the validation dataset to the trained model to check the effect of the model on the validation data. The evaluation metric is "accuracy".

evaluator = Evaluator(network=model, eval_dataset=dataset_test, metrics=metric)
evaluator.run(tgt_columns="labels")

07 Model Inference

Traverse the inference dataset and map the results and labels.

dataset_infer = SentimentDataset("data/infer.tsv")

def predict(text, label=None):
    label_map = {0: "negative," 1: "neutral," 2: "positive"}

    text_tokenized = Tensor([tokenizer(text).input_ids])
    logits = model(text_tokenized)
    predict_label = logits[0].asnumpy().argmax()
    info = f"inputs: '{text}', predict: '{label_map[predict_label]}'"
    if label is not None:
        info += f" , label: '{label_map[label]}'"
    print(info)

from mindspore import Tensor

for label, text in dataset_infer:
    predict(text, label)

08 Custom Inference Dataset

Input inference data to test the generalization capability of the model.

predict("家人们咱就是说一整个无语住了 绝绝子叠buff")

Learning

Core Frameworks

Foundation Model

Scientific Computing

Domain Suites

Tools

Ecosystem

Technical learning

Community Organization

Contribution and Growth

Interaction and Communication

Events

News

MindSpore-Powered Emotion Detection of BERT

MindSpore-Powered Emotion Detection of BERT