{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 文本处理与增强\n", "\n", "`Ascend` `GPU` `CPU` `数据准备`\n", "\n", "[![在线运行](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_modelarts.png)](https://authoring-modelarts-cnnorth4.huaweicloud.com/console/lab?share-url-b64=aHR0cHM6Ly9taW5kc3BvcmUtd2Vic2l0ZS5vYnMuY24tbm9ydGgtNC5teWh1YXdlaWNsb3VkLmNvbS9ub3RlYm9vay9tb2RlbGFydHMvcHJvZ3JhbW1pbmdfZ3VpZGUvbWluZHNwb3JlX3Rva2VuaXplci5pcHluYg==&imageid=65f636a0-56cf-49df-b941-7d2a07ba8c8c) [![下载Notebook](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_notebook.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.6/programming_guide/zh_cn/mindspore_tokenizer.ipynb) [![下载样例代码](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_download_code.png)](https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/r1.6/programming_guide/zh_cn/mindspore_tokenizer.py) [![查看源文件](https://gitee.com/mindspore/docs/raw/r1.6/resource/_static/logo_source.png)](https://gitee.com/mindspore/docs/blob/r1.6/docs/mindspore/programming_guide/source_zh_cn/tokenizer.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 概述\n", "\n", "随着可获得的文本数据逐步增多,对文本数据进行预处理,以便获得可用于网络训练所需干净数据的诉求也更为迫切。文本数据集预处理通常包括,文本数据集加载与数据增强两部分。\n", "\n", "其中文本数据加载通常包含以下几种方式:\n", "\n", "- 通过相应文本读取的Dataset接口如[ClueDataset](https://www.mindspore.cn/docs/api/zh-CN/r1.6/api_python/dataset/mindspore.dataset.CLUEDataset.html#mindspore.dataset.CLUEDataset)、[TextFileDataset](https://www.mindspore.cn/docs/api/zh-CN/r1.6/api_python/dataset/mindspore.dataset.TextFileDataset.html#mindspore.dataset.TextFileDataset)进行读取。\n", "- 将数据集转成标准格式(如MindRecord格式),再通过对应接口(如[MindDataset](https://www.mindspore.cn/docs/api/zh-CN/r1.6/api_python/dataset/mindspore.dataset.MindDataset.html#mindspore.dataset.MindDataset))进行读取。\n", "- 通过GeneratorDataset接口,接收用户自定义的数据集加载函数,进行数据加载,用法可参考[自定义数据集加载](https://www.mindspore.cn/docs/programming_guide/zh-CN/r1.6/dataset_loading.html#%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86%E5%8A%A0%E8%BD%BD)章节。\n", "\n", "针对文本数据增强,常用操作包含文本分词、词汇表查找等:\n", "\n", "- 完成文本数据集加载后,通常需进行分词操作,即将原始一长串句子连续分割成多个基本的词汇。\n", "- 进一步,需构建词汇表,查找分割后各词汇对应的id,并将句子中包含的id组成词向量传入网络进行训练。\n", "\n", "下面对主要对数据增强过程中,用到的分词功能和词汇表查找等功能进行介绍,关于文本处理API的使用说明,可以参见[API文档](https://www.mindspore.cn/docs/api/zh-CN/r1.6/api_python/mindspore.dataset.text.html)。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 词汇表构造与使用\n", "\n", "词汇表提供了单词与id对应的映射关系,通过词汇表,输入单词能找到对应的单词id,反之依据单词id也能获取对应的单词。\n", "\n", "MindSpore提供了多种构造词汇表(Vocab)的方法,可以从字典、文件、列表以及Dataset对象中获取原始数据,以便构造词汇表,对应的接口为:[from_dict](https://www.mindspore.cn/docs/api/zh-CN/r1.6/api_python/dataset_text/mindspore.dataset.text.Vocab.html#mindspore.dataset.text.Vocab.from_dict)、[from_file](https://www.mindspore.cn/docs/api/zh-CN/r1.6/api_python/dataset_text/mindspore.dataset.text.Vocab.html#mindspore.dataset.text.Vocab.from_file)、[from_list](https://www.mindspore.cn/docs/api/zh-CN/r1.6/api_python/dataset_text/mindspore.dataset.text.Vocab.html#mindspore.dataset.text.Vocab.from_list)、[from_dataset](https://www.mindspore.cn/docs/api/zh-CN/r1.6/api_python/dataset_text/mindspore.dataset.text.Vocab.html#mindspore.dataset.text.Vocab.from_dataset)。\n", "\n", "以from_dict为例,构造Vocab的方式如下,传入的dict中包含多组单词和id对。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from mindspore.dataset import text\n", "\n", "vocab = text.Vocab.from_dict({\"home\": 3, \"behind\": 2, \"the\": 4, \"world\": 5, \"\": 6})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vocab提供了单词与id之间相互查询的方法,即:[tokens_to_ids](https://www.mindspore.cn/docs/api/zh-CN/r1.6/api_python/dataset_text/mindspore.dataset.text.Vocab.html#mindspore.dataset.text.Vocab.tokens_to_ids)和[ids_to_tokens](https://www.mindspore.cn/docs/api/zh-CN/r1.6/api_python/dataset_text/mindspore.dataset.text.Vocab.html#mindspore.dataset.text.Vocab.ids_to_tokens)方法,用法如下所示:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ids: [3, 5]\n", "tokens: ['behind', 'world']\n" ] } ], "source": [ "from mindspore.dataset import text\n", "\n", "vocab = text.Vocab.from_dict({\"home\": 3, \"behind\": 2, \"the\": 4, \"world\": 5, \"\": 6})\n", "\n", "ids = vocab.tokens_to_ids([\"home\", \"world\"])\n", "print(\"ids: \", ids)\n", "\n", "tokens = vocab.ids_to_tokens([2, 5])\n", "print(\"tokens: \", tokens)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "此外Vocab也是多种分词器(如WordpieceTokenizer)的必要入参,分词时会将句子中存在于词汇表的单词,前后分割开,变成单独的一个词汇,之后通过查找词汇表能够获取对应的词汇id。" ] }, { "cell_type": "markdown", "source": [ "## MindSpore分词器\n", "\n", "分词就是将连续的字序列按照一定的规范重新组合成词序列的过程,合理的进行分词有助于语义的理解。\n", "\n", "MindSpore提供了多种用途的分词器(Tokenizer),能够帮助用户高性能地处理文本,用户可以构建自己的字典,使用适当的标记器将句子拆分为不同的标记,并通过查找操作获取字典中标记的索引。\n", "\n", "MindSpore目前提供的分词器如下表所示。此外,用户也可以根据需要实现自定义的分词器。\n", "\n", "| 分词器 | 分词器说明 |\n", "| :-- | :-- |\n", "| BasicTokenizer | 根据指定规则对标量文本数据进行分词。 |\n", "| BertTokenizer | 用于处理Bert文本数据的分词器。 |\n", "| JiebaTokenizer | 基于字典的中文字符串分词器。 |\n", "| RegexTokenizer | 根据指定正则表达式对标量文本数据进行分词。 |\n", "| SentencePieceTokenizer | 基于SentencePiece开源工具包进行分词。 |\n", "| UnicodeCharTokenizer | 将标量文本数据分词为Unicode字符。 |\n", "| UnicodeScriptTokenizer | 根据Unicode边界对标量文本数据进行分词。 |\n", "| WhitespaceTokenizer | 根据空格符对标量文本数据进行分词。 |\n", "| WordpieceTokenizer | 根据单词集对标量文本数据进行分词。 |\n", "\n", "下面介绍几种常用分词器的使用方法。\n", "\n", "### BertTokenizer\n", "\n", "`BertTokenizer`是通过调用`BasicTokenizer`和`WordpieceTokenizer`来进行分词的。\n", "\n", "下面的样例首先构建了一个文本数据集和字符串列表,然后通过`BertTokenizer`对数据集进行分词,并展示了分词前后的文本结果。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 1, "source": [ "import mindspore.dataset as ds\n", "import mindspore.dataset.text as text\n", "\n", "input_list = [\"床前明月光\", \"疑是地上霜\", \"举头望明月\", \"低头思故乡\", \"I am making small mistakes during working hours\",\n", " \"😀嘿嘿😃哈哈😄大笑😁嘻嘻\", \"繁體字\"]\n", "dataset = ds.NumpySlicesDataset(input_list, column_names=[\"text\"], shuffle=False)\n", "\n", "print(\"------------------------before tokenization----------------------------\")\n", "\n", "for data in dataset.create_dict_iterator(output_numpy=True):\n", " print(text.to_str(data['text']))\n", "\n", "vocab_list = [\n", " \"床\", \"前\", \"明\", \"月\", \"光\", \"疑\", \"是\", \"地\", \"上\", \"霜\", \"举\", \"头\", \"望\", \"低\", \"思\", \"故\", \"乡\",\n", " \"繁\", \"體\", \"字\", \"嘿\", \"哈\", \"大\", \"笑\", \"嘻\", \"i\", \"am\", \"mak\", \"make\", \"small\", \"mistake\",\n", " \"##s\", \"during\", \"work\", \"##ing\", \"hour\", \"😀\", \"😃\", \"😄\", \"😁\", \"+\", \"/\", \"-\", \"=\", \"12\",\n", " \"28\", \"40\", \"16\", \" \", \"I\", \"[CLS]\", \"[SEP]\", \"[UNK]\", \"[PAD]\", \"[MASK]\", \"[unused1]\", \"[unused10]\"]\n", "\n", "vocab = text.Vocab.from_list(vocab_list)\n", "tokenizer_op = text.BertTokenizer(vocab=vocab)\n", "dataset = dataset.map(operations=tokenizer_op)\n", "\n", "print(\"------------------------after tokenization-----------------------------\")\n", "\n", "for i in dataset.create_dict_iterator(num_epochs=1, output_numpy=True):\n", " print(text.to_str(i['text']))" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "------------------------before tokenization----------------------------\n", "床前明月光\n", "疑是地上霜\n", "举头望明月\n", "低头思故乡\n", "I am making small mistakes during working hours\n", "😀嘿嘿😃哈哈😄大笑😁嘻嘻\n", "繁體字\n", "------------------------after tokenization-----------------------------\n", "['床' '前' '明' '月' '光']\n", "['疑' '是' '地' '上' '霜']\n", "['举' '头' '望' '明' '月']\n", "['低' '头' '思' '故' '乡']\n", "['I' 'am' 'mak' '##ing' 'small' 'mistake' '##s' 'during' 'work' '##ing'\n", " 'hour' '##s']\n", "['😀' '嘿' '嘿' '😃' '哈' '哈' '😄' '大' '笑' '😁' '嘻' '嘻']\n", "['繁' '體' '字']\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### JiebaTokenizer\n", "\n", "`JiebaTokenizer`是基于jieba的中文分词。\n", "\n", "以下示例代码完成下载字典文件`hmm_model.utf8`和`jieba.dict.utf8`,并将其放到指定位置。" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "import os\n", "import requests\n", "\n", "requests.packages.urllib3.disable_warnings()\n", "\n", "def download_dataset(dataset_url, path):\n", " filename = dataset_url.split(\"/\")[-1]\n", " save_path = os.path.join(path, filename)\n", " if os.path.exists(save_path):\n", " return\n", " if not os.path.exists(path):\n", " os.makedirs(path)\n", " res = requests.get(dataset_url, stream=True, verify=False)\n", " with open(save_path, \"wb\") as f:\n", " for chunk in res.iter_content(chunk_size=512):\n", " if chunk:\n", " f.write(chunk)\n", " print(\"The {} file is downloaded and saved in the path {} after processing\".format(os.path.basename(dataset_url), path))\n", "\n", "train_path = \"datasets/MNIST_Data/train\"\n", "test_path = \"datasets/MNIST_Data/test\"\n", "\n", "download_dataset(\"https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/datasets/hmm_model.utf8\", \"./datasets/tokenizer/\")\n", "download_dataset(\"https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/datasets/jieba.dict.utf8\", \"./datasets/tokenizer/\")" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "下载的文件放置的目录结构如下:\n", "\n", "```text\n", "./datasets/tokenizer/\n", "├── hmm_model.utf8\n", "└── jieba.dict.utf8\n", "```" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "下面的样例首先构建了一个文本数据集,然后使用HMM与MP字典文件创建`JiebaTokenizer`对象,并对数据集进行分词,最后展示了分词前后的文本结果。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 3, "source": [ "import mindspore.dataset as ds\n", "import mindspore.dataset.text as text\n", "\n", "input_list = [\"今天天气太好了我们一起去外面玩吧\"]\n", "dataset = ds.NumpySlicesDataset(input_list, column_names=[\"text\"], shuffle=False)\n", "\n", "print(\"------------------------before tokenization----------------------------\")\n", "\n", "for data in dataset.create_dict_iterator(output_numpy=True):\n", " print(text.to_str(data['text']))\n", "\n", "# files from open source repository https://github.com/yanyiwu/cppjieba/tree/master/dict\n", "HMM_FILE = \"./datasets/tokenizer/hmm_model.utf8\"\n", "MP_FILE = \"./datasets/tokenizer/jieba.dict.utf8\"\n", "jieba_op = text.JiebaTokenizer(HMM_FILE, MP_FILE)\n", "dataset = dataset.map(operations=jieba_op, input_columns=[\"text\"], num_parallel_workers=1)\n", "\n", "print(\"------------------------after tokenization-----------------------------\")\n", "\n", "for i in dataset.create_dict_iterator(num_epochs=1, output_numpy=True):\n", " print(text.to_str(i['text']))" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "------------------------before tokenization----------------------------\n", "今天天气太好了我们一起去外面玩吧\n", "------------------------after tokenization-----------------------------\n", "['今天天气' '太好了' '我们' '一起' '去' '外面' '玩吧']\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### SentencePieceTokenizer\n", "\n", "`SentencePieceTokenizer`是基于[SentencePiece](https://github.com/google/sentencepiece)这个开源的自然语言处理工具包。\n", "\n", "以下示例代码将下载文本数据集文件`botchan.txt`,并将其放置到指定位置。" ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "download_dataset(\"https://obs.dualstack.cn-north-4.myhuaweicloud.com/mindspore-website/notebook/datasets/botchan.txt\", \"./datasets/tokenizer/\")" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "下载的文件放置的目录结构如下:\n", "\n", "```text\n", "./datasets/tokenizer/\n", "└── botchan.txt\n", "```" ], "metadata": {} }, { "cell_type": "markdown", "source": [ "下面的样例首先构建了一个文本数据集,然后从`vocab_file`文件中构建一个`vocab`对象,再通过`SentencePieceTokenizer`对数据集进行分词,并展示了分词前后的文本结果。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 5, "source": [ "import mindspore.dataset as ds\n", "import mindspore.dataset.text as text\n", "from mindspore.dataset.text import SentencePieceModel, SPieceTokenizerOutType\n", "\n", "input_list = [\"I saw a girl with a telescope.\"]\n", "dataset = ds.NumpySlicesDataset(input_list, column_names=[\"text\"], shuffle=False)\n", "\n", "print(\"------------------------before tokenization----------------------------\")\n", "\n", "for data in dataset.create_dict_iterator(output_numpy=True):\n", " print(text.to_str(data['text']))\n", "\n", "# file from MindSpore repository https://gitee.com/mindspore/mindspore/blob/r1.6/tests/ut/data/dataset/test_sentencepiece/botchan.txt\n", "vocab_file = \"./datasets/tokenizer/botchan.txt\"\n", "vocab = text.SentencePieceVocab.from_file([vocab_file], 5000, 0.9995, SentencePieceModel.UNIGRAM, {})\n", "tokenizer_op = text.SentencePieceTokenizer(vocab, out_type=SPieceTokenizerOutType.STRING)\n", "dataset = dataset.map(operations=tokenizer_op)\n", "\n", "print(\"------------------------after tokenization-----------------------------\")\n", "\n", "for i in dataset.create_dict_iterator(num_epochs=1, output_numpy=True):\n", " print(text.to_str(i['text']))" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "------------------------before tokenization----------------------------\n", "I saw a girl with a telescope.\n", "------------------------after tokenization-----------------------------\n", "['▁I' '▁sa' 'w' '▁a' '▁girl' '▁with' '▁a' '▁te' 'les' 'co' 'pe' '.']\n" ] } ], "metadata": { "scrolled": true } }, { "cell_type": "markdown", "source": [ "### UnicodeCharTokenizer\n", "\n", "`UnicodeCharTokenizer`是根据Unicode字符集来分词的。\n", "\n", "下面的样例首先构建了一个文本数据集,然后通过`UnicodeCharTokenizer`对数据集进行分词,并展示了分词前后的文本结果。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 6, "source": [ "import mindspore.dataset as ds\n", "import mindspore.dataset.text as text\n", "\n", "input_list = [\"Welcome to Beijing!\", \"北京欢迎您!\", \"我喜欢English!\"]\n", "dataset = ds.NumpySlicesDataset(input_list, column_names=[\"text\"], shuffle=False)\n", "\n", "print(\"------------------------before tokenization----------------------------\")\n", "\n", "for data in dataset.create_dict_iterator(output_numpy=True):\n", " print(text.to_str(data['text']))\n", "\n", "tokenizer_op = text.UnicodeCharTokenizer()\n", "dataset = dataset.map(operations=tokenizer_op)\n", "\n", "print(\"------------------------after tokenization-----------------------------\")\n", "\n", "for i in dataset.create_dict_iterator(num_epochs=1, output_numpy=True):\n", " print(text.to_str(i['text']).tolist())" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "------------------------before tokenization----------------------------\n", "Welcome to Beijing!\n", "北京欢迎您!\n", "我喜欢English!\n", "------------------------after tokenization-----------------------------\n", "['W', 'e', 'l', 'c', 'o', 'm', 'e', ' ', 't', 'o', ' ', 'B', 'e', 'i', 'j', 'i', 'n', 'g', '!']\n", "['北', '京', '欢', '迎', '您', '!']\n", "['我', '喜', '欢', 'E', 'n', 'g', 'l', 'i', 's', 'h', '!']\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### WhitespaceTokenizer\n", "\n", "`WhitespaceTokenizer`是根据空格来进行分词的。\n", "\n", "下面的样例首先构建了一个文本数据集,然后通过`WhitespaceTokenizer`对数据集进行分词,并展示了分词前后的文本结果。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 7, "source": [ "import mindspore.dataset as ds\n", "import mindspore.dataset.text as text\n", "\n", "input_list = [\"Welcome to Beijing!\", \"北京欢迎您!\", \"我喜欢English!\"]\n", "dataset = ds.NumpySlicesDataset(input_list, column_names=[\"text\"], shuffle=False)\n", "\n", "print(\"------------------------before tokenization----------------------------\")\n", "\n", "for data in dataset.create_dict_iterator(output_numpy=True):\n", " print(text.to_str(data['text']))\n", "\n", "tokenizer_op = text.WhitespaceTokenizer()\n", "dataset = dataset.map(operations=tokenizer_op)\n", "\n", "print(\"------------------------after tokenization-----------------------------\")\n", "\n", "for i in dataset.create_dict_iterator(num_epochs=1, output_numpy=True):\n", " print(text.to_str(i['text']).tolist())" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "------------------------before tokenization----------------------------\n", "Welcome to Beijing!\n", "北京欢迎您!\n", "我喜欢English!\n", "------------------------after tokenization-----------------------------\n", "['Welcome', 'to', 'Beijing!']\n", "['北京欢迎您!']\n", "['我喜欢English!']\n" ] } ], "metadata": {} }, { "cell_type": "markdown", "source": [ "### WordpieceTokenizer\n", "\n", "`WordpieceTokenizer`是基于单词集来进行划分的,划分依据可以是单词集中的单个单词,或者多个单词的组合形式。\n", "\n", "下面的样例首先构建了一个文本数据集,然后从单词列表中构建`vocab`对象,通过`WordpieceTokenizer`对数据集进行分词,并展示了分词前后的文本结果。" ], "metadata": {} }, { "cell_type": "code", "execution_count": 8, "source": [ "import mindspore.dataset as ds\n", "import mindspore.dataset.text as text\n", "\n", "input_list = [\"my\", \"favorite\", \"book\", \"is\", \"love\", \"during\", \"the\", \"cholera\", \"era\", \"what\",\n", " \"我\", \"最\", \"喜\", \"欢\", \"的\", \"书\", \"是\", \"霍\", \"乱\", \"时\", \"期\", \"的\", \"爱\", \"情\", \"您\"]\n", "vocab_english = [\"book\", \"cholera\", \"era\", \"favor\", \"##ite\", \"my\", \"is\", \"love\", \"dur\", \"##ing\", \"the\"]\n", "vocab_chinese = [\"我\", '最', '喜', '欢', '的', '书', '是', '霍', '乱', '时', '期', '爱', '情']\n", "\n", "dataset = ds.NumpySlicesDataset(input_list, column_names=[\"text\"], shuffle=False)\n", "\n", "print(\"------------------------before tokenization----------------------------\")\n", "\n", "for data in dataset.create_dict_iterator(output_numpy=True):\n", " print(text.to_str(data['text']))\n", "\n", "vocab = text.Vocab.from_list(vocab_english+vocab_chinese)\n", "tokenizer_op = text.WordpieceTokenizer(vocab=vocab)\n", "dataset = dataset.map(operations=tokenizer_op)\n", "\n", "print(\"------------------------after tokenization-----------------------------\")\n", "\n", "for i in dataset.create_dict_iterator(num_epochs=1, output_numpy=True):\n", " print(text.to_str(i['text']))" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "------------------------before tokenization----------------------------\n", "my\n", "favorite\n", "book\n", "is\n", "love\n", "during\n", "the\n", "cholera\n", "era\n", "what\n", "我\n", "最\n", "喜\n", "欢\n", "的\n", "书\n", "是\n", "霍\n", "乱\n", "时\n", "期\n", "的\n", "爱\n", "情\n", "您\n", "------------------------after tokenization-----------------------------\n", "['my']\n", "['favor' '##ite']\n", "['book']\n", "['is']\n", "['love']\n", "['dur' '##ing']\n", "['the']\n", "['cholera']\n", "['era']\n", "['[UNK]']\n", "['我']\n", "['最']\n", "['喜']\n", "['欢']\n", "['的']\n", "['书']\n", "['是']\n", "['霍']\n", "['乱']\n", "['时']\n", "['期']\n", "['的']\n", "['爱']\n", "['情']\n", "['[UNK]']\n" ] } ], "metadata": {} } ], "metadata": { "kernelspec": { "display_name": "MindSpore", "language": "python", "name": "mindspore" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }