{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Illustration of text transforms\n", "\n", "[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/mindspore/blob/master/docs/api/api_python_en/samples/dataset/text_gallery.ipynb)\n", "\n", "This example illustrates the various transforms available in the [mindspore.dataset.text](https://www.mindspore.cn/docs/en/master/api_python/mindspore.dataset.transforms.html#module-mindspore.dataset.text) module.\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Preparation" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/bert-base-uncased-vocab.txt (226 kB)\n", "\n", "file_sizes: 100%|████████████████████████████| 232k/232k [00:00<00:00, 2.21MB/s]\n", "Successfully downloaded file to ./bert-base-uncased-vocab.txt\n", "Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/article.txt (9 kB)\n", "\n", "file_sizes: 100%|██████████████████████████| 9.06k/9.06k [00:00<00:00, 1.83MB/s]\n", "Successfully downloaded file to ./article.txt\n", "['text_gallery.ipynb', 'article.txt', 'bert-base-uncased-vocab.txt']\n" ] } ], "source": [ "import os\n", "from download import download\n", "\n", "import mindspore.dataset as ds\n", "import mindspore.dataset.text as text\n", "\n", "# Download opensource datasets\n", "# citation: https://www.kaggle.com/datasets/drknope/bertbaseuncasedvocab\n", "url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/bert-base-uncased-vocab.txt\"\n", "download(url, './bert-base-uncased-vocab.txt', replace=True)\n", "\n", "url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/article.txt\"\n", "download(url, './article.txt', replace=True)\n", "\n", "# Show the directory\n", "print(os.listdir())\n", "\n", "def call_op(op, input):\n", " print(op(input), flush=True)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Vocab\n", "\n", "The [mindspore.dataset.text.Vocab](https://mindspore.cn/docs/en/master/api_python/dataset_text/mindspore.dataset.text.Vocab.html#mindspore.dataset.text.Vocab) is used to save pairs of words and ids.\n", "It contains a map that maps each word(str) to an id(int) or reverse." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ids [18863, 18279]\n", "tokens ['##nology', 'crystalline']\n", "lookup: ids [18863 18279]\n" ] } ], "source": [ "# Load bert vocab\n", "vocab_file = open(\"bert-base-uncased-vocab.txt\")\n", "vocab_content = list(set(vocab_file.read().splitlines()))\n", "vocab = text.Vocab.from_list(vocab_content)\n", "\n", "# lookup tokens to ids\n", "ids = vocab.tokens_to_ids([\"good\", \"morning\"])\n", "print(\"ids\", ids)\n", "\n", "# lookup ids to tokens\n", "tokens = vocab.ids_to_tokens([128, 256])\n", "print(\"tokens\", tokens)\n", "\n", "# Use Lookup op to lookup index\n", "op = text.Lookup(vocab)\n", "ids = op([\"good\", \"morning\"])\n", "print(\"lookup: ids\", ids)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## AddToken\n", "\n", "The [mindspore.dataset.text.AddToken](https://mindspore.cn/docs/en/master/api_python/dataset_text/mindspore.dataset.text.AddToken.html#mindspore.dataset.text.AddToken) transform adds token to beginning or end of sequence.\n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['TOKEN' 'a' 'b' 'c' 'd' 'e']\n", "['a' 'b' 'c' 'd' 'e' 'END']\n" ] } ], "source": [ "txt = [\"a\", \"b\", \"c\", \"d\", \"e\"]\n", "add_token_op = text.AddToken(token='TOKEN', begin=True)\n", "call_op(add_token_op, txt)\n", "\n", "add_token_op = text.AddToken(token='END', begin=False)\n", "call_op(add_token_op, txt)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## SentencePieceTokenizer\n", "\n", "The [mindspore.dataset.text.SentencePieceTokenizer](https://mindspore.cn/docs/en/master/api_python/dataset_text/mindspore.dataset.text.SentencePieceTokenizer.html#mindspore.dataset.text.SentencePieceTokenizer) transform tokenizes scalar string into tokens by sentencepiece.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading data from https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/sentencepiece.bpe.model (4.8 MB)\n", "\n", "file_sizes: 100%|██████████████████████████| 5.07M/5.07M [00:01<00:00, 2.93MB/s]\n", "Successfully downloaded file to ./sentencepiece.bpe.model\n", "['▁Today' '▁is' '▁Tuesday' '.']\n" ] } ], "source": [ "# Construct a SentencePieceVocab model\n", "url = \"https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/notebook/datasets/sentencepiece.bpe.model\"\n", "download(url, './sentencepiece.bpe.model', replace=True)\n", "sentence_piece_vocab_file = './sentencepiece.bpe.model'\n", "\n", "# Use the model to tokenize text\n", "tokenizer = text.SentencePieceTokenizer(sentence_piece_vocab_file, out_type=text.SPieceTokenizerOutType.STRING)\n", "txt = \"Today is Tuesday.\"\n", "call_op(tokenizer, txt)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## WordpieceTokenizer\n", "\n", "The [mindspore.dataset.text.WordpieceTokenizer](https://mindspore.cn/docs/en/master/api_python/dataset_text/mindspore.dataset.text.WordpieceTokenizer.html#mindspore.dataset.text.WordpieceTokenizer) transform tokenizes the input text to subword tokens.\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['token' '##izer' 'will' 'outputs' 'sub' '##words']\n" ] } ], "source": [ "# Reuse the vocab defined above as input vocab\n", "tokenizer = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]')\n", "txt = [\"tokenizer\", \"will\", \"outputs\", \"subwords\"]\n", "call_op(tokenizer, txt)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Process TXT File In Dataset Pipeline\n", "\n", "Use [mindspore.dataset.TextFileDataset](https://mindspore.cn/docs/en/master/api_python/dataset/mindspore.dataset.TextFileDataset.html#mindspore.dataset.TextFileDataset) to read content of text into dataset pipeline and the perform tokenization on text." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load text content into dataset pipeline\n", "text_file = \"article.txt\"\n", "dataset = ds.TextFileDataset(dataset_files=text_file, shuffle=False)\n", "\n", "# check the column names inside the dataset\n", "print(\"column names:\", dataset.get_col_names())\n", "\n", "# tokenize all text content into tokens with bert vocab\n", "dataset = dataset.map(text.BertTokenizer(vocab=vocab), input_columns=[\"text\"])\n", "\n", "for data in dataset:\n", " print(data)" ] } ], "metadata": { "kernelspec": { "display_name": "ly37", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" }, "vscode": { "interpreter": { "hash": "9f0efe8a0d8ccef1406a56130f5ab5480567fb275f7fbf51bbc40aede97503df" } } }, "nbformat": 4, "nbformat_minor": 0 }