mindspore.dataset.text

mindspore.dataset.text.transforms

The module text.transforms is inherited from _c_dataengine and is implemented based on ICU4C and cppjieba in C++. It’s a high performance module to process NLP text. Users can use Vocab to build their own dictionary, use appropriate tokenizers to split sentences into different tokens, and use Lookup to find the index of tokens in Vocab.

Note

A constructor’s arguments for every class in this module must be saved into the class attributes (self.xxx) to support save() and load().

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> dataset_file = "path/to/text_file_path"
>>> # sentences as line data saved in a file
>>> dataset = ds.TextFileDataset(dataset_file, shuffle=False)
>>> # tokenize sentence to unicode characters
>>> tokenizer = text.UnicodeCharTokenizer()
>>> # load vocabulary form list
>>> vocab = text.Vocab.from_list(['深', '圳', '欢', '迎', '您'])
>>> # lookup is an operation for mapping tokens to ids
>>> lookup = text.Lookup(vocab)
>>> dataset = dataset.map(operations=[tokenizer, lookup])
>>> for i in dataset.create_dict_iterator():
>>>     print(i)
>>> # if text line in dataset_file is:
>>> # 深圳欢迎您
>>> # then the output will be:
>>> # {'text': array([0, 1, 2, 3, 4], dtype=int32)}
class mindspore.dataset.text.transforms.BasicTokenizer(lower_case=False, keep_whitespace=False, normalization_form=NormalizeForm.NONE, preserve_unused_token=True, with_offsets=False)[source]

Tokenize a scalar tensor of UTF-8 string by specific rules.

Parameters
  • lower_case (bool, optional) – If True, apply CaseFold, NormalizeUTF8(NFD mode), RegexReplace operation on input text to fold the text to lower case and strip accents characters. If False, only apply NormalizeUTF8(‘normalization_form’ mode) operation on input text (default=False).

  • keep_whitespace (bool, optional) – If True, the whitespace will be kept in out tokens (default=False).

  • normalization_form (NormalizeForm, optional) – Used to specify a specific normalize mode. This is only effective when ‘lower_case’ is False. See NormalizeUTF8 for details (default=NormalizeForm.NONE).

  • preserve_unused_token (bool, optional) – If True, do not split special tokens like ‘[CLS]’, ‘[SEP]’, ‘[UNK]’, ‘[PAD]’, ‘[MASK]’ (default=True).

  • with_offsets (bool, optional) – If or not output offsets of tokens (default=False).

Examples

>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> tokenizer_op = text.BasicTokenizer(lower_case=False,
>>>                                   keep_whitespace=False,
>>>                                   normalization_form=NormalizeForm.NONE,
>>>                                   preserve_unused_token=True,
>>>                                   with_offsets=False)
>>> dataset = dataset.map(operations=tokenizer_op)
>>> # If with_offsets=False, then output three columns {["token", dtype=str],
>>> #                                                   ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.BasicTokenizer(lower_case=False,
>>>                                   keep_whitespace=False,
>>>                                   normalization_form=NormalizeForm.NONE,
>>>                                   preserve_unused_token=True,
>>>                                   with_offsets=True)
>>> data = data.map(operations=tokenizer_op, input_columns=["text"],
>>>                 output_columns=["token", "offsets_start", "offsets_limit"],
>>>                 column_order=["token", "offsets_start", "offsets_limit"])
class mindspore.dataset.text.transforms.BertTokenizer(vocab, suffix_indicator='##', max_bytes_per_token=100, unknown_token='[UNK]', lower_case=False, keep_whitespace=False, normalization_form=NormalizeForm.NONE, preserve_unused_token=True, with_offsets=False)[source]

Tokenizer used for Bert text process.

Parameters
  • vocab (Vocab) – A vocabulary object.

  • suffix_indicator (str, optional) – Used to show that the subword is the last part of a word (default=’##’).

  • max_bytes_per_token (int, optional) – Tokens exceeding this length will not be further split (default=100).

  • unknown_token (str, optional) – When a token cannot be found: if ‘unknown_token’ is empty string, return the token directly, else return ‘unknown_token’(default=’[UNK]’).

  • lower_case (bool, optional) – If True, apply CaseFold, NormalizeUTF8(NFD mode), RegexReplace operation on input text to fold the text to lower case and strip accented characters. If False, only apply NormalizeUTF8(‘normalization_form’ mode) operation on input text (default=False).

  • keep_whitespace (bool, optional) – If True, the whitespace will be kept in out tokens (default=False).

  • normalization_form (NormalizeForm, optional) – Used to specify a specific normalize mode, only effective when ‘lower_case’ is False. See NormalizeUTF8 for details (default=’NONE’).

  • preserve_unused_token (bool, optional) – If True, do not split special tokens like ‘[CLS]’, ‘[SEP]’, ‘[UNK]’, ‘[PAD]’, ‘[MASK]’ (default=True).

  • with_offsets (bool, optional) – If or not output offsets of tokens (default=False).

Examples

>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> tokenizer_op = text.BertTokenizer(vocab=vocab, suffix_indicator='##', max_bytes_per_token=100,
>>>                                  unknown_token=100, lower_case=False, keep_whitespace=False,
>>>                                  normalization_form=NormalizeForm.NONE, preserve_unused_token=True,
>>>                                  with_offsets=False)
>>> dataset = dataset.map(operations=tokenizer_op)
>>> # If with_offsets=False, then output three columns {["token", dtype=str],
>>> #                                                   ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.BertTokenizer(vocab=vocab, suffix_indicator='##', max_bytes_per_token=100,
>>>                                  unknown_token=100, lower_case=False, keep_whitespace=False,
>>>                                  normalization_form=NormalizeForm.NONE, preserve_unused_token=True,
>>>                                  with_offsets=True)
>>> data = data.map(operations=tokenizer_op, input_columns=["text"],
>>>                 output_columns=["token", "offsets_start", "offsets_limit"],
>>>                 column_order=["token", "offsets_start", "offsets_limit"])
class mindspore.dataset.text.transforms.CaseFold[source]

Apply case fold operation on utf-8 string tensor.

class mindspore.dataset.text.transforms.JiebaTokenizer(hmm_path, mp_path, mode=JiebaMode.MIX, with_offsets=False)[source]

Tokenize Chinese string into words based on dictionary.

Parameters
  • hmm_path (str) – Dictionary file is used by HMMSegment algorithm. The dictionary can be obtained on the official website of cppjieba.

  • mp_path (str) – Dictionary file is used by MPSegment algorithm. The dictionary can be obtained on the official website of cppjieba.

  • mode (JiebaMode, optional) –

    Valid values can be any of [JiebaMode.MP, JiebaMode.HMM, JiebaMode.MIX](default=JiebaMode.MIX).

    • JiebaMode.MP, tokenize with MPSegment algorithm.

    • JiebaMode.HMM, tokenize with Hiddel Markov Model Segment algorithm.

    • JiebaMode.MIX, tokenize with a mix of MPSegment and HMMSegment algorithm.

  • with_offsets (bool, optional) – If or not output offsets of tokens (default=False).

Examples

>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> tokenizer_op = JiebaTokenizer(HMM_FILE, MP_FILE, mode=JiebaMode.MP, with_offsets=False)
>>> data = data.map(operations=tokenizer_op)
>>> # If with_offsets=False, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = JiebaTokenizer(HMM_FILE, MP_FILE, mode=JiebaMode.MP, with_offsets=True)
>>> data = data.map(operations=tokenizer_op, input_columns=["text"],
>>>                 output_columns=["token", "offsets_start", "offsets_limit"],
>>>                 column_order=["token", "offsets_start", "offsets_limit"])
add_dict(user_dict)[source]

Add user defined word to JiebaTokenizer’s dictionary.

Parameters

user_dict (Union[str, dict]) –

Dictionary to be added, file path or Python dictionary, Python Dict format: {word1:freq1, word2:freq2,…}. Jieba dictionary format : word(required), freq(optional), such as:

word1 freq1
word2
word3 freq3

add_word(word, freq=None)[source]

Add user defined word to JiebaTokenizer’s dictionary.

Parameters
  • word (str) – The word to be added to the JiebaTokenizer instance. The added word will not be written into the built-in dictionary on disk.

  • freq (int, optional) – The frequency of the word to be added. The higher the frequency, the better chance the word will be tokenized (default=None, use default frequency).

class mindspore.dataset.text.transforms.Lookup(vocab, unknown_token=None, data_type=mindspore.int32)[source]

Lookup operator that looks up a word to an id.

Parameters
  • vocab (Vocab) – A vocabulary object.

  • unknown_token (str, optional) – Word used for lookup if the word being looked up is out-of-vocabulary (OOV). If unknown_token is OOV, a runtime error will be thrown (default=None).

  • data_type (mindspore.dtype, optional) – mindspore.dtype that lookup maps string to (default=mstype.int32)

class mindspore.dataset.text.transforms.Ngram(n, left_pad=('', 0), right_pad=('', 0), separator=' ')[source]

TensorOp to generate n-gram from a 1-D string Tensor.

Refer to https://en.wikipedia.org/wiki/N-gram#Examples for an overview of what n-gram is and how it works.

Parameters
  • n (list[int]) – n in n-gram, n >= 1. n is a list of positive integers. For example, if n=[4,3], then the result would be a 4-gram followed by a 3-gram in the same tensor. If the number of words is not enough to make up for a n-gram, an empty string will be returned. For example, 3 grams on [“mindspore”,”best”] will result in an empty string produced.

  • left_pad (tuple, optional) – (“pad_token”, pad_width). Padding performed on left side of the sequence. pad_width will be capped at n-1. left_pad=(“_”,2) would pad left side of the sequence with “__” (default=None).

  • right_pad (tuple, optional) – (“pad_token”, pad_width). Padding performed on right side of the sequence. pad_width will be capped at n-1. right_pad=(“-“:2) would pad right side of the sequence with “–” (default=None).

  • separator (str, optional) – symbol used to join strings together. For example. if 2-gram is [“mindspore”, “amazing”] with separator=”-”, the result would be [“mindspore-amazing”] (default=None, which means whitespace is used).

class mindspore.dataset.text.transforms.NormalizeUTF8(normalize_form=NormalizeForm.NFKC)[source]

Apply normalize operation on utf-8 string tensor.

Parameters

normalize_form (NormalizeForm, optional) –

Valid values can be any of [NormalizeForm.NONE, NormalizeForm.NFC, NormalizeForm.NFKC, NormalizeForm.NFD, NormalizeForm.NFKD](default=NormalizeForm.NFKC). See http://unicode.org/reports/tr15/ for details.

  • NormalizeForm.NONE, do nothing for input string tensor.

  • NormalizeForm.NFC, normalize with Normalization Form C.

  • NormalizeForm.NFKC, normalize with Normalization Form KC.

  • NormalizeForm.NFD, normalize with Normalization Form D.

  • NormalizeForm.NFKD, normalize with Normalization Form KD.

class mindspore.dataset.text.transforms.PythonTokenizer(tokenizer)[source]

Callable class to be used for user-defined string tokenizer.

Parameters

tokenizer (Callable) – Python function that takes a str and returns a list of str as tokens.

Examples

>>> def my_tokenizer(line):
>>>     return line.split()
>>> data = data.map(operations=PythonTokenizer(my_tokenizer))
class mindspore.dataset.text.transforms.RegexReplace(pattern, replace, replace_all=True)[source]

Replace utf-8 string tensor with ‘replace’ according to regular expression ‘pattern’.

See http://userguide.icu-project.org/strings/regexp for support regex pattern.

Parameters
  • pattern (str) – the regex expression patterns.

  • replace (str) – the string to replace matched element.

  • replace_all (bool, optional) – If False, only replace first matched element; if True, replace all matched elements (default=True).

class mindspore.dataset.text.transforms.RegexTokenizer(delim_pattern, keep_delim_pattern='', with_offsets=False)[source]

Tokenize a scalar tensor of UTF-8 string by regex expression pattern.

See http://userguide.icu-project.org/strings/regexp for support regex pattern.

Parameters
  • delim_pattern (str) – The pattern of regex delimiters. The original string will be split by matched elements.

  • keep_delim_pattern (str, optional) – The string matched by ‘delim_pattern’ can be kept as a token if it can be matched by ‘keep_delim_pattern’. The default value is an empty str (‘’) which means that delimiters will not be kept as an output token (default=’’).

  • with_offsets (bool, optional) – If or not output offsets of tokens (default=False).

Examples

>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> tokenizer_op = text.RegexTokenizer(delim_pattern, keep_delim_pattern, with_offsets=False)
>>> dataset = dataset.map(operations=tokenizer_op)
>>> # If with_offsets=False, then output three columns {["token", dtype=str],
>>> #                                                   ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.RegexTokenizer(delim_pattern, keep_delim_pattern, with_offsets=True)
>>> data = data.map(operations=tokenizer_op, input_columns=["text"],
>>>                 output_columns=["token", "offsets_start", "offsets_limit"],
>>>                 column_order=["token", "offsets_start", "offsets_limit"])
class mindspore.dataset.text.transforms.SentencePieceTokenizer(mode, out_type)[source]

Tokenize scalar token or 1-D tokens to tokens by sentencepiece.

Parameters
  • mode (Union[str, SentencePieceVocab]) – If the input parameter is a file, then it is of type string. If the input parameter is a SentencePieceVocab object, then it is of type SentencePieceVocab.

  • out_type (Union[str, int]) – The type of output.

class mindspore.dataset.text.transforms.SlidingWindow(width, axis=0)[source]

TensorOp to construct a tensor from data (only 1-D for now), where each element in the dimension axis is a slice of data starting at the corresponding position, with a specified width.

Parameters
  • width (int) – The width of the window. It must be an integer and greater than zero.

  • axis (int, optional) – The axis along which the sliding window is computed (default=0).

Examples

>>> # Data before
>>> # |    col1     |
>>> # +-------------+
>>> # | [1,2,3,4,5] |
>>> # +-------------+
>>> data = data.map(operations=SlidingWindow(3, 0))
>>> # Data after
>>> # |     col1    |
>>> # +-------------+
>>> # |  [[1,2,3],  |
>>> # |   [2,3,4],  |
>>> # |   [3,4,5]]  |
>>> # +--------------+
class mindspore.dataset.text.transforms.ToNumber(data_type)[source]

Tensor operation to convert every element of a string tensor to a number.

Strings are casted according to the rules specified in the following links: https://en.cppreference.com/w/cpp/string/basic_string/stof, https://en.cppreference.com/w/cpp/string/basic_string/stoul, except that any strings which represent negative numbers cannot be cast to an unsigned integer type.

Parameters

data_type (mindspore.dtype) – mindspore.dtype to be casted to. Must be a numeric type.

Raises

RuntimeError – If strings are invalid to cast, or are out of range after being casted.

class mindspore.dataset.text.transforms.TruncateSequencePair(max_length)[source]

Truncate a pair of rank-1 tensors such that the total length is less than max_length.

This operation takes two input tensors and returns two output Tenors.

Parameters

max_length (int) – Maximum length required.

Examples

>>> # Data before
>>> # |  col1   |  col2   |
>>> # +---------+---------|
>>> # | [1,2,3] | [4,5]   |
>>> # +---------+---------+
>>> data = data.map(operations=TruncateSequencePair(4))
>>> # Data after
>>> # |  col1   |  col2   |
>>> # +---------+---------+
>>> # | [1,2]   | [4,5]   |
>>> # +---------+---------+
class mindspore.dataset.text.transforms.UnicodeCharTokenizer(with_offsets=False)[source]

Tokenize a scalar tensor of UTF-8 string to Unicode characters.

Parameters

with_offsets (bool, optional) – If or not output offsets of tokens (default=False).

Examples

>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> tokenizer_op = text.UnicodeCharTokenizer()
>>> dataset = dataset.map(operations=tokenizer_op)
>>> # If with_offsets=False, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.UnicodeCharTokenizer(True)
>>> data = data.map(operations=tokenizer_op, input_columns=["text"],
>>>                 output_columns=["token", "offsets_start", "offsets_limit"],
>>>                 column_order=["token", "offsets_start", "offsets_limit"])
class mindspore.dataset.text.transforms.UnicodeScriptTokenizer(keep_whitespace=False, with_offsets=False)[source]

Tokenize a scalar tensor of UTF-8 string on Unicode script boundaries.

Parameters
  • keep_whitespace (bool, optional) – If or not emit whitespace tokens (default=False).

  • with_offsets (bool, optional) – If or not output offsets of tokens (default=False).

Examples

>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> tokenizer_op = text.UnicodeScriptTokenizerOp(keep_whitespace=True, with_offsets=False)
>>> dataset = dataset.map(operations=tokenizer_op)
>>> # If with_offsets=False, then output three columns {["token", dtype=str],
>>> #                                                   ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.UnicodeScriptTokenizerOp(keep_whitespace=True, with_offsets=True)
>>> data = data.map(operations=tokenizer_op, input_columns=["text"],
>>>                 output_columns=["token", "offsets_start", "offsets_limit"],
>>>                 column_order=["token", "offsets_start", "offsets_limit"])
class mindspore.dataset.text.transforms.WhitespaceTokenizer(with_offsets=False)[source]

Tokenize a scalar tensor of UTF-8 string on ICU4C defined whitespaces, such as: ‘ ‘, ‘\t’, ‘\r’, ‘\n’.

Parameters

with_offsets (bool, optional) – If or not output offsets of tokens (default=False).

Examples

>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> tokenizer_op = text.WhitespaceTokenizer()
>>> dataset = dataset.map(operations=tokenizer_op)
>>> # If with_offsets=False, then output three columns {["token", dtype=str],
>>> #                                                   ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.WhitespaceTokenizer(True)
>>> data = data.map(operations=tokenizer_op, input_columns=["text"],
>>>                 output_columns=["token", "offsets_start", "offsets_limit"],
>>>                 column_order=["token", "offsets_start", "offsets_limit"])
class mindspore.dataset.text.transforms.WordpieceTokenizer(vocab, suffix_indicator='##', max_bytes_per_token=100, unknown_token='[UNK]', with_offsets=False)[source]

Tokenize scalar token or 1-D tokens to 1-D subword tokens.

Parameters
  • vocab (Vocab) – A vocabulary object.

  • suffix_indicator (str, optional) – Used to show that the subword is the last part of a word (default=’##’).

  • max_bytes_per_token (int, optional) – Tokens exceeding this length will not be further split (default=100).

  • unknown_token (str, optional) – When a token cannot be found: if ‘unknown_token’ is empty string, return the token directly, else return ‘unknown_token’ (default=’[UNK]’).

  • with_offsets (bool, optional) – If or not output offsets of tokens (default=False).

Examples

>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> tokenizer_op = text.WordpieceTokenizer(vocab=vocab, unknown_token=['UNK'],
>>>                                       max_bytes_per_token=100, with_offsets=False)
>>> dataset = dataset.map(operations=tokenizer_op)
>>> # If with_offsets=False, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32],
>>> #                                                   ["offsets_limit", dtype=uint32]}
>>> tokenizer_op = text.WordpieceTokenizer(vocab=vocab, unknown_token=['UNK'],
>>>                                       max_bytes_per_token=100, with_offsets=True)
>>> data = data.map(operations=tokenizer_op,
>>>                 input_columns=["text"], output_columns=["token", "offsets_start", "offsets_limit"],
>>>                 column_order=["token", "offsets_start", "offsets_limit"])

mindspore.dataset.text.utils

The module text.utils provides some general methods for NLP text processing. For example, you can use Vocab to build a dictionary, use to_bytes and to_str to encode and decode strings into a specified format.

class mindspore.dataset.text.utils.SentencePieceVocab[source]

SentencePiece obiect that is used to segmentate words

classmethod from_dataset(dataset, col_names, vocab_size, character_coverage, model_type, params)[source]

Build a sentencepiece from a dataset

Parameters
  • dataset (Dataset) – Dataset to build sentencepiece.

  • col_names (list) – The list of the col name.

  • vocab_size (int) – Vocabulary size.

  • character_coverage (float) – Amount of characters covered by the model, good defaults are: 0.9995 for languages. with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.

  • model_type (SentencePieceModel) – Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type.

  • params (dict) – A dictionary with no incoming parameters.

Returns

SentencePiece, SentencePiece object from dataset.

classmethod from_file(file_path, vocab_size, character_coverage, model_type, params)[source]

Build a SentencePiece object from a list of word.

Parameters
  • file_path (list) – Path to the file which contains the sentencepiece list.

  • vocab_size (int) – Vocabulary size, the type of uint32_t.

  • character_coverage (float) – Amount of characters covered by the model, good defaults are: 0.9995 for languages. with rich character set like Japanse or Chinese and 1.0 for other languages with small character set.

  • model_type (SentencePieceModel) – Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type.

  • params (dict) –

    A dictionary with no incoming parameters(The parameters are derived from SentencePiece library).

    input_sentence_size 0
    max_sentencepiece_length 16
    

classmethod save_model(vocab, path, filename)[source]

Save model to filepath

Parameters
  • vocab (SentencePieceVocab) – A sentencepiece object.

  • path (str) – Path to store model.

  • filename (str) – The name of the file.

class mindspore.dataset.text.utils.Vocab[source]

Vocab object that is used to lookup a word.

It contains a map that maps each word(str) to an id (int).

classmethod from_dataset(dataset, columns=None, freq_range=None, top_k=None, special_tokens=None, special_first=True)[source]

Build a vocab from a dataset.

This would collect all unique words in a dataset and return a vocab within the frequency range specified by user in freq_range. User would be warned if no words fall into the frequency. Words in vocab are ordered from highest frequency to lowest frequency. Words with the same frequency would be ordered lexicographically.

Parameters
  • dataset (Dataset) – dataset to build vocab from.

  • columns (list[str], optional) – column names to get words from. It can be a list of column names. (default=None, where all columns will be used. If any column isn’t string type, will return error).

  • freq_range (tuple, optional) – A tuple of integers (min_frequency, max_frequency). Words within the frequency range would be kept. 0 <= min_frequency <= max_frequency <= total_words. min_frequency=0 is the same as min_frequency=1. max_frequency > total_words is the same as max_frequency = total_words. min_frequency/max_frequency can be None, which corresponds to 0/total_words separately (default=None, all words are included).

  • top_k (int, optional) – top_k > 0. Number of words to be built into vocab. top_k most frequent words are taken. top_k is taken after freq_range. If not enough top_k, all words will be taken (default=None, all words are included).

  • special_tokens (list, optional) – a list of strings, each one is a special token. for example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).

  • special_first (bool, optional) – whether special_tokens will be prepended/appended to vocab. If special_tokens is specified and special_first is set to True, special_tokens will be prepended (default=True).

Returns

Vocab, Vocab object built from dataset.

classmethod from_dict(word_dict)[source]

Build a vocab object from a dict.

Parameters

word_dict (dict) – dict contains word and id pairs, where word should be str and id be int. id is recommended to start from 0 and be continuous. ValueError will be raised if id is negative.

classmethod from_file(file_path, delimiter='', vocab_size=None, special_tokens=None, special_first=True)[source]

Build a vocab object from a list of word.

Parameters
  • file_path (str) – path to the file which contains the vocab list.

  • delimiter (str, optional) – a delimiter to break up each line in file, the first element is taken to be the word (default=””).

  • vocab_size (int, optional) – number of words to read from file_path (default=None, all words are taken).

  • special_tokens (list, optional) – a list of strings, each one is a special token. for example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).

  • special_first (bool, optional) – whether special_tokens will be prepended/appended to vocab, If special_tokens is specified and special_first is set to True, special_tokens will be prepended (default=True).

classmethod from_list(word_list, special_tokens=None, special_first=True)[source]

Build a vocab object from a list of word.

Parameters
  • word_list (list) – a list of string where each element is a word of type string.

  • special_tokens (list, optional) – a list of strings, each one is a special token. for example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).

  • special_first (bool, optional) – whether special_tokens will be prepended/appended to vocab, If special_tokens is specified and special_first is set to True, special_tokens will be prepended (default=True).

mindspore.dataset.text.utils.to_bytes(array, encoding='utf8')[source]

Convert NumPy array of str to array of bytes by encoding each element based on charset encoding.

Parameters
  • array (numpy.ndarray) – Array of type str representing strings.

  • encoding (str) – Indicating the charset for encoding.

Returns

numpy.ndarray, NumPy array of bytes.

mindspore.dataset.text.utils.to_str(array, encoding='utf8')[source]

Convert NumPy array of bytes to array of str by decoding each element based on charset encoding.

Parameters
  • array (numpy.ndarray) – Array of type bytes representing strings.

  • encoding (str) – Indicating the charset for decoding.

Returns

numpy.ndarray, NumPy array of str.