mindspore.dataset.text.transforms

The module text.transforms is inheritted from _c_dataengine which is implemented basing on icu4c and cppjieba in C++. It’s a high performance module to process nlp text. Users can use Vocab to build their own dictionary, use appropriate tokenizers to split sentences into different tokens, and use Lookup to find the index of tokens in Vocab.

Note

Constructor’s arguments for every class in this module must be saved into the class attributes (self.xxx) to support save() and load().

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>> dataset_file = "path/to/text_file_path"
>>> # sentences as line data saved in a file
>>> dataset = ds.TextFileDataset(dataset_file, shuffle=False)
>>> # tokenize sentence to unicode characters
>>> tokenizer = text.UnicodeCharTokenizer()
>>> # load vocabulary form list
>>> vocab = text.Vocab.from_list(['深', '圳', '欢', '迎', '您'])
>>> # lookup is an operation for mapping tokens to ids
>>> lookup = text.Lookup(vocab)
>>> dataset = dataset.map(operations=[tokenizer, lookup])
>>> for i in dataset.create_dict_iterator():
>>>     print(i)
>>> # if text line in dataset_file is:
>>> # 深圳欢迎您
>>> # then the output will be:
>>> # {'text': array([0, 1, 2, 3, 4], dtype=int32)}

class mindspore.dataset.text.transforms.BasicTokenizer(lower_case=False, keep_whitespace=False, normalization_form=NormalizeForm.NONE, preserve_unused_token=True)[source]

Tokenize a scalar tensor of UTF-8 string by specific rules.

Parameters

lower_case (bool, optional) – If True, apply CaseFold, NormalizeUTF8(NFD mode), RegexReplace operation on input text to make the text to lower case and strip accents characters; If False, only apply NormalizeUTF8(‘normalization_form’ mode) operation on input text(default=False).
keep_whitespace (bool, optional) – If True, the whitespace will be kept in out tokens(default=False).
normalization_form (NormalizeForm, optional) – Used to specify a specific normlaize mode, only effective when ‘lower_case’ is False. See NormalizeUTF8 for details(default=’NONE’).
preserve_unused_token (bool, optional) – If True, do not split special tokens like ‘[CLS]’, ‘[SEP]’, ‘[UNK]’, ‘[PAD]’, ‘[MASK]’(default=True).

class mindspore.dataset.text.transforms.BertTokenizer(vocab, suffix_indicator='##', max_bytes_per_token=100, unknown_token='[UNK]', lower_case=False, keep_whitespace=False, normalization_form=NormalizeForm.NONE, preserve_unused_token=True)[source]

Tokenizer used for Bert text process.

Parameters

vocab (Vocab) – a Vocab object.
suffix_indicator (str, optional) – Used to show that the subword is the last part of a word(default=’##’).
max_bytes_per_token (int, optional) – Tokens exceeding this length will not be further split(default=100).
unknown_token (str, optional) – When we can not found the token: if ‘unknown_token’ is empty string, return the token directly, else return ‘unknown_token’(default=’[UNK]’).
lower_case (bool, optional) – If True, apply CaseFold, NormalizeUTF8(NFD mode), RegexReplace operation on input text to make the text to lower case and strip accents characters; If False, only apply NormalizeUTF8(‘normalization_form’ mode) operation on input text(default=False).
keep_whitespace (bool, optional) – If True, the whitespace will be kept in out tokens(default=False).
normalization_form (NormalizeForm, optional) – Used to specify a specific normlaize mode, only effective when ‘lower_case’ is False. See NormalizeUTF8 for details(default=’NONE’).
preserve_unused_token (bool, optional) – If True, do not split special tokens like ‘[CLS]’, ‘[SEP]’, ‘[UNK]’, ‘[PAD]’, ‘[MASK]’(default=True).

class mindspore.dataset.text.transforms.CaseFold[source]: Apply case fold operation on utf-8 string tensor.

class mindspore.dataset.text.transforms.JiebaTokenizer(hmm_path, mp_path, mode=JiebaMode.MIX)[source]

Tokenize Chinese string into words based on dictionary.

Parameters

hmm_path (str) – the dictionary file is used by HMMSegment algorithm, the dictionary can be obtained on the official website of cppjieba.
mp_path (str) – the dictionary file is used by MPSegment algorithm, the dictionary can be obtained on the official website of cppjieba.
mode (JiebaMode, optional) – “MP” model will tokenize with MPSegment algorithm, “HMM” mode will tokenize with Hiddel Markov Model Segment algorithm, “MIX” model will tokenize with a mix of MPSegment and HMMSegment algorithm (default=”MIX”).

add_dict(user_dict)[source]

Add user defined word to JiebaTokenizer’s dictionary.

Parameters

user_dict (str or dict) –

Dictionary to be added, file path or Python dictionary, Python Dict format: {word1:freq1, word2:freq2,…}. Jieba dictionary format : word(required), freq(optional), such as:

word1 freq1
word2
word3 freq3

add_word(word, freq=None)[source]

Add user defined word to JiebaTokenizer’s dictionary.

Parameters

word (str) – The word to be added to the JiebaTokenizer instance. The added word will not be written into the built-in dictionary on disk.
freq (int, optional) – The frequency of the word to be added, The higher the frequency, the better change the word will be tokenized(default=None, use default frequency).

class mindspore.dataset.text.transforms.Lookup(vocab, unknown=None)[source]

Lookup operator that looks up a word to an id.

Parameters

vocab (Vocab) – a Vocab object.
unknown (int, optional) – default id to lookup a word that is out of vocab. If no argument is passed, 1 will be used to be the default id which is the convention for unknown_token <unk>. Otherwise, user is strongly encouraged to pass in the id for <unk> (default=None).

class mindspore.dataset.text.transforms.Ngram(n, left_pad=None, right_pad=None, separator=None)[source]

TensorOp to generate n-gram from a 1-D string Tensor.

Refer to https://en.wikipedia.org/wiki/N-gram#Examples for an overview of what n-gram is and how it works.

Parameters

n (list of int) – n in n-gram, n >= 1. n is a list of positive integers, for e.g. n=[4,3], The result would be a 4-gram followed by a 3-gram in the same tensor. If number of words is not enough to make up for a n-gram, an empty string would be returned. For e.g. 3 grams on [“mindspore”,”best”] would result in an empty string be produced.
left_pad (tuple, optional) – (“pad_token”, pad_width). Padding performed on left side of the sequence. pad_width will be capped at n-1. left_pad=(“_”,2) would pad left side of the sequence with “__” (default=None).
right_pad (tuple, optional) – (“pad_token”, pad_width). Padding performed on right side of the sequence. pad_width will be capped at n-1. right_pad=(“-“:2) would pad right side of the sequence with “–” (default=None).
separator (str, optional) – symbol used to join strings together. for e.g. if 2-gram the [“mindspore”, “amazing”] with separator=”-” the result would be [“mindspore-amazing”] (default=None, which means whitespace is used).

class mindspore.dataset.text.transforms.NormalizeUTF8(normalize_form=NormalizeForm.NFKC)[source]

Apply normalize operation on utf-8 string tensor.

Parameters: normalize_form (NormalizeForm, optional) – Valid values are “NONE”, “NFC”, “NFKC”, “NFD”, “NFKD”. If set “NONE”, will do nothing for input string tensor. If set to any of “NFC”, “NFKC”, “NFD”, “NFKD”, will apply normalize operation(default=”NFKC”). See http://unicode.org/reports/tr15/ for details.

class mindspore.dataset.text.transforms.PythonTokenizer(tokenizer)[source]

Callable class to be used for user-defined string tokenizer. :param tokenizer: Python function that takes a str and returns a list of str as tokens. :type tokenizer: Callable

Examples

>>> def my_tokenizer(line):
>>>     return line.split()
>>> data = data.map(operations=PythonTokenizer(my_tokenizer))

class mindspore.dataset.text.transforms.RegexReplace(pattern, replace, replace_all=True)[source]

Replace utf-8 string tensor with ‘replace’ according to regular expression ‘pattern’.

See http://userguide.icu-project.org/strings/regexp for support regex pattern.

Parameters

pattern (str) – the regex expression patterns.
replace (str) – the string to replace matched element.
replace_all (bool, optional) – If False, only replace first matched element; if True, replace all matched elements(default=True).

class mindspore.dataset.text.transforms.RegexTokenizer(delim_pattern, keep_delim_pattern='')[source]

Tokenize a scalar tensor of UTF-8 string by regex expression pattern.

See http://userguide.icu-project.org/strings/regexp for support regex pattern.

Parameters

delim_pattern (str) – The pattern of regex delimiters. The original string will be split by matched elements.
keep_delim_pattern (str, optional) – The string matched by ‘delim_pattern’ can be kept as a token if it can be matched by ‘keep_delim_pattern’. And the default value is empty str(‘’), in this situation, delimiters will not kept as a output token(default=’’).

class mindspore.dataset.text.transforms.ToNumber(data_type)[source]

Tensor operation to convert every element of a string tensor to a number.

Strings are casted according to the rules specified in the following links: https://en.cppreference.com/w/cpp/string/basic_string/stof, https://en.cppreference.com/w/cpp/string/basic_string/stoul, except that any strings which represent negative numbers cannot be casted to an unsigned integer type.

Parameters: data_type (mindspore.dtype) – mindspore.dtype to be casted to. Must be a numeric type.
Raises: RuntimeError – If strings are invalid to cast, or are out of range after being casted.

class mindspore.dataset.text.transforms.TruncateSequencePair(max_length)[source]

Truncate a pair of rank-1 tensors such that the total length is less than max_length.

This operation takes two input tensors and returns two output Tenors.

Parameters: max_length (int) – Maximum length required.

Examples

>>> # Data before
>>> # |  col1   |  col2   |
>>> # +---------+---------|
>>> # | [1,2,3] | [4,5]   |
>>> # +---------+---------+
>>> data = data.map(operations=TruncateSequencePair(4))
>>> # Data after
>>> # |  col1   |  col2   |
>>> # +---------+---------+
>>> # | [1,2]   | [4,5]   |
>>> # +---------+---------+

class mindspore.dataset.text.transforms.UnicodeCharTokenizer[source]: Tokenize a scalar tensor of UTF-8 string to Unicode characters.

class mindspore.dataset.text.transforms.UnicodeScriptTokenizer(keep_whitespace=False)[source]

Tokenize a scalar tensor of UTF-8 string on Unicode script boundaries.

Parameters: keep_whitespace (bool, optional) – If or not emit whitespace tokens (default=False).

class mindspore.dataset.text.transforms.WhitespaceTokenizer[source]: Tokenize a scalar tensor of UTF-8 string on ICU defined whitespaces(such as: ‘ ‘, ‘\t’, ‘\r’, ‘\n’).

class mindspore.dataset.text.transforms.WordpieceTokenizer(vocab, suffix_indicator='##', max_bytes_per_token=100, unknown_token='[UNK]')[source]

Tokenize scalar token or 1-D tokens to 1-D subword tokens.

Parameters

vocab (Vocab) – a Vocab object.
suffix_indicator (str, optional) – Used to show that the subword is the last part of a word(default=’##’).
max_bytes_per_token (int, optional) – Tokens exceeding this length will not be further split(default=100).
unknown_token (str, optional) – When we can not found the token: if ‘unknown_token’ is empty string, return the token directly, else return ‘unknown_token’(default=’[UNK]’).