mindspore.dataset.text

This module is to support text processing for NLP. It includes two parts: text transforms and utils. text transforms is a high performance NLP text processing module which is developed with ICU4C and cppjieba. utils provides some general methods for NLP text processing.

Common imported modules in corresponding API examples are as follows:

import mindspore.dataset as ds
import mindspore.dataset.text as text

Descriptions of common data processing terms are as follows:

  • TensorOperation, the base class of all data processing operations implemented in C++.

  • TextTensorOperation, the base class of all text processing operations. It is a derived class of TensorOperation.

Transforms

API Name

Description

Note

mindspore.dataset.text.BasicTokenizer

Tokenize the input UTF-8 encoded string by specific rules.

BasicTokenizer is not supported on Windows platform yet.

mindspore.dataset.text.BertTokenizer

Tokenizer used for Bert text process.

BertTokenizer is not supported on Windows platform yet.

mindspore.dataset.text.CaseFold

Apply case fold operation on UTF-8 string tensor, which is aggressive that can convert more characters into lower case.

CaseFold is not supported on Windows platform yet.

mindspore.dataset.text.FilterWikipediaXML

Filter Wikipedia XML dumps to "clean" text consisting only of lowercase letters (a-z, converted from A-Z), and spaces (never consecutive).

FilterWikipediaXML is not supported on Windows platform yet.

mindspore.dataset.text.JiebaTokenizer

Tokenize Chinese string into words based on dictionary.

The integrity of the HMMSEgment algorithm and MPSegment algorithm files must be confirmed.

mindspore.dataset.text.Lookup

Look up a word into an id according to the input vocabulary table.

None

mindspore.dataset.text.Ngram

Generate n-gram from a 1-D string Tensor.

None

mindspore.dataset.text.NormalizeUTF8

Apply normalize operation on UTF-8 string tensor.

NormalizeUTF8 is not supported on Windows platform yet.

mindspore.dataset.text.PythonTokenizer

Class that applies user-defined string tokenizer into input string.

None

mindspore.dataset.text.RegexReplace

Replace a part of UTF-8 string tensor with given text according to regular expressions.

RegexReplace is not supported on Windows platform yet.

mindspore.dataset.text.RegexTokenizer

Tokenize a scalar tensor of UTF-8 string by regex expression pattern.

RegexTokenizer is not supported on Windows platform yet.

mindspore.dataset.text.SentencePieceTokenizer

Tokenize scalar token or 1-D tokens to tokens by sentencepiece.

None

mindspore.dataset.text.SlidingWindow

Construct a tensor from given data (only support 1-D for now), where each element in the dimension axis is a slice of data starting at the corresponding position, with a specified width.

None

mindspore.dataset.text.ToNumber

Tensor operation to convert every element of a string tensor to a number.

None

mindspore.dataset.text.ToVectors

Look up a token into vectors according to the input vector table.

None

mindspore.dataset.text.TruncateSequencePair

Truncate a pair of rank-1 tensors such that the total length is less than max_length.

None

mindspore.dataset.text.UnicodeCharTokenizer

Tokenize a scalar tensor of UTF-8 string to Unicode characters.

None

mindspore.dataset.text.UnicodeScriptTokenizer

Tokenize a scalar tensor of UTF-8 string based on Unicode script boundaries.

UnicodeScriptTokenizer is not supported on Windows platform yet.

mindspore.dataset.text.WhitespaceTokenizer

Tokenize a scalar tensor of UTF-8 string on ICU4C defined whitespaces, such as: ' ', '\t', '\r', '\n'.

WhitespaceTokenizer is not supported on Windows platform yet.

mindspore.dataset.text.WordpieceTokenizer

Tokenize the input text to subword tokens.

None

Utilities

API Name

Description

Note

mindspore.dataset.text.CharNGram

CharNGram object that is used to map tokens into pre-trained vectors.

None

mindspore.dataset.text.FastText

FastText object that is used to map tokens into vectors.

None

mindspore.dataset.text.GloVe

GloVe object that is used to map tokens into vectors.

None

mindspore.dataset.text.JiebaMode

An enumeration for JiebaTokenizer.

None

mindspore.dataset.text.NormalizeForm

Enumeration class for Unicode normalization forms .

None

mindspore.dataset.text.SentencePieceModel

An enumeration for SentencePieceModel.

None

mindspore.dataset.text.SentencePieceVocab

SentencePiece object that is used to do words segmentation.

None

mindspore.dataset.text.SPieceTokenizerLoadType

An enumeration for loading type of SentencePieceTokenizer.

None

mindspore.dataset.text.SPieceTokenizerOutType

An enumeration for SPieceTokenizerOutType.

None

mindspore.dataset.text.Vectors

Vectors object that is used to map tokens into vectors.

None

mindspore.dataset.text.Vocab

Vocab object that is used to save pairs of words and ids.

None

mindspore.dataset.text.to_bytes

Convert NumPy array of str to array of bytes by encoding each element based on charset encoding.

None

mindspore.dataset.text.to_str

Convert NumPy array of bytes to array of str by decoding each element based on charset encoding.

None