Class RegexTokenizer

Inheritance Relationships

Base Type

Class Documentation

class RegexTokenizer : public mindspore::dataset::TensorTransform

Tokenize a scalar tensor of UTF-8 string by the regex expression pattern.

Public Functions

inline explicit RegexTokenizer(std::string delim_pattern, std::string keep_delim_pattern = "", bool with_offsets = false)

Constructor.

Parameters
  • delim_pattern[in] The pattern of regex delimiters.

  • keep_delim_pattern[in] The string matched with ‘delim_pattern’ can be kept as a token if it can be matched by ‘keep_delim_pattern’. The default value is an empty string (“”). which means that delimiters will not be kept as an output token (default=””).

  • with_offsets[in] Whether to output offsets of tokens (default=false).

~RegexTokenizer() = default

Destructor.