Class RegexTokenizer
- Defined in File text.h 
Inheritance Relationships
Base Type
- public mindspore::dataset::TensorTransform(Class TensorTransform)
Class Documentation
- 
class RegexTokenizer : public mindspore::dataset::TensorTransform
- Tokenize a scalar tensor of UTF-8 string by the regex expression pattern. - Public Functions - 
inline explicit RegexTokenizer(const std::string &delim_pattern, const std::string &keep_delim_pattern = "", bool with_offsets = false)
- Constructor. - Parameters
- delim_pattern – [in] The pattern of regex delimiters. 
- keep_delim_pattern – [in] The string matched with ‘delim_pattern’ can be kept as a token if it can be matched by ‘keep_delim_pattern’. The default value is an empty string (“”). which means that delimiters will not be kept as an output token (default=””). 
- with_offsets – [in] Whether to output offsets of tokens (default=false). 
 Example
- /* Define operations */ auto regex_op = text::RegexTokenizer("\\s+", "\\s+", false); /* dataset is an instance of Dataset object */ dataset = dataset->Map({regex_op}, // operations {"text"}); // input columns 
 
 - 
~RegexTokenizer() = default
- Destructor. 
 
- 
inline explicit RegexTokenizer(const std::string &delim_pattern, const std::string &keep_delim_pattern = "", bool with_offsets = false)