Class BasicTokenizer
- Defined in File text.h 
Inheritance Relationships
Base Type
- public mindspore::dataset::TensorTransform(Class TensorTransform)
Class Documentation
- 
class BasicTokenizer : public mindspore::dataset::TensorTransform
- Tokenize a scalar tensor of UTF-8 string by specific rules. - 说明 - BasicTokenizer is not supported on the Windows platform yet. - Public Functions - 
explicit BasicTokenizer(bool lower_case = false, bool keep_whitespace = false, NormalizeForm normalize_form = NormalizeForm::kNone, bool preserve_unused_token = true, bool with_offsets = false)
- Constructor. - Example
- /* Define operations */ auto tokenizer_op = text::BasicTokenizer(); /* dataset is an instance of Dataset object */ dataset = dataset->Map({tokenizer_op}, // operations {"text"}); // input columns 
 - 参数
- lower_case – [in] If true, apply CaseFold, NormalizeUTF8 (NFD mode) and RegexReplace operations to the input text to fold the text to lower case and strip accents characters. If false, only apply the NormalizeUTF8(‘normalization_form’ mode) operation to the input text (default=false). 
- keep_whitespace – [in] If true, the whitespace will be kept in output tokens (default=false). 
- normalize_form – [in] This parameter is used to specify a specific normalize mode. This is only effective when ‘lower_case’ is false. See NormalizeUTF8 for details (default=NormalizeForm::kNone). 
- preserve_unused_token – [in] If true, do not split special tokens like ‘[CLS]’, ‘[SEP]’, ‘[UNK]’, ‘[PAD]’ and ‘[MASK]’ (default=true). 
- with_offsets – [in] Whether to output offsets of tokens (default=false). 
 
 
 - 
~BasicTokenizer() = default
- Destructor. 
 
- 
explicit BasicTokenizer(bool lower_case = false, bool keep_whitespace = false, NormalizeForm normalize_form = NormalizeForm::kNone, bool preserve_unused_token = true, bool with_offsets = false)