Class WordpieceTokenizer
- Defined in File text.h 
Inheritance Relationships
Base Type
- public mindspore::dataset::TensorTransform(Class TensorTransform)
Class Documentation
- 
class WordpieceTokenizer : public mindspore::dataset::TensorTransform
- Tokenize scalar token or 1-D tokens to 1-D sub-word tokens. - Public Functions - 
inline explicit WordpieceTokenizer(const std::shared_ptr<Vocab> &vocab, const std::string &suffix_indicator = "##", int32_t max_bytes_per_token = 100, const std::string &unknown_token = "[UNK]", bool with_offsets = false)
- Constructor. - Parameters
- vocab – [in] A Vocab object. 
- suffix_indicator – [in] This parameter is used to show that the sub-word is the last part of a word (default='##'). 
- max_bytes_per_token – [in] Tokens exceeding this length will not be further split (default=100). 
- unknown_token – [in] When a token cannot be found, return the token directly if 'unknown_token' is an empty string, else return the specified string (default='[UNK]'). 
- with_offsets – [in] whether to output offsets of tokens (default=false). 
 Example
- /* Define operations */ std::vector<std::string> word_list = {"book", "apple", "rabbit"}; std::shared_ptr<Vocab> vocab = std::make_shared<Vocab>(); Status s = Vocab::BuildFromVector(word_list, {}, true, &vocab); auto tokenizer_op = text::WordpieceTokenizer(vocab); /* dataset is an instance of Dataset object */ dataset = dataset->Map({tokenizer_op}, // operations {"text"}); // input columns 
 
 - 
~WordpieceTokenizer() override = default
- Destructor. 
 
- 
inline explicit WordpieceTokenizer(const std::shared_ptr<Vocab> &vocab, const std::string &suffix_indicator = "##", int32_t max_bytes_per_token = 100, const std::string &unknown_token = "[UNK]", bool with_offsets = false)