Class JiebaTokenizer
- Defined in File text.h 
Inheritance Relationships
Base Type
- public mindspore::dataset::TensorTransform(Class TensorTransform)
Class Documentation
- 
class JiebaTokenizer : public mindspore::dataset::TensorTransform
- Tokenize a Chinese string into words based on the dictionary. - Note - The integrity of the HMMSegment algorithm and MPSegment algorithm files must be confirmed. - Public Functions - 
inline JiebaTokenizer(const std::string &hmm_path, const std::string &mp_path, const JiebaMode &mode = JiebaMode::kMix, bool with_offsets = false)
- Constructor. - Parameters
- hmm_path – [in] Dictionary file is used by the HMMSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba). 
- mp_path – [in] Dictionary file is used by the MPSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba). 
- mode – [in] Valid values can be any of JiebaMode.kMP, JiebaMode.kHMM and JiebaMode.kMIX (default=JiebaMode.kMIX). - JiebaMode.kMP, tokenizes with MPSegment algorithm. 
- JiebaMode.kHMM, tokenizes with Hidden Markov Model Segment algorithm. 
- JiebaMode.kMIX, tokenizes with a mix of MPSegment and HMMSegment algorithms. 
 
- with_offsets – [in] Whether to output offsets of tokens (default=false). 
 Example
- /* Define operations */ auto tokenizer_op = text::JiebaTokenizer("/path/to/hmm/file", "/path/to/mp/file"); /* dataset is an instance of Dataset object */ dataset = dataset->Map({tokenizer_op}, // operations {"text"}); // input columns 
 
 - 
JiebaTokenizer(const std::vector<char> &hmm_path, const std::vector<char> &mp_path, const JiebaMode &mode, bool with_offsets)
- Constructor. - Parameters
- hmm_path – [in] Dictionary file is used by the HMMSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba). 
- mp_path – [in] Dictionary file is used by the MPSegment algorithm. The dictionary can be obtained on the official website of cppjieba (https://github.com/yanyiwu/cppjieba). 
- mode – [in] Valid values can be any of JiebaMode.kMP, JiebaMode.kHMM and JiebaMode.kMIX (default=JiebaMode.kMIX). - JiebaMode.kMP, tokenizes with MPSegment algorithm. 
- JiebaMode.kHMM, tokenizes with Hidden Markov Model Segment algorithm. 
- JiebaMode.kMIX, tokenizes with a mix of MPSegment and HMMSegment algorithms. 
 
- with_offsets – [in] Whether to output offsets of tokens (default=false). 
 
 
 - 
~JiebaTokenizer() override = default
- Destructor. 
 - 
inline Status AddWord(const std::string &word, int64_t freq = 0)
- Add a user defined word to the JiebaTokenizer’s dictionary. - Parameters
- word – [in] The word to be added to the JiebaTokenizer instance. The added word will not be written into the built-in dictionary on disk. 
- freq – [in] The frequency of the word to be added. The higher the frequency, the better chance the word will be tokenized (default=None, use default frequency). 
 
- Returns
- Status error code, returns OK if no error is encountered. Example
- /* Define operations */ auto tokenizer_op = text::JiebaTokenizer("/path/to/hmm/file", "/path/to/mp/file"); Status s = tokenizer_op.AddWord("hello", 2); 
 
 - 
inline Status AddDict(const std::vector<std::pair<std::string, int64_t>> &user_dict)
- Add a user defined dictionary of word-freq pairs to the JiebaTokenizer’s dictionary. - Parameters
- user_dict – [in] Vector of word-freq pairs to be added to the JiebaTokenizer’s dictionary. 
- Returns
- Status error code, returns OK if no error is encountered. Example
- /* Define operations */ auto tokenizer_op = text::JiebaTokenizer("/path/to/hmm/file", "/path/to/mp/file"); std::vector<std::pair<std::string, int64_t>> user_dict = {{"a", 1}, {"b", 2}, {"c", 3}}; Status s = tokenizer_op.AddDict(user_dict); 
 
 - 
inline Status AddDict(const std::string &file_path)
- Add user defined dictionary of word-freq pairs to the JiebaTokenizer’s dictionary from a file. Only valid word-freq pairs in user defined file will be added into the dictionary. Rows containing invalid inputs will be ignored, no error nor warning status is returned. - Parameters
- file_path – [in] Path to the dictionary which includes user defined word-freq pairs. 
- Returns
- Status error code, returns OK if no error is encountered. Example
- /* Define operations */ auto tokenizer_op = text::JiebaTokenizer("/path/to/hmm/file", "/path/to/mp/file"); Status s = tokenizer_op.AddDict("/path/to/dict/file"); 
 
 
- 
inline JiebaTokenizer(const std::string &hmm_path, const std::string &mp_path, const JiebaMode &mode = JiebaMode::kMix, bool with_offsets = false)