mindspore.dataset.text.JiebaTokenizer
- class mindspore.dataset.text.JiebaTokenizer(hmm_path, mp_path, mode=JiebaMode.MIX, with_offsets=False)[source]
- Tokenize Chinese string into words based on dictionary. - Note - The integrity of the HMMSEgment algorithm and MPSegment algorithm files must be confirmed. - Parameters
- hmm_path (str) – Dictionary file is used by HMMSegment algorithm. The dictionary can be obtained on the official website of cppjieba. 
- mp_path (str) – Dictionary file is used by MPSegment algorithm. The dictionary can be obtained on the official website of cppjieba. 
- mode (JiebaMode, optional) – - Valid values can be any of [JiebaMode.MP, JiebaMode.HMM, JiebaMode.MIX](default=JiebaMode.MIX). - JiebaMode.MP, tokenize with MPSegment algorithm. 
- JiebaMode.HMM, tokenize with Hidden Markov Model Segment algorithm. 
- JiebaMode.MIX, tokenize with a mix of MPSegment and HMMSegment algorithm. 
 
- with_offsets (bool, optional) – Whether or not output offsets of tokens (default=False). 
 
- Raises
- ValueError – If path of HMMSegment dict is not provided. 
- ValueError – If path of MPSegment dict is not provided. 
- TypeError – If hmm_path or mp_path is not of type string. 
- TypeError – If with_offsets is not of type bool. 
 
 - Supported Platforms:
- CPU
 - Examples - >>> from mindspore.dataset.text import JiebaMode >>> # If with_offsets=False, default output one column {["text", dtype=str]} >>> jieba_hmm_file = "/path/to/jieba/hmm/file" >>> jieba_mp_file = "/path/to/jieba/mp/file" >>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=False) >>> text_file_dataset = text_file_dataset.map(operations=tokenizer_op) >>> # If with_offsets=False, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32], >>> # ["offsets_limit", dtype=uint32]} >>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=True) >>> text_file_dataset_1 = text_file_dataset_1.map(operations=tokenizer_op, input_columns=["text"], ... output_columns=["token", "offsets_start", "offsets_limit"], ... column_order=["token", "offsets_start", "offsets_limit"]) - add_dict(user_dict)[source]
- Add a user defined word to JiebaTokenizer’s dictionary. - Parameters
- user_dict (Union[str, dict]) – - One of the two loading methods is file path(str) loading (according to the Jieba dictionary format) and the other is Python dictionary(dict) loading, Python Dict format: {word1:freq1, word2:freq2,…}. Jieba dictionary format : word(required), freq(optional), such as: - word1 freq1 word2 None word3 freq3 - Only valid word-freq pairs in user provided file will be added into the dictionary. Rows containing invalid input will be ignored. No error nor warning Status is returned. 
 - Examples - >>> from mindspore.dataset.text import JiebaMode >>> jieba_hmm_file = "/path/to/jieba/hmm/file" >>> jieba_mp_file = "/path/to/jieba/mp/file" >>> user_dict = {"男默女泪": 10} >>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP) >>> jieba_op.add_dict(user_dict) >>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"]) 
 - add_word(word, freq=None)[source]
- Add a user defined word to JiebaTokenizer’s dictionary. - Parameters
- word (str) – The word to be added to the JiebaTokenizer instance. The added word will not be written into the built-in dictionary on disk. 
- freq (int, optional) – The frequency of the word to be added. The higher the frequency, the better chance the word will be tokenized (default=None, use default frequency). 
 
 - Examples - >>> from mindspore.dataset.text import JiebaMode >>> jieba_hmm_file = "/path/to/jieba/hmm/file" >>> jieba_mp_file = "/path/to/jieba/mp/file" >>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP) >>> sentence_piece_vocab_file = "/path/to/sentence/piece/vocab/file" >>> with open(sentence_piece_vocab_file, 'r') as f: ... for line in f: ... word = line.split(',')[0] ... jieba_op.add_word(word) >>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])