mindspore.dataset.text.JiebaTokenizer

View Source On Gitee
class mindspore.dataset.text.JiebaTokenizer(hmm_path, mp_path, mode=JiebaMode.MIX, with_offsets=False)[source]

Use Jieba tokenizer to tokenize Chinese strings.

Note

The dictionary files used by Hidden Markov Model segment and Max Probability segment can be obtained through the cppjieba GitHub . Please ensure the validity and integrity of these files.

Parameters
  • hmm_path (str) – Path to the dictionary file used by Hidden Markov Model segment.

  • mp_path (str) – Path to the dictionary file used by Max Probability segment.

  • mode (JiebaMode, optional) – The desired segment algorithms. See JiebaMode for details on optional values. Default: JiebaMode.MIX .

  • with_offsets (bool, optional) – Whether to output the start and end offsets of each token in the original string. Default: False .

Raises
Supported Platforms:

CPU

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>> from mindspore.dataset.text import JiebaMode
>>>
>>> # Use the transform in dataset pipeline mode
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data=["床前明月光"], column_names=["text"])
>>>
>>> # 1) If with_offsets=False, return one data column {["text", dtype=str]}
>>> # The paths to jieba_hmm_file and jieba_mp_file can be downloaded directly from the mindspore repository.
>>> # Refer to https://gitee.com/mindspore/mindspore/blob/master/tests/ut/data/dataset/jiebadict/hmm_model.utf8
>>> # and https://gitee.com/mindspore/mindspore/blob/master/tests/ut/data/dataset/jiebadict/jieba.dict.utf8
>>> jieba_hmm_file = "tests/ut/data/dataset/jiebadict/hmm_model.utf8"
>>> jieba_mp_file = "tests/ut/data/dataset/jiebadict/jieba.dict.utf8"
>>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=False)
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=tokenizer_op)
>>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
...     print(item["text"])
['床' '前' '明月光']
>>>
>>> # 2) If with_offsets=True, return three columns {["token", dtype=str], ["offsets_start", dtype=uint32],
>>> #                                                ["offsets_limit", dtype=uint32]}
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data=["床前明月光"], column_names=["text"])
>>> tokenizer_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP, with_offsets=True)
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=tokenizer_op, input_columns=["text"],
...                                                 output_columns=["token", "offsets_start", "offsets_limit"])
>>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
...     print(item["token"], item["offsets_start"], item["offsets_limit"])
['床' '前' '明月光'] [0 3 6] [ 3  6 15]
>>>
>>> # Use the transform in eager mode
>>> data = "床前明月光"
>>> output = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP)(data)
>>> print(output)
['床' '前' '明月光']
Tutorial Examples:
add_dict(user_dict)[source]

Add the specified word mappings to the Vocab of the tokenizer.

Parameters

user_dict (Union[str, dict[str, int]]) – The word mappings to be added to the Vocab. If the input type is str, it means the path of the file storing the word mappings to be added. Each line of the file should contain two fields separated by a space, where the first field indicates the word itself and the second field should be a number indicating the word frequency. Invalid lines will be ignored and no error or warning will be returned. If the input type is dict[str, int], it means the dictionary storing the word mappings to be added, where the key name is the word itself and the key value is the word frequency.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>> from mindspore.dataset.text import JiebaMode
>>>
>>> jieba_hmm_file = "/path/to/jieba/hmm/file"
>>> jieba_mp_file = "/path/to/jieba/mp/file"
>>> user_dict = {"男默女泪": 10}
>>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP)
>>> jieba_op.add_dict(user_dict)
>>>
>>> text_file_list = ["/path/to/text_file_dataset_file"]
>>> text_file_dataset = ds.TextFileDataset(dataset_files=text_file_list)
>>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])
add_word(word, freq=None)[source]

Add a specified word mapping to the Vocab of the tokenizer.

Parameters
  • word (str) – The word to be added to the Vocab.

  • freq (int, optional) – The frequency of the word to be added. The higher the word frequency, the greater the chance that the word will be tokenized. Default: None, using the default word frequency.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>> from mindspore.dataset.text import JiebaMode
>>>
>>> jieba_hmm_file = "/path/to/jieba/hmm/file"
>>> jieba_mp_file = "/path/to/jieba/mp/file"
>>> jieba_op = text.JiebaTokenizer(jieba_hmm_file, jieba_mp_file, mode=JiebaMode.MP)
>>> sentence_piece_vocab_file = "/path/to/sentence/piece/vocab/file"
>>> with open(sentence_piece_vocab_file, 'r') as f:
...     for line in f:
...         word = line.split(',')[0]
...         jieba_op.add_word(word)
>>>
>>> text_file_list = ["/path/to/text_file_dataset_file"]
>>> text_file_dataset = ds.TextFileDataset(dataset_files=text_file_list)
>>> text_file_dataset = text_file_dataset.map(operations=jieba_op, input_columns=["text"])