mindspore.dataset.text.SentencePieceTokenizer

class mindspore.dataset.text.SentencePieceTokenizer(mode, out_type)[source]

Tokenize scalar token or 1-D tokens to tokens by sentencepiece.

Parameters

mode (Union[str, SentencePieceVocab]) – SentencePiece model. If the input parameter is a file, it represents the path of SentencePiece mode to be loaded. If the input parameter is a SentencePieceVocab object, it should be constructed in advanced.
out_type (SPieceTokenizerOutType) –
The type of output, it can be SPieceTokenizerOutType.STRING, SPieceTokenizerOutType.INT.
- SPieceTokenizerOutType.STRING, means output type of SentencePice Tokenizer is string.
- SPieceTokenizerOutType.INT, means output type of SentencePice Tokenizer is int.

Raises

TypeError – If mode is not of type string or SentencePieceVocab.
TypeError – If out_type is not of type SPieceTokenizerOutType.

Supported Platforms:: CPU

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>> from mindspore.dataset.text import SentencePieceModel, SPieceTokenizerOutType
>>>
>>> # Use the transform in dataset pipeline mode
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data=['Hello world'], column_names=["text"])
>>> # The paths to sentence_piece_vocab_file can be downloaded directly from the mindspore repository. Refer to
>>> # https://gitee.com/mindspore/mindspore/blob/v2.7.1/tests/ut/data/dataset/test_sentencepiece/vocab.txt
>>> sentence_piece_vocab_file = "tests/ut/data/dataset/test_sentencepiece/vocab.txt"
>>> vocab = text.SentencePieceVocab.from_file([sentence_piece_vocab_file], 512, 0.9995,
...                                            SentencePieceModel.UNIGRAM, {})
>>> tokenizer = text.SentencePieceTokenizer(vocab, out_type=SPieceTokenizerOutType.STRING)
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=tokenizer)
>>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
...     print(item["text"])
['▁H' 'el' 'lo' '▁' 'w' 'or' 'l' 'd']
>>>
>>> # Use the transform in eager mode
>>> data = "Hello world"
>>> vocab = text.SentencePieceVocab.from_file([sentence_piece_vocab_file], 100, 0.9995,
...                                           SentencePieceModel.UNIGRAM, {})
>>> output = text.SentencePieceTokenizer(vocab, out_type=SPieceTokenizerOutType.STRING)(data)
>>> print(output)
['▁' 'H' 'e' 'l' 'l' 'o' '▁' 'w' 'or' 'l' 'd']

Tutorial Examples:

Illustration of text transforms