mindspore.dataset.text.SentencePieceTokenizer
- class mindspore.dataset.text.SentencePieceTokenizer(mode, out_type)[source]
- Tokenize scalar token or 1-D tokens to tokens by sentencepiece. - Parameters
- mode (Union[str, SentencePieceVocab]) – SentencePiece model. If the input parameter is a file, it represents the path of SentencePiece mode to be loaded. If the input parameter is a SentencePieceVocab object, it should be constructed in advanced. 
- out_type (SPieceTokenizerOutType) – - The type of output, it can be - SPieceTokenizerOutType.STRING,- SPieceTokenizerOutType.INT.- SPieceTokenizerOutType.STRING, means output type of SentencePice Tokenizer is string.
- SPieceTokenizerOutType.INT, means output type of SentencePice Tokenizer is int.
 
 
- Raises
 - Supported Platforms:
- CPU
 - Examples - >>> import mindspore.dataset as ds >>> import mindspore.dataset.text as text >>> from mindspore.dataset.text import SentencePieceModel, SPieceTokenizerOutType >>> >>> # Use the transform in dataset pipeline mode >>> numpy_slices_dataset = ds.NumpySlicesDataset(data=['Hello world'], column_names=["text"]) >>> # The paths to sentence_piece_vocab_file can be downloaded directly from the mindspore repository. Refer to >>> # https://gitee.com/mindspore/mindspore/blob/v2.7.1/tests/ut/data/dataset/test_sentencepiece/vocab.txt >>> sentence_piece_vocab_file = "tests/ut/data/dataset/test_sentencepiece/vocab.txt" >>> vocab = text.SentencePieceVocab.from_file([sentence_piece_vocab_file], 512, 0.9995, ... SentencePieceModel.UNIGRAM, {}) >>> tokenizer = text.SentencePieceTokenizer(vocab, out_type=SPieceTokenizerOutType.STRING) >>> numpy_slices_dataset = numpy_slices_dataset.map(operations=tokenizer) >>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True): ... print(item["text"]) ['▁H' 'el' 'lo' '▁' 'w' 'or' 'l' 'd'] >>> >>> # Use the transform in eager mode >>> data = "Hello world" >>> vocab = text.SentencePieceVocab.from_file([sentence_piece_vocab_file], 100, 0.9995, ... SentencePieceModel.UNIGRAM, {}) >>> output = text.SentencePieceTokenizer(vocab, out_type=SPieceTokenizerOutType.STRING)(data) >>> print(output) ['▁' 'H' 'e' 'l' 'l' 'o' '▁' 'w' 'or' 'l' 'd'] - Tutorial Examples: