mindspore.dataset.text.WordpieceTokenizer

class mindspore.dataset.text.WordpieceTokenizer(vocab, suffix_indicator='##', max_bytes_per_token=100, unknown_token='[UNK]', with_offsets=False)[source]

Tokenize the input text to subword tokens.

Parameters

vocab (Vocab) – Vocabulary used to look up words.
suffix_indicator (str, optional) – Prefix flags used to indicate subword suffixes. Default: '##'.
max_bytes_per_token (int, optional) – The maximum length of tokenization, words exceeding this length will not be split. Default: 100.
unknown_token (str, optional) – The output for unknown words. When set to an empty string, the corresponding unknown word will be directly returned as the output. Otherwise, the set string will be returned as the output. Default: '[UNK]'.
with_offsets (bool, optional) – Whether to output the start and end offsets of each token in the original string. Default: False .

Raises

TypeError – If vocab is not of type mindspore.dataset.text.Vocab .
TypeError – If suffix_indicator is not of type str.
TypeError – If max_bytes_per_token is not of type int.
TypeError – If unknown_token is not of type str.
TypeError – If with_offsets is not of type bool.
ValueError – If max_bytes_per_token is negative.

Supported Platforms:: CPU

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Use the transform in dataset pipeline mode
>>> seed = ds.config.get_seed()
>>> ds.config.set_seed(12345)
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data=["happy", "birthday", "to", "you"], column_names=["text"])
>>>
>>> vocab_list = ["book", "cholera", "era", "favor", "##ite", "my", "is", "love", "dur", "##ing", "the"]
>>> vocab = text.Vocab.from_list(vocab_list)
>>>
>>> # If with_offsets=False, default output one column {["text", dtype=str]}
>>> tokenizer_op = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]',
...                                        max_bytes_per_token=100, with_offsets=False)
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=tokenizer_op)
>>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
...     print(item["text"])
...     break
['[UNK]']
>>>
>>> # If with_offsets=True, then output three columns {["token", dtype=str], ["offsets_start", dtype=uint32],
>>> #                                                  ["offsets_limit", dtype=uint32]}
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data=["happy", "birthday", "to", "you"], column_names=["text"])
>>> tokenizer_op = text.WordpieceTokenizer(vocab=vocab, unknown_token='[UNK]',
...                                        max_bytes_per_token=100, with_offsets=True)
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=tokenizer_op, input_columns=["text"],
...                                                 output_columns=["token", "offsets_start", "offsets_limit"])
>>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
...     print(item["token"], item["offsets_start"], item["offsets_limit"])
...     break
['[UNK]'] [0] [5]
>>>
>>> # Use the transform in eager mode
>>> data = ["happy", "birthday", "to", "you"]
>>> vocab_list = ["book", "cholera", "era", "favor", "**ite", "my", "is", "love", "dur", "**ing", "the"]
>>> vocab = text.Vocab.from_list(vocab_list)
>>> output = text.WordpieceTokenizer(vocab=vocab, suffix_indicator="y", unknown_token='[UNK]')(data)
>>> print(output)
['[UNK]' '[UNK]' '[UNK]' '[UNK]']
>>> ds.config.set_seed(seed)

Tutorial Examples:

Illustration of text transforms