mindspore.dataset.text.PythonTokenizer

View Source On Gitee
class mindspore.dataset.text.PythonTokenizer(tokenizer)[source]

Class that applies user-defined string tokenizer into input string.

Parameters

tokenizer (Callable) – Python function that takes a str and returns a list of str as tokens.

Raises

TypeError – If tokenizer is not a callable Python function.

Supported Platforms:

CPU

Examples

>>> import numpy as np
>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Use the transform in dataset pipeline mode
>>> def my_tokenizer(line):
...     return line.split()
>>>
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data=['Hello world'], column_names=["text"])
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=text.PythonTokenizer(my_tokenizer))
>>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
...     print(item["text"])
['Hello' 'world']
>>>
>>> # Use the transform in eager mode
>>> data = np.array('Hello world'.encode())
>>> output = text.PythonTokenizer(my_tokenizer)(data)
>>> print(output)
['Hello' 'world']
Tutorial Examples: