mindspore.dataset.text.FilterWikipediaXML

View Source On Gitee
class mindspore.dataset.text.FilterWikipediaXML[source]

Filter Wikipedia XML dumps to “clean” text consisting only of lowercase letters (a-z, converted from A-Z), and spaces (never consecutive).

Note

FilterWikipediaXML is not supported on Windows platform yet.

Supported Platforms:

CPU

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Use the transform in dataset pipeline mode
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data=["Welcome    to    China", "!!!", "ABC"],
...                                              column_names=["text"], shuffle=False)
>>> replace_op = text.FilterWikipediaXML()
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=replace_op)
>>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
...     print(item["text"])
...     break
welcome to china
>>>
>>> # Use the transform in eager mode
>>> data = "Welcome    to    China"
>>> output = replace_op(data)
>>> print(output)
welcome to china
Tutorial Examples: