mindspore.dataset.text.FilterWikipediaXML
- class mindspore.dataset.text.FilterWikipediaXML[source]
Filter Wikipedia XML dumps to “clean” text consisting only of lowercase letters (a-z, converted from A-Z), and spaces (never consecutive).
Note
FilterWikipediaXML is not supported on Windows platform yet.
- Supported Platforms:
CPU
Examples
>>> import mindspore.dataset as ds >>> import mindspore.dataset.text as text >>> >>> # Use the transform in dataset pipeline mode >>> numpy_slices_dataset = ds.NumpySlicesDataset(data=["Welcome to China", "!!!", "ABC"], ... column_names=["text"], shuffle=False) >>> replace_op = text.FilterWikipediaXML() >>> numpy_slices_dataset = numpy_slices_dataset.map(operations=replace_op) >>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True): ... print(item["text"]) ... break welcome to china >>> >>> # Use the transform in eager mode >>> data = "Welcome to China" >>> output = replace_op(data) >>> print(output) welcome to china
- Tutorial Examples: