mindspore.dataset.text.Lookup

View Source On Gitee
class mindspore.dataset.text.Lookup(vocab, unknown_token=None, data_type=mstype.int32)[source]

Look up a word into an id according to the input vocabulary table.

Parameters
  • vocab (Vocab) – A vocabulary object.

  • unknown_token (str, optional) – Word is used for lookup. In case of the word is out of vocabulary (OOV), the result of lookup will be replaced with unknown_token. If the unknown_token is not specified or it is OOV, runtime error will be thrown. Default: None, means no unknown_token is specified.

  • data_type (mindspore.dtype, optional) – The data type that lookup operation maps string to. Default: mstype.int32.

Raises
  • TypeError – If vocab is not of type text.Vocab.

  • TypeError – If unknown_token is not of type string.

  • TypeError – If data_type is not of type mindspore.dtype.

Supported Platforms:

CPU

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> # Use the transform in dataset pipeline mode
>>> numpy_slices_dataset = ds.NumpySlicesDataset(data=["with"], column_names=["text"])
>>> # Load vocabulary from list
>>> vocab = text.Vocab.from_list(["?", "##", "with", "the", "test", "符号"])
>>> # Use Lookup operation to map tokens to ids
>>> lookup = text.Lookup(vocab)
>>> numpy_slices_dataset = numpy_slices_dataset.map(operations=[lookup])
>>> for item in numpy_slices_dataset.create_dict_iterator(num_epochs=1, output_numpy=True):
...     print(item["text"])
2
>>>
>>> # Use the transform in eager mode
>>> vocab = text.Vocab.from_list(["?", "##", "with", "the", "test", "符号"])
>>> data = "with"
>>> output = text.Lookup(vocab=vocab, unknown_token="test")(data)
>>> print(output)
2
Tutorial Examples: