mindspore.dataset.text.Vocab

class mindspore.dataset.text.Vocab[source]

Vocab object that is used to save pairs of words and ids.

It contains a map that maps each word(str) to an id(int) or reverse.

classmethod from_dataset(dataset, columns=None, freq_range=None, top_k=None, special_tokens=None, special_first=True)[source]

Build a vocab from a dataset.

This would collect all unique words in a dataset and return a vocab within the frequency range specified by user in freq_range. User would be warned if no words fall into the frequency. Words in vocab are ordered from the highest frequency to the lowest frequency. Words with the same frequency would be ordered lexicographically.

Parameters
  • dataset (Dataset) – dataset to build vocab from.

  • columns (list[str], optional) – column names to get words from. It can be a list of column names. (default=None, where all columns will be used. If any column isn’t string type, will return error).

  • freq_range (tuple, optional) – A tuple of integers (min_frequency, max_frequency). Words within the frequency range would be kept. 0 <= min_frequency <= max_frequency <= total_words. min_frequency=0 is the same as min_frequency=1. max_frequency > total_words is the same as max_frequency = total_words. min_frequency/max_frequency can be None, which corresponds to 0/total_words separately (default=None, all words are included).

  • top_k (int, optional) – top_k is greater than 0. Number of words to be built into vocab. top_k means most frequent words are taken. top_k is taken after freq_range. If not enough top_k, all words will be taken (default=None, all words are included).

  • special_tokens (list, optional) – A list of strings, each one is a special token. For example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).

  • special_first (bool, optional) – Whether special_tokens will be prepended/appended to vocab. If special_tokens is specified and special_first is set to True, special_tokens will be prepended (default=True).

Returns

Vocab, vocab built from the dataset.

Examples

>>> dataset = ds.TextFileDataset("/path/to/sentence/piece/vocab/file", shuffle=False)
>>> vocab = text.Vocab.from_dataset(dataset, "text", freq_range=None, top_k=None,
...                                 special_tokens=["<pad>", "<unk>"],
...                                 special_first=True)
>>> dataset = dataset.map(operations=text.Lookup(vocab, "<unk>"), input_columns=["text"])
classmethod from_dict(word_dict)[source]

Build a vocab object from a dict.

Parameters

word_dict (dict) – Dict contains word and id pairs, where word should be str and id be int. id is recommended to start from 0 and be continuous. ValueError will be raised if id is negative.

Returns

Vocab, vocab built from the dict.

Examples

>>> vocab = text.Vocab.from_dict({"home": 3, "behind": 2, "the": 4, "world": 5, "<unk>": 6})
classmethod from_file(file_path, delimiter='', vocab_size=None, special_tokens=None, special_first=True)[source]

Build a vocab object from a list of word.

Parameters
  • file_path (str) – Path to the file which contains the vocab list.

  • delimiter (str, optional) – A delimiter to break up each line in file, the first element is taken to be the word (default=””, the whole line will be treated as a word).

  • vocab_size (int, optional) – Number of words to read from file_path (default=None, all words are taken).

  • special_tokens (list, optional) – A list of strings, each one is a special token. For example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).

  • special_first (bool, optional) – Whether special_tokens will be prepended/appended to vocab, If special_tokens is specified and special_first is set to True, special_tokens will be prepended (default=True).

Returns

Vocab, vocab built from the file.

Examples

>>> # Assume vocab file contains the following content:
>>> # --- begin of file ---
>>> # apple,apple2
>>> # banana, 333
>>> # cat,00
>>> # --- end of file ---
>>>
>>> # Read file through this API and specify "," as delimiter.
>>> # The delimiter will break up each line in file, then the first element is taken to be the word.
>>> vocab = text.Vocab.from_file("/path/to/simple/vocab/file", ",", None, ["<pad>", "<unk>"], True)
>>>
>>> # Finally, there are 5 words in the vocab: "<pad>", "<unk>", "apple", "banana", "cat".
>>> vocabulary = vocab.vocab()
classmethod from_list(word_list, special_tokens=None, special_first=True)[source]

Build a vocab object from a list of word.

Parameters
  • word_list (list) – A list of string where each element is a word of type string.

  • special_tokens (list, optional) – A list of strings, each one is a special token. For example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).

  • special_first (bool, optional) – Whether special_tokens is prepended or appended to vocab. If special_tokens is specified and special_first is set to True, special_tokens will be prepended (default=True).

Returns

Vocab, vocab built from the list.

Examples

>>> vocab = text.Vocab.from_list(["w1", "w2", "w3"], special_tokens=["<unk>"], special_first=True)
ids_to_tokens(ids)[source]

Converts a single index or a sequence of indices in a token or a sequence of tokens. If id does not exist, return empty string.

Parameters

ids (Union[int, list[int]]) – The token id (or token ids) to convert to tokens.

Returns

The decoded token(s).

Examples

>>> vocab = text.Vocab.from_list(["w1", "w2", "w3"], special_tokens=["<unk>"], special_first=True)
>>> token = vocab.ids_to_tokens(0)
tokens_to_ids(tokens)[source]

Converts a token string or a sequence of tokens in a single integer id or a sequence of ids. If token does not exist, return id with value -1.

Parameters

tokens (Union[str, list[str]]) – One or several token(s) to convert to token id(s).

Returns

The token id or list of token ids.

Examples

>>> vocab = text.Vocab.from_list(["w1", "w2", "w3"], special_tokens=["<unk>"], special_first=True)
>>> ids = vocab.tokens_to_ids(["w1", "w3"])
vocab()[source]

Get the vocabory table in dict type.

Returns

A vocabulary consisting of word and id pairs.

Examples

>>> vocab = text.Vocab.from_list(["word_1", "word_2", "word_3", "word_4"])
>>> vocabory_dict = vocab.vocab()