Class Vocab

Class Documentation

class Vocab

Vocab object that is used to save pairs of words and ids.

Note

It contains a map that maps each word(str) to an id(int) or reverse.

Public Functions

WordIdType TokensToIds(const WordType &word) const

Lookup the id of a word, if the word doesn’t exist in vocab, return -1.

Parameters

word – Word to be looked up.

Returns

ID of the word in the vocab.

样例
// lookup, convert token to id
auto single_index = vocab->TokensToIds("home");
single_index = vocab->TokensToIds("hello");
std::vector<WordIdType> TokensToIds(const std::vector<WordType> &words) const

Lookup the id of a word, if the word doesn’t exist in vocab, return -1.

Parameters

words – Words to be looked up.

Returns

ID of the word in the vocab.

样例
// lookup multiple tokens
auto multi_indexs = vocab->TokensToIds(std::vector<std::string>{"<pad>", "behind"});
std::vector<int32_t> expected_multi_indexs = {0, 4};
multi_indexs = vocab->TokensToIds(std::vector<std::string>{"<pad>", "apple"});
expected_multi_indexs = {0, -1};
WordType IdsToTokens(const WordIdType &id)

Lookup the word of an ID, if ID doesn’t exist in vocab, return empty string.

Parameters

id – ID to be looked up.

Returns

Indicates the word corresponding to the ID.

样例
// reverse lookup, convert id to token
auto single_word = vocab->IdsToTokens(2);
single_word = vocab->IdsToTokens(-1);
std::vector<WordType> IdsToTokens(const std::vector<WordIdType> &ids)

Lookup the word of an ID, if ID doesn’t exist in vocab, return empty string.

Parameters

ids – ID to be looked up.

Returns

Indicates the word corresponding to the ID.

样例
// reverse lookup multiple ids
auto multi_words = vocab->IdsToTokens(std::vector<int32_t>{0, 4});
std::vector<std::string> expected_multi_words = {"<pad>", "behind"};
multi_words = vocab->IdsToTokens(std::vector<int32_t>{0, 99});
expected_multi_words = {"<pad>", ""};
explicit Vocab(std::unordered_map<WordType, WordIdType> map)

Constructor, shouldn’t be called directly, can’t be private due to std::make_unique().

Parameters

map – Sanitized word2id map.

void AppendWord(const std::string &word)

Add one word to vocab, increment it’s index automatically.

Parameters

word – Word to be added, word will skip if word already exists.

inline const std::unordered_map<WordType, WordIdType> &GetVocab() const

Return a read-only vocab in unordered_map type.

Returns

A unordered_map of word2id.

Vocab() = default

Constructor.

~Vocab() = default

Destructor.

Public Static Functions

static Status BuildFromUnorderedMap(const std::unordered_map<WordType, WordIdType> &words, std::shared_ptr<Vocab> *vocab)

Build a vocab from an unordered_map. IDs should be no duplicate and continuous.

Parameters
  • words[in] An unordered_map containing word id pair.

  • vocab[out] A vocab object.

Returns

Status code.

样例
// Build a map
std::unordered_map<std::string, int32_t> dict;
dict["banana"] = 0;
dict["apple"] = 1;
dict["cat"] = 2;
dict["dog"] = 3;
// Build vocab from map
std::shared_ptr<Vocab> vocab = std::make_shared<Vocab>();
Status s = Vocab::BuildFromUnorderedMap(dict, &vocab);
static Status BuildFromVector(const std::vector<WordType> &words, const std::vector<WordType> &special_tokens, bool prepend_special, std::shared_ptr<Vocab> *vocab)

Build a vocab from a c++ vector. id no duplicate and continuous.

Parameters
  • words[in] A vector of string containing words.

  • special_tokens[in] A vector of string containing special tokens.

  • prepend_special[in] Whether the special_tokens will be prepended/appended to vocab.

  • vocab[out] A vocab object.

Returns

Status code.

样例
// Build vocab from a vector of words, special tokens are prepended to vocab
std::vector<std::string> list = {"apple", "banana", "cat", "dog", "egg"};
std::shared_ptr<Vocab> vocab = std::make_shared<Vocab>();
Status s = Vocab::BuildFromVector(list, {"<unk>"}, true, &vocab);
static Status BuildFromFile(const std::string &path, const std::string &delimiter, int32_t vocab_size, const std::vector<WordType> &special_tokens, bool prepend_special, std::shared_ptr<Vocab> *vocab)

Build a vocab from vocab file, IDs will be automatically assigned.

Parameters
  • path[in] Path to vocab file, each line in file is assumed as a word (including space).

  • delimiter[in] Delimiter to break each line, characters after the delimiter will be deprecated.

  • vocab_size[in] Number of lines to be read from file.

  • special_tokens[in] A vector of string containing special tokens.

  • prepend_special[in] Whether the special_tokens will be prepended/appended to vocab.

  • vocab[out] A vocab object.

Returns

Status code.

样例
// Build vocab from local file
std::string vocab_dir = datasets_root_path_ + "/testVocab/vocab_list.txt";
std::shared_ptr<Vocab> vocab = std::make_shared<Vocab>();
Status s = Vocab::BuildFromFile(vocab_dir, ",", -1, {"<pad>", "<unk>"}, true, &vocab);