Class Vocab
- Defined in File text.h 
Class Documentation
- 
class Vocab
- Vocab object that is used to save pairs of words and ids. - 说明 - It contains a map that maps each word(str) to an id(int) or reverse. - Public Functions - 
WordIdType TokensToIds(const WordType &word) const
- Lookup the id of a word, if the word doesn't exist in vocab, return -1. - 参数
- word – Word to be looked up. 
- 返回
- ID of the word in the vocab. 样例
- // lookup, convert token to id auto single_index = vocab->TokensToIds("home"); single_index = vocab->TokensToIds("hello"); 
 
 - 
std::vector<WordIdType> TokensToIds(const std::vector<WordType> &words) const
- Lookup the id of a word, if the word doesn't exist in vocab, return -1. - 参数
- words – Words to be looked up. 
- 返回
- ID of the word in the vocab. 样例
- // lookup multiple tokens auto multi_indexs = vocab->TokensToIds(std::vector<std::string>{"<pad>", "behind"}); std::vector<int32_t> expected_multi_indexs = {0, 4}; multi_indexs = vocab->TokensToIds(std::vector<std::string>{"<pad>", "apple"}); expected_multi_indexs = {0, -1}; 
 
 - 
WordType IdsToTokens(const WordIdType &id)
- Lookup the word of an ID, if ID doesn't exist in vocab, return empty string. - 参数
- id – ID to be looked up. 
- 返回
- Indicates the word corresponding to the ID. 样例
- // reverse lookup, convert id to token auto single_word = vocab->IdsToTokens(2); single_word = vocab->IdsToTokens(-1); 
 
 - 
std::vector<WordType> IdsToTokens(const std::vector<WordIdType> &ids)
- Lookup the word of an ID, if ID doesn't exist in vocab, return empty string. - 参数
- ids – ID to be looked up. 
- 返回
- Indicates the word corresponding to the ID. 样例
- // reverse lookup multiple ids auto multi_words = vocab->IdsToTokens(std::vector<int32_t>{0, 4}); std::vector<std::string> expected_multi_words = {"<pad>", "behind"}; multi_words = vocab->IdsToTokens(std::vector<int32_t>{0, 99}); expected_multi_words = {"<pad>", ""}; 
 
 - 
explicit Vocab(std::unordered_map<WordType, WordIdType> map)
- Constructor, shouldn't be called directly, can't be private due to std::make_unique(). - 参数
- map – Sanitized word2id map. 
 
 - 
void AppendWord(const std::string &word)
- Add one word to vocab, increment it's index automatically. - 参数
- word – Word to be added, word will skip if word already exists. 
 
 - 
inline const std::unordered_map<WordType, WordIdType> &GetVocab() const
- Return a read-only vocab in unordered_map type. - 返回
- A unordered_map of word2id. 
 
 - 
Vocab() = default
- Constructor. 
 - 
~Vocab() = default
- Destructor. 
 - Public Static Functions - Build a vocab from an unordered_map. IDs should be no duplicate and continuous. - 参数
- words – [in] An unordered_map containing word id pair. 
- vocab – [out] A vocab object. 
 
- 返回
- Status code. 样例
- // Build a map std::unordered_map<std::string, int32_t> dict; dict["banana"] = 0; dict["apple"] = 1; dict["cat"] = 2; dict["dog"] = 3; // Build vocab from map std::shared_ptr<Vocab> vocab = std::make_shared<Vocab>(); Status s = Vocab::BuildFromUnorderedMap(dict, &vocab); 
 
 - Build a vocab from a c++ vector. id no duplicate and continuous. - 参数
- words – [in] A vector of string containing words. 
- special_tokens – [in] A vector of string containing special tokens. 
- prepend_special – [in] Whether the special_tokens will be prepended/appended to vocab. 
- vocab – [out] A vocab object. 
 
- 返回
- Status code. 样例
- // Build vocab from a vector of words, special tokens are prepended to vocab std::vector<std::string> list = {"apple", "banana", "cat", "dog", "egg"}; std::shared_ptr<Vocab> vocab = std::make_shared<Vocab>(); Status s = Vocab::BuildFromVector(list, {"<unk>"}, true, &vocab); 
 
 - Build a vocab from vocab file, IDs will be automatically assigned. - 参数
- path – [in] Path to vocab file, each line in file is assumed as a word (including space). 
- delimiter – [in] Delimiter to break each line, characters after the delimiter will be deprecated. 
- vocab_size – [in] Number of lines to be read from file. 
- special_tokens – [in] A vector of string containing special tokens. 
- prepend_special – [in] Whether the special_tokens will be prepended/appended to vocab. 
- vocab – [out] A vocab object. 
 
- 返回
- Status code. 样例
- // Build vocab from local file std::string vocab_dir = datasets_root_path_ + "/testVocab/vocab_list.txt"; std::shared_ptr<Vocab> vocab = std::make_shared<Vocab>(); Status s = Vocab::BuildFromFile(vocab_dir, ",", -1, {"<pad>", "<unk>"}, true, &vocab); 
 
 
- 
WordIdType TokensToIds(const WordType &word) const