# 比较与torchtext.data.functional.numericalize_tokens_from_iterator的功能差异

## torchtext.data.functional.numericalize_tokens_from_iterator ```python torchtext.data.functional.numericalize_tokens_from_iterator( vocab, iterator, removed_tokens=None ) ``` 更多内容详见[torchtext.data.functional.numericalize_tokens_from_iterator](https://pytorch.org/text/0.10.0/data_functional.html#numericalize-tokens-from-iterator)。 ## mindspore.dataset.text.Lookup ```python class mindspore.dataset.text.Lookup( vocab, unknown_token=None, data_type=mstype.int32 ) ``` 更多内容详见[mindspore.dataset.text.Lookup](https://mindspore.cn/docs/zh-CN/r2.0.0-alpha/api_python/dataset_text/mindspore.dataset.text.Lookup.html#mindspore.dataset.text.Lookup)。 ## 使用方式 PyTorch：从分词迭代器中生成词汇表对应的id列表，输入为词汇与id对应的映射表、词汇迭代器，返回创建好的迭代器对象，可从中获取对应词汇的id。 MindSpore：依据词汇与id的映射表，查找词汇对应的id。 ## 代码示例 ```python import mindspore.dataset as ds from mindspore.dataset import text import torch as T from torchtext.data.functional import simple_space_split, numericalize_tokens_from_iterator # In MindSpore, return id of given word with looking up the vocab. Vocab_file_path = '/path/to/testVocab/vocab_list.txt' vocab = text.Vocab.from_file(Vocab_file_path, ",", None, ["", ""], True) lookup = text.Lookup(vocab) text_file_dataset_dir = '/path/to/testVocab/words.txt' text_file_dataset = ds.TextFileDataset(dataset_files=text_file_dataset_dir) text_file_dataset = text_file_dataset.map(operations=lookup, input_columns=["text"]) for d in text_file_dataset.create_dict_iterator(num_epochs=1, output_numpy=True): print(d["text"]) # Out: # 14 # 12 # 13 # 10 # 15 # 11 # In torch, return the ids iterator with looking up the vocab. vocab = {'Sentencepiece' : 0, 'encode' : 1, 'as' : 2, 'pieces' : 3} ids_iter = numericalize_tokens_from_iterator(vocab, simple_space_split(["Sentencepiece as pieces", "as pieces"])) for ids in ids_iter: print([num for num in ids]) # Out: # [0, 2, 3] # [2, 3] ```