mindspore.dataset.TextBaseDataset.build_sentencepiece_vocab

TextBaseDataset.build_sentencepiece_vocab(columns, vocab_size, character_coverage, model_type, params)[source]

Function to create a SentencePieceVocab from source dataset. Desired source dataset is a text type dataset.

Parameters
  • columns (list[str]) – Column names to get words from.

  • vocab_size (int) – Vocabulary size.

  • character_coverage (float) – Percentage of characters covered by the model, must be between 0.98 and 1.0 Good defaults are: 0.9995 for languages with rich character sets like Japanese or Chinese character sets, and 1.0 for other languages with small character sets like English or Latin.

  • model_type (SentencePieceModel) – Model type. Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type.

  • params (dict) – Any extra optional parameters of sentencepiece library according to your raw data

Returns

SentencePieceVocab, vocab built from the dataset.

Examples

>>> from mindspore.dataset.text import SentencePieceModel
>>>
>>> # You can construct any text dataset as source, take TextFileDataset as example.
>>> dataset = ds.TextFileDataset("/path/to/sentence/piece/vocab/file", shuffle=False)
>>> dataset = dataset.build_sentencepiece_vocab(["text"], 5000, 0.9995, SentencePieceModel.UNIGRAM, {})