mindspore.dataset.text.SentencePieceVocab

View Source On Gitee
class mindspore.dataset.text.SentencePieceVocab[source]

SentencePiece object that is used to do words segmentation.

classmethod from_dataset(dataset, col_names, vocab_size, character_coverage, model_type, params)[source]

Build a SentencePiece from a dataset.

Parameters
  • dataset (Dataset) – Dataset to build SentencePiece.

  • col_names (list) – The list of the col name.

  • vocab_size (int) – Vocabulary size.

  • character_coverage (float) – Amount of characters covered by the model. Recommend 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.

  • model_type (SentencePieceModel) – The desired subword algorithm. See SentencePieceModel for details on optional values.

  • params (dict) – A dictionary with no incoming parameters.

Returns

SentencePieceVocab, vocab built from the dataset.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>>
>>> from mindspore.dataset.text import SentencePieceVocab, SentencePieceModel
>>> dataset = ds.TextFileDataset("/path/to/sentence/piece/vocab/file", shuffle=False)
>>> vocab = SentencePieceVocab.from_dataset(dataset, ["text"], 5000, 0.9995,
...                                         SentencePieceModel.UNIGRAM, {})
>>> # Build tokenizer based on vocab
>>> tokenizer = text.SentencePieceTokenizer(vocab, out_type=text.SPieceTokenizerOutType.STRING)
>>> txt = "Today is Tuesday."
>>> token = tokenizer(txt)
classmethod from_file(file_path, vocab_size, character_coverage, model_type, params)[source]

Build a SentencePiece object from a file.

Parameters
  • file_path (list) – Path to the file which contains the SentencePiece list.

  • vocab_size (int) – Vocabulary size.

  • character_coverage (float) – Amount of characters covered by the model. Recommend 0.9995 for languages with rich character set like Japanese or Chinese and 1.0 for other languages with small character set.

  • model_type (SentencePieceModel) – The desired subword algorithm. See SentencePieceModel for details on optional values.

  • params (dict) – A dictionary with no incoming parameters(The parameters are derived from SentencePiece library).

Returns

SentencePieceVocab, vocab built from the file.

Examples

>>> from mindspore.dataset.text import SentencePieceVocab, SentencePieceModel
>>> vocab = SentencePieceVocab.from_file(["/path/to/sentence/piece/vocab/file"], 5000, 0.9995,
...                                      SentencePieceModel.UNIGRAM, {})
>>> # Build tokenizer based on vocab model
>>> tokenizer = text.SentencePieceTokenizer(vocab, out_type=text.SPieceTokenizerOutType.STRING)
>>> txt = "Today is Friday."
>>> token = tokenizer(txt)
classmethod save_model(vocab, path, filename)[source]

Save model into given filepath.

Parameters
  • vocab (SentencePieceVocab) – A SentencePiece object.

  • path (str) – Path to store model.

  • filename (str) – The name of the file.

Examples

>>> from mindspore.dataset.text import SentencePieceVocab, SentencePieceModel
>>> vocab = SentencePieceVocab.from_file(["/path/to/sentence/piece/vocab/file"], 5000, 0.9995,
...                                      SentencePieceModel.UNIGRAM, {})
>>> SentencePieceVocab.save_model(vocab, "./", "m.model")