mindspore.dataset.text.utils

The module text.utils provides some general methods for nlp text processing. For example, you can use Vocab to build a dictionary, use to_bytes and to_str to encode and decode strings into a specified format.

class mindspore.dataset.text.utils.JiebaMode(value)[source]

An enumeration for JiebaTokenizer, effective enumeration types are MIX, MP, HMM.

class mindspore.dataset.text.utils.NormalizeForm(value)[source]

An enumeration for NormalizeUTF8, effective enumeration types are NONE, NFC, NFKC, NFD, NFKD.

class mindspore.dataset.text.utils.Vocab[source]

Vocab object that is used to lookup a word.

It contains a map that maps each word(str) to an id (int).

classmethod from_dataset(dataset, columns=None, freq_range=None, top_k=None, special_tokens=None, special_first=None)[source]

Build a vocab from a dataset.

This would collect all unique words in a dataset and return a vocab within the frequency range specified by user in freq_range. User would be warned if no words fall into the frequency. Words in vocab are ordered from highest frequency to lowest frequency. Words with the same frequency would be ordered lexicographically.

Parameters
  • dataset (Dataset) – dataset to build vocab from.

  • columns (list of str, optional) – column names to get words from. It can be a list of column names. (default=None, where all columns will be used. If any column isn’t string type, will return error).

  • freq_range (tuple, optional) – A tuple of integers (min_frequency, max_frequency). Words within the frequency range would be kept. 0 <= min_frequency <= max_frequency <= total_words. min_frequency=0 is the same as min_frequency=1. max_frequency > total_words is the same as max_frequency = total_words. min_frequency/max_frequency can be None, which corresponds to 0/total_words separately (default=None, all words are included).

  • top_k (int, optional) – top_k > 0. Number of words to be built into vocab. top_k most frequent words are taken. top_k is taken after freq_range. If not enough top_k, all words will be taken (default=None, all words are included).

  • special_tokens (list, optional) – a list of strings, each one is a special token. for example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).

  • special_first (bool, optional) – whether special_tokens will be prepended/appended to vocab. If special_tokens is specified and special_first is set to None, special_tokens will be prepended (default=None).

Returns

Vocab, Vocab object built from dataset.

classmethod from_dict(word_dict)[source]

Build a vocab object from a dict.

Parameters

word_dict (dict) – dict contains word, id pairs where word should be str and id int. id is recommended to start from 0 and be continuous. ValueError will be raised if id is negative.

classmethod from_file(file_path, delimiter=None, vocab_size=None, special_tokens=None, special_first=None)[source]

Build a vocab object from a list of word.

Parameters
  • file_path (str) – path to the file which contains the vocab list.

  • delimiter (str, optional) – a delimiter to break up each line in file, the first element is taken to be the word (default=None).

  • vocab_size (int, optional) – number of words to read from file_path (default=None, all words are taken).

  • special_tokens (list, optional) – a list of strings, each one is a special token. for example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).

  • special_first (bool, optional) – whether special_tokens will be prepended/appended to vocab, If special_tokens is specified and special_first is set to None, special_tokens will be prepended (default=None).

classmethod from_list(word_list, special_tokens=None, special_first=None)[source]

Build a vocab object from a list of word.

Parameters
  • word_list (list) – a list of string where each element is a word of type string.

  • special_tokens (list, optional) – a list of strings, each one is a special token. for example special_tokens=[“<pad>”,”<unk>”] (default=None, no special tokens will be added).

  • special_first (bool, optional) – whether special_tokens will be prepended/appended to vocab, If special_tokens is specified and special_first is set to None, special_tokens will be prepended (default=None).

mindspore.dataset.text.utils.to_bytes(array, encoding='utf8')[source]

Convert numpy array of str to array of bytes by encoding each element based on charset encoding.

Parameters
  • array (numpy.ndarray) – Array of type str representing strings.

  • encoding (str) – Indicating the charset for encoding.

Returns

numpy.ndarray, numpy array of bytes.

mindspore.dataset.text.utils.to_str(array, encoding='utf8')[source]

Convert numpy array of bytes to array of str by decoding each element based on charset encoding.

Parameters
  • array (numpy.ndarray) – Array of type bytes representing strings.

  • encoding (string) – Indicating the charset for decoding.

Returns

numpy.ndarray, numpy array of str.