mindspore.dataset.text.transforms.Ngram

class mindspore.dataset.text.transforms.Ngram(n, left_pad=('', 0), right_pad=('', 0), separator=' ')[source]

TensorOp to generate n-gram from a 1-D string Tensor.

Refer to https://en.wikipedia.org/wiki/N-gram#Examples for an overview of what n-gram is and how it works.

Parameters

n (list[int]) – n in n-gram, which is a list of positive integers. For example, if n=[4, 3], then the result would be a 4-gram followed by a 3-gram in the same tensor. If the number of words is not enough to make up for a n-gram, an empty string will be returned. For example, 3 grams on [“mindspore”, “best”] will result in an empty string produced.
left_pad (tuple, optional) – Padding performed on left side of the sequence shaped like (“pad_token”, pad_width). pad_width will be capped at n-1. For example, specifying left_pad=(“_”, 2) would pad left side of the sequence with “__” (default=None).
right_pad (tuple, optional) – Padding performed on right side of the sequence shaped like (“pad_token”, pad_width). pad_width will be capped at n-1. For example, specifying right_pad=(“_”, 2) would pad right side of the sequence with “__” (default=None).
separator (str, optional) – Symbol used to join strings together. For example. if 2-gram is [“mindspore”, “amazing”] with separator=”-”, the result would be [“mindspore-amazing”] (default=None, which will use whitespace as separator).

Examples

>>> ngram_op = text.Ngram(3, separator="-")
>>> output = ngram_op(["WildRose Country", "Canada's Ocean Playground", "Land of Living Skies"])
>>> # output
>>> # ["WildRose Country-Canada's Ocean Playground-Land of Living Skies"]
>>> # same ngram_op called through map
>>> text_file_dataset = text_file_dataset.map(operations=ngram_op)