mindformers.models.PreTrainedTokenizerFast
- class mindformers.models.PreTrainedTokenizerFast(*args, **kwargs)[source]
Base class for all fast tokenizers (wrapping HuggingFace tokenizers library).
Handles all the shared methods for tokenization and special tokens, as well as methods for downloading/caching/loading pretrained tokenizers, as well as adding tokens to the vocabulary.
This class also contains the added tokens in a unified way on top of all tokenizers so we don't have to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece…).
- Parameters
model_max_length (int, optional) – The maximum length (in number of tokens) for the inputs to the transformer model. Set when the tokenizer is loaded with
from_pretrained()based on the model'smax_model_input_sizesattribute. Default:1e-30.padding_side (str, optional) – Specifies the side on which the model should have padding applied. Options are ['right', 'left']. The default value is picked from the class attribute of the same name.
truncation_side (str, optional) – Specifies the side on which the model should have truncation applied. Options are ['right', 'left']. The default value is picked from the class attribute of the same name.
chat_template (str, optional) – A Jinja template string used to format lists of chat messages. Default:
"{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}".model_input_names (List[str], optional) – Lists the names of inputs accepted by the forward pass of the model, such as "token_type_ids" or "attention_mask". Defaults to values picked from the class attribute of the same name. Default:
None.bos_token (Union[str, tokenizers.AddedToken], optional) – Represents the beginning of a sentence and is associated with
self.bos_tokenandself.bos_token_id. Default:None.eos_token (Union[str, tokenizers.AddedToken], optional) – Represents the end of a sentence and is associated with
self.eos_tokenandself.eos_token_id. Default:None.unk_token (Union[str, tokenizers.AddedToken], optional) – Represents an out-of-vocabulary token and is associated with
self.unk_tokenandself.unk_token_id. Default:None.sep_token (Union[str, tokenizers.AddedToken], optional) – A special token separating two different sentences in the same input (used by BERT, for example) and is associated with
self.sep_tokenandself.sep_token_id. Default:None.pad_token (Union[str, tokenizers.AddedToken], optional) – Used to make arrays of tokens the same size for batching purposes and will be ignored by attention mechanisms or loss computation. It is associated with
self.pad_tokenandself.pad_token_id. Default:None.cls_token (Union[str, tokenizers.AddedToken], optional) – Represents the class of the input (used by BERT, for example) and is associated with
self.cls_tokenandself.cls_token_id. Default:None.mask_token (Union[str, tokenizers.AddedToken], optional) – Represents a masked token (used by masked-language modeling pretraining objectives like BERT) and is associated with
self.mask_tokenandself.mask_token_id. Default:None.additional_special_tokens (Union[tuple, list, tokenizers.AddedToken], optional) – Lists additional special tokens that are ensured to be skipped when decoding with
skip_special_tokensset to True. They will be added at the end of the vocabulary if not already part of it. Default:None.clean_up_tokenization_spaces (bool, optional) – Determines whether to clean-up spaces that were added when splitting the input text during the tokenization process. Default:
True.split_special_tokens (bool, optional) – Specifies whether special tokens should be split during the tokenization process. This affects the internal state of the tokenizer. By default, special tokens are not split. For example, if
<s>is thebos_token, thentokenizer.tokenize("<s>") = ['<s>']. Ifsplit_special_tokens = True, thentokenizer.tokenize("<s>")would result in['<','s', '>']. Default:False.tokenizer_object (tokenizers.Tokenizer) – A
tokenizers.Tokenizerobject from tokenizers to instantiate from.tokenizer_file (str) – A path to a local JSON file representing a previously serialized
tokenizers.Tokenizerobject from tokenizers.
- Returns
PreTrainedTokenizerFast instance.
Examples
>>> from transformers import LlamaTokenizerFast >>> >>> tokenizer = LlamaTokenizerFast(vocab_file="./llama2/tokenizer.model") >>> tokenizer.encode("Hello this is a test") [1, 15043, 445, 338, 263, 1243]
- property added_tokens_decoder: Dict[int, tokenizers.AddedToken]
Returns the added tokens in the vocabulary as a dictionary of index to AddedToken.
- Returns
A dict, the added tokens.
- property added_tokens_encoder: Dict[str, int]
Returns the sorted mapping from string to index. The added tokens encoder is cached for performance optimisation in self._added_tokens_encoder for the slow tokenizers.
- Returns
A dict, the added tokens.
- convert_ids_to_tokens(ids: Union[int, List[int]], skip_special_tokens: bool = False)[source]
Converts a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and added tokens.
- convert_tokens_to_ids(tokens: Union[str, List[str]])[source]
Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the vocabulary.
- get_added_vocab()[source]
Returns the added tokens in the vocabulary as a dictionary of token to index.
- Returns
A dict, the added tokens.
- num_special_tokens_to_add(pair: bool = False)[source]
Returns the number of added tokens when encoding a sequence with special tokens.
Note
This encodes a dummy input and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.
- Parameters
pair (bool, optional) – Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence. Default:
False.- Returns
int, Number of special tokens added to sequences.
- set_truncation_and_padding(padding_strategy: PaddingStrategy, truncation_strategy: TruncationStrategy, max_length: int, stride: int, pad_to_multiple_of: Optional[int])[source]
Define the truncation and the padding strategies for fast tokenizers (provided by HuggingFace tokenizers library) and restore the tokenizer settings after.
The provided tokenizer has no padding / truncation strategy before the managed section. If your tokenizer set a padding / truncation strategy before, then it will be reset to no padding / truncation when exiting the managed section.
- Parameters
padding_strategy (PaddingStrategy) – The kind of padding that will be applied to the input
truncation_strategy (TruncationStrategy) – The kind of truncation that will be applied to the input
max_length (int) – The maximum size of a sequence.
stride (int) – The stride to use when handling overflow.
pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta). Default:
None.
- train_new_from_iterator(text_iterator, vocab_size, length=None, new_special_tokens=None, special_tokens_map=None, **kwargs)[source]
Trains a tokenizer on a new corpus with the same defaults (in terms of special tokens or tokenization pipeline) as the current one.
- Parameters
text_iterator (list) – The training corpus. Should be a generator of batches of texts, for instance a list of lists of texts if you have everything in memory.
vocab_size (int) – The size of the vocabulary you want for your tokenizer.
length (int, optional) – The total number of sequences in the iterator. This is used to provide meaningful progress tracking. Default:
None.new_special_tokens (Union[list, AddedToken], optional) – A list of new special tokens to add to the tokenizer you are training. Default:
None.special_tokens_map (dict, optional) – If you want to rename some of the special tokens this tokenizer uses, pass along a mapping old special token name to new special token name in this argument. Default:
None.kwargs (Any, optional) – Additional keyword arguments.
- Returns
PreTrainedTokenizerFast, A new tokenizer of the same type as the original one, trained on text_iterator.