mindformers.models.LlamaTokenizerFast
- class mindformers.models.LlamaTokenizerFast(vocab_file=None, tokenizer_file=None, clean_up_tokenization_spaces=False, unk_token='<unk>', bos_token='<s>', eos_token='</s>', add_bos_token=True, add_eos_token=False, use_default_system_prompt=False, **kwargs)[source]
- Construct a Llama tokenizer. Based on byte-level Byte-Pair-Encoding. - This uses notably ByteFallback and no normalization. - Note - Currently, the llama_tokenizer_fast process supports only the 'right' padding mode. padding_side = "right" - Note - If you want to change the bos_token or the eos_token, make sure to specify them when initializing the model, or call tokenizer.update_post_processor() to make sure that the post-processing is correctly done (otherwise the values of the first token and final token of an encoded sequence will not be correct). - Parameters
- vocab_file (str, optional) – SentencePiece file (generally has a .model extension) that contains the vocabulary necessary to instantiate a tokenizer. Default: - None.
- tokenizer_file (str, optional) – Tokenizers file (generally has a .json extension) that contains everything needed to load the tokenizer. Default: - None.
- clean_up_tokenization_spaces (bool, optional) – Whether to clean-up spaces after decoding, cleanup consists in removing potential artifacts like extra spaces. Default: - False.
- unk_token (Union[str, tokenizers.AddedToken], optional) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. Default: - "<unk>".
- bos_token (Union[str, tokenizers.AddedToken], optional) – The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. Default: - "<s>".
- eos_token (Union[str, tokenizers.AddedToken], optional) – The end of sequence token. Default: - "</s>".
- add_bos_token (bool, optional) – Whether to add an bos_token at the start of sequences. Default: - True.
- add_eos_token (bool, optional) – Whether to add an eos_token at the end of sequences. Default: - False.
- use_default_system_prompt (bool, optional) – Whether the default system prompt for Llama should be used. Default: - False.
 
- Returns
- LlamaTokenizer, a LlamaTokenizer instance. 
 - Examples - >>> from transformers import LlamaTokenizerFast >>> >>> tokenizer = LlamaTokenizerFast(vocab_file="./llama2/tokenizer.model") >>> tokenizer.encode("Hello this is a test") [1, 15043, 445, 338, 263, 1243] - build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[source]
- Insert the special tokens to the input_ids, currently. 
 - save_vocabulary(save_directory: str, filename_prefix: Optional[str] = None)[source]
- Saves the vocabulary to the specified directory. This method is used to export the vocabulary file from the slow tokenizer. - Parameters
- Returns
- A tuple containing the paths of the saved vocabulary files. 
- Raises
- ValueError – Raises this exception if the vocabulary cannot be saved from 
- a fast tokenizer, or if the specified save directory does not exist. – 
 
 
 - slow_tokenizer_class
- alias of - mindformers.models.llama.llama_tokenizer.LlamaTokenizer
 - update_post_processor()[source]
- Updates the underlying post processor with the current bos_token and eos_token. - Raises
- ValueError – Raised if add_bos_token or add_eos_token is set but the 
- corresponding token is None. –