mindformers.models.LlamaConfig

class mindformers.models.LlamaConfig(batch_size=1, seq_length=2048, hidden_size=4096, num_layers=32, num_heads=32, n_kv_heads=None, max_position_embedding=None, intermediate_size=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, rms_norm_eps=1e-5, bos_token_id=1, eos_token_id=2, pad_token_id=0, ignore_token_id=- 100, theta=10000.0, compute_dtype='float16', layernorm_compute_type='float32', softmax_compute_type='float32', rotary_dtype='float32', param_init_type='float16', residual_dtype=None, embedding_init_type=None, qkv_has_bias=False, qkv_concat=False, attn_proj_has_bias=False, parallel_config=default_transformer_config, moe_config=default_moe_config, use_past=False, extend_method='None', scaling_factor=1.0, is_dynamic=False, use_rope_slice=False, use_flash_attention=False, use_ring_attention=False, use_attn_mask_compression=False, use_eod_attn_mask_compression=False, parallel_optimizer=False, fine_grain_interleave=1, pp_interleave_num=1, offset=0, init_method_std=0.01, checkpoint_name_or_path='', repetition_penalty=1.0, max_decode_length=1024, block_size=16, num_blocks=512, top_k=5, top_p=1.0, do_sample=True, quant_config=None, tie_word_embeddings=False, llm_backend='', fused_rms_norm=True, input_sliced_sig=False, rmsnorm_compute_2d=False, chunk_prefill=False, calculate_per_token_loss=False, pipeline_stage=None, return_hidden_states=False, **kwargs)[source]

Llama config class which defines the model size.

Parameters

batch_size (int, optional) – Batch size for input data, use in predict. Default: 1.
seq_length (int, optional) – The sequence length of input_ids. Default: 2048.
hidden_size (int, optional) – Dimensionality of the encoder layers and the pooler layer. Default: 4096.
num_layers (int, optional) – Number of hidden layers in the Transformer decoder. Default: 32.
num_heads (int, optional) – Number of attention heads for each attention layer in the Transformer decoder. Default: 32.
n_kv_heads (int, optional) – Define multi group head attention heads number. Default: None.
max_position_embedding (int, optional) – Customize the maximum sequence length that the model can handle. Default: "None".
intermediate_size (int, optional) – Customize the number of dimension of the intermediate layer. Default: None.
vocab_size (int, optional) – Vocabulary size of the llama model. Default: 32000.
multiple_of (int, optional) – Define SwiGLU hidden layer size multiples. Default: 256.
ffn_dim_multiplier (int, optional) – Define ffn layer dim multiples. Default: None.
rms_norm_eps (float, optional) – The epsilon value of the denominator. Default: 1e-5.
bos_token_id (int, optional) – The id of the beginning-of-sequence token. Default: 1.
eos_token_id (int, optional) – The id of the end-of-sequence token. Default: 2.
pad_token_id (int, optional) – The id of the padding token. Default: 0.
ignore_token_id (int, optional) – The id of the ignoring token. Default: -100.
theta (float, optional) – Frequency factors for sine and cosine functions in RoPE. Default: 10000.0.
compute_dtype (str, optional) – Linear layer compute dtype. Default: float16.
layernorm_compute_type (str, optional) – Layernorm compute dtype. Default: float32.
softmax_compute_type (str, optional) – Softmax compute dtype. Default: float32.
rotary_dtype (str, optional) – RoPE compute dtype. Default: float32.
param_init_type (str, optional) – Parameter initial dtype. Default: float16.
residual_dtype (str, optional) – Residual compute dtype. Default: None.
embedding_init_type (str, optional) – Embedding weight initial dtype. Default: None.
qkv_has_bias (bool, optional) – Whether the Query, Key, and Value projection has bias. Default: False.
qkv_concat (bool, optional) – Whether concatenate the Query, Key, and Value projection. Default: False.
attn_proj_has_bias (bool, optional) – Whether the attn projection has bias. Default: False.
parallel_config (Union[dict, TransformerOpParallelConfig], optional) – The parallel configuration. Default: default_transformer_config , an instance of TransformerOpParallelConfig with default args.
moe_config (Union[dict, MoEConfig], optional) – The MoE configuration. Default: default_moe_config , an instance of MoEConfig with default args.
use_past (bool, optional) – Whether the model should use the past last key/values attentions (if applicable to the model) to speed up decoding. Default: False.
extend_method (str, optional) – The extent method of seq length in inference. Default: None.
scaling_factor (float, optional) – Scaling factor to adjust the weights of the frequency factors in the sine and cosine functions. Default: 1.0.
is_dynamic (bool, optional) – Whether to use dynamic shape. Default: False.
use_rope_slice (bool, optional) – Whether to enable RoPE slicing. Default: False.
use_flash_attention (bool, optional) – Whether to enable flash attention ops. Default: False.
use_ring_attention (bool, optional) – Whether to enable ring attention ops. Default: False.
use_attn_mask_compression (bool, optional) – Whether to enable attention mask compression. Default: False.
use_eod_attn_mask_compression (bool, optional) – Whether to enable eod attention mask compression. Default: False.
parallel_optimizer (bool, optional) – Whether to enable optimizer parallism. Default: False.
fine_grain_interleave (int, optional) – Set the number of fine-grained interleave. Default: 1.
pp_interleave_num (int, optional) – Set the number of pipeline interleave. Default: 1.
offset (int, optional) – Offset of transformer layer when set pipeline stage number. Default: 0.
init_method_std (float, optional) – The sigma value when using normal type to initialize Linear. Default: 0.01.
checkpoint_name_or_path (str, optional) – checkpoint path or name used to load to the network. Default: None.
repetition_penalty (float, optional) – The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details. Default: 1.0.
max_decode_length (int, optional) – The maximum length the generated tokens can have. Corresponds to the length of the input prompt + max_new_tokens. Its effect is overridden by max_new_tokens, if also set. Default: 1024.
block_size (int, optional) – The maximum number of tokens in one block can have when using paged attention. Default: 16.
num_blocks (int, optional) – The maximum number of blocks when using paged attention. Default: 512.
top_k (int, optional) – The number of highest probability vocabulary tokens to keep for top-k-filtering. Default: 5.
top_p (float, optional) – If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. Default: 1.0.
do_sample (bool, optional) – Whether to use sampling; use greedy decoding otherwise. Default: True.
quant_config (dict, optional) – Quantitative configuration. Default: None.
tie_word_embeddings (bool, optional) – Whether to tie input and output embeddings. Default: False.
llm_backend (str, optional) – Llm boost backend. Default: None.
fused_rms_norm (bool, optional) – Whether to use the RMSNorm of the fusion operator. Default: True.
input_sliced_sig (bool, optional) – If input_ids and labels have been processed to equal to seq_length, input_sliced_sig should be True, if not, input_sliced_sig should be False. Default: False.
rmsnorm_compute_2d (bool, optional) – Whether to use 2D Add in RMS_NORM. Default: False.
chunk_prefill (bool, optional) – Whether to use prefill mixed decode inference. Default: False.
calculate_per_token_loss (bool, optional) – Whether to calculate the loss of each token. Default: False.
pipeline_stage (dict, optional) – A dict set the start_stage, stage_num, and offset of the model when pipeline parallelism. Default: None.
return_hidden_states (bool, optional) – Whether to return hidden states. Default: False.

Returns

LlamaConfig, a LlamaConfig instance.

Examples

>>> from mindformers.models import LlamaConfig
>>> config = LlamaConfig(num_layers=2, seq_length=1024)
>>> print(config.num_layers)
2
>>> print(config.seq_length)
1024