mindformers.dataset.MultiTurnDataset
- class mindformers.dataset.MultiTurnDataset(dataset_config)[source]
- Multi-turn dataset. - The generated dataset has two columns: - [input_ids, labels]. The tensor of column- input_idsis of the int32 type. The tensor of column- labelsis of the int32 type.- Parameters
- dataset_config (dict) – - Required. Config for dataset. Must be dict which contains all keys below at least. - data_loader: Config for data loader or a data loader object. When data_loader is a dict, the string "type", "dataset_dir" and "shuffle" are the keys can be parsed. - type: Required. Indicates the type of dataset. The value must be string or class type. 
- dataset_dir: Required. The path of dataset. 
- shuffle: Required. Whether to perform shuffle on the dataset. Must be bool. 
 
- tokenizer: Tokenizer configuration or object. 
- max_seq_length: Maximum length of the sequence. 
- batch_size: Size of each batch. 
- drop_remainder: Whether to discard the last batch when the number of data items contained in the last batch is smaller than batch_size. Default: True. 
- num_parallel_workers: Specifies the number of concurrent processes or threads for map operations to accelerate processing. 
- python_multiprocessing: Enabling the Python Multi-Process Mode to Accelerate Map Operations. 
- repeat: Number of times this dataset is repeated. 
- seed: Random seed number. 
- prefetch_size: Buffer queue size of each data processing operation in the pipeline. 
- numa_enable: Indicates whether to use the NUMA binding function. 
 
- Returns
- Instance of MultiTurnDataset. 
- Raises
- ValueError – If Python version earlier than 3.9. 
- ValueError – If dataset_dir is missing in dataset_config.data_loader, or dataset_config.data_loader.dataset_dir does not exist. 
- ValueError – If the length of tokens and loss masks mismatch. 
- ValueError – If the length of input ids and labels mismatch. 
 
 - Examples - >>> from mindformers import MultiTurnDataset >>> from mindformers.tools.register import MindFormerConfig >>> from mindformers.dataset import check_dataset_config >>> # Note: >>> # `"/path/to/tool_alpaca.jsonl"` should be replaced with the real path of the formatted dataset file. >>> # `"/path/to/tokenizer.model"` should be replaced with the real path of the tokenizer file. >>> config_dict = { ... 'data_loader': { ... 'type': 'ToolAlpacaDataLoader', ... 'dataset_dir': "/path/to/tool_alpaca.jsonl", ... 'shuffle': True ... }, ... 'tokenizer': { ... 'type': 'ChatGLM3Tokenizer', ... 'vocab_file': '/path/to/tokenizer.model' ... }, ... 'max_seq_length': 2048, ... 'batch_size': 1, ... 'drop_remainder': True, ... 'num_parallel_workers': 8, ... 'python_multiprocessing': False, ... 'repeat': 1, ... 'seed': 0, ... 'prefetch_size': 1, ... 'numa_enable': False, ... } >>> # Initialize a MindFormerConfig instance with a dict. >>> config = MindFormerConfig(**config_dict) >>> check_dataset_config(config) >>> # use class to build dataset >>> dataset_from_class = MultiTurnDataset(config)