# Dataset [![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/r2.7.0rc1/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/r2.7.0rc1/docs/mindformers/docs/source_en/feature/dataset.md) MindSpore Transformers currently supports multiple types of dataset loading methods, covering common open-source and custom scenarios. Specifically, it includes: - **Megatron Datasets**: Supports loading datasets in the Megatron-LM format, suitable for large-scale language model pre-training tasks. - **HuggingFace Datasets**: Compatible with the HuggingFace datasets library, making it convenient to access a wide range of public data resources from the community. - **MindRecord Datasets**: MindRecord is an efficient data storage and reading module provided by MindSpore. This module offers various methods to help users convert different public datasets into the MindRecord format, as well as tools for reading, writing, and retrieving data from MindRecord files. ## Megatron Dataset Megatron dataset is an efficient data format designed for large-scale distributed language model pre-training scenarios, widely used within the Megatron-LM framework. These datasets are typically preprocessed and serialized into binary formats (such as `.bin` or `.idx` files), accompanied by specific indexing mechanisms to enable efficient parallel loading and data partitioning in distributed cluster environments. The following sections will explain how to generate `.bin` and `.idx` files, as well as how to use Megatron datasets in training tasks. ### Data Preprocessing MindSpore Transformers provides a data preprocessing script, [preprocess_indexed_dataset.py](https://gitee.com/mindspore/mindformers/blob/r1.6.0/mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py), which is used to convert raw text data in `json` format into `.bin` and `.idx` files. If the raw text data is not in `json` format, users need to preprocess and convert it into the appropriate format themselves. Below is an example of a `json` format file: ```json {"src": "www.nvidia.com", "text": "The quick brown fox", "type": "Eng", "id": "0", "title": "First Part"} {"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"} ... ``` The descriptions for each data field are as follows: | Field Name | Description | Required | |------------|------------------------------|:---------:| | text | Raw text data | Yes | | id | Unique identifier (in order) | No | | src | Data source | No | | type | Language type | No | | title | Data title | No | The following example demonstrates how to convert the `wikitext-103` dataset into a Megatron dataset format: 1. Download the `wikitext-103` dataset: [Link](https://dagshub.com/DagsHub/WIkiText-103/src/main/dataset/tokens) 2. Generate a `json` format data file The original text of the `wikitext-103` dataset looks like this: ```text = Valkyria Chronicles III = Valkyria Chronicles III is a tactical role-playing game developed by Sega for the PlayStation Portable. The game was released in Japan on January 27, 2011. = Gameplay = The game is similar to its predecessors in terms of gameplay... ``` You need to preprocess the original text into the following format and save it as a `json` file: ```json {"id": 0, "text": "Valkyria Chronicles III is a tactical role-playing game..."} {"id": 1, "text": "The game is similar to its predecessors in terms of gameplay..."} ... ``` 3. Download the model's vocabulary file Since different models use different vocabulary files, you need to download the corresponding vocabulary file for the training model. Taking the `Llama3` model as an example, download the [tokenizer.model](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/original/tokenizer.model) for data preprocessing. 4. Generate `.bin` and `.idx` data files Run the data preprocessing script [preprocess_indexed_dataset.py](https://gitee.com/mindspore/mindformers/blob/r1.6.0/mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py) to convert the original text data into corresponding token IDs using the model's tokenizer. The script accepts the following parameters: | Parameter Name | Description | |----------------|---------------------------------------------------------------------------------------------------| | input | Path to the `json` format file | | output-prefix | Prefix for the `.bin` and `.idx` data files | | tokenizer-type | Type of tokenizer used by the model | | vocab-file | Path to the model’s tokenizer file (`tokenizer.model` / `vocab.json`) | | merges-file | Path to the model’s tokenizer merges file (`merge.txt`) | | add_bos_token | Whether to add a `bos_token` (beginning of sequence token) to the vocabulary | | add_eos_token | Whether to add an `eos_token` (end of sequence token) to the vocabulary | | seq-length | Set the sequence length for dataset samples | | pad_or_stitch | Choose to either `pad` or `stitch` samples | | register_path | Set the code directory of outer tokenizer. Take effects only when `tokenizer-type`='AutoRegister' | | auto_register | Set the import path of outer tokenizer. Take effects only when `tokenizer-type`='AutoRegister' | The optional value of `tokenizer-type` is 'LlamaTokenizer', 'LlamaTokenizerFast' and 'AutoRegister'. When it's set to 'LlamaTokenizer' or 'LlamaTokenizerFast', the corresponding public tokenizer class in MindSpore Transformers will be called. When it's set to 'AutoRegister', outer tokenizer class specified by `register_path` and `auto_register` will be applied. Take public tokenizer class `LlamaTokenizerFast` for example, execute the following command to preprocess the dataset: ```shell python mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py \ --input /path/data.json \ --output-prefix /path/megatron_data \ --tokenizer-type LlamaTokenizerFast \ --vocab-file /path/tokenizer.model \ --add_bos_token True \ --add_eos_token True \ --pad_or_stitch stitch \ --seq-length 8192 ``` Take outer tokenizer class [Llama3Tokenizer](https://gitee.com/mindspore/mindformers/blob/r1.6.0/research/llama3_1/llama3_1_tokenizer.py) for example, make sure **local** MindSpore Transformers repository has 'research/llama3_1/llama3_1_tokenizer.py', and execute the following command to preprocess the dataset: ```shell python mindformers/tools/dataset_preprocess/preprocess_indexed_dataset.py \ --input /path/data.json \ --output-prefix /path/megatron_data \ --tokenizer-type AutoRegister \ --vocab-file /path/tokenizer.model \ --add_bos_token True \ --add_eos_token True \ --pad_or_stitch stitch \ --seq-length 8192 \ --register_path research/llama3_1 \ --auto_register llama3_1_tokenizer.Llama3Tokenizer ``` ### Model Pre-training MindSpore Transformers recommends using Megatron datasets for model pre-training. Based on the [Data Preprocessing](#data-preprocessing) steps, you can generate the required pre-training dataset. The following explains how to configure and use Megatron datasets in the configuration file. 1. Prepare the `parallel_speed_up.json` file Megatron dataset relies on the `dataset_broadcast_opt_level` feature for data broadcasting. For more details, refer to the [documentation](https://www.mindspore.cn/docs/zh-CN/r2.7.0rc1/api_python/parallel/mindspore.parallel.auto_parallel.AutoParallel.html). Therefore, you need to create a `parallel_speed_up.json` file with the following content: ```json { "dataset_broadcast_opt_level": 3 } ``` At the same time, add the following fields to the model configuration file: ```yaml context: ascend_config: parallel_speed_up_json_path: "/path/to/parallel_speed_up.json" ``` 2. Modify the model configuration file To use the Megatron dataset in model pre-training tasks, mainly modify the `train_dataset` section in the configuration file. ```yaml train_dataset: &train_dataset data_loader: type: BlendedMegatronDatasetDataLoader datasets_type: "GPTDataset" sizes: - 1000 # Number of training dataset samples - 0 # Number of testing dataset samples (currently unsupported) - 0 # Number of evaluation dataset samples (currently unsupported) config: # GPTDataset configuration options seed: 1234 # Random seed for data sampling split: "1, 0, 0" # Ratio of training, testing, and evaluation datasets (currently unsupported) seq_length: 8192 # Sequence length of data returned by the dataset eod_mask_loss: True # Whether to compute loss at end-of-document (EOD) tokens reset_position_ids: True # Whether to reset position_ids at EOD tokens create_attention_mask: True # Whether to return attention_mask reset_attention_mask: True # Whether to reset attention_mask at EOD tokens, returning a staircase-shaped mask create_compressed_eod_mask: False # Whether to return a compressed attention_mask eod_pad_length: 128 # Length of the compressed attention_mask eod: 0 # Token ID of the EOD token in the dataset pad: 1 # Token ID of the pad token in the dataset data_path: # Sampling ratio and paths for Megatron datasets - '0.3' # Ratio of dataset1 - "/path/megatron_data1" # Path to bin file of dataset1 excluding the .bin suffix - '0.7' # Ratio of dataset2 - "/path/megatron_data2" # Path to bin file of dataset2 excluding the .bin suffix input_columns: ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"] construct_args_key: ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"] parallel: full_batch: False dataset_strategy: [[*dp, 1], [*dp, 1], [*dp, 1], [*dp, 1], [*dp, 1, 1, 1]] # *dp means same value as data_parallel model_config: input_sliced_sig: True ``` Below are the descriptions for each configuration option of the `GPTDataset` in the dataset: | Parameter Name | Description | |----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | seed | Random seed for dataset sampling. Megatron datasets use this value to randomly sample and concatenate samples. Default: `1234` | | seq_length | Sequence length of data returned by the dataset. Should be consistent with the sequence length of the training model. | | eod_mask_loss | Whether to compute loss at the end-of-document (EOD) token. Default: `False` | | create_attention_mask | Whether to return an attention_mask. Default: `True` | | reset_attention_mask | Whether to reset the attention_mask at EOD tokens, returning a staircase-shaped attention_mask. Effective only if `create_attention_mask=True`. Default: `False` | | create_compressed_eod_mask | Whether to return a compressed attention_mask. Has higher priority than `create_attention_mask`. Default: `False` | | eod_pad_length | Length of the compressed attention_mask. Effective only if `create_compressed_eod_mask=True`. Default: `128` | | eod | Token ID of the EOD token in the dataset | | pad | Token ID of the pad token in the dataset | | data_path | List, every two consecutive elements (number, string) are considered as a dataset, represent ratio of the dataset and the path to its bin file excluding `.bin` suffix respectively. The sum of datasets' ratios should be equal to 1. | In addition, the Megatron dataset also depends on configurations such as `input_columns`, `construct_args_key`, and `full_batch`. For more details, refer to the [configuration file documentation](https://www.mindspore.cn/mindformers/docs/zh-CN/r1.6.0/feature/configuration.html). Here, we only explain how to configure them in different scenarios: - When `create_compressed_eod_mask=True`: ```yaml train_dataset: &train_dataset input_columns: ["input_ids", "labels", "loss_mask", "position_ids", "actual_seq_len"] construct_args_key: ["input_ids", "labels", "loss_mask", "position_ids", "actual_seq_len"] parallel: full_batch: False dataset_strategy: [[*dp, 1], [*dp, 1], [*dp, 1], [*dp, 1], [*dp, 1]] # *dp means same value as data_parallel ``` - When `create_compressed_eod_mask=False` and `create_attention_mask=True`: ```yaml train_dataset: &train_dataset input_columns: ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"] construct_args_key: ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"] parallel: full_batch: False dataset_strategy: [[*dp, 1], [*dp, 1], [*dp, 1], [*dp, 1], [*dp, 1, 1, 1]] # *dp means same value as data_parallel ``` - When `create_compressed_eod_mask=False` and `create_attention_mask=False`: ```yaml train_dataset: &train_dataset input_columns: ["input_ids", "labels", "loss_mask", "position_ids"] construct_args_key: ["input_ids", "labels", "loss_mask", "position_ids"] parallel: full_batch: False dataset_strategy: [[*dp, 1], [*dp, 1], [*dp, 1], [*dp, 1]] # *dp means same value as data_parallel ``` 3. Start Model Pre-training After modifying the dataset and parallel-related configurations in the model configuration file, you can refer to the model documentation to launch the model pre-training task. Here, we take the [Llama3_1 model documentation](https://gitee.com/mindspore/mindformers/blob/r1.6.0/research/llama3_1/README.md) as an example. ## HuggingFace Datasets Currently, the dataset loading functionality has been integrated with the [ModelScope Open-Source Community](https://modelers.cn/datasets) and the [HuggingFace Community](https://huggingface.co/datasets), supporting online dataset loading and preprocessing. Additionally, datasets can be [packed](#dataset-packing) to enhance model training efficiency. ### Usage Instructions HuggingFace datasets support online and offline loading of datasets from both the HuggingFace community and the MoLo open-source community. Below is an introduction to environment preparation, the dataset loading process, and how to configure the use of HuggingFace datasets in configuration files. #### Integrating with Open-Source Communities - Integrating with HuggingFace Community To use datasets from the HuggingFace community, follow these steps: 1. Environment Setup The environment variable `HF_ENDPOINT` controls the remote repository used by HuggingFace. By default, it is set to `https://huggingFace.co`. For users in China, it is recommended to configure it to the mirror address ```export HF_ENDPOINT=https://hf-mirror.com``` . 2. Install Dependencies ```shell pip install datasets ``` - Integrating with ModelScope Open-Source Community To use datasets from the ModelScope Open-Source Community, follow these steps: 1. Environment Setup The environment variable `OPENMIND_HUB_ENDPOINT` controls the remote repository used by the ModelScope Open-Source Community. Defaults to ```export OPENMIND_HUB_ENDPOINT=https://telecom.openmind.cn``` when not configured. 2. Install Dependencies ```shell git clone https://gitee.com/openmind-ai/openmind-hub.git cd openmind-hub pip install -e . cd .. git clone https://gitee.com/foundation-models/openmind-datasets.git cd openmind-datasets pip install -e . cd .. ``` > When the openmind-datasets component is installed in the environment, the default interface is the Modelers open source community, if you want to interface with HuggingFace, the environment variable `USE_OM` can control which community to interface with, the default value is `ON` for the Modelers community, change it to `OFF` to interface with the HuggingFace community. #### Dataset Loading Process ![commondataloader.png](./images/commondataloader.png) The online dataset loading and processing functionality is primarily implemented through `CommonDataLoader`. The data loading part can be customized via configuration files, with detailed configuration instructions available in the [dataloader parameter description](#dataloader-parameter-description). The online loading module requires users to implement customizations for different datasets. For example, the `AlpacaInstructDataHandler` class can be used to preprocess the `alpaca` dataset. For more information, please refer to [Custom Data Handler](#custom-data-handler). The parameters such as `seq_length` and `tokenizer` used in the examples below are all from the `qwen2.5` model. Since the `qwen2.5` model is located in the `research` directory, the `--register_path` parameter needs to be used when launching the task. Users can adjust these parameters according to their actual situation. #### dataloader Parameter Description The online dataset loading feature is enabled by configuring the `data_loader` in the configuration file. Below is an example configuration for online dataset loading: ```yaml train_dataset: &train_dataset input_columns: &input_columns ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"] construct_args_key: *input_columns data_loader: type: CommonDataLoader load_func: 'load_dataset' shuffle: False split: "train" path: "llm-wizard/alpaca-gpt4-data" packing: pack handler: - type: AlpacaInstructDataHandler tokenizer: model_max_length: 131072 bos_token: null eos_token: "<|im_end|>" unk_token: null pad_token: "<|endoftext|>" vocab_file: "/path/vocab.json" # qwen2.5 merges_file: "/path/merges.txt" # qwen2.5 auto_register: qwen2_5_tokenizer.Qwen2Tokenizer type: Qwen2Tokenizer seq_length: 8192 prompt_key: "conversations" output_columns: ["input_ids", "labels"] is_dynamic: False - type: PackingHandler seq_length: 8192 output_columns: ["input_ids", "labels", "actual_seq_len"] adaptor_config: compress_mask: False column_names: *input_columns ``` Parameter descriptions for `data_loader` are as follows: | Parameter Name | Description | Type | |----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----:| | type | Fixed as `CommonDataLoader`. This module supports loading datasets from HuggingFace and the ModelScope open-source community. | str | | packing | Packing configuration when processing datasets with `handler`. Options include `pack` and `truncate`. | str | | load_func | The function used to load datasets. Options are `load_dataset` and `load_from_disk`. Use `load_from_disk` for data saved via the `save_to_disk` function, and `load_dataset` for other scenarios. The default value is `load_dataset`. | str | | path | When `load_func=load_dataset`, this parameter aligns with the interface in [datasets.load_dataset](https://huggingface.co/docs/datasets/loading). When `load_func=load_from_disk`, it specifies the dataset loading path. | str | | data_files | When `load_func=load_dataset`, this parameter aligns with the interface in [datasets.load_dataset](https://huggingface.co/docs/datasets/loading). It is ineffective when `load_func=load_from_disk`. | str | | handler | Multiple `handlers` can be configured to preprocess the loaded dataset in the order specified. For details on `handler` configuration, refer to the handler parameter description in [Custom Data Handler](#custom-data-handler). | list | | adaptor_config | Dataset-related configuration during model training. Currently supports `compress_mask`, effective when `packing` is set. If enabled, it returns a compressed data mask. Default is `False`. | dict | | shuffle | Indicates whether random sampling is enabled when loading the dataset. | bool | | column_names | Specifies the column names returned by the dataset. If not set, all columns are returned. | list | | is_dynamic | Indicates whether the dataset returns dynamic-length data. Default is `False`. | bool | > In addition to the above configurations, all parameters from the [datasets.load_dataset](https://huggingface.co/docs/datasets/loading) interface are supported with the same meanings and functions. When packing is configured, the dataset returns an `actual_seq_len` column. For more information, refer to the `actual_seq_qlen` and `actual_seq_kvlen` parameter descriptions in the [documentation](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0027.html). ### Feature Introduction #### Dynamic Sequence Length Fine-Tuning `CommonDataLoader` supports dynamic shape fine-tuning using HuggingFace datasets, which can be loaded online or offline. Below, we use the `alpaca` dataset as an example to demonstrate the configuration for dynamic shape fine-tuning. - Online Loading The online dataset name is `llm-wizard/alpaca-gpt4-data`. You can search and download it from the [HuggingFace official website](https://huggingface.co/datasets) or load it directly using the online name. Example configuration for online loading: ```yaml train_dataset: &train_dataset input_columns: &input_columns ["input_ids", "labels"] dynamic_batch: True # Enable dynamic shape divisor: 32 # With divisor and remainder configured, seq_length in dynamic shape will become a multiple of divisor and the sum of remainder remainder: 1 data_loader: type: CommonDataLoader shuffle: True split: "train" # Subset name of the online dataset path: "llm-wizard/alpaca-gpt4-data" # Online dataset name handler: - type: AlpacaInstructDataHandler tokenizer: model_max_length: 131072 bos_token: null eos_token: "<|im_end|>" unk_token: null pad_token: "<|endoftext|>" vocab_file: "/path/vocab.json" # qwen2.5 merges_file: "/path/merges.txt" # qwen2.5 auto_register: qwen2_5_tokenizer.Qwen2Tokenizer type: Qwen2Tokenizer seq_length: 8192 prompt_key: "conversations" output_columns: *input_columns is_dynamic: True seed: 0 num_parallel_workers: 8 python_multiprocessing: False drop_remainder: True repeat: 1 numa_enable: False prefetch_size: 1 ``` 1. For parameter descriptions in `train_dataset`, please refer to the [documentation](https://www.mindspore.cn/mindformers/docs/zh-CN/r1.6.0/feature/configuration.html). 2. `AlpacaInstructDataHandler` is an online processing script developed for the `alpaca` dataset. If using a different dataset, you need to implement a custom data handler by referring to the [Custom Data Handler](#custom-data-handler) guide. - Offline Loading For offline loading, you need to prepare the JSON files of the `alpaca` dataset. The offline configuration differs from the online configuration only in the following parameters: ```yaml train_dataset: data_loader: path: "json" # loading datasets using the load_dataset interface data_files: '/path/alpaca_gpt4_data.json' # the file path of the alpaca dataset ``` After configuring the dataset loading method, you also need to set `is_dynamic=True` in the model configuration to enable dynamic shape training for the model. ```yaml model_config: is_dynamic: True ``` Since dynamic shapes may lead to operator compilation caching, it is recommended to set the following environment variables to limit the number of cached compilations when running in a memory-constrained environment. This helps prevent out-of-memory issues: ```shell export ACLNN_CACHE_LIMIT=10 export MS_DEV_RUNTIME_CONF="aclnn_cache_queue_length:64" ``` - The `ACLNN_CACHE_LIMIT` parameter description can be found in the [documentation](https://www.hiascend.com/document/detail/zh/canncommercial/800/apiref/envvar/envref_07_0031.html). - `MS_DEV_RUNTIME_CONF` is a parameter in MindSpore for setting the operator cache queue length. The value `64` represents the length of the sequence, which defaults to `1024`. This can be adjusted based on the actual environment. Setting the value too small may affect model training performance. After completing all the configurations above, you can proceed with dynamic shape fine-tuning by referring to the documentation for the specific model you are using. #### Custom Data Handler Users can define custom data handlers to apply various preprocessing logic to the loaded dataset. - Handler Parameter Description | Parameter Name | Description | Type | |----------------|---------------------------------------------------------------------------------------------------------------------------------------|:--------:| | type | Custom data handler name. A custom handler must inherit from `BaseInstructDataHandler`. | str | | tokenizer_name | Name of the tokenizer used. | str | | tokenizer | Tokenizer configuration parameters. Can be a dictionary, string, or a `tokenizer` object. Takes lower priority than `tokenizer_name`. | dict/str | | seq_length | Maximum sequence length, usually the same as the model's sequence length. | int | | output_columns | Column names of the processed data returned after preprocessing. | list | | prompt_key | Column name for data after applying prompt processing. | str | - Development Sample 1 The custom data handler is usually placed in the `mindformers/dataset/handler` directory, and the customized one needs to inherit the abstract base class ``BaseInstructDataHandler``. You need to implement ``format_func`` and ``tokenize_func`` methods, which preprocess each data loaded. Refer to ``alpaca_handler.py``. ```python @MindFormerRegister.register(MindFormerModuleType.DATA_HANDLER) class XXXInstructDataHandler(BaseInstructDataHandler): def format_func(self, example): # Custom data format conversion def tokenize_func(self, example): # Custom tokenizer split word processing ``` The ``BaseInstructDataHandler`` provides an implementation of the entry ``handler`` method by default, which is used to iterate over each piece of data for data preprocessing. The ``format_func`` is used to implement how to convert the raw data into the desired data format, and the ``tokenize_func`` method is used to take the processed data and perform a customized tokenization. The input parameter ``example`` in the example is each of the samples obtained. - Development Sample 2 If you want to process the data directly for the whole dataset instead of processing each piece of data in batches, you can implement the entry ``handle`` method in custom handler, and you will get the complete dataset, as shown below: ```python def handle(self, dataset): """data handler""" return dataset.rename_columns({"content":"prompt","summary":"answer"}) ``` - alpaca Dataset Sample Modify the task configuration file [finetune_qwen2_5_0_5b_8k.yaml](https://gitee.com/mindspore/mindformers/blob/r1.6.0/research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml). Modify the following parameters: ```yaml train_dataset: &train_dataset input_columns: &input_columns ["input_ids", "labels"] data_loader: type: CommonDataLoader shuffle: True split: "train" path: "llm-wizard/alpaca-gpt4-data" handler: - type: AlpacaInstructDataHandler tokenizer: model_max_length: 131072 bos_token: null eos_token: "<|im_end|>" unk_token: null pad_token: "<|endoftext|>" vocab_file: "/path/vocab.json" # qwen2.5 merges_file: "/path/merges.txt" # qwen2.5 auto_register: qwen2_5_tokenizer.Qwen2Tokenizer type: Qwen2Tokenizer seq_length: 8192 prompt_key: "conversations" output_columns: *input_columns seed: 0 num_parallel_workers: 8 python_multiprocessing: False drop_remainder: True repeat: 1 numa_enable: False prefetch_size: 1 ``` The rest of the parameters can be described in "model training configuration" and "model evaluation configuration" in [Configuration File Description](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/configuration.html). Custom data handler: ```python @MindFormerRegister.register(MindFormerModuleType.DATA_HANDLER) class AlpacaInstructDataHandler(BaseInstructDataHandler): def format_func(self, example): """format func""" source = PROMPT_INPUT.format_map(example) \ if example.get(self.input_key, "") != "" \ else PROMPT_NO_INPUT.format_map(example) target = example.get(self.output_key) formatted_example = [ { "from": self.user_role, "value": source, }, { "from": self.assistant_role, "value": target, }, ] return formatted_example def tokenize_func(self, messages): """tokenize func""" conversation = self.gen_prompt(messages) sep = self.template.sep + self.assistant_role + ": " # Tokenize conversations rounds = conversation.split(self.template.sep2) ids = [self.tokenizer.bos_token_id] mask = [1] for _, rou in enumerate(rounds): if rou == "": break conv_out = self.tokenizer(rou) ids.extend(conv_out['input_ids'][1:]) mask.extend(conv_out['attention_mask'][1:]) d = {'input_ids': ids, 'attention_mask': mask} # pylint: disable=W0212 if not self.dynamic: d = self.tokenizer._pad(d, max_length=self.seq_length + 1, padding_strategy='max_length') input_id = d['input_ids'][:self.seq_length + 1] target = np.array(d['input_ids']) total_len = int(np.not_equal(target, self.tokenizer.pad_token_id).sum()) cur_len = 1 target[:cur_len] = self.ignore_token_id for _, rou in enumerate(rounds): if rou == "": break parts = rou.split(sep) if len(parts) != 2: break parts[0] += sep round_len = len(self.tokenizer(rou)['input_ids']) - 1 instruction_len = len(self.tokenizer(parts[0])['input_ids']) - 3 target[cur_len: cur_len + instruction_len] = self.ignore_token_id cur_len += round_len if self.dynamic: return { "input_ids": input_id, "labels": target[:len(input_id)].tolist() } target[cur_len:] = self.ignore_token_id if cur_len < self.seq_length + 1: if cur_len != total_len: target[:] = self.ignore_token_id else: target = target[:self.seq_length + 1] label = target.tolist() return { "input_ids": input_id, "labels": label, } ``` - ADGEN Dataset Sample Modify the following parameters: ```yaml train_dataset: &train_dataset data_loader: type: CommonDataLoader path: "HasturOfficial/adgen" split: "train" shuffle: True handler: - type: AdgenInstructDataHandler phase: "train" version: 3 column_names: ["prompt", "answer"] tokenizer: type: ChatGLM3Tokenizer vocab_file: "/path/to/tokenizer.model" input_columns: ["input_ids", "labels"] max_source_length: 1024 max_target_length: 1023 ignore_pad_token_for_loss: True num_parallel_workers: 8 python_multiprocessing: False drop_remainder: True batch_size: 8 repeat: 1 numa_enable: False prefetch_size: 1 seed: 0 ``` The rest of the parameters can be described in "model training configuration" and "model evaluation configuration" in [Configuration File Description](https://www.mindspore.cn/mindformers/docs/en/r1.6.0/feature/configuration.html). Custom adgen_handler: ```python @MindFormerRegister.register(MindFormerModuleType.DATA_HANDLER) class AdgenInstructDataHandler(BaseInstructDataHandler): """agden data handler""" def handle(self, dataset): """data handler""" return dataset.rename_columns({"content": "prompt", "summary": "answer"}) ``` #### Dataset Packing Configuring `PackingHandler` in `CommonDataLoader` allows for packing processing of the data. Currently, the original data needs to be processed into `input_ids` and `labels` that can be fed into the model during the preprocessing step. - Parameter Description | Parameter Name | Description | Type | |----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----:| | type | Fixed as `PackingHandler`. This module supports packing data. When `packing=pack` or `packing=truncate` is configured in [dataloader](#dataloader-parameter-description), it performs non-truncating and truncating concatenation of the data, respectively. | str | | seq_length | Maximum sequence length of the data after packing. | int | | pad_token | Token ID used for padding `input_ids` when the packed sample does not reach the maximum length. Default value is 0. | int | | ignore_token | Token ID used for padding `labels` when the packed sample does not reach the maximum length. Default value is -100. | int | - Packing Example By following the configuration below, the `alpaca` dataset can be preprocessed to achieve online packing. ```yaml train_dataset: &train_dataset input_columns: &input_columns ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"] construct_args_key: *input_columns data_loader: type: CommonDataLoader shuffle: False split: "train" path: "llm-wizard/alpaca-gpt4-data" packing: pack handler: - type: AlpacaInstructDataHandler tokenizer: model_max_length: 131072 bos_token: null eos_token: "<|im_end|>" unk_token: null pad_token: "<|endoftext|>" vocab_file: "/path/vocab.json" # qwen2.5 merges_file: "/path/merges.txt" # qwen2.5 auto_register: qwen2_5_tokenizer.Qwen2Tokenizer type: Qwen2Tokenizer seq_length: 8192 prompt_key: "conversations" output_columns: ["input_ids", "labels"] - type: PackingHandler seq_length: 8192 output_columns: ["input_ids", "labels", "actual_seq_len"] adaptor_config: compress_mask: False seed: 0 num_parallel_workers: 8 python_multiprocessing: False drop_remainder: True repeat: 1 numa_enable: False prefetch_size: 1 ``` Using the above configuration file to process the `alpaca` dataset will execute the following steps: 1. The raw text data will be processed into `input_ids` and `labels` using `AlpacaInstructDataHandler` and the `tokenizer` of `qwen2.5`. 2. `PackingHandler` will be used to perform packing on the processed `input_ids` and `labels`, resulting in concatenated `input_ids` and `labels` up to the `seq_length`. The `actual_seq_len` refers to the sequence length of each sub-sample in the concatenated sample. During training, this parameter will be used to generate the corresponding data mask. 3. If `compress_mask=False` is set in `adaptor_config`, a complete data mask will be returned during training. Otherwise, `actual_seq_len` will be returned. #### Offline Dataset Processing In addition to supporting online dataset loading and processing, `CommonDataLoader` also supports offline dataset processing and saving. The [datasets_preprocess.py](https://gitee.com/mindspore/mindformers/blob/r1.6.0/toolkit/data_preprocess/huggingface/datasets_preprocess.py) script can be used to process Huggingface datasets offline and save them. - Parameter Description | Parameter Name | Description | Type | |----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----:| | config | Configuration file for offline data processing, which is used in the same way as online processing. Refer to [dataloader](#dataloader-parameter-description) for details. | str | | save_path | Path where the preprocessed dataset will be saved. | str | | register_path | Registration path for the model API, which includes the Python files related to the model, typically the model folder under the `research` directory. | int | - Usage Example You can use the configuration file provided in the [dataset packing](#dataset-packing) example and execute the following command. ```shell python toolkit/data_preprocess/huggingface/datasets_preprocess.py \ --config data_process.yaml \ --save_path /path/processed_data \ --register_path research/qwen2_5 ``` If you need to load the saved dataset, you should modify the YAML configuration as follows: ```yaml train_dataset: &train_dataset input_columns: &input_columns ["input_ids", "labels", "loss_mask", "position_ids", "attention_mask"] construct_args_key: *input_columns data_loader: type: CommonDataLoader shuffle: False load_func: "load_from_disk" path: "/path/processed_data" adaptor_config: compress_mask: False ``` ## MindRecord Dataset MindRecord is an efficient data storage and reading module provided by MindSpore. It reduces disk IO and network IO overhead, resulting in a better data loading experience. For more detailed feature introductions, refer to the [documentation](https://www.mindspore.cn/docs/zh-CN/r2.7.0rc1/api_python/mindspore.mindrecord.html). Here, we only cover how to use MindRecord in MindSpore Transformers model training tasks. The following example uses `qwen2_5-0.5b` fine-tuning to explain related functionalities. ### Data Preprocessing 1. Download the `alpaca` dataset: [Link](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json) 2. Execute the data processing script to convert the `alpaca` dataset into a dialogue format: ```shell python research/qwen2/alpaca_converter.py \ --data_path /path/alpaca_data.json \ --output_path /path/alpaca-data-messages.json ``` Here, `data_path` refers to the path where the downloaded `alpaca` dataset is stored, and `output_path` refers to the save path for the generated dialogue format data file. 3. Execute the script to convert the dialogue format data file into MindRecord format: ```shell python research/qwen2/qwen2_preprocess.py \ --dataset_type 'qa' \ --input_glob /path/alpaca-data-messages.json \ --vocab_file /path/vocab.json \ --merges_file /path/merges.txt \ --seq_length 32768 \ --output_file /path/alpaca-messages.mindrecord ``` The script parameters are explained as follows: - `dataset_type`: Type of data preprocessing. For the alpaca dataset, set this to `qa`. - `input_glob`: Path to the dialogue format data file. - `vocab_file`: Path to the `vocab.json` file of the qwen2 model. - `merges_file`: Path to the `merges.txt` file of the qwen2 model. - `seq_length`: Sequence length for generating MindRecord data. - `output_file`: Save path for the generated MindRecord data. > The `vocab_file` and `merges_file` can be obtained from the qwen2 model repository on the HuggingFace community. ### Model Fine-tuning Following the above data preprocessing steps, you can generate a MindRecord dataset for fine-tuning the `qwen2_5-0.5b` model. Below is an introduction on how to use the generated data file to start the model fine-tuning task. 1. Modify the model configuration file The `qwen2_5-0.5b` model fine-tuning uses the [finetune_qwen2_5_0.5b_8k.yaml](https://gitee.com/mindspore/mindformers/blob/r1.6.0/research/qwen2_5/finetune_qwen2_5_0_5b_8k.yaml) configuration file. Modify the dataset section as follows: ```yaml train_dataset: &train_dataset data_loader: type: MindDataset dataset_dir: "/path/alpaca-messages.mindrecord" shuffle: True ``` When using the MindRecord dataset in a model training task, the following configurations in `data_loader` need to be modified: - `type`: Type of data_loader. Set to `MindDataset` when using MindRecord datasets. - `dataset_dir`: Path to the MindRecord data files. - `shuffle`: Whether to randomly sample data samples during training. 2. Start Model Fine-tuning After modifying the dataset and parallel-related configurations in the model configuration file, you can refer to the model documentation to launch the fine-tuning task. Here, we take the [Qwen2_5 model documentation](https://gitee.com/mindspore/mindformers/blob/r1.6.0/research/qwen2_5/README.md) as an example. ### Multi-source Datasets The native MindSpore dataset loading module [MindDataset](https://www.mindspore.cn/docs/zh-CN/r2.7.0rc1/api_python/dataset/mindspore.dataset.MindDataset.html) has performance bottlenecks when loading and sampling multiple MindRecord datasets. Therefore, MindSpore Transformers implements the `MultiSourceDataLoader` to achieve efficient loading and sampling across multiple datasets. The multi-source dataset functionality is mainly enabled by modifying the `data_loader` configuration in the config file. Below is an example: ```yaml train_dataset: &train_dataset data_loader: type: MultiSourceDataLoader data_source_type: random_access shuffle: True dataset_ratios: [0.2, 0.8] samples_count: 1000 nums_per_dataset: [2000] sub_data_loader_args: stage: 'train' column_names: ["input_ids", "target_ids", "attention_mask"] sub_data_loader: - type: MindDataset dataset_files: "/path/alpaca-messages.mindrecord" - type: MindDataset dataset_files: "/path/alpaca-messages.mindrecord" load_indices_npz_path: '/path/index.npz' save_indices_npz_path: '/path/index.npz' ``` The `shuffle` setting affects two parameters: `shuffle_dataset` and `shuffle_file`: - `shuffle_dataset` indicates random sampling at the sub-dataset level. - `shuffle_file` indicates random sampling at the sample level. The effects of different `shuffle` values are as follows: | shuffle | shuffle_dataset | shuffle_file | |---------|:-----------------:|:--------------:| | True | True | True | | False | False | False | | infile | False | True | | files | True | False | | global | True | True | Other configuration parameters are explained below: | Parameter | Description | Type | |-----------------------|-----------------------------------------------------------------------------------------------|:----:| | dataset_ratios | Sampling ratios for each sub-dataset; sum of all equals 1 | list | | samples_count | Number of samples from each sub-dataset, effective only when `dataset_ratios` is configured | int | | nums_per_dataset | Number of samples per sub-dataset, effective when `dataset_ratios` is not configured | list | | sub_data_loader_args | Common configurations for each sub-dataset, effective during sub-dataset construction | dict | | sub_data_loader | Configuration for each sub-dataset, same as `data_loader` config in single MindRecord dataset | list | | load_indices_npz_path | Path to load data index file | str | | save_indices_npz_path | Path to save data index file | str |