# Practice Case: Interconnecting MindSpore Transformers with General Evaluation Tools [![View Source on AtomGit](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://atomgit.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/example/model_test/model_test.md) This article is contributed by Killjoy, chen-xialei, fuyao-15989607593, laozhuang, and oacjiewen. During the development of LLMs, after training or fine-tuning models using MindSpore Transformers, users often use general evaluation tools to evaluate the model capabilities on custom datasets. This document describes how to interconnect a deployed MindSpore Transformers model with general evaluation tools. It covers the use of vLLM-MindSpore for model deployment and the evaluation of model capabilities based on two general evaluation frameworks: `lm-eval` and `opencompass`. This practice case helps you understand how to use general evaluation tools to evaluate models trained or fine-tuned using MindSpore Transformers. ## 1. Environment Preparations You need to install the following environment for model deployment and evaluation. | Dependent Software | Version| |--------------------|----------| | MindSpore Transformers | 1.6.0 | | vLLM-MindSpore | 0.5.0 | | lm-eval | 0.4.9 | | opencompass | 0.5.0 | ### 1.1 MindSpore Transformers Set up the environment by referring to [MindSpore Transformers Installation Guidelines](https://www.mindspore.cn/mindformers/docs/en/master/installation.html). ### 1.2 vLLM-MindSpore Run the following commands to pull the vLLM-MindSpore plugin code repository and build an image: ```bash git clone https://atomgit.com/mindspore/vllm-mindspore.git bash build_image.sh ``` > If the image building times out, you can add `ENV UV_HTTP_TIMEOUT=3000` to the `build_image.sh` script and replace the image repository with a faster one in the `install_depend_pkgs.sh` script. Create a Docker based on the server configuration. For details, see [Docker Installation](https://www.mindspore.cn/vllm_mindspore/docs/en/master/getting_started/installation/installation.html). ### 1.3 lm-eval **Note: It is strongly recommended that you create a separate conda environment with Python 3.10 or later to avoid compatibility issues.** Note that you need to use the local installation method and do not directly use `pip install lm-eval`. ```bash git clone https://github.com/EleutherAI/lm-evaluation-harness.git cd lm-evaluation-harness pip install -e . ``` If the error message `Error: Please make sure the libxml2 and libxslt development packages are installed` is displayed, run the following command for installation: ```bash conda install -c conda-forge libxml2 libxslt ``` To avoid possible version compatibility issues, install the `datasets` and `transformers` libraries of specific versions. ```bash pip install datasets==2.18.0 pip install transformers==4.35.2 ``` ### 1.4 opencompass ```bash pip install -U opencompass ``` During the installation, the following error may be reported: ```bash AttributeError: module 'inspect' has no attribute 'getargspec'. Did you mean: 'getargs'? ``` Solutions: Use the [source code](https://github.com/open-compass/opencompass) for installation, and delete `pyext` and `rouge` from the `requirements/runtime.txt` file. ## 2. Model Deployment Set the environment variable. ```bash export VLLM_MS_MODEL_BACKEND=MindFormers ``` Deploy the model. ```bash python3 -m vllm_mindspore.entrypoints vllm.entrypoints.openai.api_server --model MODEL_PATH --port YOUR_PORT --host 0.0.0.0 --served-model-name YOUR_MODEL_NAME ``` ## 3. lm-eval Usage for Evaluation lm-eval is a large-scale comprehensive evaluation framework, which is applicable to many general-domain test sets (such as MMLU and CEval) and supports convenient custom data tests. ### 3.1 Processing the Dataset > This step is required for customizing a dataset. When testing a general test set, directly use the [official tutorial](https://github.com/EleutherAI/lm-evaluation-harness). Assume that there is a custom dataset on the local PC. The dataset is a `CSV` file. Each data record is a single-choice question with the following six attributes: a question, four options, and an answer. The file does not have a table header and the file name is `output_filtered.csv`. Run the following codes to convert the `CSV` file into the `Dataset` format that is suitable for model processing: ```python import pandas as pd from datasets import Dataset, DatasetDict import os def convert_csv_to_parquet_dataset(csv_path, output_dir): """ Convert the CSV file without a header into a Parquet dataset and specify it as the validation split. Parameters: csv_path: Path of the input CSV file (without a header, with columns in the order of question, options A, B, C, and D, and answer). output_dir: Output directory (saved in the Hugging Face dataset format). """ # 1. Read the CSV file (without a header). print(f"Reading the CSV file: {csv_path}") df = pd.read_csv(csv_path, header=None) # 2. Add standard column names. df.columns = ["question", "A", "B", "C", "D", "answer"] print(f"Found {len(df)} records.") # 3. Convert to the Hugging Face dataset format. dataset = Dataset.from_pandas(df) # 4. Create a DatasetDict and specify it as the validation split. dataset_dict = DatasetDict({"validation": dataset}) # 5. Create the output directory. os.makedirs(output_dir, exist_ok=True) # 6. Save the complete dataset (in the Hugging Face format). print(f"Saving the dataset to: {output_dir}") dataset_dict.save_to_disk(output_dir) # 7. (Optional) Save the validation split as a Parquet file separately. validation_parquet_path = os.path.join(output_dir, "validation.parquet") dataset_dict["validation"].to_parquet(validation_parquet_path) print(f"Parquet file saved separately: {validation_parquet_path}") return dataset_dict # Examples if __name__ == "__main__": # Input and output configuration. input_csv = "output_filtered.csv" # Replace it with the path of your CSV file. output_dir = "YOUR_OUTPUT_PATH" # Output directory. # Execute the conversion. dataset = convert_csv_to_parquet_dataset(input_csv, output_dir) # Print the verification information. print("\n Verification of the conversion result:") print(f"Dataset structure: {dataset}") print(f"Number of validation split samples: {len(dataset['validation'])}") print(f"Example of the first data record: {dataset['validation'][0]}") ``` In this way, the `CSV` file can be converted into the `Dataset` format. ### 3.2 Creating a Dataset Configuration File Create a folder named `YOUR_DATASET_NAME` under `/lm-evaluation-harness/lm_eval/tasks` and create a `YOUR_DATASET_NAME.yaml` file in the folder. The content is as follows: ```yaml task: YOUR_DATASET_NAME dataset_path: YOUR_DATASET_PATH_FOLDER test_split: validation output_type: multiple_choice doc_to_text: "{{question.strip()}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\n Answer:" doc_to_choice: ["A", "B", "C", "D"] doc_to_target: "{{['A', 'B', 'C', 'D'].index(answer)}}" metric_list: - metric: acc aggregation: mean higher_is_better: true - metric: acc_norm aggregation: mean higher_is_better: true metadata: version: 0.0 ``` For more creation methods, see the [yaml](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/cmmlu/_default_template_yaml) build of CMMLU in the **tasks** folder. ### 3.3 Testing Precision ```bash lm_eval --model local-completions --tasks YOUR_DATASET_NAME --output_path path/to/save/output --log_samples --model_args '{ "model": "your model name", "base_url": "http://127.0.0.1:port/v1/completions", "tokenizer": "model path", "config": "model path", "use_fast_tokenizer": true, "num_concurrent": 1, "max_retries": 3, "tokenized_requests": false }' ``` ## 4. OpenCompass Usage for Evaluation ### 4.1 Preparing a Dataset Download the dataset and decompress it to the root directory of OpenCompass. ```bash wget https://github.com/open-compass/opencompass/releases/download/0.2.2.rc1/OpenCompassData-core-20240207.zip unzip OpenCompassData-core-20240207.zip ``` ### 4.2 Setting the config File and Running Script For details about how to set the **config** file of the model, see [text](https://opencompass.readthedocs.io/en/latest/advanced_guides/accelerator_intro.html). Change `path` to the name of the deployed model, change `openai_api_base` to the `url` of the deployed model, and set `tokenizer_path` of the model. You can increase the value of `batch_size` for acceleration. Generally, you do not need to configure the dataset. You can obtain the recommended configuration by referring to [text](https://opencompass.readthedocs.io/en/latest/advanced_guides/accelerator_intro.html) or find the appropriate configuration in the **config** path of each dataset. For example, in the big bench hard (BBH) dataset, there are `bbh_gen_ee62e9.py` and `bbh_0shot_nocot_academic_gen.py` in `opencompass/opencompass/configs/datasets/bbh/`, which are the configurations of **zero-shot** and **five-shot**, respectively. Select the configuration as required. Modify the script by referring to [eval_api_demo.py](https://github.com/open-compass/opencompass/blob/main/examples/eval_api_demo.py). Import the model configurations and datasets to be tested. **Possible error**: ```bash Traceback (most recent call last): ... /mmengine/config/lazy.py", line 205, in __call__ raise RuntimeError() RuntimeError ``` Solution: Hard code the address in the file `with open(os.path.join(hard_coded_path, 'lib_prompt', f'{_name}.txt'), 'r') as f:` as follows: ```bash hard_coded_path = '/path/to/datasets/bbh' \ + '/lib_prompt/' \ + f'{_name}.txt' ``` ### 4.3 Starting the Evaluation Run the following command to start the evaluation: ```bash opencompass /path/to/your/scripts ``` If additional parameter settings are required, refer to [text](https://opencompass.readthedocs.io/en/latest/advanced_guides/accelerator_intro.html) for additional configurations.