[{"data":1,"prerenderedAt":578},["ShallowReactive",2],{"content-query-6148GCqK7j":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"cover":11,"type":12,"body":13,"_type":572,"_id":573,"_source":574,"_file":575,"_stem":576,"_extension":577},"/technology-blogs/en/2872","en",false,"","Project Introduction | Integrating HuggingFace Datasets into MindSpore Through MindNLP","MindNLP has been in development for over a year, and overall, it has encountered various obstacles, accompanied by a series of challenges posed by LLM.","2023-10-11","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/11/15/f9e17fd9a99048d899b69980cdec4d4f.png","technology-blogs",{"type":14,"children":15,"toc":569},"root",[16,24,41,49,54,59,74,79,87,95,100,105,110,115,120,125,130,135,140,145,150,155,160,167,172,177,182,187,195,200,205,210,215,220,225,230,244,249,254,259,264,269,274,279,284,292,299,304,309,316,321,326,333,338,343,351,356,361,366,371,383,388,396,401,408,413,418,428,433,445,450,455,466,471,476,481,488,493,500,505,519,524,532,546,557],{"type":17,"tag":18,"props":19,"children":21},"element","h1",{"id":20},"project-introduction-integrating-huggingface-datasets-into-mindspore-through-mindnlp",[22],{"type":23,"value":8},"text",{"type":17,"tag":25,"props":26,"children":27},"p",{},[28,34,36],{"type":17,"tag":29,"props":30,"children":31},"strong",{},[32],{"type":23,"value":33},"Author: Lv Yufeng",{"type":23,"value":35}," | ",{"type":17,"tag":29,"props":37,"children":38},{},[39],{"type":23,"value":40},"Source: Zhihu",{"type":17,"tag":25,"props":42,"children":43},{},[44],{"type":17,"tag":29,"props":45,"children":46},{},[47],{"type":23,"value":48},"Abstract",{"type":17,"tag":25,"props":50,"children":51},{},[52],{"type":23,"value":53},"MindNLP has been in development for over a year, and overall, it has encountered various obstacles, accompanied by a series of challenges posed by LLM. As a new NLP framework that relies on MindSpore, expanding its ecosystem is a critical consideration.",{"type":17,"tag":25,"props":55,"children":56},{},[57],{"type":23,"value":58},"The recent announcement of PyTorch 2.1+Ascend presents an opportunity for ecosystem integration, which could be an ideal solution.",{"type":17,"tag":25,"props":60,"children":61},{},[62,67,69],{"type":17,"tag":29,"props":63,"children":64},{},[65],{"type":23,"value":66},"01",{"type":23,"value":68}," ",{"type":17,"tag":29,"props":70,"children":71},{},[72],{"type":23,"value":73},"MindNLP Datasets",{"type":17,"tag":25,"props":75,"children":76},{},[77],{"type":23,"value":78},"MindNLP was designed to fully leverage the diverse features of MindSpore, including functional programming, dynamic graph function, and data processing engine. Here, let's take a closer look at the data processing engine.",{"type":17,"tag":25,"props":80,"children":81},{},[82],{"type":17,"tag":29,"props":83,"children":84},{},[85],{"type":23,"value":86},"1.1 MindSpore Data Processing Engine",{"type":17,"tag":25,"props":88,"children":89},{},[90],{"type":17,"tag":91,"props":92,"children":94},"img",{"alt":7,"src":93},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/11/15/00b3e5c9aa2e422ebee2ecdd46cccb35.png",[],{"type":17,"tag":25,"props":96,"children":97},{},[98],{"type":23,"value":99},"Figure 1 MindSpore data engine pipeline",{"type":17,"tag":25,"props":101,"children":102},{},[103],{"type":23,"value":104},"As shown in the figure, the data engine is designed in pipeline mode[1], which is similar to TensorFlow and PyTorch Map-style Datasets and is mainly used for high-performance data processing.",{"type":17,"tag":25,"props":106,"children":107},{},[108],{"type":23,"value":109},"While everyone is still working on modifying small models and datasets, data preprocessing is usually done offline. This allows for flexible processing with Python and accommodation by the large memory of servers. All data is usually loaded at once and multiple processes are initialed to handle it. After the processing is complete, the data is loaded as tensors and sent to a network for training. But even so, it may take hours or even days to preprocess a slightly larger dataset.",{"type":17,"tag":25,"props":111,"children":112},{},[113],{"type":23,"value":114},"The pipeline mode focuses on the following capabilities:",{"type":17,"tag":25,"props":116,"children":117},{},[118],{"type":23,"value":119},"1. On-demand loading",{"type":17,"tag":25,"props":121,"children":122},{},[123],{"type":23,"value":124},"2. Asynchronous processing",{"type":17,"tag":25,"props":126,"children":127},{},[128],{"type":23,"value":129},"3. Parallelism",{"type":17,"tag":25,"props":131,"children":132},{},[133],{"type":23,"value":134},"Points 1 and 2 can be described in detail. Take text data as an example. If we use the simplest Python to load the preprocessing logic (PyTorch DataLoader), the overall execution process is as follows:",{"type":17,"tag":25,"props":136,"children":137},{},[138],{"type":23,"value":139},"Load the full dataset to memory -> Traverse and preprocess all data -> Pack single data records in batches -> Return each batch cyclically",{"type":17,"tag":25,"props":141,"children":142},{},[143],{"type":23,"value":144},"The pipeline loading mode is as follows:",{"type":17,"tag":25,"props":146,"children":147},{},[148],{"type":23,"value":149},"A pointer points to the beginning of the dataset file. Every time data of a batch size is obtained, the pointer moves forward by the batch size until all data is obtained.",{"type":17,"tag":25,"props":151,"children":152},{},[153],{"type":23,"value":154},"Obviously, obtaining only a moderate amount of data each time can significantly reduce memory consumption, and the intermediate variables generated during preprocessing can also be compressed to a small size. Moreover, this mode can convert offline data preprocessing into online.",{"type":17,"tag":25,"props":156,"children":157},{},[158],{"type":23,"value":159},"Obtain data records of a batch size and load them -> Traverse and preprocess the loaded data -> Return a batch",{"type":17,"tag":25,"props":161,"children":162},{},[163],{"type":17,"tag":91,"props":164,"children":166},{"alt":7,"src":165},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/11/15/a7fb9e33b5284dcdbfac999d8881c429.png",[],{"type":17,"tag":25,"props":168,"children":169},{},[170],{"type":23,"value":171},"Figure 2 Data processing and network computing pipeline",{"type":17,"tag":25,"props":173,"children":174},{},[175],{"type":23,"value":176},"The data processing pipeline continuously processes data and sends the processed data to the device cache. After a step ends, data of the next step is directly read from the device cache. During network training, data is also being processed.",{"type":17,"tag":25,"props":178,"children":179},{},[180],{"type":23,"value":181},"Of course, this approach is also a double-edged sword. It improves memory utilization and performance, but introduces usability issues. The map in figure 1 refers to asynchronous processing, which does not immediately execute and return results after each data preprocessing operation is configured. This is not friendly for data that requires fine-grained control and has many special conditions, and there is a high probability of triggering exceptions suddenly during pipeline execution.",{"type":17,"tag":25,"props":183,"children":184},{},[185],{"type":23,"value":186},"However, LLM changes this situation. All tasks become next token predictions, and all data processing operations become \"cleaning + tokenizing\" operations. For service scenarios where the data is streaming data and its volume is large, pipelines naturally become the optimal solution (which is probably why both PyTorch and HuggingFace Datasets use pipelines).",{"type":17,"tag":25,"props":188,"children":189},{},[190],{"type":17,"tag":29,"props":191,"children":192},{},[193],{"type":23,"value":194},"1.2 MindNLP Dataset Support",{"type":17,"tag":25,"props":196,"children":197},{},[198],{"type":23,"value":199},"As previously mentioned, MindNLP completely depends on the MindSpore data processing engine to process data, and has added the support for 20 datasets in the past year, equivalent to the number of supported datasets by torchtext. Although the aforementioned datasets provide a solid foundation for NLP tasks, it is evident that the real-world datasets needed for such tasks are considerably larger, posing a challenge for continuous adaptation to an open domain.",{"type":17,"tag":25,"props":201,"children":202},{},[203],{"type":23,"value":204},"In addition, MindSpore Datasets have also caused some problems, mainly in the design of three types of loaders, namely:",{"type":17,"tag":25,"props":206,"children":207},{},[208],{"type":23,"value":209},"1. Specific dataset loader, such as IMDBDataset and EnWik9Dataset",{"type":17,"tag":25,"props":211,"children":212},{},[213],{"type":23,"value":214},"2. Abstract text loader: TextFileDataset",{"type":17,"tag":25,"props":216,"children":217},{},[218],{"type":23,"value":219},"3. User-defined loader: GeneratorDataset",{"type":17,"tag":25,"props":221,"children":222},{},[223],{"type":23,"value":224},"Using type 1 requires continuous adaptation. Using type 2 requires preprocessing of data in formats such as XML and JSON before loading, which goes against the highly efficient design philosophy of pipelines and still requires a significant amount of manual adaptation work. Using type 3 means going back to full loading of the first step in figure 1, which is clearly not what we want. However, to quickly support datasets, we still choose the 1+3 mode.",{"type":17,"tag":25,"props":226,"children":227},{},[228],{"type":23,"value":229},"This is not efficient and needs separate adaptation each time. So, is there any one-stop solution?",{"type":17,"tag":25,"props":231,"children":232},{},[233,238,239],{"type":17,"tag":29,"props":234,"children":235},{},[236],{"type":23,"value":237},"02",{"type":23,"value":68},{"type":17,"tag":29,"props":240,"children":241},{},[242],{"type":23,"value":243},"HuggingFace Ecosystem Integration",{"type":17,"tag":25,"props":245,"children":246},{},[247],{"type":23,"value":248},"MindNLP dataset loading boils down to two main objectives:",{"type":17,"tag":25,"props":250,"children":251},{},[252],{"type":23,"value":253},"1. Support for a large number of datasets without adaptation",{"type":17,"tag":25,"props":255,"children":256},{},[257],{"type":23,"value":258},"2. Use of efficient pipelines",{"type":17,"tag":25,"props":260,"children":261},{},[262],{"type":23,"value":263},"These can be achieved using the power of ecosystems. HuggingFace has developed libraries for various AI training processes outside of the Transformers repository. After years of accumulation, HuggingFace Datasets now support a large number of datasets. With HuggingFace's hosting services, many new datasets can be directly published on the Datasets hub. Now that we have achieved objective 1 using Datasets, let's move on to objective 2.",{"type":17,"tag":25,"props":265,"children":266},{},[267],{"type":23,"value":268},"Most people who use MindSpore Datasets choose either of the following processing methods:",{"type":17,"tag":25,"props":270,"children":271},{},[272],{"type":23,"value":273},"1. Preprocess each dataset into MindRecord in offline mode and use MindDataset to load it.",{"type":17,"tag":25,"props":275,"children":276},{},[277],{"type":23,"value":278},"2. Load the dataset to memory and then use a specific dataset loader (GeneratorDataset) to load it.",{"type":17,"tag":25,"props":280,"children":281},{},[282],{"type":23,"value":283},"Method 1 is clearly not feasible for online preprocessing. However, the idea of integrating HuggingFace Datasets is very simple. The following describes two approaches.",{"type":17,"tag":25,"props":285,"children":286},{},[287],{"type":17,"tag":29,"props":288,"children":289},{},[290],{"type":23,"value":291},"2.1 Integration Dataset Download",{"type":17,"tag":25,"props":293,"children":294},{},[295],{"type":17,"tag":91,"props":296,"children":298},{"alt":7,"src":297},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/11/15/9b8cabec93fa47e689538b6ab4221f80.png",[],{"type":17,"tag":25,"props":300,"children":301},{},[302],{"type":23,"value":303},"Figure 3 HuggingFace Dataset, with IMDB as an example",{"type":17,"tag":25,"props":305,"children":306},{},[307],{"type":23,"value":308},"Figure 3 is a screenshot of the IMDB page, which shows that the data has been well structured. We can directly download it using HuggingFace Datasets and use the abstract data loader TextFileDataset to read each processed file, making it ready for use.",{"type":17,"tag":25,"props":310,"children":311},{},[312],{"type":17,"tag":91,"props":313,"children":315},{"alt":7,"src":314},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/11/15/9a9802baa59c4d86bdca519b6e8262ee.png",[],{"type":17,"tag":25,"props":317,"children":318},{},[319],{"type":23,"value":320},"Figure 4 TextFileDataset interface",{"type":17,"tag":25,"props":322,"children":323},{},[324],{"type":23,"value":325},"All you need to do is provide the file path or path list, and TextFileDataset will load the corresponding files automatically. However, in practice, there is a problem encountered - HuggingFace Datasets use Apache Arrow files.",{"type":17,"tag":25,"props":327,"children":328},{},[329],{"type":17,"tag":91,"props":330,"children":332},{"alt":7,"src":331},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/11/15/1de68e4e283342d68aa6ec5bdaf49727.png",[],{"type":17,"tag":25,"props":334,"children":335},{},[336],{"type":23,"value":337},"Figure 5 Arrow format of HuggingFace Datasets",{"type":17,"tag":25,"props":339,"children":340},{},[341],{"type":23,"value":342},"Apache Arrow[2] is a language-independent, cross-system standard format for high-performance data exchange with the support for zero copy. This means that MindSpore Datasets cannot be directly and simply read. Even though the PyArrow library can be used, it introduces complexity and returns to a state in which data preprocessing is required before loading. But as it turns out, the features of Arrow files are actually more suitable for MindSpore Datasets.",{"type":17,"tag":25,"props":344,"children":345},{},[346],{"type":17,"tag":29,"props":347,"children":348},{},[349],{"type":23,"value":350},"2.2 Advantages of the Arrow Format",{"type":17,"tag":25,"props":352,"children":353},{},[354],{"type":23,"value":355},"The Apache Arrow format used by HuggingFace has the following advantages:",{"type":17,"tag":25,"props":357,"children":358},{},[359],{"type":23,"value":360},"1. Arrow's standard format allows zero-copy reads, which removes virtually all serialization overheads.",{"type":17,"tag":25,"props":362,"children":363},{},[364],{"type":23,"value":365},"2. Arrow is column-oriented, so it is faster at querying and processing slices or columns of data.",{"type":17,"tag":25,"props":367,"children":368},{},[369],{"type":23,"value":370},"3. Arrow treats every dataset as a memory-mapped file, allowing access to only a portion of a large file without loading the entire file into memory, and enabling shared memory among multiple processes. Memory mapping allows the use of large datasets on machines with relatively small memory. For example, loading the entire English Wikipedia dataset only requires a few MB of RAM.",{"type":17,"tag":25,"props":372,"children":373},{},[374,376,381],{"type":23,"value":375},"4. When loading data, you can set ",{"type":17,"tag":29,"props":377,"children":378},{},[379],{"type":23,"value":380},"streaming",{"type":23,"value":382}," parameters to enable the streaming mode.",{"type":17,"tag":25,"props":384,"children":385},{},[386],{"type":23,"value":387},"The MindSpore data engine has been designed to allow on-demand loading and online processing, making it a perfect match for HuggingFace Datasets.",{"type":17,"tag":25,"props":389,"children":390},{},[391],{"type":17,"tag":29,"props":392,"children":393},{},[394],{"type":23,"value":395},"2.3 MindNLP Adaptation",{"type":17,"tag":25,"props":397,"children":398},{},[399],{"type":23,"value":400},"Since the Arrow files loaded by HuggingFace Datasets are memory-mapped files, there is no need to copy them into memory, and loading of full data is also not required due to the use of indices. They can be directly used as the source data and input into GeneratorDataset.",{"type":17,"tag":25,"props":402,"children":403},{},[404],{"type":17,"tag":91,"props":405,"children":407},{"alt":7,"src":406},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/11/15/7c2e7135b54e48c580b98d09a2288d79.png",[],{"type":17,"tag":25,"props":409,"children":410},{},[411],{"type":23,"value":412},"Figure 6 GeneratorDataset interface",{"type":17,"tag":25,"props":414,"children":415},{},[416],{"type":23,"value":417},"Construction of GeneratorDataset mainly requires source data and column names. Looking back at figure 3, it can be seen that HuggingFace Datasets have already named all columns. The core code is as follows:",{"type":17,"tag":419,"props":420,"children":422},"pre",{"code":421},"from mindspore.dataset import GeneratorDataset\nfrom datasets import load_dataset as hf_load\n\n......\ndef load_dataset(...):\n    ds_ret = hf_load(path,\n                     name=name,\n                     data_dir=data_dir,\n                     data_files=data_files,\n                     split=split,\n                     cache_dir=cache_dir,\n                     features=features,\n                     download_config=download_config,\n                     download_mode=download_mode,\n                     verification_mode=verification_mode,\n                     keep_in_memory=keep_in_memory,\n                     save_infos=save_infos,\n                     revision=revision,\n                     streaming=streaming,\n                     num_proc=num_proc,\n                     storage_options=storage_options,\n                     )\n    if isinstance(ds_ret, (list, tuple)):\n        ds_dict = dict(zip(split, ds_ret))\n    else:\n        ds_dict = ds_ret\n\n    datasets_dict = {}\n\n    for key, raw_ds in ds_dict.items():\n        column_names = list(raw_ds.features.keys())\n        source = TransferDataset(raw_ds, column_names) if isinstance(raw_ds, Dataset) \\\n            else TransferIterableDataset(raw_ds, column_names)\n        ms_ds = GeneratorDataset(\n            source=source,\n            column_names=column_names,\n            shuffle=shuffle,\n            num_parallel_workers=num_proc if num_proc else 1)\n        datasets_dict[key] = ms_ds\n\n    if len(datasets_dict) == 1:\n        return datasets_dict.popitem()[1]\n    return datasets_dict\n",[423],{"type":17,"tag":424,"props":425,"children":426},"code",{"__ignoreMap":7},[427],{"type":23,"value":421},{"type":17,"tag":25,"props":429,"children":430},{},[431],{"type":23,"value":432},"The procedure is as follows:",{"type":17,"tag":25,"props":434,"children":435},{},[436,438,443],{"type":23,"value":437},"1. Load a dataset using ",{"type":17,"tag":29,"props":439,"children":440},{},[441],{"type":23,"value":442},"load_dataset",{"type":23,"value":444}," of HuggingFace Datasets.",{"type":17,"tag":25,"props":446,"children":447},{},[448],{"type":23,"value":449},"2. Use the encapsulation transition class for encapsulation.",{"type":17,"tag":25,"props":451,"children":452},{},[453],{"type":23,"value":454},"3. Pass GeneratorDataset.",{"type":17,"tag":25,"props":456,"children":457},{},[458,460,464],{"type":23,"value":459},"For ease of use, keep the parameter settings of the ",{"type":17,"tag":29,"props":461,"children":462},{},[463],{"type":23,"value":442},{"type":23,"value":465}," interface consistent with those of HuggingFace Datasets. In this way, a class or dictionary that can be processed by the MindSpore data engine is returned. This enables seamless interconnection with MindSpore's data processing capabilities.",{"type":17,"tag":25,"props":467,"children":468},{},[469],{"type":23,"value":470},"The following briefly describes the construction of a transition class.",{"type":17,"tag":25,"props":472,"children":473},{},[474],{"type":23,"value":475},"There are two types of dataset objects of HuggingFace Datasets, a Dataset and an IterableDataset. Whichever type of dataset you choose to use or create depends on the size of the dataset. In general, an IterableDataset is ideal for big datasets (think hundreds of GBs) due to its lazy behavior and speed advantages, while Dataset is great for everything else. This page will compare the differences between the Dataset and IterableDataset to help you pick the right dataset object for you[3].",{"type":17,"tag":25,"props":477,"children":478},{},[479],{"type":23,"value":480},"After the two types of datasets are traversed, a dictionary is returned, which is not supported by the MindSpore data processing engine. As a result, two transition classes are created to read data from the dictionary, without adding additional operations. For the Dataset, construct a TransferDataset class and read it in the __getitem__ method.",{"type":17,"tag":25,"props":482,"children":483},{},[484],{"type":17,"tag":91,"props":485,"children":487},{"alt":7,"src":486},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/11/15/1bc3c4294954494092199883afc3f507.png",[],{"type":17,"tag":25,"props":489,"children":490},{},[491],{"type":23,"value":492},"For the IterableDataset of streaming data, you need to read it in the _iter_ method and construct TransferIterableDataset as an iterable object.",{"type":17,"tag":25,"props":494,"children":495},{},[496],{"type":17,"tag":91,"props":497,"children":499},{"alt":7,"src":498},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/11/15/7ef23aef5209409a90d61c26d341438c.png",[],{"type":17,"tag":25,"props":501,"children":502},{},[503],{"type":23,"value":504},"Thus, a solution that requires minimal effort and can fully integrate HuggingFace Datasets has been declared complete. Compared with PaddleNLP's integration strategy, this one is even simpler and more elegant.",{"type":17,"tag":25,"props":506,"children":507},{},[508,513,514],{"type":17,"tag":29,"props":509,"children":510},{},[511],{"type":23,"value":512},"03",{"type":23,"value":68},{"type":17,"tag":29,"props":515,"children":516},{},[517],{"type":23,"value":518},"Conclusions",{"type":17,"tag":25,"props":520,"children":521},{},[522],{"type":23,"value":523},"In this practical sharing session, the integration of HuggingFace Datasets into MindSpore will enhance our knowledge of MindSpore MindNLP and also contribute to the growth of the MindSpore ecosystem.",{"type":17,"tag":25,"props":525,"children":526},{},[527],{"type":17,"tag":29,"props":528,"children":529},{},[530],{"type":23,"value":531},"References",{"type":17,"tag":25,"props":533,"children":534},{},[535,537],{"type":23,"value":536},"[1]",{"type":17,"tag":538,"props":539,"children":543},"a",{"href":540,"rel":541},"https://www.mindspore.cn/docs/en/r2.1/design/data%5C_engine.html",[542],"nofollow",[544],{"type":23,"value":545},"https://www.mindspore.cn/docs/en/r2.1/design/data\\_engine.html",{"type":17,"tag":25,"props":547,"children":548},{},[549,551],{"type":23,"value":550},"[2]",{"type":17,"tag":538,"props":552,"children":555},{"href":553,"rel":554},"https://arrow.apache.org/",[542],[556],{"type":23,"value":553},{"type":17,"tag":25,"props":558,"children":559},{},[560,562],{"type":23,"value":561},"[3]",{"type":17,"tag":538,"props":563,"children":566},{"href":564,"rel":565},"https://huggingface.co/docs/datasets/about%5C_mapstyle%5C_vs%5C_iterable",[542],[567],{"type":23,"value":568},"https://huggingface.co/docs/datasets/about\\_mapstyle\\_vs\\_iterable",{"title":7,"searchDepth":570,"depth":570,"links":571},4,[],"markdown","content:technology-blogs:en:2872.md","content","technology-blogs/en/2872.md","technology-blogs/en/2872","md",1776506107842]