[{"data":1,"prerenderedAt":285},["ShallowReactive",2],{"content-query-jGfLjAVc8b":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":10,"date":11,"cover":12,"type":13,"category":14,"body":15,"_type":279,"_id":280,"_source":281,"_file":282,"_stem":283,"_extension":284},"/technology-blogs/en/1750","en",false,"",[9],"MindSpore Made Easy","Natural Language Processing (NLP) is a translator between machine language and human language for the purpose of human-machine communication. Briefly summarize the NLP process in deep learning.","2022-08-25","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/11/28/478f11adcae644669727d0763bd25bc2.png","technology-blogs","Basics",{"type":16,"children":17,"toc":271},"root",[18,32,42,50,61,69,79,90,98,106,114,127,135,143,153,161,173,181,191,199,207,215,223,231,243,251,263],{"type":19,"tag":20,"props":21,"children":23},"element","h1",{"id":22},"mindspore-made-easy-data-processing-chinese-text-preprocessing",[24,30],{"type":19,"tag":25,"props":26,"children":27},"span",{},[28],{"type":29,"value":9},"text",{"type":29,"value":31}," Data Processing – Chinese Text Preprocessing",{"type":19,"tag":33,"props":34,"children":35},"p",{},[36],{"type":19,"tag":37,"props":38,"children":39},"strong",{},[40],{"type":29,"value":41},"Introduction",{"type":19,"tag":33,"props":43,"children":44},{},[45],{"type":19,"tag":37,"props":46,"children":47},{},[48],{"type":29,"value":49},"Natural language processing (NLP) is a bridge for human-machine communication. The following figure shows the NLP process in deep learning (DL).",{"type":19,"tag":33,"props":51,"children":52},{},[53],{"type":19,"tag":37,"props":54,"children":55},{},[56],{"type":19,"tag":57,"props":58,"children":60},"img",{"alt":7,"src":59},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2022/08/25/b2b151e4b68e433f929364322f6e5280.png",[],{"type":19,"tag":33,"props":62,"children":63},{},[64],{"type":19,"tag":37,"props":65,"children":66},{},[67],{"type":29,"value":68},"In this blog, I would like to explain how to preprocess a Chinese corpus. The following figure presents the basic steps:",{"type":19,"tag":33,"props":70,"children":71},{},[72],{"type":19,"tag":37,"props":73,"children":74},{},[75],{"type":19,"tag":57,"props":76,"children":78},{"alt":7,"src":77},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2022/08/25/aff2f08ef9114b8aa2a0371ac17d991c.png",[],{"type":19,"tag":33,"props":80,"children":81},{},[82],{"type":19,"tag":37,"props":83,"children":84},{},[85],{"type":19,"tag":37,"props":86,"children":87},{},[88],{"type":29,"value":89},"Word segmentation",{"type":19,"tag":33,"props":91,"children":92},{},[93],{"type":19,"tag":37,"props":94,"children":95},{},[96],{"type":29,"value":97},"Word segmentation is the process of splitting long texts such as sentences, paragraphs, and articles to words for subsequent processing and analysis. Machine learning solves complex problems by creating models and converting problems into mathematical problems, so does NLP. Word segmentation is the first step to structuralize an original text, which can then be analyzed by mathematical models.",{"type":19,"tag":33,"props":99,"children":100},{},[101],{"type":19,"tag":37,"props":102,"children":103},{},[104],{"type":29,"value":105},"It is not appropriate to divide words to single characters, because a single character does not have complete meaning in most cases. For example, the single character \"量\" can be a part of \"量子(quantum)\" or \"数量(quantity)\". It is either not good to split the text into sentences, because a sentence contains much information. Therefore, words are the most appropriate units for segmentation.",{"type":19,"tag":33,"props":107,"children":108},{},[109],{"type":19,"tag":37,"props":110,"children":111},{},[112],{"type":29,"value":113},"Currently, the Chinese word segmentation faces three problems: lack of a unified standard, word ambiguity, and new words. Typically, we use the dictionary-based, deep learning-based, and statistical methods to segment words. The dictionary-based method is fast and costs low, but it has poor robustness and unstable performance. Statistical and deep learning-based methods deliver better robustness but are slower and cost higher. Therefore, commonly-used tokenizers combine machine learning algorithms and dictionary rules to improve the word segmentation accuracy and adaptability.",{"type":19,"tag":115,"props":116,"children":118},"h5",{"id":117},"part-of-speech-pos-tagging",[119],{"type":19,"tag":37,"props":120,"children":121},{},[122],{"type":19,"tag":37,"props":123,"children":124},{},[125],{"type":29,"value":126},"Part-of-speech (POS) tagging",{"type":19,"tag":33,"props":128,"children":129},{},[130],{"type":19,"tag":37,"props":131,"children":132},{},[133],{"type":29,"value":134},"Chinese words are classified into content words (for example, lexical verbs, nouns, and pronouns) and function words. POS tagging is a process of marking up a word in a sentence corresponding to its part of speech. Chinese characters lack inflection for word formation, which causes POS disputes in linguistics. As a result, there are no unified granularities and symbols for classifying Chinese words, posing great challenges on processing Chinese information.",{"type":19,"tag":33,"props":136,"children":137},{},[138],{"type":19,"tag":37,"props":139,"children":140},{},[141],{"type":29,"value":142},"The POS tagging is usually performed based on rules, statistical models, deep learning algorithms, and the combination of rules and statistical models.",{"type":19,"tag":33,"props":144,"children":145},{},[146],{"type":19,"tag":37,"props":147,"children":148},{},[149],{"type":19,"tag":57,"props":150,"children":152},{"alt":7,"src":151},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2022/08/25/47d21cd99dd54d17bf7767274d14516f.png",[],{"type":19,"tag":33,"props":154,"children":155},{},[156],{"type":19,"tag":37,"props":157,"children":158},{},[159],{"type":29,"value":160},"As an early method, the rule-based POS tagging uses rules predefined by developers based on expressions and contexts to reduce word ambiguity. However, the ever enlarging corpus makes it difficult to perform POS tagging using only rules and thus introduces statistical approaches. Statistical approaches focus on word sequences. Specifically, we can determine the word POS based on the sequence of words with given tags. Statistical models such as the Hidden Markov Model (HMM) and Conditional Random Field (CRF) can train large-scaled corpus with tagged data (the text in which each word is assigned to a correct POS tag). Based on rules and statistics, the hybrid method filters statistical results and then uses rules to eliminate ambiguity of suspicious results only. The DL-based method treats POS tagging as sequence tagging problem by using the LSTM+CRF and BiLSTM+CRF networks.",{"type":19,"tag":115,"props":162,"children":164},{"id":163},"named-entity-recognition-ner",[165],{"type":19,"tag":37,"props":166,"children":167},{},[168],{"type":19,"tag":37,"props":169,"children":170},{},[171],{"type":29,"value":172},"Named entity recognition (NER)",{"type":19,"tag":33,"props":174,"children":175},{},[176],{"type":19,"tag":37,"props":177,"children":178},{},[179],{"type":29,"value":180},"NER is a subtask of information extraction that locates named entities mentioned in unstructured text and classifies them into pre-defined categories such as person names, locations, organizations, and proper nouns. NER has evolved four stages (see the following figure), and the early NER is rule-based, but with the increasing size of corpus, NER has adopted machine learning algorithms to classify named entities.",{"type":19,"tag":33,"props":182,"children":183},{},[184],{"type":19,"tag":37,"props":185,"children":186},{},[187],{"type":19,"tag":57,"props":188,"children":190},{"alt":7,"src":189},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2022/08/25/3c0c9e4c7ca64e0ebca0f02a650d750e.png",[],{"type":19,"tag":33,"props":192,"children":193},{},[194],{"type":19,"tag":37,"props":195,"children":196},{},[197],{"type":29,"value":198},"There are four types of machine learning algorithms for NER:",{"type":19,"tag":33,"props":200,"children":201},{},[202],{"type":19,"tag":37,"props":203,"children":204},{},[205],{"type":29,"value":206},"l Supervised learning methods: This type of methods requires a large-scale labeled corpus to train model parameters.",{"type":19,"tag":33,"props":208,"children":209},{},[210],{"type":19,"tag":37,"props":211,"children":212},{},[213],{"type":29,"value":214},"l Semi-supervised learning methods: This type of methods is less dependent on labeled corpuses and uses labeled small datasets (seed data) for bootstrap learning.",{"type":19,"tag":33,"props":216,"children":217},{},[218],{"type":19,"tag":37,"props":219,"children":220},{},[221],{"type":29,"value":222},"l Unsupervised learning methods: This type of methods uses vocabulary resources (such as WordNet) for context clustering.",{"type":19,"tag":33,"props":224,"children":225},{},[226],{"type":19,"tag":37,"props":227,"children":228},{},[229],{"type":29,"value":230},"l Hybrid methods: This type of methods uses multiple models or a knowledge base consisting of statistical methods and summarized rules to conduct NER.",{"type":19,"tag":115,"props":232,"children":234},{"id":233},"stop-word-removal",[235],{"type":19,"tag":37,"props":236,"children":237},{},[238],{"type":19,"tag":37,"props":239,"children":240},{},[241],{"type":29,"value":242},"Stop word removal",{"type":19,"tag":33,"props":244,"children":245},{},[246],{"type":19,"tag":37,"props":247,"children":248},{},[249],{"type":29,"value":250},"Stop words are the words that do not have significant meaning, such as \"的\", \"那\", and \"之\". Those words are filtered out during the text data processing to delete unnecessary information. Deleting these words does not have a negative impact on the training job but shortens training time by reducing the dataset size and model complexity. To remove stop words, you need to form a stop list based on your requirements.",{"type":19,"tag":115,"props":252,"children":254},{"id":253},"summary",[255],{"type":19,"tag":37,"props":256,"children":257},{},[258],{"type":19,"tag":37,"props":259,"children":260},{},[261],{"type":29,"value":262},"Summary",{"type":19,"tag":33,"props":264,"children":265},{},[266],{"type":19,"tag":37,"props":267,"children":268},{},[269],{"type":29,"value":270},"This blog briefly explains the preprocessing procedure of Chinese text datasets, including word segmentation, POS tagging, NER, and stop word removal. The purpose of each step and implementation methods involved in each step are discussed as well. Hope this blog will give you some inspirations in the NLP of Chinese text data.",{"title":7,"searchDepth":272,"depth":272,"links":273},4,[274,276,277,278],{"id":117,"depth":275,"text":126},5,{"id":163,"depth":275,"text":172},{"id":233,"depth":275,"text":242},{"id":253,"depth":275,"text":262},"markdown","content:technology-blogs:en:1750.md","content","technology-blogs/en/1750.md","technology-blogs/en/1750","md",1776506104102]