[{"data":1,"prerenderedAt":421},["ShallowReactive",2],{"content-query-5WAYLBFRIp":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"cover":11,"type":12,"category":13,"body":14,"_type":415,"_id":416,"_source":417,"_file":418,"_stem":419,"_extension":420},"/technology-blogs/en/2563","en",false,"","AI Design Patterns | Implementing Feature Hashing Using MindSpore","The feature hashing pattern can be applied in certain scenarios, but it may degrade the model accuracy.","2023-03-13","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/06/16/fa6ff9c68c2e421ca73d63900b5a7bdf.png","technology-blogs","Practices",{"type":15,"children":16,"toc":412},"root",[17,25,31,36,41,46,54,63,68,76,81,91,96,104,109,114,122,127,135,140,148,153,161,166,171,176,181,186,194,201,206,211,216,221,229,241,249,254,262,270,275,280,285,290,298,311,318,326,336,352,367,382,397],{"type":18,"tag":19,"props":20,"children":22},"element","h1",{"id":21},"ai-design-patterns-implementing-feature-hashing-using-mindspore",[23],{"type":24,"value":8},"text",{"type":18,"tag":26,"props":27,"children":28},"p",{},[29],{"type":24,"value":30},"Author: Wang Lei | Source: Zhihu",{"type":18,"tag":26,"props":32,"children":33},{},[34],{"type":24,"value":35},"When developing AI software, the choice of design patterns has a significant impact on both efficiency and result accuracy. Unlike traditional software development, which emphasizes service code encapsulation and abstraction, AI software development centers around data processing and specific algorithms. As a result, new design patterns must be developed to address common development challenges in the AI field.",{"type":18,"tag":26,"props":37,"children":38},{},[39],{"type":24,"value":40},"AI design patterns have been classified into two main categories: design and operational patterns. These patterns are used in end-to-end AI processes, including data representation, data processing, problem representation, network design, model training, and resilient serving. Each category contains specific patterns that address different problems. MindSpore, an AI framework, has already supported some of these patterns.",{"type":18,"tag":26,"props":42,"children":43},{},[44],{"type":24,"value":45},"This article presents the feature hashing design pattern, a data representation pattern, and explores its implementation using MindSpore.",{"type":18,"tag":26,"props":47,"children":48},{},[49],{"type":18,"tag":50,"props":51,"children":53},"img",{"alt":7,"src":52},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/06/16/75b7f1d0829e42cba355fa8d4abb7bc5.png",[],{"type":18,"tag":26,"props":55,"children":56},{},[57],{"type":18,"tag":58,"props":59,"children":60},"strong",{},[61],{"type":24,"value":62},"1. Definition",{"type":18,"tag":26,"props":64,"children":65},{},[66],{"type":24,"value":67},"Feature hashing is a data representation pattern of the AI design patterns. It is highly effective in addressing problems such as incomplete data, high cardinality (uneven feature categories), and cold start (inability to process new categories during inference). With the user-friendly data processing interface provided by MindSpore, developers can easily implement this technique.",{"type":18,"tag":26,"props":69,"children":70},{},[71],{"type":18,"tag":58,"props":72,"children":73},{},[74],{"type":24,"value":75},"2. Problem",{"type":18,"tag":26,"props":77,"children":78},{},[79],{"type":24,"value":80},"During data processing, machine learning typically converts categorical inputs into numerical features using one-hot encoding. One-hot encoding employs a set number of features or states to encode a set number of categorical inputs from a vocabulary. The length of the resulting feature is the size of the vocabulary. This ensures that only one bit is valid at any time. For instance, let's say we have six zip codes [1,2,3,4,5,6], and we encode the categorical inputs using one-hot encoding:",{"type":18,"tag":82,"props":83,"children":85},"pre",{"code":84},"import numpy as np\nimport mindspore.dataset.transforms.c_transforms as c_transforms\nimport mindspore.dataset as ds\n\ncode = [1,2,3,4,5,6]\ndata = np.array(code) # Convert the result list to a Numpy array.\ndataset = ds.NumpySlicesDataset(data, column_names=[\"clz\"], shuffle=False) # Convert Numpy arrays into dataset objects based on MindSpore dataset Interfaces.\nonehot_op = c_transforms.OneHot(num_classes=7) # Define the operation. The value of num_class must be greater than the maximum value in the code.\ndataset = dataset.map(operations=onehot_op, input_columns=[\"clz\"]) # Use one-hot encoding.\n\nfor item in dataset:\n    print(item)\n",[86],{"type":18,"tag":87,"props":88,"children":89},"code",{"__ignoreMap":7},[90],{"type":24,"value":84},{"type":18,"tag":26,"props":92,"children":93},{},[94],{"type":24,"value":95},"The encoding result is as follows:",{"type":18,"tag":82,"props":97,"children":99},{"code":98},"[Tensor(shape=[7], dtype=Int32, value= [0, 1, 0, 0, 0, 0, 0])]\n[Tensor(shape=[7], dtype=Int32, value= [0, 0, 1, 0, 0, 0, 0])]\n[Tensor(shape=[7], dtype=Int32, value= [0, 0, 0, 1, 0, 0, 0])]\n[Tensor(shape=[7], dtype=Int32, value= [0, 0, 0, 0, 1, 0, 0])]\n[Tensor(shape=[7], dtype=Int32, value= [0, 0, 0, 0, 0, 1, 0])]\n[Tensor(shape=[7], dtype=Int32, value= [0, 0, 0, 0, 0, 0, 1])]\n",[100],{"type":18,"tag":87,"props":101,"children":102},{"__ignoreMap":7},[103],{"type":24,"value":98},{"type":18,"tag":26,"props":105,"children":106},{},[107],{"type":24,"value":108},"This ensures the uniqueness of the categorical input.",{"type":18,"tag":26,"props":110,"children":111},{},[112],{"type":24,"value":113},"Processing categorical inputs involves having prior knowledge of all relevant data, such as categories, languages, and dates, which can be easily processed. However, when dealing with unpredictable data, certain challenges may arise.",{"type":18,"tag":26,"props":115,"children":116},{},[117],{"type":18,"tag":58,"props":118,"children":119},{},[120],{"type":24,"value":121},"2.1 Incomplete Data",{"type":18,"tag":26,"props":123,"children":124},{},[125],{"type":24,"value":126},"The training data may not contain all feature categories, leading to incomplete vocabulary and encoded data. For instance, for some medical models, a vocabulary of training data cannot include all hospitals and doctors information.",{"type":18,"tag":26,"props":128,"children":129},{},[130],{"type":18,"tag":58,"props":131,"children":132},{},[133],{"type":24,"value":134},"2.2 High Cardinality (Uneven Feature Categories)",{"type":18,"tag":26,"props":136,"children":137},{},[138],{"type":24,"value":139},"A categorical feature with numerous variables, such as IP addresses and home addresses, can lead to feature vectors that are thousands to millions in length. Consequently, the model demands significant storage space and cannot be deployed on smaller devices.",{"type":18,"tag":26,"props":141,"children":142},{},[143],{"type":18,"tag":58,"props":144,"children":145},{},[146],{"type":24,"value":147},"2.3 Cold Start (Inability to Process New Categories During Inference)",{"type":18,"tag":26,"props":149,"children":150},{},[151],{"type":24,"value":152},"The model in the production environment cannot not accurately predict new categorical data, leading to potential errors. To address this, a separate serving infrastructure is required to handle the cold start problem.",{"type":18,"tag":26,"props":154,"children":155},{},[156],{"type":18,"tag":58,"props":157,"children":158},{},[159],{"type":24,"value":160},"3. Solutions",{"type":18,"tag":26,"props":162,"children":163},{},[164],{"type":24,"value":165},"For instance, let's consider the flight punctuality prediction model discussed in the book. With approximately 350 airports in the United States, there is a significant variation in their flight volume. Moreover, new airports are constructed every year. This results in high cardinality, incomplete data, and cold start problems during one-hot encoding.",{"type":18,"tag":26,"props":167,"children":168},{},[169],{"type":24,"value":170},"The feature hashing pattern is used to address the problem of one-hot coding of categorical data. The operations are as follows:",{"type":18,"tag":26,"props":172,"children":173},{},[174],{"type":24,"value":175},"(a) Convert the categorical airport data input into a unique string. This involves assigning an abbreviation to each airport name using its three-letter IATA code, and ensuring that there are no duplicates in the data.",{"type":18,"tag":26,"props":177,"children":178},{},[179],{"type":24,"value":180},"(b) Invoke a deterministic and portable hashing algorithm (can be used in training and inference) on the string.",{"type":18,"tag":26,"props":182,"children":183},{},[184],{"type":24,"value":185},"(c) Take the remainder of the hash result. The FarmHash algorithm is used to hash the airports into 10 and 1000 buckets, respectively. The result is as follows:",{"type":18,"tag":82,"props":187,"children":189},{"code":188},">> airports = [\"DTW\", \"LBB\", \"SNA\", \"MSO\", \"ANC\"]\n>>> list(map(lambda x: farmhash.hash64withseed(x, 10) % 10, airports))\n[9, 9, 4, 0, 1]\n>>> list(map(lambda x: farmhash.hash64withseed(x, 1000) % 1000, airports))\n[416, 532, 193, 538, 971]\n",[190],{"type":18,"tag":87,"props":191,"children":192},{"__ignoreMap":7},[193],{"type":24,"value":188},{"type":18,"tag":26,"props":195,"children":196},{},[197],{"type":18,"tag":50,"props":198,"children":200},{"alt":7,"src":199},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/06/16/7396acae1b504182b4bcdc42f2becc6c.png",[],{"type":18,"tag":26,"props":202,"children":203},{},[204],{"type":24,"value":205},"How does feature hashing address the problem of categorical data?",{"type":18,"tag":26,"props":207,"children":208},{},[209],{"type":24,"value":210},"(a) Incomplete data: Even if an airport is not part of the training dataset, its feature hashing value will still fall within the range. Therefore, incomplete data is not a concern.",{"type":18,"tag":26,"props":212,"children":213},{},[214],{"type":24,"value":215},"(b) High cardinality: Hashing can effectively reduce the scale of data, ensuring that system memory and model size requirements remain practical. Even with millions of data inputs, the data will only be hashed into a limited number of buckets.",{"type":18,"tag":26,"props":217,"children":218},{},[219],{"type":24,"value":220},"(c) Cold start: When new categorical data is added to the system, it will be hashed into the same bucket as other airports. However, to improve prediction accuracy, the updated model must be trained. For instance, if there are 350 airports and the hash bucket is set to 70, each bucket will contain approximately five airports with data. This ensures that the production environment's prediction will not be empty. Nevertheless, the predicted data may not be accurate, and further training is necessary to optimize the model.",{"type":18,"tag":26,"props":222,"children":223},{},[224],{"type":18,"tag":58,"props":225,"children":226},{},[227],{"type":24,"value":228},"4. Case Study",{"type":18,"tag":26,"props":230,"children":231},{},[232,234,239],{"type":24,"value":233},"We'll use the previous example of predicting airport flight punctuality. To begin, we'll apply the pattern to the airport data and prepare the data encoding using MindSpore's one-hot encoding interface. Additionally, we'll need to install the dependent hash algorithm library by running the ",{"type":18,"tag":58,"props":235,"children":236},{},[237],{"type":24,"value":238},"pip install pyfarmhash",{"type":24,"value":240}," command.",{"type":18,"tag":82,"props":242,"children":244},{"code":243},"import farmhash\nimport numpy as np\nimport mindspore.dataset.transforms.c_transforms as c_transforms\nimport mindspore.dataset as ds\n\nairports = [\"DTW\", \"LBB\", \"SNA\", \"MSO\", \"ANC\", \"ABC\", \"CDE\", \"FGH\"] # Abbreviate airport names\nhashed_data = list(map(lambda x: farmhash.hash64withseed(x, 1000) % 4, airports)) # Apply a feature hashing pattern to the string.\n\ndata = np.array(code) # Convert the result list to a Numpy array.\ndataset = ds.NumpySlicesDataset(data, column_names=[\"clz\"], shuffle=False) # Convert Numpy arrays into dataset objects based on MindSpore dataset Interfaces.\nonehot_op = c_transforms.OneHot(num_classes=4) # Define the one-hot encoding operation. The number of num_class is the same as the number of buckets.\ndataset = dataset.map(operations=onehot_op, input_columns=[\"airport_name\"]) # Apply one-hot encoding to airport data.\n\nfor item in dataset:\n    print(item)\n",[245],{"type":18,"tag":87,"props":246,"children":247},{"__ignoreMap":7},[248],{"type":24,"value":243},{"type":18,"tag":26,"props":250,"children":251},{},[252],{"type":24,"value":253},"The command output is as follows:",{"type":18,"tag":82,"props":255,"children":257},{"code":256},"[Tensor(shape=[4], dtype=Int32, value= [1, 0, 0, 0])]\n[Tensor(shape=[4], dtype=Int32, value= [1, 0, 0, 0])]\n[Tensor(shape=[4], dtype=Int32, value= [0, 1, 0, 0])]\n[Tensor(shape=[4], dtype=Int32, value= [0, 0, 1, 0])]\n[Tensor(shape=[4], dtype=Int32, value= [0, 0, 0, 1])]\n[Tensor(shape=[4], dtype=Int32, value= [0, 0, 0, 1])]\n[Tensor(shape=[4], dtype=Int32, value= [0, 0, 1, 0])]\n[Tensor(shape=[4], dtype=Int32, value= [1, 0, 0, 0])]\n",[258],{"type":18,"tag":87,"props":259,"children":260},{"__ignoreMap":7},[261],{"type":24,"value":256},{"type":18,"tag":26,"props":263,"children":264},{},[265],{"type":18,"tag":58,"props":266,"children":267},{},[268],{"type":24,"value":269},"5. Summary",{"type":18,"tag":26,"props":271,"children":272},{},[273],{"type":24,"value":274},"The feature hashing pattern can be applied in certain scenarios, but it may degrade the model accuracy. This pattern is not suitable for scenarios where categorical data is clear, vocabulary size is relatively small (1000 orders of magnitude), and cold start does not exist. Modulo is lossy. In the feature hashing pattern, different categories are placed in the same bucket, which can affect data accuracy. When categorical data is particularly unbalanced, inference errors can be significant. For instance, Airport A has relatively low traffic compared to Airport B, which has traffic 2 orders of magnitude larger. If they are placed in the same bucket, they will be processed as one type of code, leading to biased model results that favor Airport B and result in inaccurate predictions of take-off waiting time.",{"type":18,"tag":26,"props":276,"children":277},{},[278],{"type":24,"value":279},"In practice, there are two ways to mitigate the accuracy loss caused by the pattern:",{"type":18,"tag":26,"props":281,"children":282},{},[283],{"type":24,"value":284},"1. Adding aggregate feature: If the distribution of categorical variables is skewed or the number of buckets is so small that buckets collisions are frequent, you can add aggregate features as an input to the model to help alleviate the situation. For example, for each airport, the probability of a punctual flight can be found in the training dataset and added to the model as a feature. This allows us to avoid losing information related to individual airports when hashing airport codes. In some cases, we might be able to avoid using the airport name as a feature entirely, because relative frequency of on-time flights may be sufficient.",{"type":18,"tag":26,"props":286,"children":287},{},[288],{"type":24,"value":289},"2. Treat the number of buckets as a hyperparameter to achieve accuracy balance.",{"type":18,"tag":26,"props":291,"children":292},{},[293],{"type":18,"tag":58,"props":294,"children":295},{},[296],{"type":24,"value":297},"Reference:",{"type":18,"tag":26,"props":299,"children":300},{},[301,303],{"type":24,"value":302},"[1]",{"type":18,"tag":304,"props":305,"children":309},"a",{"href":306,"rel":307},"https://www.oreilly.com/library/view/machine-learning-design/9781098115777/",[308],"nofollow",[310],{"type":24,"value":306},{"type":18,"tag":26,"props":312,"children":313},{},[314],{"type":18,"tag":50,"props":315,"children":317},{"alt":7,"src":316},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/06/16/a45d4979db0c4fdb8fcbe2daa68b2bd1.png",[],{"type":18,"tag":26,"props":319,"children":320},{},[321],{"type":18,"tag":58,"props":322,"children":323},{},[324],{"type":24,"value":325},"MindSpore official documentation",{"type":18,"tag":26,"props":327,"children":328},{},[329,334],{"type":18,"tag":58,"props":330,"children":331},{},[332],{"type":24,"value":333},"Official QQ group",{"type":24,"value":335},": 871543426",{"type":18,"tag":26,"props":337,"children":338},{},[339,344,346],{"type":18,"tag":58,"props":340,"children":341},{},[342],{"type":24,"value":343},"MindSpore website",{"type":24,"value":345},": ",{"type":18,"tag":304,"props":347,"children":350},{"href":348,"rel":349},"https://www.mindspore.cn/en",[308],[351],{"type":24,"value":348},{"type":18,"tag":26,"props":353,"children":354},{},[355,360,361],{"type":18,"tag":58,"props":356,"children":357},{},[358],{"type":24,"value":359},"Gitee",{"type":24,"value":345},{"type":18,"tag":304,"props":362,"children":365},{"href":363,"rel":364},"https://gitee.com/mindspore/mindspore",[308],[366],{"type":24,"value":363},{"type":18,"tag":26,"props":368,"children":369},{},[370,375,376],{"type":18,"tag":58,"props":371,"children":372},{},[373],{"type":24,"value":374},"GitHub",{"type":24,"value":345},{"type":18,"tag":304,"props":377,"children":380},{"href":378,"rel":379},"https://github.com/mindspore-ai/mindspore",[308],[381],{"type":24,"value":378},{"type":18,"tag":26,"props":383,"children":384},{},[385,390,391],{"type":18,"tag":58,"props":386,"children":387},{},[388],{"type":24,"value":389},"Forum",{"type":24,"value":345},{"type":18,"tag":304,"props":392,"children":395},{"href":393,"rel":394},"https://bbs.huaweicloud.com/forum/forum-1076-1.html",[308],[396],{"type":24,"value":393},{"type":18,"tag":26,"props":398,"children":399},{},[400,405,406],{"type":18,"tag":58,"props":401,"children":402},{},[403],{"type":24,"value":404},"Openl Community",{"type":24,"value":345},{"type":18,"tag":304,"props":407,"children":410},{"href":408,"rel":409},"https://openi.org.cn",[308],[411],{"type":24,"value":408},{"title":7,"searchDepth":413,"depth":413,"links":414},4,[],"markdown","content:technology-blogs:en:2563.md","content","technology-blogs/en/2563.md","technology-blogs/en/2563","md",1776506106566]