[{"data":1,"prerenderedAt":284},["ShallowReactive",2],{"content-query-U31RqvmfwL":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"cover":11,"type":12,"category":13,"body":14,"_type":278,"_id":279,"_source":280,"_file":281,"_stem":282,"_extension":283},"/technology-blogs/en/1796","en",false,"","[AI Engineering] 07 - CD4ML: Continuous Delivery for Machine Learning (Part II)","As machine learning techniques continue to evolve, our knowledge of managing and delivering such applications to production is also evolving. By introducing and extending the specifications and practices from CD, we can better manage the risks of releasing changes in a safe and reliable way.","2022-08-18","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/11/28/6533581b0beb453aaf5c87973cf02834.png","technology-blogs","Influencers",{"type":15,"children":16,"toc":273},"root",[17,32,38,49,54,62,67,72,77,85,90,98,103,121,126,133,138,143,148,153,161,166,171,176,184,189,197,202,207,212,217,233,238,243,251,255,260,268],{"type":18,"tag":19,"props":20,"children":22},"element","h1",{"id":21},"ai-engineering-07-cd4ml-continuous-delivery-for-machine-learning-part-ii",[23,30],{"type":18,"tag":24,"props":25,"children":26},"span",{},[27],{"type":28,"value":29},"text","AI Engineering",{"type":28,"value":31}," 07 - CD4ML: Continuous Delivery for Machine Learning (Part II)",{"type":18,"tag":33,"props":34,"children":35},"p",{},[36],{"type":28,"value":37},"This article introduces the end-to-end CD4ML process and the exploration and prospect of the future of CD4ML.",{"type":18,"tag":39,"props":40,"children":42},"h2",{"id":41},"the-end-to-end-cd4ml-process",[43],{"type":18,"tag":44,"props":45,"children":46},"strong",{},[47],{"type":28,"value":48},"The End-to-End CD4ML Process",{"type":18,"tag":33,"props":50,"children":51},{},[52],{"type":28,"value":53},"By tackling each technical challenge, and using a variety of tools and technologies, the end-to-end process shown in Figure 7 is created. The process management is based on three dimensions: code, model, and data.",{"type":18,"tag":33,"props":55,"children":56},{},[57],{"type":18,"tag":58,"props":59,"children":61},"img",{"alt":7,"src":60},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2022/09/15/62a37c789bb5443898a756138737b6c1.png",[],{"type":18,"tag":33,"props":63,"children":64},{},[65],{"type":28,"value":66},"Figure 7 End-to-end CD4ML process",{"type":18,"tag":33,"props":68,"children":69},{},[70],{"type":28,"value":71},"On this basis, we need a simple way to manage, discover, access, and version our data. Then, we automate building and training of the model to make it reproducible. This enables us to experiment and train multiple models at the same time and track the data. Once an appropriate model is determined, we can decide how it will be productionized and served. Because the model is continuously updated, we need to test the model before deploying it to the production environment to ensure that the result meets the user's expectation. After the model is put into production, we use the monitoring and observation services provided by the infrastructure to gather new data that can be analyzed and used to create new training datasets. In this way, the feedback loop of continuous improvement is closed.",{"type":18,"tag":33,"props":73,"children":74},{},[75],{"type":28,"value":76},"A continuous delivery (CD) orchestration tool coordinates the end-to-end CD4ML process, provides the infrastructure on demand, and manages the deployment of models and applications in production environments.",{"type":18,"tag":33,"props":78,"children":79},{},[80],{"type":18,"tag":44,"props":81,"children":82},{},[83],{"type":28,"value":84},"Where Does the Road Lead?",{"type":18,"tag":33,"props":86,"children":87},{},[88],{"type":28,"value":89},"In this section, we will focus on some areas of improvement that are not covered in the workshop materials, as well as open areas to be further explored.",{"type":18,"tag":33,"props":91,"children":92},{},[93],{"type":18,"tag":44,"props":94,"children":95},{},[96],{"type":28,"value":97},"Data Versioning",{"type":18,"tag":33,"props":99,"children":100},{},[101],{"type":28,"value":102},"When talking about CD4ML, a frequently asked question is \"how to trigger a pipeline when the data changes?\"",{"type":18,"tag":33,"props":104,"children":105},{},[106,108,113,115,119],{"type":28,"value":107},"In the authors' example, they take the following approach. The machine learning pipeline in Figure 8 starts with the ",{"type":18,"tag":44,"props":109,"children":110},{},[111],{"type":28,"value":112},"download_data.py",{"type":28,"value":114}," script, which is used to download the training dataset from a shared location. If the content of the dataset in the shared location is changed, the pipeline is not immediately triggered because the program code is not changed and the Data Science Version Control tool (DVC) cannot detect the dataset change. To version the data, we have to create a new file or change the file name, which in turn requires us to update the ",{"type":18,"tag":44,"props":116,"children":117},{},[118],{"type":28,"value":112},{"type":28,"value":120}," script with a new path and create a new code commit.",{"type":18,"tag":33,"props":122,"children":123},{},[124],{"type":28,"value":125},"An improvement to this approach is to allow DVC to track the file content, eliminating the need for changing code manually: To achieve this, the authors slightly modified their machine learning pipeline, as shown in Figure 8.",{"type":18,"tag":33,"props":127,"children":128},{},[129],{"type":18,"tag":58,"props":130,"children":132},{"alt":7,"src":131},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2022/09/15/0c9ef2af058c4dea9de49fdff2d84cd1.png",[],{"type":18,"tag":33,"props":134,"children":135},{},[136],{"type":28,"value":137},"Figure 8: Updating the first step to allow DVC to track the data versions and simplifying the ML pipeline",{"type":18,"tag":33,"props":139,"children":140},{},[141],{"type":28,"value":142},"This creates a metadata file to track the content of the file committed to Git. When file content changes, DVC will update the metadata file, which will trigger a pipeline execution.",{"type":18,"tag":33,"props":144,"children":145},{},[146],{"type":28,"value":147},"Although this allows us to retrain the model when the data changes, it does not solve all problems of data versioning. One is data history. Ideally, we want to keep an entire history of data changes, but this is not always feasible. Another is data provenance. We want to know which processing step caused the data to change, and how the changed data propagates across different datasets. There is also another question that as the data type and schema evolve over time, whether those changes are compatible backwards and forwards.",{"type":18,"tag":33,"props":149,"children":150},{},[151],{"type":28,"value":152},"In the field of streaming media, these aspects of data versioning become even more complicated. We expect more practices, tools, and techniques to emerge.",{"type":18,"tag":33,"props":154,"children":155},{},[156],{"type":18,"tag":44,"props":157,"children":158},{},[159],{"type":28,"value":160},"Data Pipelines",{"type":18,"tag":33,"props":162,"children":163},{},[164],{"type":28,"value":165},"The authors prefer open source tools that can define data pipelines in the code, making it easier to control, test, and deploy versions. For example, with Spark, you can use Scala to write a data pipeline, test it using ScalaTest or spark-testing-base, and then package the job as a JAR artifact that can be deployed on a deployment pipeline in GoCD.",{"type":18,"tag":33,"props":167,"children":168},{},[169],{"type":28,"value":170},"Since data pipelines usually run either as a batch job or as a long-running streaming application, the authors did not include them in the end-to-end CD4ML process diagram in Figure 8. The output of the pipeline may be changed, which may not be what the model or application expects. This is another potential issue of integration. Therefore, including integration and data contract tests as a part of the deployment pipeline can catch those mistakes.",{"type":18,"tag":33,"props":172,"children":173},{},[174],{"type":28,"value":175},"Another type of testing associated with data pipelines is a data quality check, which is another extensive topic of discussion and is probably better to be covered in a separate article.",{"type":18,"tag":33,"props":177,"children":178},{},[179],{"type":18,"tag":44,"props":180,"children":181},{},[182],{"type":28,"value":183},"Platform Thinking",{"type":18,"tag":33,"props":185,"children":186},{},[187],{"type":28,"value":188},"Various tools and technologies are used to implement CD4ML. If multiple teams are working on the same system, they might end up repeating work or duplicating efforts. That's why platform thinking is useful. By leveraging the platform engineering to build domain-agnostic tools, a team can hide the underlying complexity of the tools and speed up the test and adoption of the tools by other teams.",{"type":18,"tag":33,"props":190,"children":191},{},[192],{"type":18,"tag":44,"props":193,"children":194},{},[195],{"type":28,"value":196},"Evolving Intelligent Systems without Bias",{"type":18,"tag":33,"props":198,"children":199},{},[200],{"type":28,"value":201},"After an ML system is deployed to a production environment, it will start predicting and processing invisible data. It might even replace an existing, rules-based system. It is important to realize that the training data and model validation used in the system are based on historical data of the previous system, which might inherit bias of the previous system. Moreover, the impact of the ML system on users also affects training data in the future.",{"type":18,"tag":33,"props":203,"children":204},{},[205],{"type":28,"value":206},"Consider the following two examples.",{"type":18,"tag":33,"props":208,"children":209},{},[210],{"type":28,"value":211},"First, assume there is an application that predicts demand to decide the exact quantity of products to be ordered and offered to customers. If the predicted demand is lower than the actual demand, the quantity of products will be insufficient and the transaction volume for that product decreases. If you only use these new transactions as training data to improve the model, the demand predictions will degrade over time.",{"type":18,"tag":33,"props":213,"children":214},{},[215],{"type":28,"value":216},"Second, imagine that you are building an anomaly detection model to decide if a customer's credit card transaction is fraudulent. If the application takes the model decision to block the fraudulent transactions, over time you will have only \"true labels\" for the transactions allowed by the model and fewer fraudulent ones to train on. The model performance will also degrade because the training data becomes biased towards \"good\" transactions.",{"type":18,"tag":33,"props":218,"children":219},{},[220,222,231],{"type":28,"value":221},"There is no simple solution to this problem. In the first example, retailers can consider out-of-stock situations and order more items to cover the potential shortage. For the fraud detection scenario, using some probability distribution, the model classification can sometimes be ignored or overridden. It is also important to realize that many datasets are temporal, that is, their distribution changes over time. Many validation methods that split data randomly assume the datasets are ",{"type":18,"tag":223,"props":224,"children":228},"a",{"href":225,"rel":226},"https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables",[227],"nofollow",[229],{"type":28,"value":230},"i.i.d.",{"type":28,"value":232}," (independent and identically distributed), but that is not true taking the effect of time into account.",{"type":18,"tag":33,"props":234,"children":235},{},[236],{"type":28,"value":237},"Therefore, it is important to not only capture the input/output of a model, but also to check whether the consuming application directly takes the model output. You can annotate the data to avoid this bias in future training rounds. Another key capability that is required when you face these issues is managing training data and having systems to allow humans to curate it.",{"type":18,"tag":33,"props":239,"children":240},{},[241],{"type":28,"value":242},"Evolving an intelligent system to choose and improve ML models can also be seen as a meta-learning problem. Many of the state-of-the-art research in this field are focused on these types of problems. For example, the usage of reinforcement learning techniques, such as multi-arm bandits, or online learning in the production environments.",{"type":18,"tag":33,"props":244,"children":245},{},[246],{"type":18,"tag":44,"props":247,"children":248},{},[249],{"type":28,"value":250},"Conclusion",{"type":18,"tag":33,"props":252,"children":253},{},[254],{"type":28,"value":9},{"type":18,"tag":33,"props":256,"children":257},{},[258],{"type":28,"value":259},"Using a sample sales forecasting application, this article shows the technical components of CD4ML and discusses a few approaches of implementing them. ML techniques will continue to evolve, and new tools will emerge and disappear, but the core principles of CD remain relevant and provide important reference and guidance for your own machine learning applications.",{"type":18,"tag":33,"props":261,"children":262},{},[263],{"type":18,"tag":44,"props":264,"children":265},{},[266],{"type":28,"value":267},"References",{"type":18,"tag":33,"props":269,"children":270},{},[271],{"type":28,"value":272},"1. Danilo Sato, Arif Wider, Christoph Windheuser, Continuous Delivery for Machine Learning",{"title":7,"searchDepth":274,"depth":274,"links":275},4,[276],{"id":41,"depth":277,"text":48},2,"markdown","content:technology-blogs:en:1796.md","content","technology-blogs/en/1796.md","technology-blogs/en/1796","md",1776506104567]