[{"data":1,"prerenderedAt":411},["ShallowReactive",2],{"content-query-WuuFQ7jUpO":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"cover":11,"type":12,"body":13,"_type":405,"_id":406,"_source":407,"_file":408,"_stem":409,"_extension":410},"/technology-blogs/en/3009","en",false,"","Idea Sharing | Fitting of the Chemical Deep Learning Model ChemGPT","The research provides motivation and practical guidance for scaling research in scientific deep learning, and provides fruitful new research directions for the intersection of large-scale and physics-based deep learning.","2024-01-26","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/08/5f4b995bfff844f6a5bec0e8c3e26577.png","technology-blogs",{"type":14,"children":15,"toc":402},"root",[16,24,34,42,47,52,57,62,67,75,80,88,96,101,108,113,139,147,179,187,192,200,205,213,221,229,234,241,246,254,259,266,271,279,290,297,302,307,314,319,327,332,337,345,358,369,380,391],{"type":17,"tag":18,"props":19,"children":21},"element","h1",{"id":20},"idea-sharing-fitting-of-the-chemical-deep-learning-model-chemgpt",[22],{"type":23,"value":8},"text",{"type":17,"tag":25,"props":26,"children":27},"p",{},[28],{"type":17,"tag":29,"props":30,"children":31},"strong",{},[32],{"type":23,"value":33},"Author: Yu Fan Source: Zhihu",{"type":17,"tag":25,"props":35,"children":36},{},[37],{"type":17,"tag":29,"props":38,"children":39},{},[40],{"type":23,"value":41},"Background",{"type":17,"tag":25,"props":43,"children":44},{},[45],{"type":23,"value":46},"The effectiveness of deep learning in fields such as computer vision (CV) and natural language processing (NLP) relies on the ability of deep neural networks to utilize the increasing amount of computing power, data size, and model capacity. Most state-of-the-art (SOTA) CV and NLP models are adapted from a small set of large pre-trained models. Through self-supervised pre-training, information can be successfully synthesized from large datasets, and little to no fine-tuning is needed when various downstream tasks are performed. Therefore, massive models and dataset expansion will be a prerequisite for the great success of deep learning in the scientific field.",{"type":17,"tag":25,"props":48,"children":49},{},[50],{"type":23,"value":51},"AlphaFold, the Open Catalyst Project, ChemBERTa, and other recent work have proven that key ingredients in CV and NLP, that is, larger datasets and models, pre-training, and self-supervised learning, unlock new possibilities for the application of deep learning in chemistry.",{"type":17,"tag":25,"props":53,"children":54},{},[55],{"type":23,"value":56},"However, unlike those of CV and NLP, the path and benefits of large-scale chemical deep networks are unclear. On the one hand, chemical deep learning can incorporate domain prior knowledge that may alleviate the urgency of resource requirements. On the other hand, the heterogeneity and complexity of chemical space and molecular machine learning tasks make it challenging to train a general and robust model that performs well on various downstream tasks. The enormity of chemical space and heterogeneity of downstream tasks make larger models in chemistry ideal for unlabelled multimodal datasets. At the same time, recent research has found that neural-scaling laws improves model performance over many orders of magnitude in terms of model size, dataset size, and computing workload. However, these experiments require a large number of computing resources and depend on model training procedures in specific domains. These procedures do not apply outside traditional deep learning application fields. Moreover, the development and deployment costs of large models are high, and the research of neural-scaling behavior also depends on costly hyperparameter optimization (HPO) and experimentation.",{"type":17,"tag":25,"props":58,"children":59},{},[60],{"type":23,"value":61},"Based on techniques in CV and NLP that accelerate neural architecture search and hyperparameter transfer, such as training speed estimation (TSE) and μTransfer, Nathan C. Frey, an MIT PhD, proposed that to investigate the capabilities of deep chemical models across resource scales, practical and principled approaches are needed to accelerate hyperparameter transfer and characterize neural scaling.",{"type":17,"tag":25,"props":63,"children":64},{},[65],{"type":23,"value":66},"Frey and his team developed strategies for scaling deep chemical models and studied neural-scaling behavior in large chemical models by changing the size of the model and dataset in multiple orders of magnitude. Their paper introduced the ChemGPT model with more than 1 billion parameters, trained on datasets of up to 10 million data points, and studied large language models (LLMs) for generative chemical modelling and graph neural networks (GNNs) for learning interatomic potentials. In addition, the team explored the interaction between physics-based priors and scale and discovered the empirical scaling relations for language models in chemistry. Finally, for the largest dataset within the experimental scope, the scaling exponent β was 0.17 and for equivariant graph neural network interatomic potentials, the scaling exponent α is 0.26.",{"type":17,"tag":25,"props":68,"children":69},{},[70],{"type":17,"tag":71,"props":72,"children":74},"img",{"alt":7,"src":73},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/08/d38a2be60ed546e8af76973a14381401.png",[],{"type":17,"tag":25,"props":76,"children":77},{},[78],{"type":23,"value":79},"Figure 1 Research on scaling relationships for deep learning chemical models",{"type":17,"tag":25,"props":81,"children":82},{},[83],{"type":17,"tag":29,"props":84,"children":85},{},[86],{"type":23,"value":87},"1. Methods",{"type":17,"tag":25,"props":89,"children":90},{},[91],{"type":17,"tag":29,"props":92,"children":93},{},[94],{"type":23,"value":95},"1.1 Neural-Scaling Laws",{"type":17,"tag":25,"props":97,"children":98},{},[99],{"type":23,"value":100},"For LLMs and CV models with sufficient parameters and/or convergence, the empirical scaling relationships between the performance and the number of parameters, size of datasets, and computing workload can be denoted as formula 1.",{"type":17,"tag":25,"props":102,"children":103},{},[104],{"type":17,"tag":71,"props":105,"children":107},{"alt":7,"src":106},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/08/bbd1a018c25a4602bc854029a566181c.png",[],{"type":17,"tag":25,"props":109,"children":110},{},[111],{"type":23,"value":112},"Formula 1",{"type":17,"tag":25,"props":114,"children":115},{},[116,118,124,126,130,132,137],{"type":23,"value":117},"α is the scaling coefficient. ",{"type":17,"tag":119,"props":120,"children":121},"em",{},[122],{"type":23,"value":123},"R",{"type":23,"value":125}," is the resource information, including the number of parameters, dataset size, and computing workload. β is the scaling exponent that measures the slope of the power law and indicates the scaling efficiency of the model with respect to the scaling factor ",{"type":17,"tag":119,"props":127,"children":128},{},[129],{"type":23,"value":123},{"type":23,"value":131},". For a fixed data budget, the scaling exponent quantifies the loss improvements caused by the increase in the model size. A larger ",{"type":17,"tag":119,"props":133,"children":134},{},[135],{"type":23,"value":136},"β",{"type":23,"value":138}," corresponds to a steeper slope and better performance as the data/model size increases. Note that this empirical formula is not applicable to the case where the resolution is limited, that is, the model is large enough but the dataset is not, or vice versa. Identifying these resolution-limited areas from neural-scaling relationships allows us to roughly understand whether model loss improvements are limited by data availability or model capacity.",{"type":17,"tag":25,"props":140,"children":141},{},[142],{"type":17,"tag":29,"props":143,"children":144},{},[145],{"type":23,"value":146},"1.2 Chemical LLMs",{"type":17,"tag":25,"props":148,"children":149},{},[150,152,157,159,164,166,171,173,177],{"type":23,"value":151},"Chemical graphs can be naturally represented by strings, so sequence-based models are a natural choice for processing chemical data. After observing that the pre-training loss of transformer-based models could be significantly improved by increasing the model or dataset size, the team designed a generative LLM called ChemGPT to study the impact of dataset and model size on the pre-training loss. ChemGPT is a GPT3-style model based on GPT-Neo with a tokenizer for self-referencing embedded strings (SELFIES). For chemical language modeling, a set of molecules (",{"type":17,"tag":119,"props":153,"children":154},{},[155],{"type":23,"value":156},"x1, x2, ..., xn",{"type":23,"value":158},") is represented with each molecule as a sequence of symbols (",{"type":17,"tag":119,"props":160,"children":161},{},[162],{"type":23,"value":163},"s1, s2, ..., sn",{"type":23,"value":165},"). For a given sequence, ",{"type":17,"tag":119,"props":167,"children":168},{},[169],{"type":23,"value":170},"p(x)",{"type":23,"value":172},", the probability of the sequence, is factorized as the product of conditional probabilities of each molecule. ChemGPT uses the transformer architecture with a self-attention mechanism to calculate the conditional probabilities, estimate ",{"type":17,"tag":119,"props":174,"children":175},{},[176],{"type":23,"value":170},{"type":23,"value":178},", and sample from it to generate new molecules. ChemGPT has up to 1 billion non-embedded parameter and is pre-trained on up to 10 million molecules in the PubChem database. The scale is greatly increased compare to those of traditional generative chemical models.",{"type":17,"tag":25,"props":180,"children":181},{},[182],{"type":17,"tag":29,"props":183,"children":184},{},[185],{"type":23,"value":186},"1.3 GNN Force Fields",{"type":17,"tag":25,"props":188,"children":189},{},[190],{"type":23,"value":191},"Molecular geometry and three-dimensional structure are necessary for most chemical downstream tasks. The team used GNNs to take coordinates of atoms in molecules as input and predict the energy of a given molecular geometry. The energy-conserving atomic force field was then obtained by differentiating the predicted energy.",{"type":17,"tag":25,"props":193,"children":194},{},[195],{"type":17,"tag":29,"props":196,"children":197},{},[198],{"type":23,"value":199},"1.4 Training Performance Estimation (TPE)",{"type":17,"tag":25,"props":201,"children":202},{},[203],{"type":23,"value":204},"Hyperparameters, including learning rates and batch sizes, are critical to achieving optimal losses and cannot be transferred between different domains and model/dataset sizes. Therefore, we need effective strategies to implement scalable HPO in deep chemical models. To implement efficient scaling of deep chemical models under computing resource constraints, the team introduced TPE, which is an extension of TSE. TPE reduces the computing cost of HPO, discovers the most important hyperparameters in new domains, and accelerates HPO by automatic early stopping during training. This accelerates the selection of models of chemical LLMs and GNN interatomic potentials.",{"type":17,"tag":25,"props":206,"children":207},{},[208],{"type":17,"tag":29,"props":209,"children":210},{},[211],{"type":23,"value":212},"2. Results",{"type":17,"tag":25,"props":214,"children":215},{},[216],{"type":17,"tag":29,"props":217,"children":218},{},[219],{"type":23,"value":220},"2.1 Accelerated HPO",{"type":17,"tag":25,"props":222,"children":223},{},[224],{"type":17,"tag":29,"props":225,"children":226},{},[227],{"type":23,"value":228},"(1) TPE Accelerates ChemGPT HPO",{"type":17,"tag":25,"props":230,"children":231},{},[232],{"type":23,"value":233},"Figure 2 shows the TPE results of ChemGPT trained on 2 million molecules from the Molecular Sets (MOSES) dataset. MOSES is smaller than PubChem and is a representative dataset on which chemical generative models are typically trained. The team used MOSES to demonstrate how to use TPE to quickly discover the optimal settings for a chemical LLM, such as ChemGPT. In Figure 2, 20% data is used to validate TPE, showing the true loss after 50 epochs versus the TPE-predicted loss after only 10 epochs. R^2 of linear regression is 0.98. This process can be easily reproduced on new datasets and can accelerate HPO.",{"type":17,"tag":25,"props":235,"children":236},{},[237],{"type":17,"tag":71,"props":238,"children":240},{"alt":7,"src":239},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/08/648804b4e8dc477daef64574397ee043.png",[],{"type":17,"tag":25,"props":242,"children":243},{},[244],{"type":23,"value":245},"Figure 2 TPE is used to identify optimal models in the early stage of training. Training of non-optimal models is stopped to save 80%+ total computing consumption.",{"type":17,"tag":25,"props":247,"children":248},{},[249],{"type":17,"tag":29,"props":250,"children":251},{},[252],{"type":23,"value":253},"(2) TPE Accelerates GNN HPO",{"type":17,"tag":25,"props":255,"children":256},{},[257],{"type":23,"value":258},"TPE performs equally well for GNNs. The team repeated the preceding procedure using 20% of the total training budget with changed learning rate and batch size, and trained SchNet, Polarizable Atom Interaction Neural Network (PaiNN), and SpookyNet models on the MD-17 dataset. TPE for SchNet and PaiNN achieved excellent prediction performance, as shown in Figure 3. It is found that the effect of TPE is closely related to the variance of the model loss using the full training budget, which also indicates the importance of proper HPO.",{"type":17,"tag":25,"props":260,"children":261},{},[262],{"type":17,"tag":71,"props":263,"children":265},{"alt":7,"src":264},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/08/4d5be9ccb1ec41849d8755b47a0c80d7.png",[],{"type":17,"tag":25,"props":267,"children":268},{},[269],{"type":23,"value":270},"Figure 3 TPE performs equally well for GNNs.",{"type":17,"tag":25,"props":272,"children":273},{},[274],{"type":17,"tag":29,"props":275,"children":276},{},[277],{"type":23,"value":278},"2.2 Fitting of the Neural-Scaling Formula",{"type":17,"tag":25,"props":280,"children":281},{},[282,284,288],{"type":23,"value":283},"As mentioned above, the resolution can be limited, that is, the sizes of model and dataset do not match. Identifying these resolution-limited areas from neural-scaling relationships allows us to roughly understand whether model loss improvements are limited by data availability or model capacity. For a fixed data budget, the scaling exponent quantifies the loss improvements caused by the increase in the model size. Depending on the dataset size, scaling behavior similar to the power law can be observed in different ranges of model sizes. The scaling can be represented as an approximately straight line fit of the loss versus model size on a log-log plot. A larger exponent ",{"type":17,"tag":119,"props":285,"children":286},{},[287],{"type":23,"value":136},{"type":23,"value":289}," corresponds to a steeper slope and better performance as the data/model size increases. Figure 4 shows how the number of parameters affects the loss with a given size of dataset and demonstrates the impact of dataset size on model performance with different dataset sizes and β values. The fracture of the fit line also indicates the existence of the resolution-limited areas.",{"type":17,"tag":25,"props":291,"children":292},{},[293],{"type":17,"tag":71,"props":294,"children":296},{"alt":7,"src":295},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/08/4f418ded310840d5a7f7e5d27b4887a7.png",[],{"type":17,"tag":25,"props":298,"children":299},{},[300],{"type":23,"value":301},"Figure 4 Fitting of the neural-scaling formula of ChemGPT",{"type":17,"tag":25,"props":303,"children":304},{},[305],{"type":23,"value":306},"Within the scope of the scaling law, the model performance changes monotonically with increasing model size, dataset size, and capacity (left in Figure 5). This proves that to improve the model performance, you can simply expand the dataset or model within a certain order of magnitude. At the same time, for GNNs or NFFs, the benefits of a low-capacity model decrease as the dataset size increases, while the benefits of a high-capacity model improve rapidly as the dataset size increases (right in Figure 5). Therefore, the benefits of expanding the model and data set size should be balanced with the increased computational costs to find the most efficient improvement opportunities for computation and data.",{"type":17,"tag":25,"props":308,"children":309},{},[310],{"type":17,"tag":71,"props":311,"children":313},{"alt":7,"src":312},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/03/08/ddf7fc2d172d48b69c71675241599870.png",[],{"type":17,"tag":25,"props":315,"children":316},{},[317],{"type":23,"value":318},"Figure 5 Neural scaling of ChemGPT performance (validation loss) as a function of model (number of non-embedding parameters) and dataset (number of tokens) size",{"type":17,"tag":25,"props":320,"children":321},{},[322],{"type":17,"tag":29,"props":323,"children":324},{},[325],{"type":23,"value":326},"3. Thoughts",{"type":17,"tag":25,"props":328,"children":329},{},[330],{"type":23,"value":331},"The core contribution of the research work is the discovery of strategies (neural-scaling laws) for scaling LLMs and GNN interatomic potentials in the chemical field, and quantification of model losses with respect to the model and dataset size in multiple orders of magnitude. It is also found that there is no saturated model loss with respect to the model size, data set size, or calculation for LLMs and NFFs in the chemical field. Finally, the impact of physics-based priors on scaling behavior provides a rich description of how the incorporation of physics, known empirical relationships, and other forms of knowledge into machine learning frameworks affects learning quality and efficiency.",{"type":17,"tag":25,"props":333,"children":334},{},[335],{"type":23,"value":336},"The results provide motivation and practical guidance for scaling research in scientific deep learning, and provide fruitful new research directions for the intersection of large-scale and physics-based deep learning. These results can be used to optimize allocation of computing and data budgets and achieve model loss improvements with optimal efficiency, making scalable scientific deep learning more suitable for a wider range of research areas.",{"type":17,"tag":25,"props":338,"children":339},{},[340],{"type":17,"tag":29,"props":341,"children":342},{},[343],{"type":23,"value":344},"References",{"type":17,"tag":25,"props":346,"children":347},{},[348,350],{"type":23,"value":349},"[1]Frey, N.C., Soklaski, R., Axelrod, S. et al. Neural scaling of deep chemical models. Nat Mach Intell 5, 1297–1305 (2023). ",{"type":17,"tag":351,"props":352,"children":356},"a",{"href":353,"rel":354},"https://doi.org/10.1038/s42256-023-00740-3",[355],"nofollow",[357],{"type":23,"value":353},{"type":17,"tag":25,"props":359,"children":360},{},[361,363],{"type":23,"value":362},"[2]",{"type":17,"tag":351,"props":364,"children":367},{"href":365,"rel":366},"https://github.com/ncfrey/litmatter.git",[355],[368],{"type":23,"value":365},{"type":17,"tag":25,"props":370,"children":371},{},[372,374],{"type":23,"value":373},"[3]",{"type":17,"tag":351,"props":375,"children":378},{"href":376,"rel":377},"https://github.com/learningmatter-mit/NeuralForceField.git",[355],[379],{"type":23,"value":376},{"type":17,"tag":25,"props":381,"children":382},{},[383,385],{"type":23,"value":384},"[4]",{"type":17,"tag":351,"props":386,"children":389},{"href":387,"rel":388},"https://github.com/datamol-io/molfeat.git",[355],[390],{"type":23,"value":387},{"type":17,"tag":25,"props":392,"children":393},{},[394,396],{"type":23,"value":395},"[5]",{"type":17,"tag":351,"props":397,"children":400},{"href":398,"rel":399},"https://github.com/coleygroup/rogi-xd.git",[355],[401],{"type":23,"value":398},{"title":7,"searchDepth":403,"depth":403,"links":404},4,[],"markdown","content:technology-blogs:en:3009.md","content","technology-blogs/en/3009.md","technology-blogs/en/3009","md",1776506109163]