[{"data":1,"prerenderedAt":186},["ShallowReactive",2],{"content-query-1vChRPpVze":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"cover":11,"type":12,"body":13,"_type":180,"_id":181,"_source":182,"_file":183,"_stem":184,"_extension":185},"/technology-blogs/en/2847","en",false,"","Idea Sharing: A Brief Summary of Scientific Discovery in the Age of Artificial Intelligence","With the advent of ChatGPT, foundation models are significantly shaping various industries. It has become a trend to use AI to promote the development of scientific research.","2023-08-19","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/11/10/3b28694ccd0346a2aebd1104dc15128a.png","technology-blogs",{"type":14,"children":15,"toc":177},"root",[16,24,41,61,69,74,82,87,95,100,108,113,121,126,134,145,153],{"type":17,"tag":18,"props":19,"children":21},"element","h1",{"id":20},"idea-sharing-a-brief-summary-of-scientific-discovery-in-the-age-of-artificial-intelligence",[22],{"type":23,"value":8},"text",{"type":17,"tag":25,"props":26,"children":27},"p",{},[28,34,36],{"type":17,"tag":29,"props":30,"children":31},"strong",{},[32],{"type":23,"value":33},"1.",{"type":23,"value":35}," ",{"type":17,"tag":29,"props":37,"children":38},{},[39],{"type":23,"value":40},"Background",{"type":17,"tag":25,"props":42,"children":43},{},[44,46,59],{"type":23,"value":45},"With the advent of ChatGPT, foundation models are significantly shaping various industries. It has become a trend to use AI to promote the development of scientific research. On August 2, 2023, ",{"type":17,"tag":47,"props":48,"children":52},"a",{"href":49,"rel":50},"https://www.nature.com/articles/s41586-023-06221-2",[51],"nofollow",[53],{"type":17,"tag":54,"props":55,"children":56},"em",{},[57],{"type":23,"value":58},"Scientific discovery in the age of artificial intelligence",{"type":23,"value":60},", a paper written by the University of Cambridge, Georgia Technological University, and Cornell University, was published in Nature. This paper states that AI is being integrated into every single stage of scientific research, including hypothesis generation, data processing, and experiment design. We would like to sort out some of the progress made by AI for science listed in this paper and what can AI give full play to its advantages in scientific research.",{"type":17,"tag":25,"props":62,"children":63},{},[64],{"type":17,"tag":29,"props":65,"children":66},{},[67],{"type":23,"value":68},"2. AI-aided Data Collection and Curation",{"type":17,"tag":25,"props":70,"children":71},{},[72],{"type":23,"value":73},"Scientific research relies heavily on data collection and processing, in which AI can perform data screening, labeling, generation, and refined processing. In this paper, a particle collision test is used as an example, which generates over 100 terabytes of data every second. However, 99.99% of the data represents background events that must be discarded in real time to facilitate the transmission of important data. AI-aided screening is to identify rare events required by subsequent tasks from such large amount of raw data. Current attempts are using an autoencoder in deep learning for modeling. The autoencoder returns a higher loss value for previously unseen events to achieve data screening. This approach has been widely used in physics, neuroscience, geoscience, oceanography, and astronomy. In biology, the functions and structure of newly characterized molecules are often not labeled. Take protein sequencing as an example, less than 1% of biological functions are labeled. Currently, two promising AI automatic data labeling strategies are pseudo-labeling and label propagation. Pseudo-labeling first trains a model on a small amount of labeled data, and then directly predicts the labels of unlabeled data. In contrast, label propagation diffuses labels to unlabeled samples through similarity graphs constructed based on feature embeddings, which essentially introduces inductive bias. Another strategy for introducing inductive bias is to develop labeling rules that use domain knowledge. In terms of data generation, generative models represented by generative adversarial network (GAN) have been able to include particle collision events, pathology slides, chest X-rays, magnetic resonance contrasts, three-dimensional (3D) material microstructure, protein functions, and DNA sequencing. AI is also used for super-resolution and denoising, for example, to improve measurement resolution, eliminate measurement errors, visualize region of spacetime such as black holes, improve the resolution of living cell images, and improve protein-RNA expression analysis.",{"type":17,"tag":25,"props":75,"children":76},{},[77],{"type":17,"tag":78,"props":79,"children":81},"img",{"alt":7,"src":80},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2023/11/10/7f1d44bc29c44eb6989d0d27b9e156c6.png",[],{"type":17,"tag":25,"props":83,"children":84},{},[85],{"type":23,"value":86},"Figure 1. Representation learning diagram (a) Geometric deep learning; (b) Self-supervised learning; (c) Masked-language modeling",{"type":17,"tag":25,"props":88,"children":89},{},[90],{"type":17,"tag":29,"props":91,"children":92},{},[93],{"type":23,"value":94},"3. Learning Meaningful Representations of Scientific Data",{"type":17,"tag":25,"props":96,"children":97},{},[98],{"type":23,"value":99},"The success of deep learning is often considered as a result of learning high-quality representations from scientific data. This paper briefly introduces geometric deep learning, self-supervised learning, and language model representation learning in the scientific field, as shown in Figure 1. Geometric structure is an important feature in nature, and geometric deep learning can capture underlying relational patterns. A typical example is to construct different graph representations based on various scientific scenarios to capture complex systems. Directed graphs can assist in physical modeling of glassy systems, and hypergraphs connecting multiple nodes can be used to understand chromatin structures. In Large Hadron Collider (LHC) physical tasks, graphs are also used to reconstruct the particles output by the detector and discriminate physical signals. In natural language processing (NLP) tasks, language models often perform self-supervised training by masking some words. We arrange atoms or amino acids into molecules to build protein language models, similar to how letters form words and sentences to define the meaning of a document. Protein language models can encode amino acid sequences to predict biological functions. And representations learned by such models can be applied to a series of downstream tasks, from sequence design to structure prediction. In dealing with biochemical sequences, chemical language models are helpful to efficiently explore the vast chemical space. They have been used to predict properties, plan for multi-step syntheses, and more.",{"type":17,"tag":25,"props":101,"children":102},{},[103],{"type":17,"tag":29,"props":104,"children":105},{},[106],{"type":23,"value":107},"4. AI-based Generation of Scientific Hypotheses",{"type":17,"tag":25,"props":109,"children":110},{},[111],{"type":23,"value":112},"Testable hypotheses are the key to scientific research. However, formulating a meaningful hypothesis can be an arduous process. In this process, by learning symbolic expressions from data, AI can generate hypotheses, design objects, or build mathematical counterexamples. Take drug discovery as an example. High-throughput screening can assess thousands to millions of molecules, and algorithms can help to prioritize which samples for experiments. Moreover, the first batch of samples selected by these models can be refined through experiments, and feedback and optimization algorithms are used to provide more meaningful candidate samples. Another way is to develop a proper scoring rule for each hypothesis, and then introduce reinforcement learning to identify an optimal hypothesis. The agent in reinforcement learning takes action in the search space to maximize reward feedback, which can be defined as a metric of the hypothesis quality. Scientific hypotheses usually take the form of discrete objects, but it is also meaningful to convert discrete objects into continuous spaces. A typical practice is variational autoencoders. In astrophysics, variational autoencoders have been used to estimate gravitational-wave detector parameters based on pre-trained black hole waveform models. This approach is up to six orders of magnitude faster than the traditional approaches, making it practical to capture transient gravitational wave events. In material science, thermodynamic rules are combined with an autoencoder to design an interpretable latent space for identifying phase maps of crystal structures.",{"type":17,"tag":25,"props":114,"children":115},{},[116],{"type":17,"tag":29,"props":117,"children":118},{},[119],{"type":23,"value":120},"5. AI-driven Experiments and Simulation",{"type":17,"tag":25,"props":122,"children":123},{},[124],{"type":23,"value":125},"Experiments are an important means of testing scientific hypotheses, yet many experiments can be quite costly or impractical. Using AI can provide experimental design and optimization tools to save resources. One example is synthesis planning in chemistry. Synthetic planning involves finding a series of steps by which a target compound can be synthesized from predetermined chemicals. AI systems can design routes for synthesizing desired compounds, reducing the need for manual intervention. In the course of ongoing experiments, decision-making often needs to be adapted in real time. However, relying solely on human experience and intuition to carry out this process can be difficult and error-prone. Therefore, reinforcement learning provides an alternative method that can always react to changing environments and maximize the safety and success rate of experiments. For example, reinforcement learning approaches have been validated to be effective in magnetic control of tokamak plasmas, where the algorithm interacts with the tokamak simulator to optimize the strategy for the control process. Computer simulation is a powerful tool in experimental evaluation. However, existing simulation technologies rely heavily on human understanding and knowledge of the underlying mechanisms of the studied systems, which can be suboptimal and inefficient. AI can assist computer simulation through efficient learning even without a deep understanding of underlying principles. An example is molecular force fields, which are explainable, but require a deep understanding of scientific knowledge in representing various functions. To improve the accuracy of molecular simulations, an AI-based neural potential that fits quantum mechanics data has been developed to replace traditional force fields. In quantum physics, neural networks are gradually replacing manually estimated symbolic forms in parameterizing wave functions or density functionals due to AI flexibility and ability to accurately fit data.",{"type":17,"tag":25,"props":127,"children":128},{},[129],{"type":17,"tag":29,"props":130,"children":131},{},[132],{"type":23,"value":133},"6. Summary",{"type":17,"tag":25,"props":135,"children":136},{},[137,139,143],{"type":23,"value":138},"This blog lists some achievements of AI in various scientific research fields mentioned in ",{"type":17,"tag":54,"props":140,"children":141},{},[142],{"type":23,"value":58},{"type":23,"value":144},", including data processing and learning, scientific hypothesis and design, and auxiliary experiments. Definitely, AI is playing and will continue to play a critical role in scientific research.",{"type":17,"tag":25,"props":146,"children":147},{},[148],{"type":17,"tag":29,"props":149,"children":150},{},[151],{"type":23,"value":152},"References",{"type":17,"tag":25,"props":154,"children":155},{},[156,158,163,165,170,172],{"type":23,"value":157},"[1] Wang, H., Fu, T., Du, Y. ",{"type":17,"tag":54,"props":159,"children":160},{},[161],{"type":23,"value":162},"et al.",{"type":23,"value":164}," Scientific discovery in the age of artificial intelligence. ",{"type":17,"tag":54,"props":166,"children":167},{},[168],{"type":23,"value":169},"Nature",{"type":23,"value":171}," 620, 47–60 (2023). ",{"type":17,"tag":47,"props":173,"children":175},{"href":49,"rel":174},[51],[176],{"type":23,"value":49},{"title":7,"searchDepth":178,"depth":178,"links":179},4,[],"markdown","content:technology-blogs:en:2847.md","content","technology-blogs/en/2847.md","technology-blogs/en/2847","md",1776506107585]