[{"data":1,"prerenderedAt":248},["ShallowReactive",2],{"content-query-gZdV5WH7tC":3},{"_path":4,"_dir":5,"_draft":6,"_partial":6,"_locale":7,"title":8,"description":9,"date":10,"cover":11,"type":12,"body":13,"_type":242,"_id":243,"_source":244,"_file":245,"_stem":246,"_extension":247},"/technology-blogs/en/3104","en",false,"","MindSpore Learning: Hands-on MindSpore Reinforcement Learning (1)","The DQN algorithm introduced in this blog is used to solve the problem of discrete actions in continuous states.","2024-03-04","https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/05/17/4b3279e93ae8490f965908571f8f2b4e.png","technology-blogs",{"type":14,"children":15,"toc":239},"root",[16,24,34,42,87,95,100,112,122,127,135,146,154,159,167,178,186,193,201,212,220,228],{"type":17,"tag":18,"props":19,"children":21},"element","h1",{"id":20},"mindspore-learning-hands-on-mindspore-reinforcement-learning-1",[22],{"type":23,"value":8},"text",{"type":17,"tag":25,"props":26,"children":27},"p",{},[28],{"type":17,"tag":29,"props":30,"children":31},"strong",{},[32],{"type":23,"value":33},"DQN Algorithm",{"type":17,"tag":25,"props":35,"children":36},{},[37],{"type":17,"tag":29,"props":38,"children":39},{},[40],{"type":23,"value":41},"Overview",{"type":17,"tag":25,"props":43,"children":44},{},[45,47,52,54,59,61,66,68,73,75,79,81,85],{"type":23,"value":46},"In the Q-learning algorithm we previously learned, we constructed a table in matrix form to store the ",{"type":17,"tag":29,"props":48,"children":49},{},[50],{"type":23,"value":51},"Q",{"type":23,"value":53}," values of all actions in each state. The value ",{"type":17,"tag":29,"props":55,"children":56},{},[57],{"type":23,"value":58},"Q(s,a)",{"type":23,"value":60}," of each action in the table estimates the expected return that can be obtained by selecting the action ",{"type":17,"tag":29,"props":62,"children":63},{},[64],{"type":23,"value":65},"a",{"type":23,"value":67}," under the state ",{"type":17,"tag":29,"props":69,"children":70},{},[71],{"type":23,"value":72},"s",{"type":23,"value":74}," and continuing to follow a specific policy. However, this method of storing action values in a table is limited to environments with discrete states and actions, and where the spaces are relatively small. This is true in several practice environments, such as Cliff Walking. However, this does not apply when the number of states or actions is very large. For example, supposing an RGB image with a size of 210 x 160 x 3, the total number of possible states is 256^{(210 x 160 x 3)} for this image, and storing all the ",{"type":17,"tag":29,"props":76,"children":77},{},[78],{"type":23,"value":51},{"type":23,"value":80}," values under each state in a table on a computer would be impractical. Additionally, in cases where states or actions are continuous, the sheer multitude of state-action pairs renders it impractical to record ",{"type":17,"tag":29,"props":82,"children":83},{},[84],{"type":23,"value":51},{"type":23,"value":86}," values for each pair in such as table. In this case, we need to use the function fitting method for estimation. The DQN algorithm can be used to solve the problem of discrete actions in continuous states.",{"type":17,"tag":25,"props":88,"children":89},{},[90],{"type":17,"tag":29,"props":91,"children":92},{},[93],{"type":23,"value":94},"DQN Coding Practice",{"type":17,"tag":25,"props":96,"children":97},{},[98],{"type":23,"value":99},"Next, we will try to implement the DQN algorithm using the CartPole-v0 testing environment. This environment has a relatively simple state space with only four variables, so the designed network structure is also relatively simple. It uses a fully-connected layer with 128 neurons and employs ReLU as the activation function. When encountering more complex environments such as those using images as inputs, we can consider using deep convolutional networks.",{"type":17,"tag":25,"props":101,"children":102},{},[103,105,110],{"type":23,"value":104},"In this practice, we will leverage the ",{"type":17,"tag":29,"props":106,"children":107},{},[108],{"type":23,"value":109},"rl_utils",{"type":23,"value":111}," library, which provides a set of functions tailored for Hands-on RL. These functions include the capabilities to plot moving average curves and calculate advantage functions, enabling different algorithms to utilize them in unison.",{"type":17,"tag":113,"props":114,"children":116},"pre",{"code":115},"import random\nimport gym\nimport numpy as np\nimport collections\nfrom tqdm import tqdm\nimport mindspore as ms\nimport matplotlib.pyplot as plt\nimport rl_utils\n\n[ERROR] ME(14136:139737443379008,MainProcess):2024-03-04-21:52:18.535.781 [mindspore/run_check/_check_version.py:230] Cuda ['10.1', '11.1', '11.6'] version(libcudart*.so need by mindspore-gpu) is not found. Please confirm that the path of cuda is set to the env LD_LIBRARY_PATH, or check whether the CUDA version in wheel package and the CUDA runtime in current device matches. Please refer to the installation guidelines: https://www.mindspore.cn/install\n[ERROR] ME(14136:139737443379008,MainProcess):2024-03-04-21:52:18.558.430 [mindspore/run_check/_check_version.py:230] Cuda ['10.1', '11.1', '11.6'] version(libcudnn*.so need by mindspore-gpu) is not found. Please confirm that the path of cuda is set to the env LD_LIBRARY_PATH, or check whether the CUDA version in wheel package and the CUDA runtime in current device matches. Please refer to the installation guidelines: https://www.mindspore.cn/install\n[WARNING] ME(14136:139737443379008,MainProcess):2024-03-04-21:52:18.560.902 [mindspore/run_check/_check_version.py:98] Can not found cuda libs. Please confirm that the correct cuda version has been installed. Refer to the installation guidelines: https://www.mindspore.cn/install\n\nfrom mindspore import ops, nn\n",[117],{"type":17,"tag":118,"props":119,"children":120},"code",{"__ignoreMap":7},[121],{"type":23,"value":115},{"type":17,"tag":25,"props":123,"children":124},{},[125],{"type":23,"value":126},"First, define the class for the experience replay pool, which includes two functions used for adding data and sampling data.",{"type":17,"tag":113,"props":128,"children":130},{"code":129},"class ReplayBuffer:\n    ''' experience replay pool '''\n    def __init__(self, capacity):\n        self.buffer = collections.deque(maxlen=capacity) # queue, first in first out\n\n    def add(self, state, action, reward, next_state, done): \n        self.buffer.append((state, action, reward, next_state, done)) # Add data to the buffer.\n\n    def sample(self, batch_size): # Sample data from the buffer, and the number of data pieces is batch_size.\n        transitions = random.sample(self.buffer, batch_size)\n        state, action, reward, next_state, done = zip(*transitions)\n        return np.array(state), action, reward, np.array(next_state), done \n    \n    def size(self): # Number of data pieces in the buffer.\n        return len(self.buffer)\n",[131],{"type":17,"tag":118,"props":132,"children":133},{"__ignoreMap":7},[134],{"type":23,"value":129},{"type":17,"tag":25,"props":136,"children":137},{},[138,140,144],{"type":23,"value":139},"Next, define the class for the ",{"type":17,"tag":29,"props":141,"children":142},{},[143],{"type":23,"value":51},{"type":23,"value":145}," network with only one hidden layer.",{"type":17,"tag":113,"props":147,"children":149},{"code":148},"class Qnet(nn.Cell):\n    ''' Q network with only one hidden layer '''\n    def __init__(self, state_dim, hidden_dim, action_dim):\n        super(Qnet, self).__init__()\n        self.fc1 = nn.Dense(state_dim, hidden_dim)\n        self.fc2 = nn.Dense(hidden_dim, action_dim)\n\n    def construct(self, x):\n        x = ops.relu(self.fc1(x))\n        return self.fc2(x)\n",[150],{"type":17,"tag":118,"props":151,"children":152},{"__ignoreMap":7},[153],{"type":23,"value":148},{"type":17,"tag":25,"props":155,"children":156},{},[157],{"type":23,"value":158},"Then, let's see the code of the DQN algorithm.",{"type":17,"tag":113,"props":160,"children":162},{"code":161},"device=\"cuda:0\"\n\nimport mindspore as ms\nfrom mindspore import nn, ops\nfrom mindspore.ops import composite as C\nfrom mindspore.ops import functional as F\n\nclass DQN:\n    ''' DQN algorithm '''\n    def __init__(self, state_dim, hidden_dim, action_dim, learning_rate, gamma, epsilon, target_update, device):\n        self.action_dim = action_dim\n        self.q_net = Qnet(state_dim, hidden_dim, self.action_dim) # Q network\n        self.target_q_net = Qnet(state_dim, hidden_dim, self.action_dim) # Target network\n        self.optimizer = ms.nn.Adam(self.q_net.trainable_params(), learning_rate=learning_rate) # Use the Adam optimizer.\n        self.gamma = gamma # Discount factor\n        self.epsilon = epsilon # epsilon-greedy\n        self.target_update = target_update # Target network update frequency\n        self.count = 0 # Counter, which records the number of updates\n        self.device = device # Device\n\n        self.q_net.to_float(ms.float16)\n        self.target_q_net.to_float(ms.float16)\n\n    def take_action(self, state): # Actions taken according to the epsilon greedy policy\n        if np.random.random() \u003C self.epsilon:\n            action = np.random.randint(self.action_dim)\n        else:\n            state = ms.Tensor([state], dtype=ms.float32)\n            action = self.q_net(state).argmax().asnumpy().item()\n        return action\n\n    def update(self, transition_dict):\n        states = ms.Tensor(transition_dict['states'], dtype=ms.float32)\n        actions = ms.Tensor(transition_dict['actions']).view(-1, 1)\n        rewards = ms.Tensor(transition_dict['rewards'], dtype=ms.float32).view(-1, 1)\n        next_states = ms.Tensor(transition_dict['next_states'], dtype=ms.float32)\n        dones = ms.Tensor(transition_dict['dones'], dtype=ms.float32).view(-1, 1)\n\n        def forward_pass(states, actions, rewards, next_states, dones):\n            q_values = ops.gather_elements(self.q_net(states),1, actions) # Q value\n            max_next_q_values = self.target_q_net(next_states).max(1)[0].view(-1, 1) # Maximum Q value for the next state\n            q_targets = rewards + self.gamma * max_next_q_values * (1 - dones) # TD target\n            return ops.square(q_values - q_targets).mean() # Mean square error loss function\n        \n        grad_fn = ms.value_and_grad(forward_pass, None, self.optimizer.parameters)\n        loss, grads = grad_fn(states, actions, rewards, next_states, dones)\n        self.optimizer(grads)\n\n        if self.count % self.target_update == 0:\n            # Update the parameters of the target network.\n            ms.load_param_into_net(self.target_q_net, self.q_net.parameters_dict())\n        self.count += 1\n",[163],{"type":17,"tag":118,"props":164,"children":165},{"__ignoreMap":7},[166],{"type":23,"value":161},{"type":17,"tag":25,"props":168,"children":169},{},[170,172,176],{"type":23,"value":171},"Everything is ready. Let's start training and check the results. Afterwards, package the training process into ",{"type":17,"tag":29,"props":173,"children":174},{},[175],{"type":23,"value":109},{"type":23,"value":177}," for easy implementation of the code for algorithms to be learned.",{"type":17,"tag":25,"props":179,"children":180},{},[181],{"type":17,"tag":182,"props":183,"children":185},"img",{"alt":7,"src":184},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/05/17/fbba62dccff0497894b2aa5d06916286.png",[],{"type":17,"tag":25,"props":187,"children":188},{},[189],{"type":17,"tag":182,"props":190,"children":192},{"alt":7,"src":191},"https://obs-mindspore-file.obs.cn-north-4.myhuaweicloud.com/file/2024/05/17/751bde10efe744ee86dff76859b0d4c6.png",[],{"type":17,"tag":25,"props":194,"children":195},{},[196],{"type":17,"tag":29,"props":197,"children":198},{},[199],{"type":23,"value":200},"DQN Algorithm Using Images as Inputs",{"type":17,"tag":25,"props":202,"children":203},{},[204,206,210],{"type":23,"value":205},"In previous reinforcement learning environments, non-image states, like vehicle coordinates and speeds, are utilized as inputs. However, in certain video games, this state information is not directly accessible. Intelligent agents can solely obtain images on the screen. To enable the agents to emulate human gameplay, train them to make decisions based on images as states. In this case, we can use the DQN algorithm and incorporate convolutional networks into the network structure to extract image features, ultimately achieving reinforcement learning with images as inputs. The main difference between the code for the DQN algorithm with image inputs and the above code lies in the structure and data inputs of the ",{"type":17,"tag":29,"props":207,"children":208},{},[209],{"type":23,"value":51},{"type":23,"value":211}," network. Typically, the most recent frames of images are utilized as inputs to the DQN network. Using more than one frame serves to perceive the dynamics of the environment. Next, let's look at the code of the DQN algorithm that uses images as inputs. However, due to the long running time required, we will not display the training result here.",{"type":17,"tag":113,"props":213,"children":215},{"code":214},"class ConvolutionalQnet(nn.Cell):\n    ''' Q network with convolutional layers added '''\n    def __init__(self, action_dim, in_channels=4):\n        super(ConvolutionalQnet, self).__init__()\n        self.conv1 = nn.Conv2d(in_channels, 32, kernel_size=8, stride=4)\n        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)\n        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)\n        self.fc4 = nn.Dense(7 * 7 * 64, 512)\n        self.head = nn.Dense(512, n_actions)\n        \n    def forward(self, x):\n        x = x.float() / 255\n        x = ops.relu(self.conv1(x))\n        x = ops.relu(self.conv2(x))\n        x = ops.relu(self.conv3(x))\n        x = ops.relu(self.fc4(x.view(x.size(0), -1)))\n        return self.head(x)\n",[216],{"type":17,"tag":118,"props":217,"children":218},{"__ignoreMap":7},[219],{"type":23,"value":214},{"type":17,"tag":25,"props":221,"children":222},{},[223],{"type":17,"tag":29,"props":224,"children":225},{},[226],{"type":23,"value":227},"Summary",{"type":17,"tag":25,"props":229,"children":230},{},[231,233,237],{"type":23,"value":232},"In this blog, we learned the DQN algorithm, which leverages a neural network to model the ",{"type":17,"tag":29,"props":234,"children":235},{},[236],{"type":23,"value":51},{"type":23,"value":238}," function of the optimal policy and updates the parameters using the Q-learning idea. To enhance the stability and efficiency of training in DQN, two key modules, namely experience replay and target network, have been introduced, which enable the algorithm to achieve better performance in practical applications. DQN is the basis of deep reinforcement learning, and proficiency in this algorithm marks the initial step towards deep reinforcement learning. As we continue to delve into this field, we can anticipate the emergence of more deep reinforcement learning algorithms to be discovered and explored.",{"title":7,"searchDepth":240,"depth":240,"links":241},4,[],"markdown","content:technology-blogs:en:3104.md","content","technology-blogs/en/3104.md","technology-blogs/en/3104","md",1776506110404]