MindSpore Reinforcement API

Components for MindSpore Reinforcement Learning Framework.

mindspore_rl.agent

Components for agent, actor, learner, trainer.

class mindspore_rl.agent.Actor[source]

Base class for all actors.

Examples

>>> from mindspore_rl.agent.actor import Actor
>>> from mindspore_rl.network import FullyConnectedNet
>>> from mindspore_rl.environment import GymEnvironment
>>> class MyActor(Actor):
...   def __init__(self):
...     super(MyActor, self).__init__()
...     self.argmax = P.Argmax()
...     self.actor_net = FullyConnectedNet(4, 10, 2)
...     self.env = GymEnvironment({'name': 'CartPole-v0'})
>>> my_actor = MyActor()
>>> print(my_actor)
MyActor<
(actor_net): FullyConnectedNet<
(linear1): Dense<input_channels=4, output_channels=10, has_bias=True>
(linear2): Dense<input_channels=10, output_channels=2, has_bias=True>
(relu): ReLU<>
>
(environment): GymEnvironment<>
act(state)[source]

The interface of the act function. User will need to overload this function according to the algorithm. But argument of this function should be the state output from the environment.

Parameters

state (Tensor) – the output state from the environment.

Returns

  • done (Tensor), whether the simulation is finish or not.

  • reward (Tensor), simulation reward.

  • state (Tensor), simulation state.

act_init(state)[source]

The interface of the act initialization function. User will need to overload this function according to the algorithm. But argument of this function should be the state output from the environment.

Parameters

state (Tensor) – the output state from the environment.

Returns

  • done (Tensor), whether the simulation is finish or not.

  • reward (Tensor), simulation reward.

  • state (Tensor), simulation state.

env_setter(env)[source]

Set the environment by the input env for the actor. The env is created by class GymEnvironment or other environment class.

Parameters

env (object) – the input environment.

Returns

environment.

evaluate(state)[source]

The interface of the act evaluation function. User will need to overload this function according to the algorithm. But argument of this function should be the state output from the environment.

Parameters

state (Tensor) – the output state from the environment.

Returns

  • done (Tensor), whether the simulation is finish or not.

  • reward (Tensor), simulation reward.

  • state (Tensor), simulation state.

reset_collect_actor()[source]

Reset the collect actor, reset the collect actor’s environment and return the reset state and a false flag of done.

Returns

  • state (Tensor), the state of the actor after reset.

  • Tensor, always false of done.

reset_eval_actor()[source]

Reset the eval actor, reset the eval actor’s environment and return the reset state and a false flag of done.

Returns

  • state (Tensor), the state of the actor after reset.

  • Tensor, always false of done.

update()[source]

The interface of the update function. User will need to overload this function according to the algorithm.

class mindspore_rl.agent.Agent(num_actor, actors, learner)[source]

The base class for the Agent.

Parameters
  • num_actor (int) – The actor numbers in this agent.

  • actors (object) – The actor instance.

  • learner (object) – The learner instance.

Examples

>>> from mindspore_rl.agent.learner import Learner
>>> from mindspore_rl.agent.actor import Actor
>>> from mindspore_rl.agent.agent import Agent
>>> actor_num = 1
>>> actors = Actor()
>>> learner = Learner()
>>> agent = Agent(actor_num, actors, learner)
>>> print(agent)
Agent<
(_actors): Actor<>
(_learner): Learner<>
>
act()[source]

The act function interface.

property actors

Get the instance of actors in the agent.

Returns

actors (object), actors object created by class Actor.

env_setter(env)[source]

Set the agent environment for actors in agent.

Parameters

env (object) – the input environment.

init()[source]

Initialize the agent, reset all the actors in agent.

learn(samples)[source]

The learn function interface.

Parameters

samples (Tensor) – the sample from replay buffer.

property learner

Get the instance of learner in the agent.

Returns

learner (object), learner object created by class Learner.

property num_actor

Get the number of the actors of the agent.

Returns

num_actor (int), actor numbers.

reset_all()[source]

Reset the all the actors in agent, and return the reset state and the flag done.

Returns

  • state (Tensor), the state of the reset environment in actor.

  • done (Tensor), a false flag of done.

update()[source]

The update function interface.

class mindspore_rl.agent.Learner[source]

The base class of the learner.

Examples

>>> from mindspore_rl.agent.learner import Learner
>>> from mindspore_rl.network import FullyConnectedNet
>>> class MyLearner(Learner):
...   def init(self):
...     super(MyLearner, self).init()
...     self.target_network = FullyConnectedNet(4, 10, 2)
>>> my_learner = MyLearner()
>>> print(my_learner)
MyLearner<
(target_network): FullyConnectedNet<
(linear1): Dense<input_channels=4, output_channels=10, has_bias=True>
(linear2): Dense<input_channels=10, output_channels=2, has_bias=True>
(relu): ReLU<>
>
learn(samples)[source]

The interface for the learn function. The behavior of the learn function depend on the user’s implementation. Usually, it takes the samples form replay buffer or other Tensors, and calculates the loss for updating the networks.

Parameters

samples (Tensor) – Sampling from the buffer.

Returns

success, If the training success or not.

class mindspore_rl.agent.Trainer(msrl)[source]

The trainer base class.

Note

Reference to dqn_trainer.py.

Parameters

msrl (object) – the function handler class.

train(episode)[source]

The interface of the train function. User will implement this function.

Parameters

episode (int) – the number of training episode.

mindspore_rl.core

Helper components used to implement RL algorithms.

class mindspore_rl.core.MSRL(config)[source]

The MSRL class provides the function handlers and APIs for reinforcement learning algorithm development.

It exposes the following function handler to the user. The input and output of these function handlers are identical to the user defined functions.

agent_act_init
agent_act_collect
agent_act_eval
agent_act
agent_reset
sample_buffer
agent_learn
Parameters

config (dict) –

provides the algorithm configuration.

  • Top level: defines the algorithm components.

    • key: ‘actor’, value: the actor configuration (dict).

    • key: ‘learner’, value: the learner configuration (dict).

    • key: ‘policy_and_network’, value: the policy and networks used by actors and learners (dict).

    • key: ‘env’, value: the environment configuration (dict).

  • Second level: the configuration of each algorithm component.

    • key: ‘number’, value: the number of actors/learner (int).

    • key: ‘type’, value: the type of the actor/learner/policy_and_network/environment (class name).

    • key: ‘params’, value: the parameters of actor/learner/policy_and_network/environment (dict).

    • key: ‘policies’, value: the list of policies used by the actor/learner (list).

    • key: ‘networks’, value: the list of networks used by the actor/learner (list).

    • key: ‘environment’, value: True if the component needs to interact with the environment, False otherwise (Bool).

    • key: ‘buffer’, value: the buffer configuration (dict).

get_replay_buffer()[source]

It will return the instance of replay buffer.

Returns

Buffers (object), The instance of relay buffer. If the buffer is None, the return value will be None.

get_replay_buffer_elements(transpose=False, shape=None)[source]

It will return all the elements in the replay buffer. :param transpose: whether the output element needs to be transpose, :type transpose: boolean :param if transpose is true: False :param shape will also need to be filled. Default: False :param shape: the shape used in transpose. Default: None :type shape: Tuple[int]

Returns

elements (List[Tensor]), A set of tensor contains all the elements in the replay buffer

init(config)[source]

Initialization of MSRL object. The function creates all the data/objects that the algorithm requires. It also initializes all the function handler.

Parameters

config (dict) – algorithm configuration file.

class mindspore_rl.core.Session(config)[source]

The Session is a class for running MindSpore RL algorithms.

Parameters

config (dict) – the algorithm configuration or the deployment configuration of the algorithm. For more details of configuration of algorithm, please have a look at https://www.mindspore.cn/reinforcement/docs/zh-CN/r1.5/index.html

run(class_type=None, episode=0, params=None)[source]

Execute the reinforcement learning algorithm.

Parameters
  • class_type (class type) – The class type of the algorithm’s trainer class. Default: None.

  • episode (int) – The number of episode of the training. Default: 0.

  • params (dict) – The algorithm specific training parameters. Default: None.

mindspore_rl.environment

Component used to implement custom environments.

class mindspore_rl.environment.GymEnvironment(params)[source]

The GymEnvironment class provides the functions to interact with different environments.

Parameters

params (dict) – A dictionary contains all the parameters which are used to create the instance of GymEnvironment, such as the name of environment. Since this environment is based on Gym, the name of environment should match with the name in Gym.

Supported Platforms:

Ascend GPU CPU

Examples

>>> env_params = {'name': 'CartPole-v0'}
>>> environment = GymEnvironment(env_params)
>>> print(environment)
GymEnvironment<>
property action_space_dim

Get the action space dim of the environment.

Returns

A tuple which states for the space dimension of action

clone()[source]

Make a copy of the environment.

Returns

env (object), a copy of the original environment object.

reset()[source]

Reset the environment to the initial state. It is always used at the beginning of each episode. It will return the value of initial state.

Returns

A tensor which states for the initial state of environment.

property state_space_dim

Get the state space dim of the environment.

Returns

A tuple which states for the space dimension of state

step(action)[source]

Execute the environment step, which means that interact with environment once.

Parameters

action (Tensor) – A tensor that contains the action information.

Returns

  • state (Tensor), the environment state after performing the action.

  • reward (Tensor), the reward after performing the action.

  • done (mindspore.bool_), whether the simulation finishes or not.

class mindspore_rl.environment.GymMultiEnvironment(params)[source]

The GymMultiEnvironment class provides the functions to interact with different environments. It is the multi-environment version of GymEnvironment.

Parameters

params (dict) – A dictionary contains all the parameters which are used to create the instance of GymEnvironment, such as name of environment, number of environment. Since this environment is based on Gym, the name of environment should match with the name in Gym.

Supported Platforms:

Ascend GPU CPU

Examples

>>> env_params = {'name': 'CartPole-v0', 'env_nums': 10}
>>> environment = GymMultiEnvironment(env_params)
>>> print(environment)
GymMultiEnvironment<>
property action_space_dim

Get the action space dim of the environment.

Returns

A tuple which states for the space dimension of action

clone()[source]

Make a copy of the environment.

Returns

env (object). A copy of the original environment object.

reset()[source]

Reset the environment to the initial state. It is always used at the beginning of each episode. It will return the value of initial state of each environment.

Returns

A list of tensor which states for all the initial states of each environment.

property state_space_dim

Get the state space dim of the environment.

Returns

A tuple which states for the space dimension of state

step(action)[source]

Execute the environment step, which means that interact with environment once.

Parameters

action (Tensor) – A tensor that contains the action information.

Returns

  • state (Tensor), a list of environment state after performing the action.

  • reward (Tensor), a list of reward after performing the action.

  • done (Tensor), whether the simulations of each environment finishes or not

mindspore_rl.network

Network component used to implement polies.

class mindspore_rl.network.FullyConnectedNet(input_size, hidden_size, output_size)[source]

A basic fully connected neural network.

Parameters
  • input_size (int) – numbers of input size.

  • hidden_size (int) – numbers of hidden layers.

  • output_size (int) – numbers of output size.

Examples

>>> input = Tensor(np.ones([2, 4]).astype(np.float32))
>>> net = FullyConnectedNet(4, 10, 2)
>>> output = net(input)
>>> print(output.shape)
(2, 2)
construct(x)[source]

Returns output of Dense layer.

Parameters

x (Tensor) – Tensor as the input of network.

Returns

The output of the Dense layer.

mindspore_rl.policy

Policies used in RL algorithms.

class mindspore_rl.policy.EpsilonGreedyPolicy(input_network, size, epsi_high, epsi_low, decay, action_space_dim)[source]

Produces an epsilon-greedy sample action base on the given policy.

Parameters
  • input_network (Cell) – A network returns policy action.

  • size (int) – Shape of epsilon.

  • epsi_high (float) – A high epsilon for exploration betweens [0, 1].

  • epsi_low (float) – A low epsilon for exploration betweens [0, epsi_high].

  • decay (float) – A decay factor applied to epsilon.

  • action_space_dim (int) – Dimensions of the action space.

Examples

>>> state_dim, hidden_dim, action_dim = (4, 10, 2)
>>> input_net = FullyConnectedNet(state_dim, hidden_dim, action_dim)
>>> policy = EpsilonGreedyPolicy(input_net, 1, 0.1, 0.1, 100, action_dim)
>>> state = Tensor(np.ones([1, state_dim]).astype(np.float32))
>>> step =  Tensor(np.array([10,]).astype(np.float32))
>>> output = policy(state, step)
>>> print(output.shape)
(1,)
construct(state, step)[source]

The interface of the construct function.

Parameters
  • state (Tensor) – The input tensor for network.

  • step (Tensor) – The current step, effects the epsilon decay.

Returns

The output action.

class mindspore_rl.policy.GreedyPolicy(input_network)[source]

Produces a greedy action base on the given policy.

Parameters

input_network (Cell) – network used to generate action probs by input state.

Examples

>>> state_dim, hidden_dim, action_dim = 4, 10, 2
>>> input_net = FullyConnectedNet(state_dim, hidden_dim, action_dim)
>>> policy = GreedyPolicy(input_net)
>>> state = Tensor(np.ones([2, 4]).astype(np.float32))
>>> output = policy(state)
>>> print(output.shape)
(2,)
construct(state)[source]

Returns the best action.

Parameters

state (Tensor) – State tensor as the input of network.

Returns

action_max, the best action.

class mindspore_rl.policy.Policy[source]

The virtual base class for the policy. This class should be overridden before calling in the model.

construct(*inputs, **kwargs)[source]

The interface of the construct function.

Parameters
  • inputs – it’s depended on the user definition.

  • kwargs – it’s depended on the user definition.

Returns

User defined.

class mindspore_rl.policy.RandomPolicy(action_space_dim)[source]

Produces a random action betweens [0, acton_space_dim).

Parameters

acton_space_dim (int) – dimension of the action space.

Examples

>>> action_space_dim = 2
>>> policy = RandomPolicy(action_space_dim)
>>> output = policy()
>>> print(output.shape)
(1,)
construct()[source]

Returns a random number betweens [0, acton_space_dim).

Returns

A random integer betweens [0, acton_space_dim).