mindspore_rl

Components for MindSpore Reinforcement Learning Framework.

mindspore_rl.agent

Components for agent, actor, learner, trainer.

class mindspore_rl.agent.Actor[source]

Base class for all actors.

Examples

>>> from mindspore_rl.agent.actor import Actor
>>> from mindspore_rl.network import FullyConnectedNet
>>> from mindspore_rl.environment import GymEnvironment
>>> class MyActor(Actor):
...   def __init__(self):
...     super(MyActor, self).__init__()
...     self.argmax = P.Argmax()
...     self.actor_net = FullyConnectedNet(4, 10, 2)
...     self.env = GymEnvironment({'name': 'CartPole-v0'})
>>> my_actor = MyActor()
>>> print(my_actor)
MyActor<
(actor_net): FullyConnectedNet<
(linear1): Dense<input_channels=4, output_channels=10, has_bias=True>
(linear2): Dense<input_channels=10, output_channels=2, has_bias=True>
(relu): ReLU<>
>
(environment): GymEnvironment<>

act(phase, params)[source]

The interface of the act function. User will need to overload this function according to the algorithm. But argument of this function should be phase and params. This interface will interact with environment

Parameters

phase (enum) – A enumerate value states for init, collect, eval or other user-defined stage.
params (tuple(Tensor)) – A tuple of tensor as input, which is used to calculate action

Returns

A tuple of tensor as output, which states for experience

Return type

observation (tuple(Tensor))

get_action(phase, params)[source]

The interface of the act function. User will need to overload this function according to the algorithm. But argument of this function should be phase and params. This interface will not interact with environment

Parameters

phase (enum) – A enumerate value states for init, collect, eval or other user-defined stage.
params (tuple(Tensor)) – A tuple of tensor as input, which is used to calculate action

Returns

A tuple of tensor as output, which states for experience

Return type

observation (tuple(Tensor))

class mindspore_rl.agent.Agent(actors, learner)[source]

The base class for the Agent.

Parameters

actors (object) – The actor instance.
learner (object) – The learner instance.

Examples

>>> from mindspore_rl.agent.learner import Learner
>>> from mindspore_rl.agent.actor import Actor
>>> from mindspore_rl.agent.agent import Agent
>>> actors = Actor()
>>> learner = Learner()
>>> agent = Agent(actors, learner)
>>> print(agent)
Agent<
(_actors): Actor<>
(_learner): Learner<>
>

act(phase, params)[source]

The act function will take an enumerate value and observation or other data which is needed during calculating the action. It will return a set of output which contains new observation, or other experience. In this function, agent will interact with environment.

Parameters

phase (enum) – A enumerate value states for init, collect or eval stage.
params (tuple(Tensor)) – A tuple of tensor as input, which is used to calculate action

Returns

A tuple of tensor as output, which states for experience

Return type

observation (tuple(Tensor))

get_action(phase, params)[source]

The get_action function will take an enumerate value and observation or other data which is needed during calculating the action. It will return a set of output which contains actions of experience. In this function, agent will not interact with environment.

Parameters

phase (enum) – A enumerate value states for init, collect or eval stage.
params (tuple(Tensor)) – A tuple of tensor as input, which is used to calculate action

Returns

A tuple of tensor as output, which states for experience

Return type

observation (tuple(Tensor))

learn(experience)[source]

The learn function will take a set of experience as input to calculate the loss and update the weights.

Parameters: experience (tuple(Tensor)) – A tuple of tensor states for experience
Returns: Result which outputs after updating weights
Return type: results (tuple(Tensor))

class mindspore_rl.agent.Learner[source]

The base class of the learner.

Examples

>>> from mindspore_rl.agent.learner import Learner
>>> from mindspore_rl.network import FullyConnectedNet
>>> class MyLearner(Learner):
...   def init(self):
...     super(MyLearner, self).init()
...     self.target_network = FullyConnectedNet(4, 10, 2)
>>> my_learner = MyLearner()
>>> print(my_learner)
MyLearner<
(target_network): FullyConnectedNet<
(linear1): Dense<input_channels=4, output_channels=10, has_bias=True>
(linear2): Dense<input_channels=10, output_channels=2, has_bias=True>
(relu): ReLU<>
>

learn(experience)[source]

The interface for the learn function. The behavior of the learn function depend on the user’s implementation. Usually, it takes the samples form replay buffer or other Tensors, and calculates the loss for updating the networks.

Parameters: experience (tuple(Tensor)) – Sampling from the buffer.
Returns: Result which outputs after updating weights
Return type: results (tuple(Tensor))

class mindspore_rl.agent.Trainer(msrl)[source]

The trainer base class.

Note

Reference to dqn_trainer.py.

Parameters: msrl (object) – the function handler class.

evaluate()[source]: The interface of the evaluate function for evaluate in train.

load_and_eval(ckpt_path=None)[source]

The interface of the eval function for offline. A checkpoint must be provided.

Parameters: ckpt_path (string) – The checkpoint file to restore net.

train(episodes, callbacks=None, ckpt_path=None)[source]

The interface of the train function. User will implement this function.

Parameters

episodes (int) – the number of training episodes.
callbacks (Optional[list[Callback]]) – List of callback objects. Default: None
ckpt_path (Optional[string]) – The checkpoint file to init or restore net. Default: None.

train_one_episode()[source]: The interface of train one episode function in train. And the output of this function must be constricted as loss, rewards, steps, [optional]others in order.

trainable_variables()[source]: The variables for saving to checkpoint.

mindspore_rl.core

Helper components used to implement RL algorithms.

class mindspore_rl.core.MSRL(config)[source]

The MSRL class provides the function handlers and APIs for reinforcement learning algorithm development.

It exposes the following function handler to the user. The input and output of these function handlers are identical to the user defined functions.

agent_act
sample_buffer
agent_learn
replay_buffer_sample
replay_buffer_insert
replay_buffer_reset

Parameters

config (dict) –

provides the algorithm configuration.

Top level: defines the algorithm components.
- key: ‘actor’, value: the actor configuration (dict).
- key: ‘learner’, value: the learner configuration (dict).
- key: ‘policy_and_network’, value: the policy and networks used by actors and learners (dict).
- key: ‘collect_environment’, value: the collect environment configuration (dict).
- key: ‘eval_environment’, value: the eval environment configuration (dict).
- key: ‘replay_buffer’, value: the replay buffer configuration (dict).
Second level: the configuration of each algorithm component.
- key: ‘number’, value: the number of actors/learner (int).
- key: ‘type’, value: the type of the actor/learner/policy_and_network/environment (class name).
- key: ‘params’, value: the parameters of actor/learner/policy_and_network/environment (dict).
- key: ‘policies’, value: the list of policies used by the actor/learner (list).
- key: ‘networks’, value: the list of networks used by the actor/learner (list).
- key: ‘pass_environment’, value: True user needs to pass the environment instance into actor, False otherwise (Bool).

get_replay_buffer()[source]

It will return the instance of replay buffer.

Returns: Buffers (object), The instance of relay buffer. If the buffer is None, the return value will be None.

get_replay_buffer_elements(transpose=False, shape=None)[source]

It will return all the elements in the replay buffer.

Parameters

transpose (boolean) – whether the output element needs to be transpose, if transpose is true, shape will also need to be filled. Default: False
shape (Tuple[int]) – the shape used in transpose. Default: None

Returns

elements (List[Tensor]), A set of tensor contains all the elements in the replay buffer

init(config)[source]

Initialization of MSRL object. The function creates all the data/objects that the algorithm requires. It also initializes all the function handler.

Parameters: config (dict) – algorithm configuration file.

class mindspore_rl.core.Session(config)[source]

The Session is a class for running MindSpore RL algorithms.

Parameters: config (dict) – the algorithm configuration or the deployment configuration of the algorithm. For more details of configuration of algorithm, please have a look at https://www.mindspore.cn/reinforcement/docs/zh-CN/master/custom_config_info.html

run(class_type=None, is_train=True, episode=0, params=None, callbacks=None)[source]

Execute the reinforcement learning algorithm.

Parameters

class_type (class type) – The class type of the algorithm’s trainer class. Default: None.
is_train (boolean) – Run the algorithm in train mode or eval mode. Default: True
episode (int) – The number of episode of the training. Default: 0.
params (dict) – The algorithm specific training parameters. Default: None.
callbacks (list[Callback]) – The callback list. Default: None.

mindspore_rl.environment

Component used to implement custom environments.

class mindspore_rl.environment.Environment[source]: The virtual base class for the environment. This class should be overridden before calling in the model.

class mindspore_rl.environment.EnvironmentProcess(proc_no, env_num, envs, actions, observations, initial_states)[source]

An independent process responsible for creating and interacting with one or more environments.

Parameters

proc_no (-) – The process number assigned by the caller.
env_num (-) – The number of environments created by this process.
envs (-) – A list that contains instance of environment.
actions (-) – The queue used to pass actions to the environment process.
observations (-) – The queue used to pass observations to the caller process.
initial_states (-) – The queue used to pass initial states to the caller process.

Examples

>>> from multiprocessing import Queue
>>> actions = Queue()
>>> observations = Queue()
>>> initial_states = Queue()
>>> proc_no = 1
>>> env_num = 2
>>> env_params = {'name': 'CartPole-v0'}
>>> multi_env = [GymEnvironment(env_params), GymEnvironment(env_params)]
>>> env_proc = EnvironmentProcess(proc_no, env_num, multi_env, actions, observations, initial_states)
>>> env_proc.start()

run()[source]: Method to be run in sub-process; can be overridden in sub-class

class mindspore_rl.environment.GymEnvironment(params, env_id=0)[source]

The GymEnvironment class provides the functions to interact with different environments.

Parameters

params (dict) – A dictionary contains all the parameters which are used to create the instance of GymEnvironment, such as name of environment.
env_id (int) – A integer which is used to set the seed of this environment.

Supported Platforms:: Ascend GPU CPU

Examples

>>> env_params = {'name': 'CartPole-v0'}
>>> environment = GymEnvironment(env_params, 0)
>>> print(environment)
GymEnvironment<>

property action_space

Get the action space of the environment.

Returns: A tuple which states for the space of action

property observation_space

Get the state space of the environment.

Returns: A tuple which states for the space of state

reset()[source]

Reset the environment to the initial state. It is always used at the beginning of each episode. It will return the value of initial state.

Returns: A tensor which states for the initial state of environment.

step(action)[source]

Execute the environment step, which means that interact with environment once.

Parameters

action (Tensor) – A tensor that contains the action information.

Returns

state (Tensor), the environment state after performing the action.
reward (Tensor), the reward after performing the action.
done (mindspore.bool_), whether the simulation finishes or not.

class mindspore_rl.environment.MsEnvironment(kwargs=None)[source]

Class encapsulates built-in environment.

Parameters

kwargs (dict) –

The dictionary of environment specific configurations. See below table for details:

Environment name	Configuration Parameters	Default value	Notices
Tag	seed	42	random seed
	environment_num	2	number of environments
	predator_num	10	number of predators
	max_timestep	100	max timestep per episode
	map_length	100	length of map
	map_width	100	width of map
	wall_hit_penalty	0.1	agent wall hit penalty
	catch_reward	10	predator catch reward
	caught_penalty	5	prey caught penalty
	step_cost	0.01	step cost

Supported Platforms:: “GPU”

Examples

>>> config = {'name': 'Tag', 'predator_num': 4}
>>> env = MsEnvironment(config)
>>> observation = env.reset()
>>> action = Tensor(env.action_space.sample())
>>> observation, reward, done = env.step(action)
>>> print(observation.shape)
(2, 5, 21)

property action_space: Get the valid action space of the environment.

property config: Get environment configuration.

property done_space: Get the valid done space of the environment.

property observation_space: Get the valid observation space of the environment.

reset()[source]

Reset the environment to initial observation and return the initial observation.

Inputs:: No inputs.

Returns: Tensor, the initial observation.

Supported Platforms:: “GPU”

Examples

>>> config = {'name': 'Tag', 'predator_num': 4}
>>> env = MsEnvironment(config)
>>> observation = env.reset()
>>> print(observation.shape)
(2, 5, 21)

property reward_space: Get the valid reward space of the environment.

step(action)[source]

Run one timestep of environment.

Parameters

action (Tensor) – Action provided by the all of agents.

Returns

Tuple of 3 tensors, the observation, the reward and the done.

observation (Tensor) - Observations of all agents after action.
reward (Tensor) - Amount of reward returned by the environment.
done (Tensor) - Whether the episode has ended.

Supported Platforms:: “GPU”

Examples

>>> config = {'name': 'Tag', 'predator_num': 4}
>>> env = MsEnvironment(config)
>>> observation = env.reset()
>>> action = Tensor(env.action_space.sample())
>>> observation, reward, done = env.step(action)
>>> print(observation.shape)
(2, 5, 21)

class mindspore_rl.environment.MultiEnvironmentWrapper(env_instance, num_proc=None)[source]

The MultiEnvironmentWrapper is a wrapper for multi environment scenario. User implements their single environment class and set the environment number larger than 1 in configuration file, framework will automatically invoke this class to create a multi environment class.

Parameters

env_instance (list(Class)) – A list that contains instance of environment.
num_proc (int) – Number of processing uses during interacting with environment. Default: None

Supported Plantforms:: Ascend GPU CPU

Examples

>>> env_params = {'name': 'CartPole-v0'}
>>> multi_env = [GymEnvironment(env_params), GymEnvironment(env_params)]
>>> wrapper = MultiEnvironmentWrapper(multi_env)
>>> print(wrapper)
MultiEnvironmentWrapper<>

property action_space

Get the action space of the environment.

Returns: A tuple which states for the space of action.

property config

Get the config of environment.

Returns: A dictionary which contains environment’s info

property done_space

Get the done space of the environment.

Returns: A tuple which states for the space of done.

property observation_space

Get the state space of the environment.

Returns: A tuple which states for the space of state.

reset()[source]

Reset the environment to the initial state. It is always used at the beginning of each episode. It will return the value of initial state of each environment.

Returns: A list of tensor which states for all the initial states of each environment.

property reward_space

Get the reward space of the environment.

Returns: A tuple which states for the space of reward.

step(action)[source]

Execute the environment step, which means that interact with environment once.

Parameters

action (Tensor) – A tensor that contains the action information.

Returns

state (Tensor), a list of environment state after performing the action.
reward (Tensor), a list of reward after performing the action.
done (Tensor), whether the simulations of each environment finishes or not

class mindspore_rl.environment.Space(feature_shape, dtype, low=None, high=None, batch_shape=None)[source]

The class for environment action/observation space.

Parameters

feature_shape (-) – The action/observation shape before batching.
dtype (-) – The action/observation space dtype.
low (-) – The action/observation space lower boundary.
high (-) – The action/observation space upper boundary.
batch_shape (-) – The batch shape for vectorization. It usually be used in multi-environment and multi-agent cases.

Examples

>>> action_space = Space(feature_shape=(6,), dtype=np.int32)
>>> print(action_space.ms_dtype)
Int32

property boundary: The space boundary.

property is_discrete: Is discrete space

property ms_dtype: MindSpore data type

property np_dtype: Numpy data type

property num_values: The number of optional enumeration values

sample()[source]: Take a sample from the space

property shape: Space shape after batching

mindspore_rl.network

Network component used to implement polices.

class mindspore_rl.network.FullyConnectedLayers(fc_layer_params, dropout_layer_params=None, activation_fn=nn.ReLU(), weight_init='normal', bias_init='zeros')[source]

This is a fully connected layers module. User can input abitrary number of fc_layers_params, then this module can create corresponding number of fully connect layers.

Parameters

fc_layers_params (List[int]) – A list of int states for the input and output size of fully connected layer. For example, if the input list is [10, 20, 3], then the module will create two fully connected layers whose input and output size are (10, 20) and (20, 3) respectively. The length of fc_layers_params should be larger than 3.
dropout_layer_params (List[float]) – A list of float states for the dropout rate. If the input list if [0.5, 0.3], then two dropout layers will be created after each fully connected layer. The length of dropout_layer_params should be one less than fc_layers_params. dropout_layer_params is a optional value. Default: None.
activation_fn (Union[str, Cell, Primitive) – An instance of activation function. Default: nn.ReLu().
weight_init (Union[Tensor, str, Initializer, numbers.Number]) – The trainable weight_init parameter. The dtype is same as x. The values of str refer to the function initializer. Default: ‘normal’.
bias_init (Union[Tensor, str, Initializer, numbers.Number]) – The trainable bias_init parameter. The dtype is same as x. The values of str refer to the function initializer. Default: ‘zeros’.

Inputs:

x (Tensor) - Tensor of shape \((*, fc\_layers\_params[0])\).

Outputs:

Tensor of shape \((*, fc\_layers\_params[-1])\).

Examples

>>> input = Tensor(np.ones([2, 4]).astype(np.float32))
>>> net = FullyConnectedLayers(fc_layers_params=[4, 10, 2])
>>> output = net(input)
>>> print(output.shape)
(2, 2)

construct(x)[source]

Parameters: x (Tensor) – Tensor of shape \((*, fc\_layers\_params[0])\).
Returns: Tensor of shape \((*, fc\_layers\_params[-1])\).

class mindspore_rl.network.FullyConnectedNet(input_size, hidden_size, output_size, compute_type=mstype.float32)[source]

A basic fully connected neural network.

Parameters

input_size (int) – numbers of input size.
hidden_size (int) – numbers of hidden layers.
output_size (int) – numbers of output size.
compute_type (mindspore.dtype) – data type used for fully connected layer. Default: mindspore.dtype.float32

Examples

>>> input = Tensor(np.ones([2, 4]).astype(np.float32))
>>> net = FullyConnectedNet(4, 10, 2)
>>> output = net(input)
>>> print(output.shape)
(2, 2)

construct(x)[source]

Returns output of Dense layer.

Parameters: x (Tensor) – Tensor as the input of network.
Returns: The output of the Dense layer.

class mindspore_rl.network.GruNet(input_size, hidden_size, weight_init='normal', num_layers=1, has_bias=True, batch_first=False, dropout=0.0, bidirectional=False)[source]

Stacked GRU (Gated Recurrent Unit) layers.

Apply GRU layer to the input.

For detailed information, please refer to mindspore.nn.GRU.

Parameters

input_size (int) – Number of features of input.
hidden_size (int) – Number of features of hidden layer.
weight_init (str or initializer) – Initialize method. Default: normal.
num_layers (int) – Number of layers of stacked GRU. Default: 1.
has_bias (bool) – Whether the cell has bias b_ih and b_hh. Default: True.
batch_first (bool) – Specifies whether the first dimension of input x is batch_size. Default: False.
dropout (float) – If not 0.0, append Dropout layer on the outputs of each GRU layer except the last layer. Default 0.0. The range of dropout is [0.0, 1.0).
bidirectional (bool) – Specifies whether it is a bidirectional GRU, num_directions=2 if bidirectional=True otherwise 1. Default: False.

Inputs:

x_in (Tensor) - Tensor of data type mindspore.float32 and shape (seq_len, batch_size, input_size) or (batch_size, seq_len, input_size).
h_in (Tensor) - Tensor of data type mindspore.float32 and shape (num_directions * num_layers, batch_size, hidden_size). The data type of h_in must be the same as x_in.

Outputs:

Tuple, a tuple contains (x_out, h_out).

x_out (Tensor) - Tensor of shape (seq_len, batch_size, num_directions * hidden_size) or (batch_size, seq_len, num_directions * hidden_size).
h_out (Tensor) - Tensor of shape (num_directions * num_layers, batch_size, hidden_size).

Examples

>>> net = GruNet(10, 16, 2, has_bias=True, bidirectional=False)
>>> x_in = Tensor(np.ones([3, 5, 10]).astype(np.float32))
>>> h_in = Tensor(np.ones([1, 5, 16]).astype(np.float32))
>>> x_out, h_out = net(x_in, h_in)
>>> print(x_out.shape)
(3, 5, 16)

construct(x_in, h_in)[source]

The forward calculation of gru net

Parameters

x_in (Tensor) – Tensor of data type mindspore.float32 and shape (seq_len, batch_size, input_size) or (batch_size, seq_len, input_size).
h_in (Tensor) – Tensor of data type mindspore.float32 and shape (num_directions * num_layers, batch_size, hidden_size). The data type of h_in must be the same as x_in.

Returns

x_out (Tensor) - Tensor of shape (seq_len, batch_size, num_directions * hidden_size) or (batch_size, seq_len, num_directions * hidden_size).
h_out (Tensor) - Tensor of shape (num_directions * num_layers, batch_size, hidden_size).

mindspore_rl.policy

Policies used in RL algorithms.

class mindspore_rl.policy.EpsilonGreedyPolicy(input_network, size, epsi_high, epsi_low, decay, action_space_dim)[source]

Produces an epsilon-greedy sample action base on the given policy.

Parameters

input_network (Cell) – A network returns policy action.
size (int) – Shape of epsilon.
epsi_high (float) – A high epsilon for exploration betweens [0, 1].
epsi_low (float) – A low epsilon for exploration betweens [0, epsi_high].
decay (float) – A decay factor applied to epsilon.
action_space_dim (int) – Dimensions of the action space.

Examples

>>> state_dim, hidden_dim, action_dim = (4, 10, 2)
>>> input_net = FullyConnectedNet(state_dim, hidden_dim, action_dim)
>>> policy = EpsilonGreedyPolicy(input_net, 1, 0.1, 0.1, 100, action_dim)
>>> state = Tensor(np.ones([1, state_dim]).astype(np.float32))
>>> step =  Tensor(np.array([10,]).astype(np.float32))
>>> output = policy(state, step)
>>> print(output.shape)
(1,)

construct(state, step)[source]

The interface of the construct function.

Parameters

state (Tensor) – The input tensor for network.
step (Tensor) – The current step, effects the epsilon decay.

Returns

The output action.

class mindspore_rl.policy.GreedyPolicy(input_network)[source]

Produces a greedy action base on the given policy.

Parameters: input_network (Cell) – network used to generate action probs by input state.

Examples

>>> state_dim, hidden_dim, action_dim = 4, 10, 2
>>> input_net = FullyConnectedNet(state_dim, hidden_dim, action_dim)
>>> policy = GreedyPolicy(input_net)
>>> state = Tensor(np.ones([2, 4]).astype(np.float32))
>>> output = policy(state)
>>> print(output.shape)
(2,)

construct(state)[source]

Returns the best action.

Parameters: state (Tensor) – State tensor as the input of network.
Returns: action_max, the best action.

class mindspore_rl.policy.Policy[source]

The virtual base class for the policy. This class should be overridden before calling in the model.

construct(*inputs, **kwargs)[source]

The interface of the construct function.

Parameters

inputs – it’s depended on the user definition.
kwargs – it’s depended on the user definition.

Returns

User defined.

class mindspore_rl.policy.RandomPolicy(action_space_dim)[source]

Produces a random action betweens [0, acton_space_dim).

Parameters: acton_space_dim (int) – dimension of the action space.

Examples

>>> action_space_dim = 2
>>> policy = RandomPolicy(action_space_dim)
>>> output = policy()
>>> print(output.shape)
(1,)

construct()[source]

Returns a random number betweens [0, acton_space_dim).

Returns: A random integer betweens [0, acton_space_dim).