Models

GCP-HOLO uses Stable Baselines 3 for reinforcement learning training, but includes some customization to make sure that it works for the Mech Gym environment. Stable Baselines 3 is a popular library for RL training that provides a set of pre-implemented algorithms, such as Proximal Policy Optimization (PPO) and Deep Q-Network (DQN).

GCP-HOLO customizes the Stable Baselines 3 algorithms to work specifically with the Mech Gym environment, which is a custom environment designed for path synthesis of linkage systems. The Mech Gym environment includes a specific action space to enhance the efficiency, the customization also include maksing invalid actions determined from the scaffold nodes and makes sure that each of the models is selecting actions non-deterministically.

A2C

This is the custom A2C that GCP-HOLO uses.

class models.a2c.CustomActorCriticPolicy(observation_space: ~gym.spaces.space.Space, action_space: ~gym.spaces.space.Space, lr_schedule: ~typing.Callable[[float], float], net_arch: ~typing.Optional[~typing.List[~typing.Union[int, ~typing.Dict[str, ~typing.List[int]]]]] = None, activation_fn: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.ReLU'>, *args, **kwargs)[source]

Bases: ActorCriticPolicy

Custom Actor Critic Policy for GCP-HOLO

evaluate_actions(obs: Tensor, actions: Tensor)[source]

Evaluate actions according to the current policy, given the observations.

Parameters
  • obs

  • actions

Returns

estimated value, log likelihood of taking those actions and entropy of the action distribution.

forward(obs: Tensor, deterministic: bool = False)[source]

Forward pass in all the networks (actor and critic)

Parameters
  • obs – Observation

  • deterministic – Whether to sample or use deterministic actions

Returns

action, value and log probability of the action

get_distribution(obs: Tensor)[source]

Get the current policy distribution given the observations.

Parameters

obs

Returns

the action distribution.

training: bool

DQN

This is the custom DQN that GCP-HOLO uses.

class models.dqn.CustomDQN(policy: Union[str, Type[DQNPolicy]], env: Union[Env, VecEnv, str], learning_rate: Union[float, Callable[[float], float]] = 0.0001, buffer_size: int = 1000000, learning_starts: int = 50000, batch_size: int = 32, tau: float = 1.0, gamma: float = 0.99, train_freq: Union[int, Tuple[int, str]] = 4, gradient_steps: int = 1, replay_buffer_class: Optional[ReplayBuffer] = None, replay_buffer_kwargs: Optional[Dict[str, Any]] = None, optimize_memory_usage: bool = False, target_update_interval: int = 10000, exploration_fraction: float = 0.1, exploration_initial_eps: float = 1.0, exploration_final_eps: float = 0.05, max_grad_norm: float = 10, tensorboard_log: Optional[str] = None, create_eval_env: bool = False, policy_kwargs: Optional[Dict[str, Any]] = None, verbose: int = 0, seed: Optional[int] = None, device: Union[device, str] = 'auto', _init_setup_model: bool = True)[source]

Bases: DQN

predict(observation: ndarray, state: Optional[Tuple[ndarray, ...]] = None, episode_start: Optional[ndarray] = None, deterministic: bool = False)[source]

Overrides the base_class predict function to include epsilon-greedy exploration.

Parameters
  • observation – the input observation

  • state – The last states (can be None, used in recurrent policies)

  • episode_start – The last masks (can be None, used in recurrent policies)

  • deterministic – Whether or not to return deterministic actions.

Returns

the model’s action and the next state (used in recurrent policies)

train(gradient_steps: int, batch_size: int = 100)[source]

Sample the replay buffer and do the updates (gradient descent and update target networks)

class models.dqn.CustomDQNPolicy(observation_space: ~gym.spaces.space.Space, action_space: ~gym.spaces.space.Space, lr_schedule: ~typing.Callable[[float], float], net_arch: ~typing.Optional[~typing.List[int]] = None, activation_fn: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.ReLU'>, features_extractor_class: ~typing.Type[~stable_baselines3.common.torch_layers.BaseFeaturesExtractor] = <class 'stable_baselines3.common.torch_layers.FlattenExtractor'>, features_extractor_kwargs: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, normalize_images: bool = True, optimizer_class: ~typing.Type[~torch.optim.optimizer.Optimizer] = <class 'torch.optim.adam.Adam'>, optimizer_kwargs: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None)[source]

Bases: DQNPolicy

Policy class with Q-Value Net and target net for DQN

Parameters
  • observation_space – Observation space

  • action_space – Action space

  • lr_schedule – Learning rate schedule (could be constant)

  • net_arch – The specification of the policy and value networks.

  • activation_fn – Activation function

  • features_extractor_class – Features extractor to use.

  • features_extractor_kwargs – Keyword arguments to pass to the features extractor.

  • normalize_images – Whether to normalize images or not, dividing by 255.0 (True by default)

  • optimizer_class – The optimizer to use, th.optim.Adam by default

  • optimizer_kwargs – Additional keyword arguments, excluding the learning rate, to pass to the optimizer

make_q_net()[source]
training: bool
class models.dqn.CustomQNetwork(observation_space: ~gym.spaces.space.Space, action_space: ~gym.spaces.space.Space, features_extractor: ~torch.nn.modules.module.Module, features_dim: int, net_arch: ~typing.Optional[~typing.List[int]] = None, activation_fn: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.ReLU'>, normalize_images: bool = True)[source]

Bases: QNetwork

Action-Value (Q-Value) network for DQN

Parameters
  • observation_space – Observation space

  • action_space – Action space

  • net_arch – The specification of the policy and value networks.

  • activation_fn – Activation function

  • normalize_images – Whether to normalize images or not, dividing by 255.0 (True by default)

forward(obs: Tensor)[source]

Predict the q-values.

Parameters

obs – Observation

Returns

The estimated Q-Value for each action.

training: bool

GCN

This is the graph convolution policy network adopted from You et al.

class models.gcpn.GNN(observation_space, max_nodes, num_features, hidden_channels=64, out_channels=64, normalize=False, batch_normalization=False, lin=True, add_loop=False)[source]

Bases: BaseFeaturesExtractor

Graph Convolution network: adopted from Zhao et. al “Robogrammar”
Args:

observation_space (gym.observation): The observation space of the gym environment max_nodes (int): maximum number of nodes for linkage graph num_features (int): number of points in the trajectory to describe the node features hidden_channels (int, optional): hidden channels for the Dense SAGE convolutions. Defaults to 64. out_channels (int, optional): number of output features. Defaults to 64. normalize (bool, optional): normalization used in Dense SAGE. Defaults to False. batch_normalization (bool, optional): Batch Normalization used. Defaults to False. lin (bool, optional): Add linear layer to the end. Defaults to True. add_loop (bool, optional): Add self loops. Defaults to False.

bn(i, x)[source]
forward(observations)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool