Models
GCP-HOLO uses Stable Baselines 3 for reinforcement learning training, but includes some customization to make sure that it works for the Mech Gym environment. Stable Baselines 3 is a popular library for RL training that provides a set of pre-implemented algorithms, such as Proximal Policy Optimization (PPO) and Deep Q-Network (DQN).
GCP-HOLO customizes the Stable Baselines 3 algorithms to work specifically with the Mech Gym environment, which is a custom environment designed for path synthesis of linkage systems. The Mech Gym environment includes a specific action space to enhance the efficiency, the customization also include maksing invalid actions determined from the scaffold nodes and makes sure that each of the models is selecting actions non-deterministically.
A2C
This is the custom A2C that GCP-HOLO uses.
- class models.a2c.CustomActorCriticPolicy(observation_space: ~gym.spaces.space.Space, action_space: ~gym.spaces.space.Space, lr_schedule: ~typing.Callable[[float], float], net_arch: ~typing.Optional[~typing.List[~typing.Union[int, ~typing.Dict[str, ~typing.List[int]]]]] = None, activation_fn: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.ReLU'>, *args, **kwargs)[source]
Bases:
ActorCriticPolicy
Custom Actor Critic Policy for GCP-HOLO
- evaluate_actions(obs: Tensor, actions: Tensor)[source]
Evaluate actions according to the current policy, given the observations.
- Parameters
obs –
actions –
- Returns
estimated value, log likelihood of taking those actions and entropy of the action distribution.
- forward(obs: Tensor, deterministic: bool = False)[source]
Forward pass in all the networks (actor and critic)
- Parameters
obs – Observation
deterministic – Whether to sample or use deterministic actions
- Returns
action, value and log probability of the action
- get_distribution(obs: Tensor)[source]
Get the current policy distribution given the observations.
- Parameters
obs –
- Returns
the action distribution.
- training: bool
DQN
This is the custom DQN that GCP-HOLO uses.
- class models.dqn.CustomDQN(policy: Union[str, Type[DQNPolicy]], env: Union[Env, VecEnv, str], learning_rate: Union[float, Callable[[float], float]] = 0.0001, buffer_size: int = 1000000, learning_starts: int = 50000, batch_size: int = 32, tau: float = 1.0, gamma: float = 0.99, train_freq: Union[int, Tuple[int, str]] = 4, gradient_steps: int = 1, replay_buffer_class: Optional[ReplayBuffer] = None, replay_buffer_kwargs: Optional[Dict[str, Any]] = None, optimize_memory_usage: bool = False, target_update_interval: int = 10000, exploration_fraction: float = 0.1, exploration_initial_eps: float = 1.0, exploration_final_eps: float = 0.05, max_grad_norm: float = 10, tensorboard_log: Optional[str] = None, create_eval_env: bool = False, policy_kwargs: Optional[Dict[str, Any]] = None, verbose: int = 0, seed: Optional[int] = None, device: Union[device, str] = 'auto', _init_setup_model: bool = True)[source]
Bases:
DQN
- predict(observation: ndarray, state: Optional[Tuple[ndarray, ...]] = None, episode_start: Optional[ndarray] = None, deterministic: bool = False)[source]
Overrides the base_class predict function to include epsilon-greedy exploration.
- Parameters
observation – the input observation
state – The last states (can be None, used in recurrent policies)
episode_start – The last masks (can be None, used in recurrent policies)
deterministic – Whether or not to return deterministic actions.
- Returns
the model’s action and the next state (used in recurrent policies)
- class models.dqn.CustomDQNPolicy(observation_space: ~gym.spaces.space.Space, action_space: ~gym.spaces.space.Space, lr_schedule: ~typing.Callable[[float], float], net_arch: ~typing.Optional[~typing.List[int]] = None, activation_fn: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.ReLU'>, features_extractor_class: ~typing.Type[~stable_baselines3.common.torch_layers.BaseFeaturesExtractor] = <class 'stable_baselines3.common.torch_layers.FlattenExtractor'>, features_extractor_kwargs: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, normalize_images: bool = True, optimizer_class: ~typing.Type[~torch.optim.optimizer.Optimizer] = <class 'torch.optim.adam.Adam'>, optimizer_kwargs: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None)[source]
Bases:
DQNPolicy
Policy class with Q-Value Net and target net for DQN
- Parameters
observation_space – Observation space
action_space – Action space
lr_schedule – Learning rate schedule (could be constant)
net_arch – The specification of the policy and value networks.
activation_fn – Activation function
features_extractor_class – Features extractor to use.
features_extractor_kwargs – Keyword arguments to pass to the features extractor.
normalize_images – Whether to normalize images or not, dividing by 255.0 (True by default)
optimizer_class – The optimizer to use,
th.optim.Adam
by defaultoptimizer_kwargs – Additional keyword arguments, excluding the learning rate, to pass to the optimizer
- training: bool
- class models.dqn.CustomQNetwork(observation_space: ~gym.spaces.space.Space, action_space: ~gym.spaces.space.Space, features_extractor: ~torch.nn.modules.module.Module, features_dim: int, net_arch: ~typing.Optional[~typing.List[int]] = None, activation_fn: ~typing.Type[~torch.nn.modules.module.Module] = <class 'torch.nn.modules.activation.ReLU'>, normalize_images: bool = True)[source]
Bases:
QNetwork
Action-Value (Q-Value) network for DQN
- Parameters
observation_space – Observation space
action_space – Action space
net_arch – The specification of the policy and value networks.
activation_fn – Activation function
normalize_images – Whether to normalize images or not, dividing by 255.0 (True by default)
- forward(obs: Tensor)[source]
Predict the q-values.
- Parameters
obs – Observation
- Returns
The estimated Q-Value for each action.
- training: bool
GCN
This is the graph convolution policy network adopted from You et al.
- class models.gcpn.GNN(observation_space, max_nodes, num_features, hidden_channels=64, out_channels=64, normalize=False, batch_normalization=False, lin=True, add_loop=False)[source]
Bases:
BaseFeaturesExtractor
- Graph Convolution network: adopted from Zhao et. al “Robogrammar”
- Args:
observation_space (gym.observation): The observation space of the gym environment max_nodes (int): maximum number of nodes for linkage graph num_features (int): number of points in the trajectory to describe the node features hidden_channels (int, optional): hidden channels for the Dense SAGE convolutions. Defaults to 64. out_channels (int, optional): number of output features. Defaults to 64. normalize (bool, optional): normalization used in Dense SAGE. Defaults to False. batch_normalization (bool, optional): Batch Normalization used. Defaults to False. lin (bool, optional): Add linear layer to the end. Defaults to True. add_loop (bool, optional): Add self loops. Defaults to False.
- forward(observations)[source]
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- training: bool
Random Search
This is the random search method for applying random actions for generating linkages.
- models.random_search.random_search(env, episodes=100)[source]
random search for linkage graph generation
- Parameters
env (gym.env) – linkag_gym
episodes (int, optional) – number of linkage graphs to generate. Defaults to 100.
- Returns
Best designs from search, all rewards, all designs, all episode lengths
- Return type
(dict, list, list, list)