Hindsight policy gradients2019-02-20 ${\displaystyle \cong }$ |

A reinforcement learning agent that needs to pursue different goals across episodes requires a goal-conditional policy. In addition to their potential to generalize desirable behavior to unseen goals, such policies may also enable higher-level planning based on subgoals. In sparse-reward environments, the capacity to exploit information about the degree to which an arbitrary goal has been achieved while another goal was intended appears crucial to enable sample efficient learning. However, reinforcement learning agents have only recently been endowed with such capacity for hindsight. In this paper, we demonstrate how hindsight can be introduced to policy gradient methods, generalizing this idea to a broad class of successful algorithms. Our experiments on a diverse selection of sparse-reward environments show that hindsight leads to a remarkable increase in sample efficiency. |

Deep Reinforcement Learning for Complex Manipulation Tasks with Sparse Feedback2020-01-12 ${\displaystyle \cong }$ |

Learning optimal policies from sparse feedback is a known challenge in reinforcement learning. Hindsight Experience Replay (HER) is a multi-goal reinforcement learning algorithm that comes to solve such tasks. The algorithm treats every failure as a success for an alternative (virtual) goal that has been achieved in the episode and then generalizes from that virtual goal to real goals. HER has known flaws and is limited to relatively simple tasks. In this thesis, we present three algorithms based on the existing HER algorithm that improves its performances. First, we prioritize virtual goals from which the agent will learn more valuable information. We call this property the \textit{instructiveness} of the virtual goal and define it by a heuristic measure, which expresses how well the agent will be able to generalize from that virtual goal to actual goals. Secondly, we designed a filtering process that detects and removes misleading samples that may induce bias throughout the learning process. Lastly, we enable the learning of complex, sequential, tasks using a form of curriculum learning combined with HER. We call this algorithm \textit{Curriculum HER}. To test our algorithms, we built three challenging manipulation environments with sparse reward functions. Each environment has three levels of complexity. Our empirical results show vast improvement in the final success rate and sample efficiency when compared to the original HER algorithm. |

Bias-Reduced Hindsight Experience Replay with Virtual Goal Prioritization2020-03-20 ${\displaystyle \cong }$ |

Hindsight Experience Replay (HER) is a multi-goal reinforcement learning algorithm for sparse reward functions. The algorithm treats every failure as a success for an alternative (virtual) goal that has been achieved in the episode. Virtual goals are randomly selected, irrespective of which are most instructive for the agent. In this paper, we present two improvements over the existing HER algorithm. First, we prioritize virtual goals from which the agent will learn more valuable information. We call this property the instructiveness of the virtual goal and define it by a heuristic measure, which expresses how well the agent will be able to generalize from that virtual goal to actual goals. Secondly, we reduce existing bias in HER by the removal of misleading samples. To test our algorithms, we built two challenging environments with sparse reward functions. Our empirical results in both environments show vast improvement in the final success rate and sample efficiency when compared to the original HER algorithm. A video showing experimental results is available at https://youtu.be/3cZwfK8Nfps . |

Hindsight Trust Region Policy Optimization2020-02-10 ${\displaystyle \cong }$ |

Reinforcement Learning(RL) with sparse rewards is a major challenge. We propose \emph{Hindsight Trust Region Policy Optimization}(HTRPO), a new RL algorithm that extends the highly successful TRPO algorithm with \emph{hindsight} to tackle the challenge of sparse rewards. Hindsight refers to the algorithm's ability to learn from information across goals, including ones not intended for the current task. HTRPO leverages two main ideas. It introduces QKL, a quadratic approximation to the KL divergence constraint on the trust region, leading to reduced variance in KL divergence estimation and improved stability in policy update. It also presents Hindsight Goal Filtering(HGF) to select conductive hindsight goals. In experiments, we evaluate HTRPO in various sparse reward tasks, including simple benchmarks, image-based Atari games, and simulated robot control. Ablation studies indicate that QKL and HGF contribute greatly to learning stability and high performance. Comparison results show that in all tasks, HTRPO consistently outperforms both TRPO and HPG, a state-of-the-art algorithm for RL with sparse rewards. |

MapGo: Model-Assisted Policy Optimization for Goal-Oriented Tasks2021-05-13 ${\displaystyle \cong }$ |

In Goal-oriented Reinforcement learning, relabeling the raw goals in past experience to provide agents with hindsight ability is a major solution to the reward sparsity problem. In this paper, to enhance the diversity of relabeled goals, we develop FGI (Foresight Goal Inference), a new relabeling strategy that relabels the goals by looking into the future with a learned dynamics model. Besides, to improve sample efficiency, we propose to use the dynamics model to generate simulated trajectories for policy training. By integrating these two improvements, we introduce the MapGo framework (Model-Assisted Policy Optimization for Goal-oriented tasks). In our experiments, we first show the effectiveness of the FGI strategy compared with the hindsight one, and then show that the MapGo framework achieves higher sample efficiency when compared to model-free baselines on a set of complicated tasks. |

Maximum Entropy Gain Exploration for Long Horizon Multi-goal Reinforcement Learning2020-07-06 ${\displaystyle \cong }$ |

What goals should a multi-goal reinforcement learning agent pursue during training in long-horizon tasks? When the desired (test time) goal distribution is too distant to offer a useful learning signal, we argue that the agent should not pursue unobtainable goals. Instead, it should set its own intrinsic goals that maximize the entropy of the historical achieved goal distribution. We propose to optimize this objective by having the agent pursue past achieved goals in sparsely explored areas of the goal space, which focuses exploration on the frontier of the achievable goal set. We show that our strategy achieves an order of magnitude better sample efficiency than the prior state of the art on long-horizon multi-goal tasks including maze navigation and block stacking. |

Complex Robotic Manipulation via Graph-Based Hindsight Goal Generation2020-07-27 ${\displaystyle \cong }$ |

Reinforcement learning algorithms such as hindsight experience replay (HER) and hindsight goal generation (HGG) have been able to solve challenging robotic manipulation tasks in multi-goal settings with sparse rewards. HER achieves its training success through hindsight replays of past experience with heuristic goals, but under-performs in challenging tasks in which goals are difficult to explore. HGG enhances HER by selecting intermediate goals that are easy to achieve in the short term and promising to lead to target goals in the long term. This guided exploration makes HGG applicable to tasks in which target goals are far away from the object's initial position. However, HGG is not applicable to manipulation tasks with obstacles because the euclidean metric used for HGG is not an accurate distance metric in such environments. In this paper, we propose graph-based hindsight goal generation (G-HGG), an extension of HGG selecting hindsight goals based on shortest distances in an obstacle-avoiding graph, which is a discrete representation of the environment. We evaluated G-HGG on four challenging manipulation tasks with obstacles, where significant enhancements in both sample efficiency and overall success rate are shown over HGG and HER. Videos can be viewed at https://sites.google.com/view/demos-g-hgg/. |

Automatic Curriculum Learning through Value Disagreement2020-06-16 ${\displaystyle \cong }$ |

Continually solving new, unsolved tasks is the key to learning diverse behaviors. Through reinforcement learning (RL), we have made massive strides towards solving tasks that have a single goal. However, in the multi-task domain, where an agent needs to reach multiple goals, the choice of training goals can largely affect sample efficiency. When biological agents learn, there is often an organized and meaningful order to which learning happens. Inspired by this, we propose setting up an automatic curriculum for goals that the agent needs to solve. Our key insight is that if we can sample goals at the frontier of the set of goals that an agent is able to reach, it will provide a significantly stronger learning signal compared to randomly sampled goals. To operationalize this idea, we introduce a goal proposal module that prioritizes goals that maximize the epistemic uncertainty of the Q-function of the policy. This simple technique samples goals that are neither too hard nor too easy for the agent to solve, hence enabling continual improvement. We evaluate our method across 13 multi-goal robotic tasks and 5 navigation tasks, and demonstrate performance gains over current state-of-the-art methods. |

PlanGAN: Model-based Planning With Sparse Rewards and Multiple Goals2020-06-01 ${\displaystyle \cong }$ |

Learning with sparse rewards remains a significant challenge in reinforcement learning (RL), especially when the aim is to train a policy capable of achieving multiple different goals. To date, the most successful approaches for dealing with multi-goal, sparse reward environments have been model-free RL algorithms. In this work we propose PlanGAN, a model-based algorithm specifically designed for solving multi-goal tasks in environments with sparse rewards. Our method builds on the fact that any trajectory of experience collected by an agent contains useful information about how to achieve the goals observed during that trajectory. We use this to train an ensemble of conditional generative models (GANs) to generate plausible trajectories that lead the agent from its current state towards a specified goal. We then combine these imagined trajectories into a novel planning algorithm in order to achieve the desired goal as efficiently as possible. The performance of PlanGAN has been tested on a number of robotic navigation/manipulation tasks in comparison with a range of model-free reinforcement learning baselines, including Hindsight Experience Replay. Our studies indicate that PlanGAN can achieve comparable performance whilst being around 4-8 times more sample efficient. |

Bias-reduced multi-step hindsight experience replay2021-02-25 ${\displaystyle \cong }$ |

Multi-goal reinforcement learning is widely used in planning and robot manipulation. Two main challenges in multi-goal reinforcement learning are sparse rewards and sample inefficiency. Hindsight Experience Replay (HER) aims to tackle the two challenges with hindsight knowledge. However, HER and its previous variants still need millions of samples and a huge computation. In this paper, we propose \emph{Multi-step Hindsight Experience Replay} (MHER) based on $n$-step relabeling, incorporating multi-step relabeled returns to improve sample efficiency. Despite the advantages of $n$-step relabeling, we theoretically and experimentally prove the off-policy $n$-step bias introduced by $n$-step relabeling may lead to poor performance in many environments. To address the above issue, two bias-reduced MHER algorithms, MHER($?$) and Model-based MHER (MMHER) are presented. MHER($?$) exploits the $?$ return while MMHER benefits from model-based value expansions. Experimental results on numerous multi-goal robotic tasks show that our solutions can successfully alleviate off-policy $n$-step bias and achieve significantly higher sample efficiency than HER and Curriculum-guided HER with little additional computation beyond HER. |

Curiosity Driven Exploration of Learned Disentangled Goal Spaces2018-11-04 ${\displaystyle \cong }$ |

Intrinsically motivated goal exploration processes enable agents to autonomously sample goals to explore efficiently complex environments with high-dimensional continuous actions. They have been applied successfully to real world robots to discover repertoires of policies producing a wide diversity of effects. Often these algorithms relied on engineered goal spaces but it was recently shown that one can use deep representation learning algorithms to learn an adequate goal space in simple environments. However, in the case of more complex environments containing multiple objects or distractors, an efficient exploration requires that the structure of the goal space reflects the one of the environment. In this paper we show that using a disentangled goal space leads to better exploration performances than an entangled goal space. We further show that when the representation is disentangled, one can leverage it by sampling goals that maximize learning progress in a modular manner. Finally, we show that the measure of learning progress, used to drive curiosity-driven exploration, can be used simultaneously to discover abstract independently controllable features of the environment. |

Self-supervised Learning of Distance Functions for Goal-Conditioned Reinforcement Learning2020-06-02 ${\displaystyle \cong }$ |

Goal-conditioned policies are used in order to break down complex reinforcement learning (RL) problems by using subgoals, which can be defined either in state space or in a latent feature space. This can increase the efficiency of learning by using a curriculum, and also enables simultaneous learning and generalization across goals. A crucial requirement of goal-conditioned policies is to be able to determine whether the goal has been achieved. Having a notion of distance to a goal is thus a crucial component of this approach. However, it is not straightforward to come up with an appropriate distance, and in some tasks, the goal space may not even be known a priori. In this work we learn a distance-to-goal estimate which is computed in terms of the number of actions that would need to be carried out in a self-supervised approach. Our method solves complex tasks without prior domain knowledge in the online setting in three different scenarios in the context of goal-conditioned policies a) the goal space is the same as the state space b) the goal space is given but an appropriate distance is unknown and c) the state space is accessible, but only a subset of the state space represents desired goals, and this subset is known a priori. We also propose a goal-generation mechanism as a secondary contribution. |

Learn Goal-Conditioned Policy with Intrinsic Motivation for Deep Reinforcement Learning2021-04-11 ${\displaystyle \cong }$ |

It is of significance for an agent to learn a widely applicable and general-purpose policy that can achieve diverse goals including images and text descriptions. Considering such perceptually-specific goals, the frontier of deep reinforcement learning research is to learn a goal-conditioned policy without hand-crafted rewards. To learn this kind of policy, recent works usually take as the reward the non-parametric distance to a given goal in an explicit embedding space. From a different viewpoint, we propose a novel unsupervised learning approach named goal-conditioned policy with intrinsic motivation (GPIM), which jointly learns both an abstract-level policy and a goal-conditioned policy. The abstract-level policy is conditioned on a latent variable to optimize a discriminator and discovers diverse states that are further rendered into perceptually-specific goals for the goal-conditioned policy. The learned discriminator serves as an intrinsic reward function for the goal-conditioned policy to imitate the trajectory induced by the abstract-level policy. Experiments on various robotic tasks demonstrate the effectiveness and efficiency of our proposed GPIM method which substantially outperforms prior techniques. |

Language as a Cognitive Tool to Imagine Goals in Curiosity-Driven Exploration2020-06-12 ${\displaystyle \cong }$ |

Developmental machine learning studies how artificial agents can model the way children learn open-ended repertoires of skills. Such agents need to create and represent goals, select which ones to pursue and learn to achieve them. Recent approaches have considered goal spaces that were either fixed and hand-defined or learned using generative models of states. This limited agents to sample goals within the distribution of known effects. We argue that the ability to imagine out-of-distribution goals is key to enable creative discoveries and open-ended learning. Children do so by leveraging the compositionality of language as a tool to imagine descriptions of outcomes they never experienced before, targeting them as goals during play. We introduce Imagine, an intrinsically motivated deep reinforcement learning architecture that models this ability. Such imaginative agents, like children, benefit from the guidance of a social peer who provides language descriptions. To take advantage of goal imagination, agents must be able to leverage these descriptions to interpret their imagined out-of-distribution goals. This generalization is made possible by modularity: a decomposition between learned goal-achievement reward function and policy relying on deep sets, gated attention and object-centered representations. We introduce the Playground environment and study how this form of goal imagination improves generalization and exploration over agents lacking this capacity. In addition, we identify the properties of goal imagination that enable these results and study the impacts of modularity and social interactions. |

Hierarchical Policy Learning is Sensitive to Goal Space Design2019-06-25 ${\displaystyle \cong }$ |

Hierarchy in reinforcement learning agents allows for control at multiple time scales yielding improved sample efficiency, the ability to deal with long time horizons and transferability of sub-policies to tasks outside the training distribution. It is often implemented as a master policy providing goals to a sub-policy. Ideally, we would like the goal-spaces to be learned, however, properties of optimal goal spaces still remain unknown and consequently there is no method yet to learn optimal goal spaces. Motivated by this, we systematically analyze how various modifications to the ground-truth goal-space affect learning in hierarchical models with the aim of identifying important properties of optimal goal spaces. Our results show that, while rotation of ground-truth goal spaces and noise had no effect, having additional unnecessary factors significantly impaired learning in hierarchical models. |

ACTRCE: Augmenting Experience via Teacher's Advice For Multi-Goal Reinforcement Learning2019-02-12 ${\displaystyle \cong }$ |

Sparse reward is one of the most challenging problems in reinforcement learning (RL). Hindsight Experience Replay (HER) attempts to address this issue by converting a failed experience to a successful one by relabeling the goals. Despite its effectiveness, HER has limited applicability because it lacks a compact and universal goal representation. We present Augmenting experienCe via TeacheR's adviCE (ACTRCE), an efficient reinforcement learning technique that extends the HER framework using natural language as the goal representation. We first analyze the differences among goal representation, and show that ACTRCE can efficiently solve difficult reinforcement learning problems in challenging 3D navigation tasks, whereas HER with non-language goal representation failed to learn. We also show that with language goal representations, the agent can generalize to unseen instructions, and even generalize to instructions with unseen lexicons. We further demonstrate it is crucial to use hindsight advice to solve challenging tasks, and even small amount of advice is sufficient for the agent to achieve good performance. |

Learning Deep Parameterized Skills from Demonstration for Re-targetable Visuomotor Control2019-10-23 ${\displaystyle \cong }$ |

Robots need to learn skills that can not only generalize across similar problems but also be directed to a specific goal. Previous methods either train a new skill for every different goal or do not infer the specific target in the presence of multiple goals from visual data. We introduce an end-to-end method that represents targetable visuomotor skills as a goal-parameterized neural network policy. By training on an informative subset of available goals with the associated target parameters, we are able to learn a policy that can zero-shot generalize to previously unseen goals. We evaluate our method in a representative 2D simulation of a button-grid and on both button-pressing and peg-insertion tasks on two different physical arms. We demonstrate that our model trained on 33% of the possible goals is able to generalize to more than 90% of the targets in the scene for both simulation and robot experiments. We also successfully learn a mapping from target pixel coordinates to a robot policy to complete a specified goal. |

Visual Reinforcement Learning with Imagined Goals2018-12-04 ${\displaystyle \cong }$ |

For an autonomous agent to fulfill a wide range of user-specified goals at test time, it must be able to learn broadly applicable and general-purpose skill repertoires. Furthermore, to provide the requisite level of generality, these skills must handle raw sensory input such as images. In this paper, we propose an algorithm that acquires such general-purpose skills by combining unsupervised representation learning and reinforcement learning of goal-conditioned policies. Since the particular goals that might be required at test-time are not known in advance, the agent performs a self-supervised "practice" phase where it imagines goals and attempts to achieve them. We learn a visual representation with three distinct purposes: sampling goals for self-supervised practice, providing a structured transformation of raw sensory inputs, and computing a reward signal for goal reaching. We also propose a retroactive goal relabeling scheme to further improve the sample-efficiency of our method. Our off-policy algorithm is efficient enough to learn policies that operate on raw image observations and goals for a real-world robotic system, and substantially outperforms prior techniques. |

Hierarchical Reinforcement Learning with Hindsight2019-03-08 ${\displaystyle \cong }$ |

Reinforcement Learning (RL) algorithms can suffer from poor sample efficiency when rewards are delayed and sparse. We introduce a solution that enables agents to learn temporally extended actions at multiple levels of abstraction in a sample efficient and automated fashion. Our approach combines universal value functions and hindsight learning, allowing agents to learn policies belonging to different time scales in parallel. We show that our method significantly accelerates learning in a variety of discrete and continuous tasks. |

A Decentralized Policy Gradient Approach to Multi-task Reinforcement Learning2020-06-07 ${\displaystyle \cong }$ |

We develop a mathematical framework for solving multi-task reinforcement learning problems based on a type of decentralized policy gradient method. The goal in multi-task reinforcement learning is to learn a common policy that operates effectively in different environments; these environments have similar (or overlapping) state and action spaces, but have different rewards and dynamics. Agents immersed in each of these environments communicate with other agents by sharing their models (i.e. their policy parameterizations) but not their state/reward paths. Our analysis provides a convergence rate for a consensus-based distributed, entropy-regularized policy gradient method for finding such a policy. We demonstrate the effectiveness of the proposed method using a series of numerical experiments. These experiments range from small-scale "Grid World" problems that readily demonstrate the trade-offs involved in multi-task learning to large-scale problems, where common policies are learned to play multiple Atari games or to navigate an airborne drone in multiple (simulated) environments. |