Goal-Conditioned Generators of Deep Policies

Vincent Herrmann, Aditya Ramesh, Louis Kirsch, Juergen Schmidhuber

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Goal-conditioned Reinforcement Learning (RL) aims at learning optimal policies, given goals encoded in special command inputs. Here we study goal-conditioned neural nets (NNs) that learn to generate deep NN policies in form of context-specific weight matrices, similar to Fast Weight Programmers and other methods from the 1990s. Using context commands of the form ``generate a policy that achieves a desired expected return,'' our NN generators combine powerful exploration of parameter space with generalization across commands to iteratively find better and better policies. A form of weight-sharing HyperNetworks and policy embeddings scales our method to generate deep NNs. Experiments show how a single learned policy generator can produce policies that achieve any return seen during training. Finally, we evaluate our algorithm on a set of continuous control tasks where it exhibits competitive performance.
Original languageEnglish (US)
Title of host publicationProceedings of the AAAI Conference on Artificial Intelligence
PublisherAssociation for the Advancement of Artificial Intelligence (AAAI)
Pages7503-7511
Number of pages9
DOIs
StatePublished - Jun 26 2023

Bibliographical note

KAUST Repository Item: Exported on 2023-09-08
Acknowledgements: We thank Mirek Strupl, Dylan Ashley, Robert Csord ´ as, Alek- ´ sandar Stanic and Anand Gopalakrishnan for their feed- ´ back. This work was supported by the ERC Advanced Grant (no: 742870), the Swiss National Science Foundation grant (200021 192356), and by the Swiss National Supercomputing Centre (CSCS, projects: s1090, s1154). We also thank NVIDIA Corporation for donating a DGX-1 as part of the Pioneers of AI Research Award and to IBM for donating a Minsky machine.

Fingerprint

Dive into the research topics of 'Goal-Conditioned Generators of Deep Policies'. Together they form a unique fingerprint.

Cite this