1

Adversarial Policies Beat Superhuman Go AIs

We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies that play against frozen KataGo victims. …

Tony Wang, Adam Gleave, Nora Belrose, Tom Tseng, Joseph Miller, Michael Dennis, Yawen Duan, Viktor Pogrebniak, Sergey Levine, Stuart Russell

Invariance in Policy Optimisation and Partial Identifiability in Reward Learning

It’s challenging to design reward functions for complex, real-world tasks. Reward learning lets one instead infer reward …

Joar Skalse, Matthew Farrugia-Roberts, Stuart Russell, Alessandro Abate, Adam Gleave

Preprocessing Reward Functions for Interpretability

In many real-world applications, the reward function is too complex to be manually specified. In such cases, reward functions must …

Erik Jenner, Adam Gleave

Quantifying Differences in Reward Functions

For many tasks, the reward function is inaccessible to introspection or too complex to be specified procedurally, and must instead be …

Adam Gleave, Michael Dennis, Shane Legg, Stuart Russell, Jan Leike

DERAIL: Diagnostic Environments for Reward And Imitation Learning

The objective of many real-world tasks is complex and difficult to procedurally specify. This makes it necessary to use reward or …

Pedro Freire, Adam Gleave, Sam Toyer, Stuart Russell

Understanding Learned Reward Functions

In many real-world tasks, it is not possible to procedurally specify an RL agent’s reward function. In such cases, a reward …

Eric J. Michaud, Adam Gleave, Stuart Russell

Adversarial Policies: Attacking Deep Reinforcement Learning

Deep reinforcement learning (RL) policies are known to be vulnerable to adversarial perturbations to their observations, similar to …

Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, Stuart Russell

Inverse Reinforcement Learning for Video Games

Deep reinforcement learning achieves superhuman performance in a range of video game environments, but requires that a designer …

Aaron Tucker, Adam Gleave, Stuart Russell

Multi-task Maximum Causal Entropy Inverse Reinforcement Learning

Multi-task Inverse Reinforcement Learning (IRL) is the problem of inferring multiple reward functions from expert demonstrations. Prior …

Adam Gleave, Oliver Habryka

Active Inverse Reward Design

Reward design, the problem of selecting an appropriate reward function for an AI system, is both critically important, as it encodes …

Sören Mindermann, Rohin Shah, Adam Gleave, Dylan Hadfield-Menell