Reinforcement Learning and Adversarial thinking

We all know that learning a new craft is hard. We spend a large part of our lives learning how to operate in everyday physics.  A large part of this learning comes from observing others, and when others can’t help we learn through trial and error. 

In machine learning the process of learning how to deal with the environment is called Reinforcement Learning (RL). By continuous interaction with its environment, an agent learns a policy that enables it to perform better. Observational learning in RL is referred to as Imitation Learning. Both trial and error and imitation learning are hard: environments are not trivial, often you can’t tell the ramifications of an action until far in the future, environments are full of non-determinism and there are no such thing as a correct policy. 

So, unlike in supervised and unsupervised learning, it is hard to tell if your decisions are correct. Episodes usually constitute thousands of decisions, and you will only know if you perform well after exploring other options. But experiment is also a hard decision: do you exploit the skill you already have, or try something new and explore the unknown?

Despite all these complexities, RL has managed to achieve incredible performance in a wide variety of tasks from robotics through recommender systems to trading. More impressively, RL agents have achieved superhuman performance in Go and other games, tasks previously believed to be impossible for computers. 

It has also become apparent that many machine-learning models can be manipulated with small perturbations. Most work in Adversarial Machine Learning develops attacks on classification systems to disrupt a correct decision. Reinforcement Learning agents can be reduced to classification systems, where their classification task is what action to pick. And in this way all of the adversarial machine learning attacks can apply to reinforcement learning. 

Yet attacks mean a very different thing in RL. Here, a single misprediction does not necessarily reduce overall reward— agents usually recover from a single mistake. Instead, the imperative is that attacks are precise and are timed correctly. There are two main questions: how to attack and when to attack. Both those questions are extremely hard. 

In this Three Paper Thursday we look at papers that investigate adversarial thinking in Reinforcement Learning.


Zhao et al. investigated whether it is possible to attack reinforcement learning agents in a fully black-box setting i.e. without making any assumptions about the agents or environments [1]. The authors use imitation learning to build an approximate model of the agent by simply passively observing it operate, then develop attacks against the imitation, then transfer them to the agent itself. Against some game-playing agents, timing is everything. Their attack model can sometimes find just the right time to intervene, so an attack can have a significant effect several moves later – creating a kind of ‘time bomb’. Finally, the authors note that agents can often be disrupted as well with random noise as by sophisticated attacks, highlighting a methodological issue with previous research. 

Gleave et al. model the emergence of deception in a competitive multi-agent setting [2]. Authors find that agents can in fact disrupt their opponents and often do so by generating random and uncoordinated movements. Interestingly, one adversarial policy is distraction: instead of hitting the ball, it may be more effective to lie on the ground and twitch. The authors then investigate what happens when the target agent expects trickery and learns how to deal with it. It appears that adversarial training does help with handling deception, but the attacker can also adapt its adversarial policies to continue exploiting the target. 

Attacks in RL can also involve betrayal of trust. Lin et al. investigate what happens in an RL cooperative multiagent setting when one agent from the group operating together gets taken over and starts behaving maliciously [3]. The authors find that here even random behaviour makes performance degrade significantly for a large number of benchmarks. The authors then improved the attack – by exploiting interaction with a group, a rogue agent can learn to degrade performance even further. This further highlights how fragile agent cooperation can be and why worst-case scenarios need to be considered when deploying this technology. 


[1] Blackbox Attacks on Reinforcement Learning Agents Using Approximated Temporal Information (2020), Yiren Zhao, Ilia Shumailov, Han Cui, Xitong Gao, Robert Mullins, Ross Anderson,

[2] Adversarial Policies: Attacking Deep Reinforcement Learning (2019), Adam Gleave, Michael Dennis, Cody Wild, Neel Kant, Sergey Levine, Stuart Russell,

[3] On the Robustness of Cooperative Multi-Agent Reinforcement Learning (2020), Jieyu Lin, Kristina Dzeparoska, Sai Qian Zhang, Alberto Leon-Garcia, Nicolas Papernot

Leave a Reply

Your email address will not be published. Required fields are marked *