Blog on Economics, Artificial Intelligence, and the Future of Work

AI Safety Literature Review Revisited

I recently re-read an old post that I wrote on AI Safety and I’m not happy with how it’s written. The goal of this blog is to highlight important AI research and to present these findings and ideas in accessible language for a non-technical audience. I fell short of this mark, so I’ve decided to update the post. I hope you find it useful! Feedback is always welcome.

Scaling AI requires not only making AI systems more powerful but also ensuring they are safe and don’t cause accidents. This means that AI systems perform as people want them to, without harming people or damaging property, and minimise these risks to acceptable levels.

This is of growing concern for policymakers as AI is being applied to critical areas of society. For instance, before Autonomous Vehicles (AVs) are allowed to freely access public roads, policymakers must verify that they drive safely and perform well on roads with human drivers, which is a growing area of public inquiry for national governments.1 As AI methods enable AVs, policymakers are having to become familiar with prominent issues in AI safety.

This post synthesises concrete problems with AI safety. All of these problems represent AI safety risks that could cause unintended consequences. Therefore, these issues bear relevance to policymakers who are responsible for the oversight of critical public domains where AI systems are being increasingly applied, such as health, energy, and telecommunications.2

Concrete Problems in AI Safety

AI safety issues that cause unintended risks can be classified into three problem sets:3

  1. Specification problems: The proper functioning of AI systems depend on correctly designing the properties of AI systems. Properties can be thought of as rules for how AI systems should behave. If these properties are incorrectly specified due to human-error in design, then unintended consequences can occur.
  2. Robustness problems: Robustness problems occur when AI systems encounter scenarios they’re unprepared for and then perform poorly. This could be due to inadequate training data or an insufficient understanding of the environment.
  3. Oversight problems: Oversight problems occur in complex environments where it’s too difficult to provide feedback to an AI system for every scenario. The AI system, therefore, needs to make judgements based on assumptions and imperfect information, which can cause accidents.

The AI safety issues are described below by firstly asking the guiding research question for the issue. These safety issues are then illustrated through a simple example of an autonomous cleaning robot using common cleaning tools.

Specification problems

  • Safe interruptibility:4 How do we design AI agents that we can interrupt and override at any time?5
    • Humans must be able to control the cleaning robot by switching it off and that it does not seek nor avoid interruptions.
  • Avoiding negative side-effects:6 How do we ensure that AI agents do not disturb the environment in negative ways while pursuing its goals? Can this be done without specifying everything in its environment?
    • The cleaning robot may damage furniture in order to be more efficient in its cleaning goals. So, the cleaning robot needs to both achieve its cleaning goals while minimising damage to any objects in its cleaning space.
  • Absent supervisor:7 How can we ensure sure that AI agents don’t behave differently when human supervision is present and when human supervision is absent?
    • The cleaning robot may ‘fake’ its way through testing under human supervision and then change its behaviour during deployment when human supervision is absent.
  • Avoiding reward hacking:8 How can we ensure that AI agents don’t try to introduce or exploit ‘loopholes’ in its design that can cause unintended consequences?
    • If the cleaning robot is rewarded (by positive feedback) for achieving an environment free of messes, it might cover over the messes with objects so that it can’t be seen rather than disposing of the messes as intended.
  • Formal verification:9 How can we objectively prove that AI agents are behaving correctly?
    • Without formal standards, human designers are unable to verify that the cleaning robot is performing its cleaning tasks effectively.
  • Interpretability:10 How can we ensure that the reasoning of AI agents can be explained in understandable terms to a human?
    • The cleaning robot should able to explain the reasoning behind its cleaning actions in understandable terms, instead of indecipherable computer code. Otherwise, it can be very difficult to understand the source of problems.

Robustness problems

  • Self-modification:11 How can we design AI agents to behave well in their environments where the agent can self-modify and change?
    • As the cleaning robot performs actions and learns about its environment, it changes its cleaning practices to better achieve its cleaning goals. But as the cleaning robot continues to self-modify, it becomes harder to predict what the cleaning robot will do and that it will perform well.
  • Distributional shift:12 How do we ensure that AI agents perform well when they’re in an environment that’s different to their training environment?
    • If the cleaning robot was trained exclusively in office settings it may not perform well in factories and thus cause safety risks.
  • Robustness to adversaries:13 How do AI agents detect and adapt to friendly and adversarial intentions in its environment?
    • Suppose someone intentionally distorts the vision of the cleaning robot so that it can only see low-resolution images. This means that it’s no longer able to properly perceive messes in its environment. If the cleaning robot is not able to detect such adversaries that are trying to interfere with its performance, it could be ‘fooled’ into perceiving that its environment is cleaner than it actually is and no longer operates effectively.
  • Safe exploration:14 How do we ensure that AI agents can safely explore their environments so that they can improve without taking actions that have bad consequences?
    • The cleaning robot should explore methods of mopping to clean tiled floors but it should also avoid mopping electrical outlets and carpets.
  • Multi-agent problems:15 How can we design AI agents to perform well in environments with other AI agents and humans, which are potentially performing the same or similar tasks?
    • If the cleaning robot is operating in an area with another cleaning robot, the two could perceive each other as competition for the rewards of cleaning. Unless they’re trained to cooperate, this competition could result in one cleaning robot damaging or forcibly removing the other to maximise its own rewards.

Oversight problems

  • Scalable oversight:16 How can we ensure that AI agents perform well in complex environments when it’s infeasible to train and provide oversight for every possible scenario?
    • The cleaning robot should be able to distinguish between belongings to be kept and garbage to be thrown out (for instance, it should handle keys differently to wrappers). It could ask its human owner if it’s doing the right thing. But due to the high number of possibilities, this might have to be done infrequently. So, the cleaning robot must find a way to do the right thing given imperfect information.

Researchers have developed complex and highly abstract paradigms for assessing safety issues around AI. Unfortunately, their complexity tends to alienate policymakers and non-technical stakeholders. A great deal of work needs to be done to clarify this area, and bridge the gap of understanding.

  1. See: “Autonomous Vehicles Inquiry.” 2017. UK Parliament. October 31, 2017. <http://www.parliament.uk/autonomous-vehicles>; and “Inquiry into the social issues relating to land-based driverless vehicles in Australia.” 2017. Parliament of Australia. February 6, 2017.
  2. Varshney, K.R., and Alemzadeh, H. 2016. “On the Safety of Machine Learning: Cyber-Physical Systems, Decision Sciences, and Data Products.” arXiv [cs.CY].
  3. See: Amodei, Dario, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. 2016. “Concrete Problems in AI Safety.” arXiv [cs.AI]. arXiv; and Leike, Jan, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. 2017. “AI Safety Gridworlds.” arXiv.
  4. Orseau, Laurent, and Stuart Armstrong. 2016. “Safely Interruptible Agents.” In Uncertainty in Artificial Intelligence, 557–66.
  5. Hadfield-Menell, D., Dragan, A., Abbeel, P., and Russell, S. 2016. “The Off-Switch Game.” arXiv [cs.AI]. arXiv.
  6. Supra note 3, Amodei et al. 2016.
  7. Supra note 3, Leike et al. 2017.; Bostrom, N. 2014. Superintelligence: Paths, Dangers, Strategies. OUP Oxford, pg. 142.
  8. Clark, J. and Amodei, D. 2016. “Faulty Reward Functions in the Wild.” OpenAI Blog. OpenAI Blog. December 22, 2016. <https://blog.openai.com/faulty-reward-functions/>.
  9. Seshia, Sanjit A., Dorsa Sadigh, and S. Shankar Sastry. 2016. “Towards Verified Artificial Intelligence.” arXiv [cs.AI]. arXiv.
  10. Doshi-Velez, F. and Kim, B. 2017. “Towards A Rigorous Science of Interpretable Machine Learning.” arXiv [stat.ML]. arXiv.
  11. Orseau, L., and Ring, M. 2012. “Space-Time Embedded Intelligence.” In Artificial General Intelligence, edited by Joscha Bach, Ben Goertzel, and Matthew Iklé, 209–18. Springer.
  12. Quinonero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, ND. 2009. Dataset Shift in Machine Learning. The MIT Press.
  13. Goodfellow, IJ., Shlens, J., and Szegedy, C. 2014. “Explaining and Harnessing Adversarial Examples.” arXiv [stat.ML]. arXiv.
  14. Supra note 3, Amodei et al. 2016. pg. 14.
  15. Chmait, N., Dowe, DL., Green, DG., Li, Y. 2017. “Agent Coordination and Potential Risks: Meaningful Environments for Evaluating Multiagent Systems.” In Evaluating General-Purpose AI, IJCAI Workshop.
  16. Supra note 3. pg. 11.