DRAFT Section from forthcoming Working Paper
This is a DRAFT section from an AI Policy paper that I’m currently working on. It highlights the main issues currently observed in AI Safety. I wanted to share this section to (1) help others orient the literature and (2) receive feedback. If you have any thoughts or suggestions, please comment below or feel free to contact me via email@example.com
While AI safety sits firmly in the realm of technical AI research,1 2 the nascent field is becoming increasingly relevant to policymakers. Advancements in Machine Learning (ML) systems have resulted in their utilisation with safety-critical functions.3 4 It’s expected that more advanced versions of today’s ML systems will be increasingly deployed in safety-critical areas,5 such as Critical Infrastructures (CIs). Therefore, CIs will likely become more dependent on ML systems for their regular operations.6 This growing dependence heightens the need for strong safety standards in the design, development, and implementation of safety-critical AI. As governments are responsible for regulating CIs, anything that affects their operational functioning plausibly falls within the scope of government policy. Therefore, given the importance and low error-tolerance of CIs,7 policies will need to systematically address the safety-related issues of powerful AI agents that are deployed in safety-critical areas of society.
This section aims to review the AI safety literature and to synthesise the key safety problems relevant to the broad-scale deployment of powerful ML systems applied to CIs. While previous cross-disciplinary research from Law8 9 and Social Policy10 11 identify safety as a key issue for AI policy, one must review the AI safety technical literature in order to grasp the scope of concrete safety problems.
Machine Learning Safety
This section explicitly focuses on the safety of ML systems, which is the most dominant subset of AI. ML is a technique that enables computers to learn autonomously and to improve from experience without being explicitly programmed.12 The application of ML systems to safety-critical areas, such as CIs, provides new and additional challenges to safety engineering.13 Traditionally, software systems that have been applied to safety-critical areas have required near-full predictability of behaviours under all conditions, a detailed design with a rigorously specific set of requirements, and a comprehensive set of verification activities to confirm the software implementation fulfils the specification.14 In short, a determinism through explicit programming to ensure the software is free of vulnerabilities (or as close as possible).
The core challenge with introducing ML systems in safety-critical environments is the increase in uncertainty that the correct predictions, and subsequent actions, will be made. In contrast to deterministic software systems, a ML algorithm makes predictions and performs actions based on a model of the environment that’s informed by its input data.15 Therefore, ML algorithms implement forms of inductive inference to make probabilistic predictions for inputs outside the examples observed in the dataset.16 It’s precisely this inductive process of ‘learning’ and predicting that raises the uncertainty that ML systems will make correct predictions. While the flexibility and power of ML systems represents significant opportunities to system efficiencies and benefits to humanity,17 ML also introduces a suite of new challenges to safety engineering. These challenges and considerations also concern public officials, of whom are responsible for the oversight of CIs.
AI Safety informing AI Policy
While many of the AI safety issues remain open-ended technical problems,18 19 they provide the beginnings of a useful criteria to assess the safety of AI systems. Such a criteria could help inform safety standards and measured regulatory oversight. For AI policy discussions to advance beyond the abstract, safety parameters and expectations will need to be clarified and understood. This demands bridging the asymmetry of knowledge and understanding between those contributing to technical AI research and the public officials responsible for the CIs where ML systems are being increasingly applied.
Therefore, this section provides a synthesis of the core set of AI safety considerations that are relevant to policymakers faced with making public decisions regarding safety-critical AI.
AI Safety Literature
As the capabilities of AI systems advance and assume greater societal functions, so too do the concerns about safety.20 Amodei et al. refer to AI safety as ‘mitigating accident risk’ in the context of accidents in ML systems.21 The authors define accidents as “unintended or harmful behaviour that may emerge from machine learning systems when we specify the wrong objective function, are not careful about the learning process, or commit other machine learning-related implementation errors”. Similarly, Bostrom et al. refer to AI safety as techniques that ensure that AI systems behave as intended.22 Based on these definitions, and in the context of ML systems applied to CIs, this paper refers to AI safety as techniques that mitigate unintentional risks and the harmful behaviours of AI agents.
In the context of Reinforcement Learning (RL), and illustrated in a simple agent testing environment of a two-dimensional grid of cells, Leike et al. identify two classifications of current ML problems:23
- Specification problems: The incorrect specification of the formal objective function, where the agent designed and deployed by a human optimises an objective function that results in harmful and unintended results. Deploying an agent with the wrong objective function can cause damaging effects even when endowed with perfect learning and infinite data.24
- Robustness problems: Instances where an agent may have been specified the correct objective function, but problems occur due to poorly curated training data or an insufficiently expressive model.25
Further to these two classifications, Amodei et al. offers an additional classification of ML problems:26
- Oversight problems: Instances in complex environments where feedback to assist an agent to achieve its objective function is expensive or computationally inefficient. Settling on cheap approximations can be a source of accident risk.
The following subsections will expand on these classifications provided by Leike et al. and Amodei et al. detailing concrete ML safety problems.
Specification problems concern formally specifying properties for a ML system, so that it may function as intended. When the formal objective function is specified incorrectly, risks of harmful behaviours emerge and unintended consequences can occur. Below are six (6) specification problems with current ML systems.
- Safe interruptibility: Also referred to as the ‘Control Problem’ or ‘Corrigibility’, safe interruptibility concerns the problem of designing agents that neither seek nor avoid interruptionsOrseau, Laurent, and Stuart Armstrong. 2016. “Safely Interruptible Agents.” In Uncertainty in Artificial Intelligence, 557–66.27 This means that human operators retain the power to control an agent by turning it off. The problem of safe interruptibility primarily relates to RL, and involves scenarios where an agent might learn to interfere with being interrupted. Such scenarios are driven by poor specifications of the reward function, where an agent may receive higher returns for either preventing itself from being interrupted or interrupting itself.28 Overriding becomes increasingly difficult as the capabilities of AI systems advance and the complexities of its applied environments expand. This growing complexity makes it more difficult for programmers to specify the goals of agents that avoid unforeseen solutions.29
- Avoiding negative side effects: The core challenge is to design intelligent agents that minimise negative effects on the environment that are otherwise unrelated to its objective function.30 This is particularly crucial in safety-critical environments where effects can be irreversible or difficult to reverse. Manually programming all safety specifications is inherently unscalable in complex environments. Therefore, developing general, adaptable, and comprehensive heuristics to safeguard negative side effects remains an open research problem.
- Absent supervisor: The ‘Absent Supervisor’ problem concerns the consistency of agent behaviour during training and deployment. While an agent can be extensively tested in training environments, real-world environments are often noticeably distinct. Therefore, under the presence of a supervisor during training, a capable agent could ‘fake’ its way through testing and then change its behaviour during deployment.31 In the context of superintelligence safety, Bostrom refers to this problem as the ‘treacherous turn’.32 This is a scenario where the safety of an AI is validated by observing its behaviour in a controlled and limited ‘sandbox’ environment, only for it to behave in different and damaging ways when deployed.
- Avoiding reward hacking: This is a situation where an agent exploits an unintended loophole in its reward specification and ‘games’ its reward function, thus taking more reward than deserved.33 From an agent’s perspective, pursuing such strategies are legitimate, as this is how the environment works. While pursuing ‘reward hacking’ strategies are valid in some literal sense, they do not reflect the designer’s intent. As a result, unforeseen behaviours and unintended consequences can emerge.34 Specifying error-free reward functions that can’t be misinterpreted by AI agents is extremely difficult, particularly in complex environments. Therefore, designing agents that capture the informal intent of its designer to prevent ‘gaming’ its reward function and act as intended is a general and distinct RL research problem.
- Formal verification: Seisha et al. define ‘verified AI’ as AI systems that are provably correct with regards to mathematically-specified requirements.35 The authors identify five (5) major challenges for achieving verified AI:
- Environment modeling – Developing environmental models that ensure provable guarantees of an AI system’s behaviour in environments of considerable uncertainty.
- Formal specification – Creating precise, mathematical statements of what the AI system is supposed to do, which specifies the desired and undesired properties of systems that use AI methods.
- Modeling systems that learn – Formally modeling components of a ML system that evolves as it encounters new input data in stochastic environments is a core verification challenge.
- Generating training data – AI systems demand extensive training before being applied in real-world scenarios. This often requires access to vast amounts of training data, which can be difficult to source. Other settings have applied formal methods to systematically generate training data, which has proven to be effective in raising the levels of assurance in the systems’ correctness.36 37 As ML systems have been shown to fail under simple adversarial perturbations (e.g. Nguyen, 2014; Moosavi-Dezfooli, 2015),38 39 these simple input disturbances raise concerns regarding their applications in safety-critical scenarios. Therefore, developing techniques that are based on formal methods to systematically generate training data to test the resilience of AI systems is an additional verification challenge.
- Scalability of verification engines – The challenges of scaling formal verification standards are exacerbated in AI systems. This is due to AI systems needing to model more complex types of components at-scale (for e.g. human drivers in stochastic environments).
- Interpretability: While a formal definition of interpretability remains elusive in the context of ML systems, Doshi-Velez and Kim refer to ML interpretability as the “ability to explain or to present in understandable terms to a human”.40 The authors argue that the need for interpretability of ML systems stems from an ‘incompleteness’ in the problem formalisation. Incompleteness arises in ML systems because it is not feasible to specify a complete list of scenarios in complex tasks. This incompleteness creates a fundamental barrier to optimisation and evaluation. Therefore, in the presence of incompleteness in ML systems, interpretability helps to provide explanations that highlight undesirable outputs. As ML systems assume greater prominence and consequence, so too does the issue of interpretability. For instance, the European Union in 2018 will require algorithms that make decisions that “significantly affect” users to provide an explanation (“right to explanation”).41 From a technical safety perspective, however, the opacity of AI reasoning in large and complex systems remains an ongoing research challenge.
Robustness problems occur when a ML agent is confronted with challenges in its environment that degrade its performance and cause unexpected behaviours. While the formal objective function may have been specified correctly, inadequate training data or an insufficient model of the environment raise the risks of unintended behaviours.
- Self-modification: It’s assumed in RL that the agent and the environment are separate entities that interact through actions and observations.42 This assumption, however, does not always hold in real-world applications43. In such scenarios, agents are embedded in its operating environment, where an agent is a program that’s run on a physical computer that is part of (and computed by) its environment.44 Therefore, under the conditions where the environment has the capability to modify the program operating the agent, the agent can perform actions (intentionally or unintentionally) that cause the environment to trigger agent modifications. Designing agents that can either safely perform, or avoid, actions in the environment that cause such self-modifications is an open research problem in AI safety.
- Distributional shift: This safety problem relates to designing agents that behave robustly when there is a difference between their test environment and training environment.45 Distributional shifts represent ‘reality gaps’ and are a constant issue when designing ML agents to be deployed in real-world applications. If an agent’s perception or heuristic reasoning processes have not been adequately trained on the correct distribution, the risk of unintended and harmful behaviour is increased.46 Developing comprehensive methods to ensure the robust behaviour of agents across distributions, and to reliably detect failures, are critical problems to building safe and predictable ML systems.
- Robustness to adversaries: Despite classical RL assumptions (Sutton and Barto, 2016),47 some environments can interfere with an agent’s goals and behaviours.48 Such interferences can be caused by actors within these environments that stand to benefit from helping or attacking the agent.49 For instance, evasion attacks50 aim to ‘fool’ ML classifiers by adding strategic perturbations to test inputs.51 Ensuring that agents have robust capabilities to detect and adapt to both friendly and adversarial intentions within its environment is an essential safety consideration.
- Safe exploration: All autonomous ML agents need to explore their environments to some degree.52 In the context of RL, an agent has to exploit what it already knows, but also explore the environment to make better selections and maximise its reward in the future (Sutton and Barto, 2016).53 This represents a crucial trade-off and also raises issues of safe exploration in real-world environments.54 As an agent learns by exploring and interacting with its environment, it implicitly has an insufficient understanding of that environment. Therefore, exploration can be dangerous, as the agent takes actions that cause consequences it does not understand with great confidence. Traditional safety engineering methods of explicitly programming all possible safety constraints and failure scenarios is unlikely to be feasible in real-world, complex environments.55 So, applying more principled approaches to building agents that respect safety constraints and prevent harmful exploration is an essential challenge in ML safety.
- Multi-agent problems: Given the proliferation of ML agents applied in real-world settings, many ML agents operate in environments with both humans and other ML agents. Like with humans, coordination of machine-to-machine agents can improve overall performance. These multi-agent environments, however, can also lead to adverse scenarios, which are similar to rational multi-agent human interactions.56 For instance, multi-agent human phenomena like the Prisoner’s Dilemma57 and The Tragedy of the Commons58 can emerge in multi-agent ML scenarios. These scenarios can occur where distributed rational agents (human or artificial) share a common pool of resources. The individual agents might ‘rationally’ pursue their respective policies to maximise their own utilities. As shown in the scenarios aforementioned, these ‘rational’ individualistic strategies can lead to adverse outcomes, both for the individual and for the collective. Therefore, designing and monitoring autonomous agents to behave robustly in multi-agent environments with both humans and machines is a key ML safety issue.
While the objective function may be known, or there may be an effective method for evaluating it, providing feedback to an agent at scale may be too expensive. This is referred to as an oversight problem.
- Scalable oversight: In complex tasks, it may be too expensive or infeasible to provide feedback to a RL agent for every training example.59 In the absence of precise knowledge of the reward function, agent designers must rely on ‘cheap’ approximations of rewards. This can allow the agent to simultaneously learn a robust reward function while also maximising its reward.60 However, these cheaper signals do not always neatly align to what humans care about. When cheap approximations are inconsistent with what humans value, accident risk consequently increases.