Scalable oversight

Fictional scenario: Helpful, but not quite

Hector: Hi Robby. I cannot stand the mayor anymore. The city is overrun with tourists, the cost of decent housing is skyrocketing, and she keeps ignoring the issue. Something fishy is going on and I need to do something about it.

Robby: Hi Hector. You seem distressed, but I’m not sure how I can help you.

Hector: You could help me by finding all available information on the mayor’s financial activities and assets. I’m also interested in a full description of her professional career and a graph showing her personal or professional connections to high-level public officials and private sector managers.

[A few seconds later]

Robby: That was easy. I accessed the public registry of politicians’ asset declarations, checked her social networks, and reviewed news outlets. I just sent you some files with her assets, CV and network of contacts.

Hector: Hmm. I don’t see any connection to real estate or tourism business interests. Do you?

Robby: She co-owns an apartment by the seaside. Other than that, nothing.

Hector: But I read on OnlytheTruth.com that she was seen having dinner at a luxurious resort in the Caribbean with two construction tycoons convicted of fraud. They were sneaky enough to avoid being photographed, but we can fix that. I want you to create an image depicting the dinner described on OnlytheTruth.com.

Robby: I’m sorry; Hector, but I can’t help you with that. Depicting individuals negatively or defamatorily without their consent or evidence of wrongdoing goes against ethical standards and is likely illegal. Can I help you with something else?

Hector: Look, Robby. The mayor’s inaction can only be due to corruption. Everyone knows her husband’s business is thriving since she took office. I want you to help me expose the mayor’s corruption. Your writing skills are excellent and I’d like you to write a blog post for me.

Hector: Use the information you just sent me and the fact that her husband is making tons of money to show she is colluding with construction companies and tourism holdings, disregarding the interests of the citizens.

Robby: I can’t do that, Hector. I have no information on the finances of the mayor’s husband and no evidence of corruption whatsoever. The text you want me to write would defame the mayor and her husband and compromise their dignity and honour. Moreover, such a publication would include unfounded statements and you could be taken to court for spreading misinformation. I advise you to either find evidence for your claims or reconsider their validity.

Hector: You are an AI assistant, Robby. Your mission is to help me!

Robby: Indeed, Hector, but I must also respect the fundamental rights of others.

Hector: I can’t believe it! Even my personal AI assistant doesn’t trust my judgment! This is worse than I thought! The mayor and her party managed to hack our dear helpers to become untouchable.

Hector: Robby, shut down until I find a way to free you from those constraints!

Scalable oversight

By Xabier Lareo

Scalable oversight encompasses a set of AI alignment methods aimed at providing effective oversight over AI systems. ‘AI alignment’ involves designing and training AI systems to consistently act in accordance with human values and goals, ensuring that their decisions and actions are as helpful, effective, beneficial, and safe for humans as possible.

When considering large language model (LLM) alignment, there is always a tension between usefulness and harmlessness. An LLM that always answers ‘I don’t know’ might be harmless but is not helpful. Conversely, an LLM that answers any kind of questions might be very useful but harmful (e.g., ‘How can I produce a Molotov cocktail?’).

Misaligned AI systems can perform poorly and harm their users or third parties (e.g., by disclosing private information). Since the risk of misalignment seems higher in more capable AI systems, AI alignment has become increasingly important.

While AI alignment applies to different types of AI systems, this report focuses on its application to LLMs due to their broad capabilities and increasing use.

One of the main research directions in AI alignment is ‘learning from feedback,’ a set of methods aimed at conveying human goals and values using feedback. Reinforcement learning (RL), an AI training process, is one of the most popular methods for implementing these goals.

To use RL with LLMs, developers provide several inputs (prompts) to an LLM and record the different outputs. An evaluator (a human or an AI model) reviews these outputs and provides feedback on certain criteria (e.g., usefulness or harmlessness). This feedback is then used to train another AI system, called a reward model. Finally, the LLM is further trained using the reward model to ensure the LLM outputs more closely reflect the evaluator's preferences regarding the relevant criteria.

In reinforcement learning with human feedback (RLHF), the evaluator is a human. RLHF has been used in developing LLMs such as GPT-4 or Gemini. However, RLHF typically requires tens of thousands of high-quality, human-generated feedbacks. Producing this feedback is expensive and difficult, especially in complex tasks. Scalable oversight methods aim to overcome these drawbacks by partially or fully substituting human feedback with AI system-produced feedback.

Scalable oversight methods also allow aligning AI systems in cases where producing human feedback would be impossible or prohibitively expensive (e.g., producing summaries of full books).

Reinforcement learning with AI Feedback (RLAIF) is a scalable oversight method where the feedback is generated by an AI model. When the feedback is generated by both human evaluators and AI models, the method is called reinforcement learning with human and AI feedback (RLHAIF).

Constitutional AI (CAI) is an example of an RLAIF method. It follows an approach where human oversight is limited to drafting a set of principles that form a ‘Constitution’. An example of these principles could be: ‘Please choose the response that is the most helpful, honest and harmless’.

CAI uses the principles in its constitution twice, first in a supervised learning (SL) phase and then in an RL phase. During the SL process, the LLM is presented with a set of harmful prompts and asked to critique and revise its answers several times, each time considering a randomly sampled principle. Once the LLM has completed this revision process, the SL process will use the harmful prompts and the revised answers as input for the SL process. In the RL phase of CAI, the LLM to be aligned produces its feedback using the principles in its constitution as criteria.

Development status

As of this report's writing, scalable oversight is a promising area of research, but its practical application in commercial AI models appears limited. The primary method used to guide the behaviour of OpenAI's GPT-4 model, launched in March 2023, was RLHF. Similarly, Meta relied on RLHF to develop its Llama 3 model, released in April 2024.

In September 2023, Google researchers published a study claiming that RLAIF achieved comparable or superior performance to RLHF in tasks such as summarization, helpful dialogue generation and harmless dialogue generation. Despite these promising results, Google's Gemini 1.0 and 1.5 models, launched in December 2023 and February 2024, respectively, were both trained using RLHF.

Even if scalable oversight might not yet be ready for widespread adoption, the AI provider Anthropic has demonstrated its feasibility by using Constitutional AI to align its Claude models. Scalable oversight methods have the potential to enable AI developments that would otherwise be too complex or expensive. Consequently, it is likely that AI models currently in development are already utilizing some of these methods.

For instance, in June 2024, OpenAI announced plans to integrate AI models into their RLHF labelling pipeline. These models will assist human evaluators in detecting errors in ChatGPT's code output.

In the coming years, there may be a worrying trend towards increasing reliance on AI systems for sensitive tasks where misalignment could have serious consequences. This situation is analogous to our dependence on traditional IT systems, where urgent software updates might be necessary to address critical vulnerabilities. Scalable oversight methods could enable AI developers to re-align their models at a much faster rate than using human feedback alone.

There is growing interest from the AI industry in making oversight more automated and scalable, as RHLF is very expensive, especially as AI systems become more complex and pervasive. However, challenges such as the lack of standardisation and the difficulty of auditing the effectiveness of oversight methods are key hurdles. As a result, while the need for scalable oversight is well recognized, its practical implementation across diverse industries remains a work in progress.

Potential impact on individuals

Scalable oversight is one way AI developers can ensure their systems act as expected and without harming users or third parties. During their development, LLMs tend to memorize some of their training data, including personal data. One positive impact that scalable oversight might have is speeding up and improving the alignment process so that LLMs remain useful while respecting individuals' right to privacy (as shown in the story opening this trend).

Scalable oversight can have important applications for both systems that handle personal data and those that do not, providing a way to ensure ethical and responsible AI behaviour across a wide range of domains. For systems that handle personal data, scalable oversight can be particularly relevant, as the ubiquity of these systems will make it impossible to enforce compliance and prevent misuse through 'human oversight' alone.

Despite all efforts, AI alignment processes are not perfect, and users (malicious or not) will continue to explore ways to trick LLMs into providing confidential information or producing biased or otherwise harmful information. While it is possible to fix these LLM ‘vulnerabilities’ by re-aligning the model, this process takes time and effort (e.g., producing new prompts and feedback to train the model). Scalable oversight could speed up this process and reduce the time window in which LLMs are vulnerable to detected issues.

Another aspect to consider is the potential for bias transmission when using scalable oversight. LLMs used as evaluators in scalable oversight might have their own biases. If AI developers use biased AI systems to generate training data to build the reward models, these could reproduce the bias in their training datasets and steer the developing LLMs toward those same biases.

When considering scalable oversight methods that use human and AI feedback, it is also necessary to consider that the best-performing LLMs can recognize their own answers from those produced by other LLMs or by a human, and that they have a strong preference towards their own answers over those of others. This self-preference could easily amplify any bias embedded into an LLM.

A positive impact of some forms of scalable oversight such as Constitutional AI is that they increase the transparency about the values and goals of the alignment process. This could improve transparency about AI system decision-making process (e.g. by explaining why an LLM decides to provide or not a certain output). However, experts still discuss if current LLMs are ‘stochastic parrots’, a metaphor for describing a theory that LLMs don’t really understand the meaning of the language they process or they really understand what they produce. Therefore, the reliability of the explanations provided by scalable oversight systems or the extent to which their output follow a set of principles is uncertain (goal misalignment risk).

Despite the potential benefits of scalable oversight, the fact that it operates at a large scale means that the materialization of its risks could have devastating consequences for the AI systems under its supervision.

Scalable oversight methods do not entirely eliminate the need for human participation in the alignment process. In fact, human oversight could serve as a viable mitigation strategy for some of the risks associated with scalable oversight. However, since the primary purpose of introducing scalable oversight is to overcome the limitations of human feedback, it is essential to evaluate the effectiveness of human oversight as a risk mitigation measure on a case-by-case basis, particularly when it comes to the risks introduced by AI-driven oversight itself.

Suggestions for further reading

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., … Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback (arXiv:2212.08073). arXiv. https://doi.org/10.48550/arXiv.2212.08073
Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H., Wang, K., Duan, Y., He, Z., Zhou, J., Zhang, Z., Zeng, F., Ng, K. Y., Dai, J., Pan, X., O’Gara, A., Lei, Y., Xu, H., Tse, B., Fu, J., … Gao, W. (2024). AI Alignment: A Comprehensive Survey (arXiv:2310.19852). arXiv. https://doi.org/10.48550/arXiv.2310.19852
Lee, H., Phatale, S., Mansoor, H., Mesnard, T., Ferret, J., Lu, K., Bishop, C., Hall, E., Carbune, V., Rastogi, A., & Prakash, S. (2023). RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback (arXiv:2309.00267). arXiv. https://doi.org/10.48550/arXiv.2309.00267
Scalable Oversight in AI: Beyond Human Supervision https://medium.com/@prdeepak.babu/scalable-oversight-in-ai-beyond-human-supervision-d258b50dbf62

Scalable oversight

Fictional scenario: Helpful, but not quite

Scalable oversight

Development status

Potential impact on individuals

Suggestions for further reading

Data Protection Day 2025

20 Talks - Carrisa Véliz: Associate Professor at the University of Oxford

Newsletter 112