AI Safety Textbook

Tldr: We are writing an AI safety textbook designed to serve as a comprehensive educational resource for the field. The current version has already begun to see use in various courses and universities, so the final improved version is poised to make a large impact on the AI safety education and field-building ecosystem.

Textbook - Questions Doc

Short summary

Currently, AI Safety courses rely heavily on a patchwork of many external links to blog posts and research papers (e.g. AISF). This content varies in terms of technical depth and communication style, posing a significant barrier to coherent learning. We are writing a textbook that aims to unify these into a seamless narrative, to help improve the learning experience. We also want to include in our curriculum important considerations that are often neglected in other courses.

Our textbook is designed to serve as a comprehensive educational resource for the AI safety field. This will include an adaptation of the Turing Seminar, which is an accredited course with talks, workshops, and exercises at Ecole Normale Superieure (ENS) Ulm, and ENS Paris-Saclay in France. The original syllabus for the seminar was loosely inspired by the AI Safety Fundamentals (AISF) curriculum that I teach at Master MVA.

However, this adaptation into a textbook will require significantly more effort than the initial course development. Since the beginning of the project, we have made many expansions to the content, adding new chapters regularly. We estimate that we have completed roughly 50% of the project and that it will take us 6 months to make a good final version.

(a) Preliminary course structure

The current preliminary version of the textbook, as well as previous versions of written chapters, can be found here. The content has undergone iterative feedback loops from students who read previous versions of the textbook, and the current content is a reflection of that back-and-forth dialog. The writers have experience teaching, and we know what the most common questions are and where to put the emphasis based on where students tend to most often be confused (more about this in our CVs, and in section (e) Additional Details).

Below is the proposed table of contents:

Introduction to AI capabilities

Reasoning for why HLMI is plausible; Definitions
Scaling laws, the bitter lesson, timelines, takeoff scenarios

risk landscape

Understanding scales of AI risk and specific failure stories
Explaining specific dangerous capabilities (e.g. autonomous replication, emergence, situational awareness)

Solution landscape

Strategies and trade-offs for mitigating risks
Responses to naive solutions and motivation for in-depth problem/solution discussions

Reward misspecification

Outer alignment/reward misspecification
Imitation learning-based approaches (IRL, CIRL, RLHF, etc.)
Feedback learning-based approaches (RLHF, RLAIF/CAI, etc.)

Goal misgeneralization

2D Robustness: Goal vs. capability distribution shifts
Loss landscapes, path dependence, inductive biases, mesa-optimizers, and deceptive alignment
E.g. content covered by the first 3 2023 MATS lectures by Hubinger.

Oversight

Recursive reward modeling, debate, adversarial training, factored cognition, iterated amplification, etc…

Interpretability

Techniques for interpretability in both vision and language models
Overview of concept-based, developmental interpretability.
Deep dive distillation of mechanistic interpretability.
E.g. distillation of all content from Neel Nanda’s detailed guide.

Agent foundations

Overviews of decision theory, causal graphs, natural abstractions, logical induction, embedded agency
Perspectives on optimization and abstract notions of agency

Evaluations

Explanation of different evaluation frameworks (Behavioral vs. Understanding), core difficulties of evaluations;
E.g. Content covered in the last 3 2023 MATS lectures by Hubinger.
Control evaluation, by Redwood Research, evaluations proposed by METR, Apollo labs, etc.

AI governance

The AI triad, Responsible Scaling Policies (RSPs), Preparedness framework, etc…
Outlining differences between corporate, national and international governance needs.
Distillation of major governance proposals (EU AI Act, Biden-Harris EO, etc.)

This is an overview of the materials. While several chapters have already been written, please keep in mind that others are still being edited and finalized, so we expect the content to shift around. We have attempted to make the content as future-proof as possible, but there might be a need for occasional rewrites of some sections, or addition of new sections as the AI safety field evolves.

(b) Target audience

The introductory and governance chapters are designed for general accessibility. Its core, however, is tailored to students wanting to deepen their technical understanding of AI safety.

The textbook in its entirety is targeted towards longer upskilling programs like AISF, or semester-long university courses, but individual chapters are poised to be quite useful in their own right. As an example, interpretability has a steep learning curve, so our chapter can potentially serve as preliminary reading for specialized programs like the Alignment Research Engineer Accelerator (ARENA), or the interpretability tracks of MATS. We envision similar uses for other individual chapters as well.

Dan Hendrycks's textbook presented by CAIS is a similar effort. However, our audiences are different and what we propose is more technical in nature, providing deeper dives into specific research agendas currently being pursued by AI labs - OpenAI, DeepMind, Anthropic, Redwood, etc. For example, the CAIS textbook only dedicates one chapter to single agent safety, which focuses mainly on risks, without going too deep into state-of-the-art proposals for solutions to these risks. We hope that a reader of our textbook can “hit the ground running” and start actively contributing to either engineering or research within the field of AI safety after having read our text. The CAIS textbook and ours fulfill specific needs, each contributing to the overall AI safety field-building space in different ways.

Professor Vincent Corruble has offered to proofread everything. He is a friend of Stuart Russell, and he thinks Stuart could potentially write a foreword for our completed text. Vincent is also able to put us in contact with a scientific editor specializing in ML.

We expect the textbook to primarily be distributed digitally, however, over the course of the last year, people have been expressing interest in also having a physical copy. We have included a cost to create print versions of the finalized text to distribute to some universities.

(c) Reception

Students have read previous versions of our textbook during educational programs like the Turing Seminar and the Machine Learning for Good (ML4G) boot camps across France, Germany, and AI Safety North’s iteration of AISF. We have been gathering feedback and iteratively improving the book, with positive reactions from participants. Notably, 90% of ML4Good participants preferred the beta version of this textbook over Bluedot's materials, indicating strong potential Here are some feedback quotes from students in these courses, that were collected anonymously:

“I found it to be very well written and super insightful. Learned tons of new things. Looking forwards to continue reading.” - Participant of ML4G Germany 2023.
“I liked the text, was well written, concise, easy to follow, contained many important points.” - Participant of ML4G Germany 2023.
“The textbooks are very helpful to keep a clear line of thought and a better structure.” - Participant of ML4G France 2023.
“The material and content are great, thank you for writing it and I can't wait to read it in its entirety.” - Participant of ML4G France 2023.

(d) Enrollment numbers

The book will be used by the Turing Seminar which last year had 100 machine learning masters students enrolled. The content will be used across Europe by organizations such as AI Safety North and ML4Good, which totaled roughly 150 additional students last year. Given the increasing demand for AI safety education, there are plans to conduct more iterations of ML4Good, so these enrollment numbers are expected to rise. Feedback from iterations with students will continue to refine and enhance our project's usefulness.

Given that the structure of the textbook was inspired by BlueDot’s AISF course, we have also tried to make our textbook plug-and-play for participants in that course, which has an anticipated enrollment of 1,000 students per year. Ideally, we hope it gets adopted as the textbook for the AISF course. Additionally, most of the AI safety courses worldwide use the AISF syllabus as their core (e.g. MIT AI Alignment). This means that any upgrade to the AISF structure will also have an impact on all downstream independently run iterations. More examples of courses that could potentially use our text are highlighted at the end of section - (e)(iv) If there was demand, wouldn't someone already have written this?

Many people apply to BlueDot courses and get rejected (roughly 200 accepted participants out of 1200+ that apply), and even people who apply to BlueDot can’t keep up with the pace or have the time to read everything

In order to mitigate this problem and increase retention rates, the structure of our book will also contain hierarchical summaries. Each chapter is summarized in one page, each section is summarized in one paragraph, and deep technical details like algorithms, proofs, or code are presented in the appendices. This reduces both the time barrier to the readings, and increases macro-comprehension, while still presenting all the technical details for those who are interested. We further detail our approach to solving this problem in sections - (e)(iii) Do you expect that everyone would read it all the way through?

(e) Additional details/Potential questions

(e)(i) Is there even a need or demand for this textbook?

The AISF curriculum was initially designed almost 5 years ago when the AI safety field was still developing. At that point in time having any course on AI safety at all was a big win. But we have come a long way since then, and the AI safety education methods need to be updated to reflect that. An update does not simply mean pointing to new blog posts, it means an improvement to the overall pedagogical experience based on iteration and empirically observed failures of the existing approaches.

We have also observed empirically that when students are offered a choice of doing the readings between the list of links in AISF or just one text that organizes all the information for them, they overwhelmingly automatically pick the latter without any prompting from our side.

Even as we write this application, there are independent AISF cohorts in Germany, Russia, and other countries that we were unaware of, that have already begun to use the preliminary version of our text over the existing AISF curriculum. We as the writers didn't even know about this until they reached out to us thanking us for our work. A quote from an AI Safety Fundamentals cohort facilitator in Moscow - "a ton of thanks for the texts you write; that's a hell of a lot of work and it's done so, so well! You're a lifesaver. Thanks again for everything!"

We have already spent 7-8 months working on this to get it off the ground even without funding, so we are clearly dedicated to this work.

There is clearly a demand for this project. People are actively choosing to use even the semi-finished version that we already have, so it is safe to assume that the finished work will have an even greater impact.

(e)(ii) Are you the right people to be writing this?/Who are the authors?

Charbel has already taught the Turing Seminar at Ecole Normale Superieure (ENS) Ulm, and ENS Paris-Saclay in France 3 times. He has taught at Machine Learning for Good boot camps 6 times and was a facilitator of an AI Safety Fundamentals group in Switzerland. In total, he has had direct experience teaching >200 students about AI safety over the last two years, which has given him a lot of on-ground interaction with students. We have been actively refining the structure and syllabus by seeing what specific confusions, questions, and misunderstandings the students have during the lectures. This iterative and interactive process will continue over the next few months during the course of this project. In addition to teaching, Charbel made several research contributions. He participated in advancing the AI safety discourse through his work, such as the article "Against Almost Every Theory of Impact of Interpretability", and other papers like "Compendium of Problems with RLHF." The latter was adapted and published as “Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback.” These contributions are important in shaping the current discourse and approaches to AI safety. He also popularized Davidad’s agenda, who is now a MATS mentor.

Jeanne has also taught the Turing Seminar at ENS Ulm and organized an AI safety weekly reading group. She is a fellow at the Athena AI Alignment Fellowship where she’s working on a mechanistic interpretability project.

Markov has a lot of experience in communication and distillation through various mediums - written, video, and spoken. He has been a part of 5 AISF cohorts both as a participant and as a facilitator, including the advanced machine learning cohort offered by BlueDot, and other independently run cohorts from various universities in both the US and Europe. He has been both an AI safety research fellow, as well as an AI safety distillation fellow at organizations like AI Safety Info. He has won hackathons for his writing in AI governance as part of Apart research sprints, given multiple public lectures distilling AI Safety research agendas, and been an author of several technical AI safety scripts for the Rational Animations YouTube channel.

(e)(ii) How does this improve the status-quo of educational materials?

Based on the combined experiences of the contributors to this project, there were some common threads of complaint that we hoped to address.

First, in all of the existing AI safety courses that we have seen the readings are not self-contained. One link in the syllabus doesn't equal one link worth of reading to understand what the post/paper says. We have heard direct complaints from our students about the external link approach. There is a lot of background knowledge assumed within one blog post or paper that might not be familiar to students, especially those who are newcomers. This causes students to go through a descending whirlpool of blog posts, which point to further external links, etc. to just try to understand one concept.

Having the core key arguments and connections all presented coherently without requiring the students to play a guessing game of which external link or referenced paper they should read to truly understand the material, has in our experience significantly enhanced the learning experience. We further address the coherence and reading time problems in the answer to - (e)(ii) Do you expect that everyone would read it all the way through?

Second, we want to make improvements to various other small things that add up to make the learning experience difficult. An example is the usage of different terminology in papers/posts that can, and often does, cause confusion for students. E.g. specification gaming vs. reward misspecification vs. reward hacking vs. proxy gaming, deceptive alignment vs. deception, deceptive alignment vs. scheming, inner alignment vs. goal misgeneralization, and so on.

Third, besides terminology there are concepts that are either not clearly explained, or get entangled due to AI safety being born out of a mixture of many disciplines. For example, papers from philosophy, rationality, mathematics, economics, or cognitive science will often refer to the same underlying insights but in a different formulation. This again confuses students. We have observed this happening among existing researchers in the AI Safety space, who have previously gone through AI Safety courses structured in this manner. This causes wasted time in discussions or a lack of collaboration in potentially mutually beneficial research agendas.

We have adopted the challenge of sifting through and distilling these insights into an understandable and coherent manner. Our project offers to consolidate and clean up many of these confusions to serve as a reference not only for newcomers but also for existing researchers. In each chapter, we take care to clearly provide definitions, comparisons, and distinctions between these concepts as necessary.

Lastly, while reading a text is a good learning tool, visualizations and interactivity help immensely as well. We have recently decided to take this project further by building a website taking inspiration from the Distill.pub article - Communicating with Interactive Articles.

Here is an example of Chapter 1 on the website. We are currently in the process of porting content over, so please expect things like references and certain pieces of formatting to be broken.

The current grant application is mainly to work on finishing the core text. However, in the long term, we hope to introduce interactivity to the website similar to existing distill.pub articles like this or this. Letting students play with concepts improves learning by allowing them to explore different toy versions of the problem through visual mediums.

Examples of interactivity might include having a JavaScript snippet that lets the reader explore Tom Davidsons takeoff speeds charts and feedback loops, designing reward functions to understand specification gaming, designing small interactive grid worlds to explain the intuition behind goal misgeneralization failures, or traversing through toy models of mechanistic interpretability. Such things can be included in all the chapters.

A secondary motivation for creating the website was to formalize the project a bit more, and not just publishing a blog post. Ideally we wanted to move towards our work being able to be cited and receive academic credit. The distill format provides both a better learning experience through interactivity, and also easier attribution in academic contexts.

(e)(iii) Do you expect that everyone would read it all the way through?

During his participation in the AI Safety fundamentals cohorts, Markov noticed that the drop-off rate for participants was high. By the fourth week, 50% of participants tended to drop out, and by the last, there were often only 2-3 participants continuing to regularly attend, out of a starting cohort of 10-12. The most common complaint is the students are not able to keep up the pace, or that the advertised reading times for AISF are way off the mark.

We have specifically taken care to address the concerns around being too long or wordy. The text is intended to have a multi-level structure for people to read in as much or as little depth of detail as they have time for.

Markov during his initial AISF cohort in 2022 was writing one-page summaries of every paper and post included in the AISF syllabus. This was to also help current and future students in the AISF course stay abreast with the material and increase retention even if they did not have time to do all the weekly readings. Both facilitators of past AISF groups and other students have appreciated these summaries a lot, so that was one of the initial sparks for this project. The google doc containing those summaries was shared informally to many AISF groups that wanted to read a summary instead of the primary source.

Taking inspiration from the impact of his initial summarization project, the structure of our book will also contain hierarchical summaries. Each chapter will be summarized in one page, and each section within each chapter will also be summarized in individual paragraphs at the beginning of each chapter. Technical details will then be presented in the section themselves, with extra details such as full algorithms, pseudo-code, or theorem proofs being presented in appendices.

This hierarchical distillation structure should increase retention rates for those who do not have enough time to dedicate to readings, while still maintaining the technical depth to help those who wish to use this text as the go-to reference for any particular topic in AI Safety.

(e)(iv) If there was demand, wouldn't someone already have written this?

Even though AI safety has grown over the last few years, it is still small. People have undertaken smaller agenda-specific distillation projects in the past, but nothing with this scope. The people who get involved are mainly incentivized to go into research roles. Here are some other projects:

AI Safety Info (AISI). This website has a different goal: it is meant to be presented as a short question-answer pair format, and is not intended to be an end-to-end organized structured explanation. Additionally, the target audience for AI Safety Info seems to be lay people or others who might want to advocate for AI Safety, whereas our target audience is potential engineers or researchers wanting to upskill and contribute to the field. Markov worked as a distillation fellow with AI Safety Info so he has gained a lot of experience in communicating AI Safety topics from there
The CAIS textbook. As we pointed out, the target audiences for our works are different, and we believe our textbooks will be complementary. The CAIS textbook focuses on a macro overview of safety, while we delve into the specific details of proposed solutions. Quoting from the preface of the CAIS textbook - “The book has been designed to be accessible to readers from diverse academic backgrounds. You do not need to have studied AI, philosophy, or other such topics. … This textbook moves beyond the confines of machine learning to provide a holistic understanding of AI risk.”
The alignment newsletter written by Rohin Shah. It is used to provide summarized distillations of many papers and posts, but he has gotten busy with his research and the last post was almost 2 years ago. There are other newsletters that can serve as a summary, but those summaries are dispatched all over the place and there is ~0% chance that a beginner will read those past newsletters to build a picture of the field. We don’t think this is the best way to build momentum. Newsletters are for people who are already familiar with the landscape.

Why do researchers not continue to work on distillation? One possibility is perverse incentives, like not getting as much academic credit or citations for a research paper vs. the amount one would get from research distillation. Whatever the incentives are, experts put the majority of their effort into research, and comparatively little into distillation, which is less prestigious than direct research.

There’s a tradeoff between the energy put into explaining an idea, and the energy needed to understand it. Due to this tradeoff, a lack of high-quality explanations means that every student has to put more energy into understanding the same research. A research debt is the increasing accumulation of missing labor that has gone into distillation that has to then be paid by students.

The learning curve does not have to be as steep as it currently is. Our project solves the research debt by making the learning experience easier. Good explanations often involve transforming and refining the idea, which can take just as much effort and deep understanding as the initial discovery.

Like the theoretician, the experimentalist, or the research engineer, the research distiller has an integral role in a healthy research community. Right now, almost no one is filling it, as can be seen by the consistent calls for distillation of research agendas - Call For Distillers.

To make sure that we can have a large impact and improve the student experience, over the last few months, we also researched various AI safety courses being run throughout the world. Many of them use either the AISF curriculum or some small variation of it. So it is quite likely that participants in these courses are having the same challenges that we have noticed in our students. This means that any upgrade to the AISF structure will also improve all downstream independently run iterations. Overall we think that the impact of our project on the whole AI safety education and field-building ecosystem will be quite significant.

As examples, here are just some of the AI Safety courses being run at various universities that we looked into: