Manifund foxManifund
Home
Login
About
People
Categories
Newsletter
HomeAboutPeopleCategoriesLoginCreate
4

Understanding Trust

abramdemski avatar

Abram Demski

ActiveGrant
$100,000raised
$100,000funding goal
Fully funded and not currently accepting donations.

Brief

I originally created this Manifund grant request with the intention of posting it in December 2024, to fundraise for 2025. However, I found a donor through other means, and decided against posting this at that time.

Figuring out the best way to transfer the money from my private donor to me (considering taxes and other such things) has been a bit of a slog. After about 6 months of consideration (including paying some lawyers), we’ve come back around to Manifund as a tool for the money transfer.

As such, I am posting this mainly so that my existing donor can send me money through Manifund. However, as always, additional money would help by giving me more resources, more runway, and letting me not worry about raising money for longer. I've set the project goal at 100k because that is the amount I am expecting from that single donor via this transfer method.

I have already run the AI Safety Camp project mentioned in the grant proposal below; I took on 4 students through that program, and they helped me improve my latest paper. I didn’t make it into the MATS program I mention below. I did record the lectures I gave for AI Safety Camp, and am in the process of having them edited.

Research Summary

The Tiling Agents problem (AKA reflective consistency, deference) consists of analyzing when one agent (the "predecessor") will choose to deliberately modify another agent (the "successor"). Usually, the predecessor and successor are imagined as the same agent across time, so we are studying self-modification. A set of properties "tiles" if those properties, when present in both predecessor and successor, guarantee that any self-modifications will avoid changing those properties.

You can think of this as the question of when agents will preserve certain desirable properties (such as safety-relevant properties) when given the opportunity to self-modify. Another way to think about it is the slightly broader question: when can one intelligence trust another? The bottleneck for avoiding harmful self-modifications is self-trust; so, getting tiling results is mainly a matter of finding conditions for trust.

The search for tiling results has four main motivations:

  • AI-AI tiling, for the purpose of finding conditions under which AI systems will want to preserve safety-relevant properties.

  • Human-AI tiling, for the purpose of understanding when we can justifiably trust AI systems.

  • AI-Human tiling, as a model of corrigibility. (Will AIs choose to interfere with human decisions?)

  • Tiling as a consistency constraint on decision theories, for the purpose of studying rationality. Improved pictures of rational decision-making may shift the way we make critical decisions about the future.

These four application areas have a large overlap, and all four seem important.

This line of research was historically one of the main topics considered at MIRI under the Agent Foundations program. However, that research program has been disbanded, and other researchers who were once involved have moved on to other things. I am the only person I know who is currently continuing this line of research (with the exception of the students I plan to take on in 2025, some of whom have already started to prepare for the project.) In some ways, this line of research is the most well-justified path for AI safety, since it focuses on mathematically modeling what a positive relationship between AI systems and humans could look like. I therefore consider it neglected relative to its importance.

Budget Summary

I can get employment of at least $100/hour elsewhere. Charging 60% of that, and assuming full-time work brings us to a salary of $124,800/year. Health insurance costs about $800/month per person, and I pay for one dependent, bringing it to $19200. This totals to my requested amount of $144000 for the year.

Deliverables

Mentoring

My current plan is to do focused 1-on-1 mentoring with four especially promising applicants found through AI Safety Camp. This mentorship will run for three months. During these three months, I will also offer weekly presentations to a wider audience (I estimate 5-10 people will come in addition to the four mentees). This will help grow a research community around these topics.

I have also applied to MATS and will probably mentor 1-2 students through that program in 2025.

Written Output

I primarily report on my research through essays written on my LessWrong page and cross-posted to the Alignment Forum. I may also choose to write an academic paper for specific results. My writing seems to be read widely and have an impact on the field (more on this in the Track Record section).

Many of the best academic books started out as lecture notes. Since I aim to do a series of presentations on the Tiling Agents problem in 2025, it seems like a good opportunity to try to put together a new collection of writing on the subject, based on those presentations as well as the work I and others do on the subject over the course of 2025. I will at least record those lectures and upload them for viewing by a wider audience. I will also aim to polish the transcripts into a series of essays which can serve as a new reference for the current state of the research agenda.

In addition to this, I expect researchers I mentor to write about the research we do together. Again, this may take the form of blog posts or formal academic papers.

Track Record

Writing

I have been working on Agent Foundations research for over a decade in some capacity (I first attended a workshop at MIRI in 2012). I worked on AI Safety research full-time from 2017 to 2024, as part of the Agent Foundations team. My most illustrious output there is probably the Embedded Agency comic.

I have 19 academic-style papers, 12 of which have been peer-reviewed. I have 294 total citations. My h-index and i10 index are both 8. I have a PhD in computer science from the University of Southern California.

At present, I have 217 LessWrong posts, with a total karma (for posts and comments) of 18490 (putting me at 17th on the site, currently, when sorting by karma). 101 of these posts are cross-posted to the AI Alignment Forum, where I have 3501 karma.

Out of the 42 essays in the Best of 2018 LessWrong books, I am an author on three. Of the 58 for 2019, I have five. Of the 49 for 2020, I have three. (I do not have any for the 2021 books. Books have not been produced for any other years.)

This helps establish that I am able to produce written content, which gets an audience, and some of which is rated highly by that audience. This is important to my impact, since my research needs to be read and understood in order to make a difference.

I find that other people working on AI safety have frequently read some of my posts and appreciated them. This suggests that I am reaching a technical audience and having an impact on the field.

I was not a coauthor on the Logical Induction paper, but Scott Garrabrant’s account of the history mentions me repeatedly.

This shows my involvement in the research community, and the utility of my ideas for sparking others to do good research. 

Mentoring

I have been involved in mentoring for AI Safety research several times with the MIRI Summer Fellows Program (AKA the AI Summer Fellows Program, depending on the year). I have also mentored students through PIBBSS, CLR, and SERI MATS.

I don't have good data on how effective I am as a mentor, however.

Detailed Research Plan

I should add that I wouldn't consider this funding to be an absolute commitment to work exclusively on the tiling problem over the next year. I regularly explore other research problems related to AI safety. It is possible that I change my mind about priorities and work on a different line of research (still related to AI safety). However, I am quite confident that I'll do dedicated work on the tiling problem for at least 3 months, and I would give a 75% probability that I will dedicate at least 60% of my time to tiling problems over the next year.

Concerning other lines of research, I expect with 75% probability that I will spend at least 30% of my time on nearer-term work focused on applying my understanding of tiling so far (incomplete as it may be) to LLMs. This includes work such as o1 is a bad idea and why not just shoggoth+face?

Motivation

In the big picture, tiling seems like perhaps the single most well-motivated approach to theoretical AI safety: it allows us to directly formally address questions about when humans can justifiably trust AIs. However, extremely few people are directly working on this approach. Indeed, out of all the people who have worked on tiling in the past, the only person I’m currently aware of who continues to work on this is myself.

I think part of this is about timelines. Tiling results remain very theoretical. I am optimistic that significant improvements to tiling results are possible with some focused attention, but I think there is a long way to go before tiling results will be realistic enough to offer concrete advice which helps with the real problems we face. (At least, a long way to go to yield “non-obvious” advice.)

However, it still seems to me like more people should be working on this approach overall. There are many unpublished results and ideas, so my major aim in 2025 will be to get some of these things into a shape fit to publish, and disseminate the knowledge.

Tiling Overview

The basic idea of tiling is that an agent architecture is “no good” in some sense if we can show that an agent designed according to that architecture would self-modify to something which does not match the architecture, given the choice. This is a highly fruitful sort of coherence requirement to impose, in the sense that it rules out a lot of proposals, including many existing decision theories.

One motivation for this criterion is for building robust AI systems: if an AI system would remove its safety constraints given the chance, then we are effectively in a fight with it (we have to avoid giving it a chance). Hence, even if we don’t plan to allow the AI to self-modify, it seems wise to build safety precautions in a way which we can show to be self-preserving. This perspective on tiling focuses on AI-AI tiling.

A broader motivation is to study the structure of trust. AI alignment is, from this perspective, the study of how to build trustworthy systems. Tiling is the study of how one mindlike thing can trust another mindlike thing. If we can make tiling theorems sufficiently realistic, then we can derive principles of trust which can provide guidance about how to create trustworthy systems in the real world. This perspective focuses on human-AI tiling.

A third motivation imposes tiling as a constraint on decision theories, as a way of trying to understand rationality better, in the hopes of subsequently using the resulting decision theory to guide our decision-making (eg, with respect to AI risk). Tiling appears to be a rather severe requirement for decision theories, so (provided one is convinced that tiling is an important consistency requirement) it weeds out a lot of bad answers that might otherwise seem good. Novel decision-theoretic insights might point to crucial considerations which would otherwise have been missed.

Fourth, studying tiling might yield insights about corrigibility. The idea of corrigibility is that AI systems should allow themselves to be modified by their creators to correct problems, which seems intuitively difficult to combine with rational decision-making due to instrumental convergence. Stuart Armstrong and others have shown that corrigibility is indeed difficult to achieve in a rational-agent framework.

By studying conditions under which AI systems can justifiably trust humans, we may gain further insights into corrigibility. Just as we don't want an AI system to have an incentive to interfere with the decision-making of its future self, we similarly don't want it to have an incentive to interfere with the decision-making of humans. If desirable AI-AI tiling properties can be engineered, perhaps corrigibility is possible as well.

Logical Uncertainty

Wei Dai's Updateless Decision Theory (UDT) admits a simple tiling result. However, this simple result does not respect the Vingean Principle which Eliezer Yudkowsky identified as a key desiderata in tiling results: an agent should be able to trust its future self without being able to reason about its future actions in detail. Trusting a process when you already know the output of that process is trivial, but not realistically applicable to the problems we want tiling results for. (It only gives us advice about how to trust AI systems if we know exactly what they'll do in every situation that might arise.)

Instead, we need a more abstract kind of trust (which we might call Vingean trust). This trust must arise from understanding something about how decisions are made; an AI system should be able to trust its future self based on reasoning that they have the same goals in mind when making decisions. We can trust modern Chess AI to win at chess, even though we can't predict precisely which moves it will make.

This involves a type of reasoning called logical uncertainty (although in hindsight, I believe “computational uncertainty” would have been a clearer term). This refers to the kind of uncertainty which can be reduced simply by thinking longer, without making empirical observations; for example, the kind of uncertainty expressed by a mathematical conjecture.

MIRI’s Logical Induction gives us a mathematical framework which slightly generalizes Bayesianism, to address logical uncertainty in a more suitable way. This gives rise to a “bounded rationality” perspective, where agents are not perfectly rational, but avoid a given class of easily recognizable inconsistencies.

We have a limited tiling result for logical-induction-based decision theory (due to Sam). I hope to significantly improve upon this by building on research into logically uncertain UDT which Martin Soto and I collaborated on last summer (none of which has been published yet). UDT has been very resistant to attempts to combine it with logical uncertainty so far. However, I believe that I am close to a solution to this problem. This would finally yield a realistically-applicable decision theory with a satisfying formal tiling result.

This area contains the clearest open problems and the lowest-hanging fruit for new tiling theorems.

However, I can imagine this area of research getting entirely solved without yet providing significant insight into the human-AI tiling problem (that is, the AI safety problem). My intuition is that it primarily addresses AI-AI tiling, and specifically in the case where the “values” of the AI are entirely pinned down in a strong sense. Therefore, to derive significant insights about AI risks, it seems important to generalise tiling further, including more of the messiness of the real-world problems we face.

Value Uncertainty

Open-minded updatelessness allows us to align with an unknown prior, in the same way that regular updatelessness allows us to align with a known prior. 

Specifying agents who are aligned with unknown values is the subject of value learning & Stuart Russel’s alignment program, focusing on assistance games (CIRL).

If we combine the two, we get a general notion of aligning with unknown preference structures. This gives us a highly general decision-theoretic concept which I hope to formally articulate and study over the next year.

In particular, with traditional uncertainty, we can model an AI which is uncertain about human values and trying to learn them from humans; however, the humans themselves have to know their own values (as is assumed in CIRL). With open-minded uncertainty, I think there will be a much better picture of AIs aligning with humans who are themselves uncertain about their own values.

Another important research thread here is how to integrate the insights of Quantilization; softly optimizing imperfect proxy values seems like a critical safety tool. Infrabayesianism appears to offer important theoretical insight into making quantilizer-like strategies tile.

Value Plurality

Here’s where we bring in the concerns of Critch’s negotiable reinforcement learning: an agent aligned to multiple stakeholders whose values may differ from each other. Several directions for moving beyond Critch’s result present themselves:

  • Making it into a proper ‘tiling’ result by using tools from previous subsections; Critch only shows Pareto-optimality, but we would like to show that we can trust such a system in a deeper sense.

  • Combining tools outlined in previous subsections to analyze realistic stakeholders who don’t fully know what they want or what they believe, and who are themselves boundedly rational.

  • Using concepts from bargaining theory and voting theory / social choice theory to go beyond pareto-optimality and include notions of fairness. In particular, we would like to ensure that the outcome is not catastrophic with respect to the values of any of the stakeholders. We also need to care about how manipulable the values are under ‘strategic voting’ and how this impacts the outcome.

  • Is there an appealing formalism for multi-stakeholder Quantilization?

In terms of general approach: I want to take the formalism of logical induction and try to extend it to recognize the “traders” as potentially having their own values, rather than only beliefs. This resembles some of the ideas of shard theory.

Ontology Plurality

Critch’s negotiable RL formalism not only assumes that the beliefs and values of the stakeholders are known; it also assumes that the stakeholders and the AI agent all share a common ontology in which these beliefs and values can be described. To meaningfully address problems such as ontological crisis, we need to move beyond such assumptions, and model “where the ontology comes from” more deeply.

My ideas here are still rather murky and speculative. One thread involves extending the market metaphor behind logical induction, to model new stocks being introduced to the market. Another idea involves modeling the market structure using linear logic.

More Monetary Details

Funding Tiers

If this project achieves less than my funding target, I will exercise my own judgement, choosing between considering myself to be funded for a shorter period (at the same monthly rate), or funded for the same one-year period proposed (but funded at a lower rate). I expect I can maintain a similar quality of life for 100k a year, making some compromises.

Similarly, if this project gets more funding than my target, I will exercise my own judgement between being funded for longer vs at a higher rate. If I got more money, I might be able to afford a dedicated office space (my office space is currently taking up half of my bedroom, which has significant disadvantages). This and other changes might significantly improve my productivity. I don't expect significant improvements of this kind beyond 180k/year, so I would expect to consider myself funded for more than a year beyond that number.

Other Fundraising

I applied to SFF and LTFF mid-summer, around the time when I lost my MIRI position. I received 10k in speculation granting from SFF, but they ultimately decided not to fund my application. I haven't received a definite answer from LTFF yet. If they approve my grant request after I've gotten money here, I will avoid accepting funding from two sources for overlapping funding periods, by either moving funding periods to be non-overlapping, or rejecting some money from one or both sources.

I also started a Patreon around that time, which is currently getting around 2k/month. I will treat extra income from Patreon similarly to how I would treat more-than-target funding here, using it flexibly to either extend the grant period or increase the monthly salary. I will also let my Patreon donors know about this funding source, so that they can make informed decisions about whether to keep giving me money.

If I decide to agree to another form of employment during the period when I am funded by this grant, I will return money to donors reflecting the fraction of the grant's funding period during which I am otherwise employed.

What are the most likely causes and outcomes if this project fails?

The biggest problem with this line of research, at least in my view, is that progress has been slow, and is liable to continue to be slow. Many people think that AI timelines are short, on the order of a few more years until superintelligence. It is plausible that tiling-centric AI safety research will be too slow for this deadline.

Scaling up the research program may help significantly, especially if we get top talent. However, the timeline from this type of abstract research to concrete applications is frequently measured in decades.

It is also possible that AI itself will soon be competent enough to accelerate this type of research significantly. Again, though, one could think it'll be too late by that time.


Comments1Donations1
Austin avatar

Austin Chen

1 day ago

Approving this project! While I'm not very familiar with this field of research myself, Abram is well-regarded in the AI safety space; I've spoken with another funder I trust who was interested in supporting Abram's work.

I'm also glad that Manifund can help out here, by serving as a simple, low-cost option for fiscal sponsorship!