Alignment Is Hard

Project summary

Alexander Bistagne is seeking funding to formalize the following argument:

If we can't tell that a black box program will loop forever, How could we verify that a black box agent will care about us forever instead of eventually betraying us? Answer: We can't in the worst case.

In order to make the minimum formal argument, this project needs to clearly articulate a large quantity of assumptions about what we mean by betrayal, a black box agent, time, the universe and other things. It will do this through the language of theoretical computer science: Turing Machines and Formal language theory. After defining everything appropriately, this project will reduce the complement of the halting problem to testing the alignment of a black box agent. I previously made decent progress on this project, but got sidetracked because the task of finding a job took priority. I think this project is feasible because I have made creative attempts at alignment before with a school project labeled suicidal AI, and I have attempted complexity theorems(see LvsNC1 on my github)

Project goals

The goal of this project is to convincingly argue that Alignment is Hard. My motivation for this is that; if we can convince the theoretical computer science community that this problem is important, they might be able to develop enough theory to do something or have enough power to throw the brakes on AGI projects. The concrete deliverable is to get the claim "Alignment is Hard" published in places that the current AGI designers might respect.

Concrete steps to acheive those goals

Type up the already existing rough draft.

Edit the rough draft to remove discussions about cosmology.

Submit a draft to the Alignment Forum.

Review & incorporate feedback I get on Less Wrong.

Submit the project/paper to FOCS or other conferences on theoretical computer science.

Work on removing agents' immutability assumption.

Work on stop button variation of this problem.

How will this funding be used?

Basic Project: 6,000 USD

5,000- 1 month salary

500- Payroll insurance & Taxes

500- Travel & Board for conferences.

This would let me work on this project as my top priority for 1 month. I would submit the core paper to a theoretical CS conference.

Organized Project: 40,000 USD

30,000- 6 months salary

3,000- Payroll insurance & Taxes

4,000- 26 meetings with an Editor/Mentor

3,000- Travel & Board for conferences

More funding would allow me to attempt to weaken or vary the assumptions and make more general claims about alignment. The most notable claims to weaken are around mutability of an agent’s code and the agent being in an enclosed space. Other angles to approach the problem are from agents with known utility function instead of known architecture.

What is your track record on similar projects?

I think the following projects outline that I am capable of thinking about AI in creative ways, finishing academic papers when deadlines are present, and understanding the deep mathematical field of computational complexity theory.

Suicidal AI paper https://drive.google.com/file/d/1ElnLoRbEfsAXNymIYg1nnkOisYCegSCm/view?usp=drive_link - This paper in Lise Getoor’s class Algorithms & Ethics won the most creative award. It attempted to provide counterexample style evidence against the claim that all agents want to live.

Matching Algorithms paper https://drive.google.com/file/d/1HYYfSjl38LnyBPeeBX43KtyvnMq-AoMP/view?usp=sharing - This paper was for my senior seminar in mathematics at UCSC. It mostly summarizes the topic for an undergraduate math audience.

Complexity Theorem Attempt LvsNC1

https://github.com/Alexhb61/Sorting_Algorithms/blob/main/LvsNC1_attempt - This rough draft aims to prove a nontrivial complexity theory claim, but has an unproven assumption and is thus incomplete.

Singular Value Decomposition attempt https://github.com/Alexhb61/Matrix_Algorithms/blob/main/SVD_draft1.pdf - This rough draft aimed to prove a nontrivial claim about the complexity of basic mathematical operations; specifically that singular value decomposition can be done in subcubic time in the theoretical setting. This paper needs a lot of polish.

TSort https://github.com/Alexhb61/Sorting_Algorithms/blob/main/truncation_sorts/Tsort

This rough pseudocode outlines an algorithm with sub n(logn) computational complexity for numbers with significantly less than n bits.

Cliques Via Cuts

https://github.com/Alexhb61/graph_algorithms/blob/main/Cliques_via_cuts.pdf

An ambitious attempt at disproving the strong exponential time hypothesis. Recently made into a shareable rough draft. This is more evidence of computational complexity work.

How could this project be actively harmful?

This research might convince people to give up on aligning black box agents instead of giving up on making black box agents.

What other funding is this person or project getting?

None at the moment. I have applied to the nonlinear network, but did not recieve any concrete indication that funding was available.