Synthesizing Standalone World-Models

Motivation: Suppose we're living in a world in which none of the alignment methods refined against frontier LLMs scale to ASI. In those worlds, we're not on the track to solve the alignment problem. There's too much non-parallelizable research to do, too few people competent to do it, and not enough time. So perhaps we... shouldn't try. Instead, we can try using adjacent theory to produce a different tool powerful enough to get us out of the current mess. Ideally, without having to directly deal with AGIs/agents at all.

Concrete goal: develop methods for constructing sufficiently powerful, safe, easily interpretable, well-structured world-models.

"Sufficiently powerful": the world-model contains or can be used to generate knowledge sufficient to resolve our AGI-doom problem, such as recipes for comprehensive mechanistic interpretability, mind uploading, or adult intelligence enhancement, or for robust solutions to alignment directly.
"Safe": it's not embedded in a superintelligent agent eager to eat our lightcone, and it also doesn't spawn superintelligent simulacra eager to eat our lightcone, and doesn't cooperate with acausal terrorists eager to eat our lightcone, and isn't liable to Basilisk-hack its human operators into prompting it to generate a superintelligent agent eager to eat our lightcone, and so on down the list.
"Easily interpretable": it's written in some symbolic language, such that interpreting it is in the reference class of "understand a vast complex codebase" combined with "learn new physics from a textbook", not "solve major philosophical/theoretical problems".
"Well-structured": has an organized top-down hierarchical structure, learning which lets you quickly navigate to specific information in it.

Some elaborations:

Safety: The problem of making such world-models safe is fairly nontrivial: a world-model powerful enough to be useful would need to be a strongly optimized construct, and strongly optimized things are inherently dangerous, agent-like or not. There's also the problem of what had exerted this strong optimization pressure on it: we would need to ensure the process synthesizing the world-model isn't itself the type of thing to develop an appetite for our lightcone.

But I'm cautiously optimistic this is achievable in this narrow case. Intuitively, it ought to be possible to generate just an "inert" world-model, without a value-laden policy (an agent) on top of it.

That said, this turning out to be harder than I expect is certainly one of the reasons I might end up curtailing this agenda.

Interpretability: There are two primary objections I expect here.

"This is impossible, because advanced world-models are inherently messy". I think this is confused/wrong, because there's already an existence proof: a human's world-model is symbolically interpretable by the human mind containing it. More on that later.
"(Neuro)symbolic methods have consistently failed to do anything useful". I'll address that below too, but in short, neurosymbolic methods fail because it's a bad way to learn: it's hard to traverse the space of neurosymbolic representations in search of the right one. But I'm not suggesting a process that "learns by" symbolic methods, I'm suggesting a process that outputs a symbolic world-model.

Why Do You Consider This Agenda Promising?

On the inside view, this problem, and the subproblems it decomposes into, seems pretty tractable. Importantly, it seems tractable using a realistic amount of resources (a small group of researchers, then perhaps a larger-scale engineering effort for crossing the theory-practice gap), in a fairly short span of time (I optimistically think 3-5 years; under a decade definitely seems realistic).

(You may think a decade is too slow given LLM timelines. Caveat: "a decade" is the pessimistic estimate under my primary, bearish-on-LLMs, model. In worlds in which LLM progress goes as fast as some hope/fear, this agenda should likewise advance much faster, for one reason: it doesn't seem that far from being fully formalized. Once it is, it would become possible to feed it to narrowly superintelligent math AIs (which are likely to appear first, before omnicide-capable general ASIs), and they'd cut years of math research down to ~zero.

I do not centrally rely on/expect that. I don't think LLM progress would go this fast; and if LLMs do speed up towards superintelligence, I'm not convinced it would be in the predictable, on-trend way people expect.

That said, I do assign nontrivial weight to those worlds, and care about succeeding in them. I expect this agenda to fare pretty well there.)

On the outside view, almost nobody has been working on this, and certainly not using modern tools. Meaning, there's no long history of people failing to solve the relevant problems. (Indeed, on the contrary: one of its main challenges is something John Wentworth and David Lorell are working on, and they've been making very promising progress recently.)

On the strategic level, I view the problem of choosing the correct research agenda as the problem of navigating between two failure modes:

Out-of-touch theorizing: If you pick a too-abstract starting point, you won't be able to find your way to the practical implementation in time. (Opinionated example: some of the agent-foundations agendas.)
Blind empirical tinkering: If you pick a too-concrete starting point, you won't be able to generalize it to ASI in time. (Opinionated example: some of the agendas focused on frontier LLMs.)

I think most alignment research agendas, if taken far enough, do produce ASI-complete alignment schemes eventually. However, they significantly differ in how long it takes them, and how much data they need. Thus, you want to pick the starting point that gets you to ASI-complete alignment in as few steps as possible: with the least amount of concretization or generalization.

Most researchers disagree with most others regarding what that correct starting point is. Currently, this agenda is mine.

High-Level Outline

As I'd stated above, I expect significant parts of this to turn out confused, wrong, or incorrect in a technical-but-not-conceptual way. This is a picture is painted with a fairly broad brush.

I am, however, confident in the overall approach. If some of its modules/subproblems turn out faulty, I expect it'd be possible to swap them for functional ones as we go.

Theoretical Justifications

1. Proof of concept. Note that human world-models appears to be "autosymbolic": able to be parsed as symbolic structures by the human mind in which they're embedded. Given that the complexity of things humans can reason about is strongly limited by their working memory, how is this possible?

Human world-models rely on chunking. To understand a complex phenomenon, we break it down into parts, understand the parts individually, then understand the whole in terms of the parts. (The human biology in terms of cells/tissues/organs, the economy in terms of various actors and forces, a complex codebase in terms of individual functions and modules.)

Alternatively, we may run this process in reverse. To predict something about a specific low-level component, we could build a model of the high-level state, then propagate that information "downwards", but only focusing on that component. (If we want to model a specific corporation, we should pay attention to the macroeconomic situation. But when translating that situation into its effects on the corporation, we don't need to model the effects on all corporations that exist. We could then narrow things down further, to e. g. predict how a specific geopolitical event impacted an acquaintance holding a specific position at that corporation.)

Those tricks seem to work pretty well for us, both in daily life and in our scientific endeavors. It seems that the process of understanding and modeling the universe can be broken up into a sequence of "locally simple" steps: steps which are simple given all preceding steps. Simple enough to fit within a human's working memory.

To emphasize: the above implies that the world's structure has this property at the ground-true level. The ability to construct such representations is an objective fact about data originating from our universe; our universe is well-abstracting.

The Natural Abstractions research agenda is a formal attempt to model all of this. In its terms, the universe is structured such that low-level parts of the systems in it are independent given their high-level state. Flipping it around: the high-level state is defined by the information redundantly represented in all low-level parts.

That greatly simplifies the task. Instead of defining some subjective, human-mind-specific "interpretability" criterion, we simply need to extract this objectively privileged structure. How can we do so?

2. Compression. Conceptually, the task seems fairly easy. The kind of hierarchical structure we want to construct happens to also be the lowest-description-length way to losslessly represent the universe. Note how it would follow the "don't repeat yourself" principle: at every level, higher-level variables would extract all information shared between the low-level variables, such that no bit of information is present in more than one variable. More concretely, if we wanted to losslessly transform the Pile into a representation that takes up the least possible amount of disk space, a sufficiently advanced compression algorithm would surely exploit various abstract regularities and correspondences in the data – and therefore, it'd discover them.

So: all we need is to set up a sufficiently powerful compression process, and point it at a sufficiently big and diverse dataset of natural data. The output would be isomorphic to a well-structured world-model.

... If we can interpret the symbolic language it's written in.

The problem with neural networks is that we don't have the "key" for deciphering them. There might be similar neat structures inside those black boxes, but we can't get at them. How can we avoid this problem here?

By defining "complexity" as the description length in some symbolic-to-us language, such as Python.

3. How does that handle ontology shifts? Suppose that this symbolic-to-us language would be suboptimal for compactly representing the universe. The compression process would want to use some other, more "natural" language. It would spend some bits of complexity defining it, then write the world-model in it. That language may turn out to be as alien to us as the encodings NNs use.

The cheapest way to define that natural language, however, would be via the definitions that are the simplest in terms of the symbolic-to-us language used by our complexity-estimator. This rules out definitions which would look to us like opaque black boxes, such as neural networks. Although they'd technically still be symbolic (matrix multiplication plus activation functions), every parameter of the network would have to be specified independently, counting towards the definition's total complexity. If the core idea regarding the universe's "abstraction-friendly" structure is correct, this can't be the cheapest way to define it. As such, the "bridge" between the symbolic-to-us language and the correct alien ontology would consist of locally simple steps.

Alternate frame: Suppose this "correct" natural language is theoretically understandable by us. That is, if we spent some years/decades working on the problem, we would have managed to figure it out, define it formally, and translate it into code. If we then looked back at the path that led us to insight, we would have seen a chain of mathematical abstractions from the concepts we knew in the past (e. g., 2025) to this true framework, with every link in that chain being locally simple (since each link would need to be human-discoverable). Similarly, the compression process would define the natural language using the simplest possible chain like this, with every link in it locally easy-to-interpret.

Interpreting the whole thing, then, would amount to: picking a random part of it, iteratively following the terms in its definition backwards, arriving at some locally simple definition that only uses the terms in the initial symbolic-to-us language, then turning around and starting to "step forwards", iteratively learning new terms and using them to comprehend more terms.

I. e.: the compression process would implement a natural "entry point" for us, a thread we'd be able to pull on to unravel the whole thing. The remaining task would still be challenging – "understand a complex codebase" multiplied by "learn new physics from a textbook" – but astronomically easier than "derive new scientific paradigms from scratch", which is where we're currently at.

(To be clear, I still expect a fair amount of annoying messiness there, such as code-golfing. But this seems like the kind of problem that could be ameliorated by some practical tinkering and regularization, and other "schlep".)

4. Computational tractability. But why would we think that this sort of compressed representation could be constructed compute-efficiently, such that the process finishes before the stars go out (forget "before the AGI doom")?

First, as above, we have existence proofs. Human world-models seem to be structured this way, and they are generated at fairly reasonable compute costs. (Potentially at shockingly low compute costs. The numbers in that post feel somewhat low to me, but I think it's directionally correct.)

Second: Any two Turing-complete languages are mutually interpretable, at the flat complexity cost of the interpreter (which depends on the languages but not on the program). As the result, the additional computational cost of interpretability – of computing a translation to the hard-coded symbolic-to-us language – would be flat.

5. How is this reconciled with the failures of previous symbolic learning systems? That is: if the universe has this neat symbolic structure that could be uncovered in compute-efficient ways, why didn't pre-DL approaches work?

This essay does an excellent job explaining why. To summarize: even if the final correct output would be (isomorphic to) a symbolic structure, the compute-efficient path to getting there, the process of figuring that structure out, is not necessarily a sequence of ever-more-correct symbolic structures. On the contrary: if we start from sparse hierarchical graphs, and start adding provisions for making it easy to traverse their space in search of the correct graph, we pretty quickly arrive at (more or less) neural networks.

However: I'm not suggesting that we use symbolic learning methods. The aim is to set up a process which would output a highly useful symbolic structure. How that process works, what path it takes there, how it constructs that structure, is up in the air.

Designing such a process is conceptually tricky. But as I argue above, theory and common sense say that it ought to be possible; and I do have ideas.

Subproblems

The compression task can be split into three subproblems. Below are their summaries; see extended descriptions here.

Summaries:

1. "Abstraction-learning". Given a set of random low-level variables which implement some higher-level abstraction, how can we learn that abstraction? What functions map from the molecules of a cell to that cell, from a human's cells to that human, from the humans of a given nation to that nation; or from the time-series of some process to the laws governing it?

As mentioned above, this is the problem the natural-abstractions agenda is currently focused on.

My current guess is that, at the high level, this problem can be characterized as a "constructive" version of Partial Information Decomposition. It involves splitting (every subset of) the low-level variables into unique, redundant, and synergistic components.

Given correct formal definitions for unique/redundant/synergistic variables, it should be straightforwardly solvable via machine learning.

Current status: the theory is well-developed and it appears highly tractable.

2. "Truesight". When we're facing a structure-learning problem, such as abstraction-learning, we assume that we get many samples from the same fixed structure. In practice, however, the probabilistic structures are themselves resampled.

Examples:

The cone cells in your eyes connect to different abstract objects depending on what you're looking at, or where your feet carry you.
The text on the frontpage of an online newsletter is attached to different real-world structures on different days.
The glider in Conway's Game of Life "drifts across" cells in the grid, rather than being an abstraction over some fixed set of them.
The same concept of a "selection pressure" can be arrived-at by abstracting from evolution or ML models or corporations or cultural norms.
The same human mind can "jump substrates" from biological neurons to a digital representation (mind uploading), while still remaining "the same object".

I. e.,

The same high-level abstraction can "reattach" to different low-level variables.
The same low-level variables can change which high-level abstraction they implement.

On a sample-to-sample basis, we can't rely on any static abstraction functions to be valid. We need to search for appropriate ones "at test-time": by trying various transformations of the data until we spot the "simple structure" in it.

Here, "simplicity" is defined relative to the library of stored abstractions. What we want, essentially, is to be able to recognize reoccurrences of known objects despite looking at them "from a different angle". Thus, "truesight".

Current status: I think I have a solid conceptual understanding of it, but it's at the pre-formalization stage. There's one obvious way to formalize it, but it seems best avoided, or only used as a stepping stone.

3. Dataset-assembly. There's a problem:

Solving abstraction-learning requires truesight. We can't learn abstractions if we don't have many samples of the random variables over which they're defined.
Truesight requires already knowing what abstractions are around. Otherwise, the problem of finding simple transformations of the data that make them visible is computationally intractable. (We can't recognize reoccurring objects if we don't know any objects.)

Thus, subproblem 3: how to automatically spot ways to slice the data into datasets entries from which are isomorphic to samples from some fixed probabilistic structure, to make them suitable for abstraction-learning.

Current status: basically greenfield. I don't have a solid high-level model of this subproblem yet, only some preliminary ideas.

Funding Breakdown

I currently reside in a country with low costs of living, and I don't require much compute at this stage, so the raw resources needed are small; e. g., $40k would cover me for a year. That said, my not residing in the US increasingly seems like a bottleneck on collaborating with other researchers. As such, I'm currently aiming to develop a financial safety pillow, then immigrate there. Funding would be useful up to $200k.

Track Record

I've been working as an independent alignment researcher since 2022, largely funded by the Long-Term Future Fund. See my LW profile.

The agent-foundations space is sparse on legible accolades, but I've won prizes at ARC's ELK contest (judges: Paul Christiano, Mark Xu) and the AI Alignment Awards contest (judges: Nate Soares, John Wentworth, Richard Ngo).

If you want a reference, reach out to John Wentworth.