PIBBSS (pibbss.ai) draws on diverse fields to foster and incubate novel approaches for AI safety that bridge between ‘theory’ – from mathematics, physics, complex systems, neuroscience, information theory, and more – and ML ‘practice’ towards an integrated Science of AI. Our current focus is on research which aims at maturing a science of deep learning capable of providing rigorous theoretical grounding for understanding-based white-box safety audits. We do so by i) actively scouting for novel research bets and world-class talent in technical domains that we deem promising and neglected by the alignment community and ii) providing institutional, research & strategic support to researchers at the Ph.D to Professorship level, typically with 4-12 month long contracts with the possibility of renewal upon satisfying performance. Zooming out slightly, we understand our affiliate program to be an effort to institutionalize the diversification of research bets in AI Safety which are theoretically principled and empirically grounded.
We are looking for funding for research affiliates' salaries, including relevant research-related costs (e.g., office space, compute), staff costs, and general funding for running expenses. We have room for funding for up to ~$3,000,000, but our primary funding goal is $753,500, covering 6 affiliates for 6 months.
In order to safely steer humanity through the unfolding AI transition, we need to make informed R&D and policy decisions concerning the risks posed by both current and near-future AI systems. To do so, we must develop evaluations and safety measures which are reliable, trustworthy, and will generalize across models, architectures, and capability gains. But we cannot do either of these things without a mature Science of AI. (See Appendix A for more details on our strategic picture and threat models.)
At PIBBSS, we seek to foster theoretically ambitious and empirically grounded AI safety research. We believe theory is critical to adequately addressing AI risk and that progress in theory requires substantive and iterative empirical engagement.
As such, we are working on building out a group of research affiliates capable of integrating insights between ‘theory’ – from mathematics, physics, complex systems, neuroscience, information theory and more – and ML ‘practice’.
We provide institutional, research & strategic support to our affiliates by offering 4-12 month long contracts, with the possibility of renewal upon satisfying performance. We aim to have a mix of 6-12 researchers at the Ph.D to Professorship level[1] and have a track record for attracting talent at this level of seniority[2].
The support we provide to research affiliates is tailored to their individual needs, but its core aspects include:
A salary
Access to office space in major AI Safety hubs (such as London or Berkeley)
Access to compute and engineering infrastructure
Personalized research and strategic support
Access to a network of research peers and (potential) collaborators
Other research support scaffolds, such as progress reports, quarterly retreats, opportunities to present WIP, travel support to visit relevant conferences or research teams, etc.
Operational and administrative support seeking to remove any hurdles preventing affiliates from focusing on their research
As a result of a successful engagement with us, we expect research affiliates to publish high-impact research that makes legible progress towards addressing key AI risks[3] and to be en route to continuing their path either by renewing their residency with us, creating new orgs, or joining other existing efforts.
As an effort towards institutionalizing the diversification of theoretical bets in AI Safety we actively scout for novel research bets and world-class talent in technical domains that we deem promising and neglected by the alignment community. We have recently made this work more systematic and contracted additional researchers to strengthen our efforts. Our research scouting efforts are, on one side, anchored in a nuanced understanding of the ‘open problems’ in AI safety and, on the other hand, in a breadth of knowledge and network across scientific fields currently considered outside of AI Safety. We expect our research taste, and consequently, our bets, to become more and more refined over time.
Our current focus is on research that can develop and validate deep learning theory aimed at providing rigorous justification that interpretability techniques used in white-box evaluations faithfully represent the computation happening in the systems. This standard of evidence is necessary for closing the gap between behavioral and understanding-based evaluations, achieving what Hubinger (2022) refers to as “worst-case, training process transparency”.[4]
Past affiliates have worked on topics such as i. leveraging computational mechanics as a theoretically rigorous framework on which to build novel interpretability and evaluation methods; ii. tracking information theoretic measures across inference to detect deceptive behavior; iii. operationalizing the idea of emergent selection pressures occurring in deep learning systems over the course of training; or v. developing a phenomenology of training dynamics by analytically solving idealized models. (We provide a more detailed summary of outcomes and research progress to date in the Evidence & Track record section.)
Over the last 3 years, we have built a unique and thriving research community across several layers of engagement, combining interdisciplinary expertise and providing a valuable bridge between AI Safety and academia. We have made a point of fostering a research culture that is sensitive to the epistemic challenges arising from the non-paradigmatic nature of AI safety, coupled with the appropriate philosophical nuance and methodological sensitivity.
We have a focused inside view which we are betting on. Rather than aiming to accelerate agendas which are already present in the field and have people who can lead them, we aim to produce new research leads which are pursuing novel bets. Not only does this diversify the field, it also adds relief to the mentorship shortage.[5]
(See Appendix B for more detail on how our research taste relates to other actors in the space.)
We are tapping into the significant potential of onboarding established academics with deep knowledge in areas relevant but as-of-yet neglected expertise into AI safety. This is something which has proved to be very advantageous in the field (c.f. Developmental Interpretability, Computational Mechanics) but remains undervalued. We are advised by Alexander Gietelink Oldenziel, who has a strong track record in this type of work, having, among others, scouted Dan Murfet, thereby originating the agenda of Developmental Interpretability.
We are aiming at a higher caliber talent demographic than other talent interventions in the field. Our distribution of researchers is mid-PhD to Professor. (Also see notes 1 & 2)
We actively address a common bottleneck in getting senior people from theoretical backgrounds into alignment–that is, their lack of exposure to ML research engineering. We pair them with experienced ML research engineers[6] and introduce them to a standard of best practices for code maintenance that allows for scalable and reproducible research which doesn’t compromise on the speed of empirical iteration.
We have a solid understanding of the AI Safety landscape backed by years of experience in the field, and are able to quickly get newcomers up to date. We do this in a personalized manner which allows us to quickly refine what would otherwise be misguided ideas from new folks coming into the field.
We are institutionalizing research taste & the diversification of research bets. Due to our close collaboration with technical experts in a wide variety of theoretical domains, we are able to continuously refine our research & talent-scouting portfolio. Having in-house technical management allows us to not only understand the key risks and promises specific to each research thread, but also to make connections between different theoretical domains which would otherwise have gone unnoticed by their respective specialists. This makes our research output greater than the sum of its parts.
In January 2024, we launched a 6-month “prototype” of the affiliate program with 5 research affiliates, two of which part-time. We are pleased about the results so far, and have been reinforced in our plans to build out the program, albeit with some changes to the original setup based on what we learnt.
(Note that the initial period hasn’t concluded yet. As such, the below provides only an intermediary snapshot of the results. We intend to publish a more comprehensive retrospective publicly towards the end of July.)
For ease of reading, we will summarize some of the highlights here:
Adam Shai has published the most upvoted research post of 2024 on LessWrong & Alignment Forum, sharing initial empirical results in support of Computational Mechanics as a fruitful framework for pushing the boundaries of AI interpretability. The work has been positively received and referenced in relevant literature. He and his collaborator Paul Riechers have secured, with our support, seed funding for a new AI Safety organization – Simplex – where they seek to further pursue this agenda. We also co-organized a successful research hackathon that made progress on a number of open problems (see recordings of the final presentations).
Fernando Rosas is confirmed to join PIBBSS as a research affiliate starting ~July 2024. He is a recognized leader in the study of multi-scale complex systems with more than 125 peer-reviewed articles in relevant fields including Nature Physics & Neuroscience, as well as the National Academy of Science. He is also a Lecturer in Informatics at the University of Sussex and a Research Fellow at Imperial College London and the University of Oxford. His work will focus on formalizing what “generalization” means in a rigorous way, building on recent results in emergence and computational mechanics, which has developed mathematical formalisms for exploring structure in the set of all possible abstractions (i.e., coarse grainings). This work could be leveraged in several directions, from unsupervised feature extraction in SOTA models to developing rigorous benchmarks for mechanistic anomaly detection and more.
Nischal Mainali has been working as a part-time affiliate. His main focus has been on extending the rigor of evals by pursuing a phenomenological study of training dynamics, where empirically observed tendencies of safety-relevant features are analytically solved and explained under simplifying assumptions. His focus has been on examining the development of in-context learning by looking at various signatures of its spectral properties as well as employing capacity analysis–a neuroscience technique used to measure how much memory a network can store in its parameters. Nischal has also secured funding from Open Philanthropy in support of his research & building relevant career capital.
Clem von Stengel has pushed novel conceptual frontiers exploring the possibility and routes to empirical investigation of the phenomena of emerging selection pressures in deep learning systems over training. The progress to-date is not as of yet public, but partial progress has been positively received by both John Wentworth and Jan Kulveit.
Guillaume Corlouer explored two main projects throughout the affiliateship. The first project asks how information theoretic measures could be used in lie detection in LLMs (more on that below). The second project seeks to understand how degeneracies in the loss landscape affect SGD trajectories (slides). On the latter point, he has produced a write-up clarifying a common confusion in the literature between the Hessian of the loss, SGD covariance, and Fisher Information Matrix when analysing degeneracies throughout training as well as a blog post (in collaboration with Nicolas Macé) comparing the behaviour of SGD trajectories and the Bayesian posterior around degenerate minima, tackling an important open question in Singular Learning Theory.
Ann-Kathrin Dombrowski collaborated with Guillaume in exploring how information theoretic measures could be used in lie detection in LLMs, which culminated in an accepted submission to an ICML workshop. They are considering further work in this direction exploring measures inspired by Multivariate information Theory as well as seeing how these measures detect other deceptive phenomena, such as steganography.
See our website for a short description of our affiliates' profiles.
Going forward, we plan to continue supporting some of our current affiliates and hire new affiliates in early Q4 (details depend on funding outcomes).
Note that beyond the affiliate program, PIBBSS pursues a set of other research & field-building efforts. For a brief overview of those, see Appendix C.
We’re looking for funding for 6 research affiliates for 6 months, including relevant research-related costs (e.g., office space, compute) and staff costs.
How much: $753,500
Breakdown:
For 6 affiliates, 6 months:
Salaries (monthly, pp – incl. HR overhead): $12,000
Office cost (monthly, pp): $1,250
Travel cost (pp): $4,000
Misc support (pp): $4,000
One semi-annual research retreats: $25,000
Compute (full cohort): $10,000
Staff cost (2 FTE, 6 months – incl. HR overhead): $125,000
Buffer 10%
Note: Minimum ask would fund an additional researcher time in case the program gets other funding. For funders who wish to support more than our ask here, please reach out to us. In case we get this marginal funding but not enough to do even the minimum version of the program, we will reach out to donors and consult accordingly.
PIBBSS was founded in 2021 by Nora Ammann and TJ as a research initiative aiming to draw insights & talent from fields studying intelligent behavior in natural systems towards progress on questions in AI risk and safety. Since its inception, PIBBSS has supported ~50 researchers for 3-month full-time fellowships, is currently supporting 6 in-house, long-term research affiliates, and has organized 15+ AI safety research events/workshops. Over the years, we have built a substantive and vibrant research community, spanning many disciplines across academia and industry, both inside and outside of AI safety. (For a brief overview of PIBBSS initiatives other than our Research Affiliate program, see Appendix C.)
Our executive team consists of:
Lucas Teixeira (Research) has an interdisciplinary background in Philosophy, Anthropology, and Computer Science. They act as a research manager and collaborator for the various research threads currently pursued by PIBBSS affiliates. Prior to joining the PIBBSS team, they worked at Conjecture, first as an Applied Epistemologist, where they helped co-run Refine as well as used insights from History and Philosophy of Science to unblock researchers, and then later as a Research Engineer on the Cognitive Emulation agenda.
Dušan D. Nešić (Operations) has lead PIBBSS’ operations since Autumn 2022. He has 10 years of experience running NGOs (including Rotary and EA Serbia) and private companies and scaling them from 0 to 1 and beyond. He has academic teaching experience in Economics and Finance and is pursuing a PhD in Finance with a focus on Financial Institution design under TAI. He serves as a founder and board member of ENAIS and a Trustee of CEEALAR.
Collaborators who we’ve hired for the summer to work on setting the technical directions and priorities for the affiliateship, as well as begin the talent scouting for the affiliateship:
Mike X Cohen: Mike was a tenured associate professor in neuroscience at Radboud University (NL) who recently decided to transition into AI Safety in an attempt to make a greater impact. Some of their most notable accomplishments include leading groundbreaking work on “midfrontal theta,” a brainwave pattern linked to error detection in humans. They also self published multiple textbooks and courses, authored and co-authored 126 peer-reviewed articles, receiving over 25,000 citations, and managed more than 8 million Euro in research funding.
Dmitry Vaintrob: Dmitry is an alignment researcher whose work spans developing a Mathematical Framework for Computation in Superposition and investigations into Grokking. In a previous life, he was a Morrey Visiting Assistant Professor at UC Berkeley with a PhD from MIT, where his interests lay at the intersection of logarithmic algebraic geometry, higher algebra, mirror symmetry, number theory, and Quantum Field Theory.
Lauren Greenspan: Lauren is an independent alignment researcher with a background in high energy physics. She was a participant in Neel Nanda's online MATS training program, where she worked on understanding superposition in attention heads. She received her Ph. D from University of Porto and has previously held positions at NYU as an adjunct professor, academic advisor, and a postdoc researcher. CV here.
Eric Winsor: Eric is an independent AI Safety researcher whose previous work has touched on various aspects of interpretability including interpreting neural networks through the polytope lens, re-examinations of LayerNorm, mapping out statistical regularities in LLM Internals as well as understanding macroscopic universal motifs in the high-level information flow in LLM internals. Eric studied mathematics at the University of Michigan, was a computer science graduate student at Stanford and was previously employed as a research engineer at Conjecture.
Angie Normandale is working with us in a project management capacity to develop and systematize our processes in view of scaling up the affiliate program.
Angie is a skilled operations generalist with a wide range of experience in project management, communications, and finance, both inside and outside of academia. She has a background in experimental psychology and is currently pursuing a part-time Master's in Computer Science.
Our board provides us with substantive support and research & strategy advice. At present, the board consists of:
Alexander Gietelink Oldenziel is Director of Strategy and Outreach at Timaeus and a PhD student in Theoretical Computer Science at University College London. They have been critical in identifying and bringing into AI Safety academics such as Dan Murphet or Paul Riechers, making them, together with their strong research taste, an excellent advisor to our research scouting efforts.
Ben Goldhaber is the Director of FAR Labs. He’s passionate about diversifying research bets in AI safety and building intellectually generative cultures for high-impact research. He has extensive experience in leading and supporting organizations. He has previously worked in operational and engineering roles at top tech companies and early stage startups.
Gabriel Weil is an Assistant Professor of Law at Touro University, and former PIBBSS fellow. He is doing work at the intersection of Legal Theory and AI safety, such as this paper on the role of Tort Law in mitigating AI Catastrophic Risks. We are excited to be able to draw on their expertise on legal and policy matters, as well as their academic experience and nuanced strategic outlook.
Nora Ammann works as a Technical Specialist for the Safeguarded AI programme at the UK’s Advanced Research & Innovation Agency. She co-founded and directed PIBBSS until Spring 2024 and continues her support as President of the Board. She has pursued various research and field-building efforts in AI safety for the last >6 years. Her research background spans political theory, complex systems and philosophy of science. She is a PhD student in Philosophy and AI and a Foresight Fellow. Her prior experience includes work with the Future of Humanities Institute (University of Oxford), the Alignment of Complex Systems research group, the Epistemic Forecasting project, and the Simon’s Institute for Longterm Governance.
Tan Zhi Xuan is a PhD student at MIT’s Probabilistic Computing Project and Computational Cognitive Science lab, advised by Vikash Mansinghka and Josh Tenenbaum. Xuan has a deep understanding of PIBBSS due to her being a part of the journey from the start. We are excited to draw on her appreciation and breadth of knowledge when it comes to underexplored bets in AI Safety.
For more background on PIBBSS, you may wish to consult:
On our website: pibbss.ai
Announcement post of the Affiliate prototype, January 2024
Other posts on LessWrong with the ‘PIBBSS’-tag
>> When humans first started building bridges based on a trial and error approach, a lot of those bridges would soon and unexpectedly collapse. A couple of hundred years, and many insights later, civil engineers are able to make highly precise and reliable statements of the type: “This bridge has less than a 1 in a billion chance of collapsing if it’s exposed to weights below 3 tonnes”. In the case of AI, it matters that we get there faster. <<
In terms of threat models, we put significant credence on relatively short timelines (~3-15y) for general, long-horizon superintelligent AI capabilities, and relatively sharp left turns (i.e. level 8 and above in this taxonomy). However, we also take seriously risks scenarios within a ‘smooth left turn’ regime as a result of (multiple) AI transitions, which we expect to unfold over the coming years up until when or if a sharp left turn manifests. The latter scenario brings with it associated structural risks as a part of humanity increasingly delegating more and more of its partial agency towards general AI systems. [7]
A cornerstone of our strategy centers around the acceleration of a science of AI. Progress towards a mature science of AI is critical in both the ‘sharp left turn’ and the ‘smooth left turn’ world, and thus, our strategy has the benefit of being robustly useful across both threat scenarios. The core point: Only with a strong enough theoretical basis – in particular, one that has been empirically and instrumentally validated – can we move from a regime where we build systems we don’t understand and that are likely to cause harm in expected as well as unexpected ways, to a regime where we know how to build systems, from the ground up, such that they reliably depict the properties we want them to have.
Apriori, such a “safe-by-design” system can be built via two main avenues. On one route, we build systems that – in virtue of their top-down design – are constrained to act only in specified-to-be-safe ways. The ‘Guaranteed Safe AI’ family of approaches is a central example of this avenue.
On the second route – which represents our current research focus –, we arrive at similarly rigorous levels of safety assurance ‘from the bottom up’. Ultimately, we develop the sort of scientific understanding of AI systems that gives us the ability to make confident, calibrated, robust, and precise statements about the system’s behavior across training and its contexts of use thanks to the system having been designed or evaluated based on a rigorous theoretical foundation. In other words, the goal is to develop training stories and containment measures capable of giving us justified credence that our engineering artifact will work as intended, reliably, on the first critical try commensurable with what we know from epistemic practices in the physical sciences when they’re at their best.
How does our research taste differentiate from other actors in the AI Safety space? We will provide some – incomplete but hopefully informative – pointers in the direction of answering this question.
Purely Theoretical Research (e.g. (traditionally) MIRI, ARC Theory, Learning Theoretic Agenda, Natural Abstractions, etc...):
With this camp, we share the concerns that the kinds of safety guarantees which we need for ASI to go well require substantial scientific conceptual innovation and that scientific work without theoretical engagement is likely to fall short.
However, we tend to diverge from this camp in the value which we see empirical research providing for conceptual innovation, and the possibility of tactfully grounding experiments of current day models in a way which provides us to generalize towards more model agnostic theoretical results.
Prosaic Research (e.g. MechInterp, Activation Engineering, Behavioral Evaluations, Alignment Capabilities (i.e. RLHF, Constitutional AI, ‘Make AI Solve It’))
We are in agreement that fruitful scientific engagement towards safety can and should be led by empirical engagement with present day models, and we also share the virtue of quick feedback loops of empirical iteration.
However, there are many aspects of ML culture which many of these agendas are following which are in disagreement with our scientific virtues.
Behavioral Evaluations
Ultimately, black box evaluations are insufficient. There is community consensus that white box evaluations are needed.
Mech Interp
Not enough effort in falsifying the null hypothesis, relying on semantic interpretations and labeling of features is not robust to spurious correlations and there is currently not enough emphasis put on establishing the faithfulness of interpretations.
Alignment Capabilities (Adversarial Robustness, RLHF, Make AI Solve It, etc…)
We expect a significant amount of industry effort to be leveraged towards solving these issues. Unfortunately, we expect this to be the most vulnerable to the epistemic vices of ML culture, notably hill climbing on benchmarks.
In this funding ask, we’re specifically seeking funding for our research affiliates. That said, PIBBSS has, and intends to continue to run a range of other research & field building efforts in the past. We will here provide a brief overview:
Fellowship
Since 2022, we have run our annual 3-month research fellowship, pairing fellows who have expertise in diverse fields studying intelligent behavior in natural systems with experienced AI safety researchers.
Public retrospectives of our fellowship programmes in 2022 and in 2023. The 2024 fellowship is currently underway (see here for an overview of this year’s cohort). More in-depth internal evaluation reports for these programmes are available upon request.
Research events/workshops
To date, we have run over 15 research events, retreats, and workshops. Some more recent events include: a hybrid Hackathon on Computational Mechanics and AI safety (June 2024), a 5-day workshop bringing together scholars in Physics of Information and AI Safety (March 2024), a workshop titled “Agent Foundations for AI Alignment” (October 2023), as well as quarterly researcher retreats for affiliates & close collaborators.
Speaker Series
We run semi-regular speaker events which we record and put online. The talks feature researchers from both AI Safety and adjacent fields studying intelligent behavior with the goal of exploring connections between the work of the speaker and questions in AI Safety.
Reading Groups
We have developed and run several reading groups. The first one was developed by TJ and can be found here. In 2024 we developed a novel, 7-week curriculum, primarily aimed at participants in our 2024 fellowship programme.
[1] Credentials should not be read as necessary requirements or indicative of our terminal interests, rather the range is there to communicate the range of research caliber we are aiming at attracting.
[2] Some illustrative examples include:
- Yevgeny Liokumovich (PIBBSS Fellow 2024; Assistant Professor of Mathematics University of Toronto, with prior academic positions at MIT, the Institute for Advanced Study and Imperial College London);
- Gabriel Weil (PIBBSS Fellow 2023; Assistant Professor at Touro University Law Center; Author of “Tort Law as a Tool for Mitigating Catastrophic Risk from Artificial Intelligence”);
- Fernando Rosas (Incoming PIBBSS Affiliate 2024; Recognised leader in the study of multi-scale complex systems, with more than 14 years of experience and >125 peer-reviewed articles including Nature and the National Academy of Science);
- Adam Shai (Current Affiliate; PhD in Empirical Neuroscience & 10+y research experience at Caltech & Stanford)
[3] In rare cases, we may encourage affiliates not to publish their research due to info-hazard reasons. In those cases, research may only be shared with a highly select group of people, or further work might be done to reduce the uncertainty about whether a genuine info-hazard applies.
[4] From the same post (bolding by us): “[T]raining process transparency [...] is understanding what’s happening in training processes themselves—e.g. understanding when and why particular features of models emerge during training.” and “The key distinction here is that worst-case transparency is about quantifying over the entire model—e.g. ‘does X exist in the model anywhere?’—whereas best-case transparency is about picking a piece of the model and the understanding that—e.g. ‘what does attention head X do?’.”
[5] As point in case, one of our affiliates (Adam Shai) is taking on scholars at the next MATS iteration, one of our 2024 fellows (Yevgeny Liokumovich, Assistant Professor) plans to direct his own students towards projects in SLT/Developmental Interpretability after the end of the fellowship.
[6] For example, we were able to significantly improve the quality and throughput of the experimental side of Adam Shai’s work. Lucas brings in relevant experience having worked as a research engineer at Conjecture under supervision of Sid Black (now at AISI), co-founder of EleutherAI and author of Neox-20B which was one of the largest open source LLMs before the ChatGPT revolution and co-author of The Pile which is a standard dataset used in LLM research with nearly 500 citations.
[7] Examples of such scenarios include Robust Agent Agnostic Processes, ascended economy, ‘human enfeeblement,’ epistemic insecurity, the exploitation of societal vulnerabilities, especially of critical cyber-physical infrastructure, and more.