This project aims to develop new machine-learning tools for microbial forensics. In particular, my PhD will aim to bring genetic engineering attribution and detection algorithms from proof-of-concept to prototype. I am hoping to start my PhD in October 2024 at the Statistics department of Oxford University.
During my PhD, I will develop machine learning tools to determine whether a genetic sequence has been engineered (genetic engineering detection or “GED”) and, if so, attribute it to its origin (genetic engineering attribution or “GEA”).
Theory of Change. The concept is simple: identifying who made a genetically engineered organism can discourage misuse and help trace the source in cases like lab accidents. This could incentivise laboratories to install more stringent safety measures or to give up the most dangerous experiments altogether. For more details on the theory of change, I recommend this paper by Greg Lewis et al.
Goals. My project will build on initial studies which exclusively used plasmids (small, circular pieces of DNA found mostly in bacteria). I plan to:
Expand this research to include viruses that pose real threats to biosecurity.
Enhance detection and attribution methods using advanced machine learning models.
Develop a user-friendly interface that allows stakeholders to use our systems effectively.
You can find a more detailed research proposal here.
Building Career Capital. Earning a PhD in Statistics from Oxford University will open many doors in various impactful careers. This program will equip me with specialised statistics and machine learning skills, enabling me to shift focus to other urgent areas, such as AGI safety, if that seems to have a greater positive impact.
Requested Amount:
Minimum: $39,157 (First-year PhD fees)
Maximum: ~ $296,227 (Complete PhD over four years, including cost-of-living)
Budget Breakdown:
Course Fees: Annual increase below 6%, totalling approximately $171,297 over four years. The calculation is based on initial fees of $39,157 and compounding annually.
Living Costs: Estimated between $20,000 to $29,000 per year. Average cost used for calculation: $24,500 per year, totalling $98,000 for four years.
Buffer: 10% of total estimated living costs and course fees, adding approximately $26,930 to account for systematic underestimation.
Total Estimated Budget: Course Fees ($171,297) + Living Costs ($98,000) + Buffer ($26,930) = $296,227
Personal Track record. I have researched GEA full-time over the past 8 months to write my undergraduate thesis. During this time I reached state-of-the-art performance in two sub-tasks of GEA. My thesis is available upon request.
After completing my bachelor’s with first-class honours, I was accepted into the Statistics PhD at Oxford University, despite only having a bachelor’s degree in Cognitive Science. You can find my CV here.
Organisational Track record. My current and future supervisor, Dr Oliver Crook, has co-authored two papers on GEA, published in Nature Communications and Nature Computational Science.
Results don’t transfer from the toy problem. The success of machine learning models heavily depends on the quality and quantity of data available. So far it seems challenging to obtain data from engineered pathogens. As a result, too much of my work might be focused on toy problems (GEA with plasmids) for which data is easily available. Our results might not transfer to the real use case (GEA with engineered pathogens).
This is the most likely reason this project fails in its current form. The first part of this project will thus be to find out if the necessary data is available to us by comprehensively searching existing databases and contacting relevant laboratories. If it's not, we might conclude that we need to pivot to other projects.
Lack of adoption by stakeholders. Even if I develop effective GEA systems, there is a risk that stakeholders (e.g., universities, and policymakers) may not adopt or use these systems due to various reasons, such as lack of awareness, resistance to change, or concerns about implementation costs or reliability.
This currently seems very unlikely to me. As far as I know, multiple important stakeholders are interested in this technology.
Ethical and regulatory hurdles. Governments might classify GEA algorithms as “dual-use technologies” and strictly regulate their export. This could pose challenges to the progress and implementation of my research.
I can only say that this has not prevented the publication of this research so far and that I am personally optimistic that regulations will continue to allow academic research to be published on this topic.