Physics verification for AI-generated hardware designs

Project summary

AI can design real hardware in seconds, but it can't tell you whether the thing will actually hold. I built the part that checks. It takes a hardware design, tests it against real physics solvers, and flags anything it can't verify as UNCHECKED instead of asserting it. Take a wheelchair seat bracket sized for a 100 kg occupant on rough ground: the obvious design a chatbot hands back measures a factor of safety between 0.31 and 0.91, so it breaks under the person. The checker finds one that holds at 2.90, re-proves it with the mounting bolt holes cut in, and the file it exports is exactly the solid it verified. The checker runs on a laptop with no GPU, and every number here regenerates from open code, so I can reproduce any of it live.

The checker already works past single parts. It matches textbook physics to within about one percent, clears a 44-piece engine assembly with zero interference through a full rotation, and proved out a wearable ultrasound gimbal while refusing to certify five things it had no way to check, like skin pressure and thermal rise. That refusal is the product, not a gap in it: the UNCHECKED flag is structural, so a claim that was never measured can't render as passed. And the loop is not hypothetical: that 44-piece engine was produced by a model from inside it, on rented cluster compute, proposing and revising against the checker until nothing interfered. The wheelchair, gimbal, and hook I built by hand to exercise the checker; the engine is what the loop does with a model driving it. What I have not done is make that cheap, or measure how reliably it works on designs it has never seen, and that is what this funds.

The goal is a loop where a model proposes a design and the checker tests it against real physics, scaling from a single part to a whole machine, so being sure a design will hold costs about as much as generating it. The checker itself needs no GPU; the model that drives it does, which is why the engine took a cluster, and that compute is the whole ask, staged from $10,000 to measure how reliably the loop works on unseen designs, up to $100,000 for a larger model and a hosted version others can use, broken out below. Underneath, it is a scalable-oversight bet: a proposer that can't ship a claim its checker hasn't independently confirmed, in one of the few domains where the checker itself is graded against textbook physics instead of another model's opinion.

What are this project's goals? How will you achieve them?

The goal is a loop where a model proposes hardware and the checker tests it against real physics, growing two ways: more parts interacting at once, and more kinds of check. It has already gone from one bracket to a 44-piece engine with no interference through a full rotation, and from a strength check alone to interference, manufacturability, and honest refusals when it hits something it cannot verify. Each new check, thermal, fatigue, cost, whether an ordinary workshop can build it, plugs into the same loop without a rewrite.

How I get there, in order. First, the piece this money buys: measure how often the loop succeeds on designs it has never seen. A model has already produced a 44-piece engine from inside the loop, but that is one result on a cluster, not a success rate. I built the exam for the real number, eleven held-out design families the system has never seen, and nobody has run it yet, so that number does not exist. Producing it honestly, flattering or not, is the first milestone. Second, own the compute, so each attempt costs electricity instead of rented-GPU money, because real design is hundreds of small iterations and metering kills that. Third, keep widening the checker, so the loop can verify more of a machine.

The checks already drive design, not just grade it. On a printable wall hook the obvious shape fails the safety bar; the system sizes a lighter one through 22 real solves, re-checks it with the mounting holes cut, and ships only what still holds. In that same run I planted an unverifiable claim, ten-year UV resistance, and the export refused to ship until I signed off. An honesty lint fails the build if any number does not trace to a solver run, and every check here regenerates from open code on a laptop, so I can reproduce the results live. Of these designs, I built the wheelchair, gimbal, and hook by hand to exercise the checker; the 44-piece engine is the one a model produced from inside the loop, on a rented cluster.

The stretch goal, not something this money buys directly: a first real brief from someone who isn't me, a wheelchair workshop, a disaster-relief fabrication team, a volunteer assistive-device network, anywhere a confidently wrong AI answer has no engineer downstream to catch it. No conversations yet, so I am not naming names.

How will this funding be used?

The checker runs on a laptop; the model that drives the loop does not, which is the split this money pays for. The loop already works with a real model: it produced the 44-piece engine on a rented cluster. But that has only ever run on rented compute, never cheaply, and I have never measured how often it succeeds on designs it has not seen. The exam for that is built, eleven held-out design families, decontamination-checked at 0.14 maximum overlap against a 0.50 threshold, and nobody has run it yet. Producing that pass rate is what the first GPU hours buy.

The ask is staged. The $10,000 minimum rents the GPU hours to run the held-out sweep to completion and post the pass rate here, flattering or not. $25,000 buys a workstation around an RTX PRO 6000, so the hundreds of small iterations the dev loop needs cost electricity instead of rented money, because metering is what kills the back-and-forth real design takes; I cover the power, storage, and electrical work myself. $100,000 funds the capacity for the cluster-scale model the engine needed, owned instead of rented, plus hosting so a workshop or another lab can reach the checker through a browser. Renting that node runs about $140,000 a year; this buys it once.

Who is on your team? What's your track record on similar projects?

Solo project. Most relevant is Axiom, a study app around a language model I fine-tuned to turn a student's notes into flashcards: live at axiomchat.app and on the iOS App Store, around 400 monthly users, built to run on one CPU core and two gigabytes of memory so it stays free. Axiom taught me both halves this needs: how to fine-tune a model to do a job well, and exactly how convincing a model stays right up until it's confidently wrong. Before that: a tennis-court occupancy sensor on small wireless boards, then self-hosted vision models tracking the ball and a player's posture from a phone mount I designed and 3D-printed. The last few months went into the work this application describes: building the checker, and getting a model to design the 44-piece engine through it on rented compute.

What are the most likely causes and outcomes if this project fails?

The most likely one is just that the hard version is hard. The engine is one machine a model got right through the loop; making that reliable, on machines it has never seen, could land slower or shakier than a single success makes it look. If so, the proven half still ships: real solver checks instead of the model's own say-so, for whatever range it covers.

The riskier failures are the ones I'm asking to measure. Whether the checking generalizes past the cases I built: the held-out exam tests exactly that, has no number yet, and I won't invent one. Whether the gate only catches what I thought to wire a verifier for: the gimbal carries five UNCHECKED flags because I knew to name them; nothing catches an unanticipated sixth. The sharpest one shows up once a model trains against the gate instead of being prompted at it: does it converge on brackets that genuinely hold, or on exploits in whatever the verifier doesn't check? That's specification gaming on cheap, deterministic ground truth, and it's the first thing to instrument once training starts.

Worth saying though: even the bad outcomes leave something. Every run pairs a design with a real pass-or-fail verdict against actual physics. The floor isn't zero: the gate already blocked an export over an unverifiable claim I deliberately planted, and it stays blocked until I sign off. Whatever happens, this never reports a check it didn't run as passed.

How much money have you raised in the last 12 months, and from where?

Zero dollars in the last 12 months. Entirely self-funded: about a hundred dollars on rented compute. No grants, no investment.