@Tym Thanks Tym really appreciate you engaging with this, and you're right on the model selection point. I will update my planned comparison to include Claude Opus 4.7 as the primary Anthropic model. it's the current frontier model and testing it makes the findings more defensible.
Your point about capability trajectories is actually the core question I want the paper to answer. If even Opus 4.7 fails significantly on this benchmark, the problem isn't that the model isn't smart enough. it means the knowledge simply isn't in the training data. Which no capability improvement automatically fixes.
Looking forward to sharing the full comparison results once the closed model runs are complete.