I have, and to be fair that has solved the “basically incorrect code” issue with...

I have, and to be fair that has solved the “basically incorrect code” issue with reasonable regularity. Occasionally the error messages don’t seem helpful enough for it, which is understandable, and I’ve had a few occurrences of it getting “stuck” in a loop trying to e.g. use an invalid addressing mode (it may have gotten itself out of those situations if I were more patient) but generally, with one of the Claude 4 models in agent mode in cursor or Claude code, I’ve found it’s possible to get reasonably good results in terms of “does it assemble”.

I’m still working on a good way to integrate more feedback for this kind of workflow, e.g. for the attempt it made at AP bootstrap - debugging that is just hard, and giving an agent enough control over the running code and the ability to extract the information it would need to debug the resulting triple fault is an interesting challenge (even if probably not all that generally useful).

I have a bunch of pretty ad-hoc test harnesses and the like that I use for general hosted testing, but that can only get you so far in this kind of low-level code.