Nice work on these benchmarks Simon. I’ve followed your blog closely since your great talk at the AI Engineers World Fair, and I want to say thank you for all the high quality content you share for free. It’s become my primary source for keeping up to date.
I’ve been working on a few benchmarks to test how well LLMs can recreate interfaces from screenshots. (https://github.com/alechewitt/llm-ui-challenge). From my basic tests, it seems GPT-5.2 is slightly better at these UI recreations. For example, in the MS Word replica, it implemented the undo/redo buttons as well as the bold/italic formatting that GPT-5.1 handled, and it generally seemed a bit closer to the original screenshot (https://alechewitt.github.io/llm-ui-challenge/outputs/micros...).
The variance is way too high for this test to have any value at all.
I ran it 10 times, and each pelican on a bicycle was a better rendition than that, about half of them you could say were perfect.
They probably saw your complaint that 5.1 was too spartan and a regression (I had the same experience with 5.1 in the POV-Ray version - have yet to try 5.2 out...).
Ive not seen any model being good in graphic/svg creation so far - all of the stuff mostly looks ugly and somewhat "synthetic-disorted".
And lately, Claude (web) started to draw ascii charts from one day to another indstead of colorful infographicstyled-images as it did before (they were only slightly better than the ascii charts)