Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wow, there's a lot going on with this pelican riding a bicycle: https://gist.github.com/simonw/c31d7afc95fe6b40506a9562b5e83...




Nice work on these benchmarks Simon. I’ve followed your blog closely since your great talk at the AI Engineers World Fair, and I want to say thank you for all the high quality content you share for free. It’s become my primary source for keeping up to date.

I’ve been working on a few benchmarks to test how well LLMs can recreate interfaces from screenshots. (https://github.com/alechewitt/llm-ui-challenge). From my basic tests, it seems GPT-5.2 is slightly better at these UI recreations. For example, in the MS Word replica, it implemented the undo/redo buttons as well as the bold/italic formatting that GPT-5.1 handled, and it generally seemed a bit closer to the original screenshot (https://alechewitt.github.io/llm-ui-challenge/outputs/micros...).

In the VS Code test, it also added the tabs that weren’t visible in the screenshot! (https://alechewitt.github.io/llm-ui-challenge/outputs/vs_cod...).


That is a very good benchmark. Interesting to see GPT-5.2 delivering on the promise of better vision support there.

The variance is way too high for this test to have any value at all. I ran it 10 times, and each pelican on a bicycle was a better rendition than that, about half of them you could say were perfect.

Compared to the other benchmarks which are much more gameable, I trust PelicanBikeEval way more.

Well, the variance is itself interesting.

They probably saw your complaint that 5.1 was too spartan and a regression (I had the same experience with 5.1 in the POV-Ray version - have yet to try 5.2 out...).

I added GPT-5.2 Pro to my pelican-alternatives benchmark for the first three prompts:

Generate an SVG of an octopus operating a pipe organ

Generate an SVG of a giraffe assembling a grandfather clock

Generate an SVG of a starfish driving a bulldozer

https://gally.net/temp/20251107pelican-alternatives/index.ht...

GPT-5.2 Pro cost about 80 cents per prompt through OpenRouter, so I stopped there. I don’t feel like spending that much on all thirty prompts.


Hi, it doesn't have Gemini 3.5 Pro which seems to be the best at this

That's probably because "Gemini 3.5 Pro" doesn't exist

That gallery is an excellent advertisement for Gemini 3.0 Pro.

Seems to be getting more aerodynamic. A clear sign of AI intelligence

the only benchmark i trust

What happens if you ask for a pterodactyl on a motorbike?

Would like to know how much they are optimizing for your pelican....



I was expecting to see a pterodactyl :(

Is that the first SVG pelican with drop shadows?

No, I got drop shadows from DeepSeek 3.2 recently https://simonwillison.net/2025/Dec/1/deepseek-v32/ (probably others as well.)

Do you think the big guys are on to your game and have been adding extra pelicans to the training data?

What is good at SVG design?

Not svg, but basically the same challenge:

https://clocks.brianmoore.com/

Probably Kimi or Deepseek are best


Graphic designers?

Ive not seen any model being good in graphic/svg creation so far - all of the stuff mostly looks ugly and somewhat "synthetic-disorted".

And lately, Claude (web) started to draw ascii charts from one day to another indstead of colorful infographicstyled-images as it did before (they were only slightly better than the ascii charts)


seems to be eating something

Probably a jellyfish. You're seeing the tentacles

benchmarks probably should not be used for so long.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: