Wow, there's a lot going on with this pelican riding a bicycle: https://gist.git...

alechewitt · 2025-12-12T01:17:21 1765502241

Nice work on these benchmarks Simon. I’ve followed your blog closely since your great talk at the AI Engineers World Fair, and I want to say thank you for all the high quality content you share for free. It’s become my primary source for keeping up to date.

I’ve been working on a few benchmarks to test how well LLMs can recreate interfaces from screenshots. (https://github.com/alechewitt/llm-ui-challenge). From my basic tests, it seems GPT-5.2 is slightly better at these UI recreations. For example, in the MS Word replica, it implemented the undo/redo buttons as well as the bold/italic formatting that GPT-5.1 handled, and it generally seemed a bit closer to the original screenshot (https://alechewitt.github.io/llm-ui-challenge/outputs/micros...).

In the VS Code test, it also added the tabs that weren’t visible in the screenshot! (https://alechewitt.github.io/llm-ui-challenge/outputs/vs_cod...).

simonw · 2025-12-12T01:41:06 1765503666

That is a very good benchmark. Interesting to see GPT-5.2 delivering on the promise of better vision support there.

Stevvo · 2025-12-11T20:40:53 1765485653

The variance is way too high for this test to have any value at all. I ran it 10 times, and each pelican on a bicycle was a better rendition than that, about half of them you could say were perfect.

golly_ned · 2025-12-11T21:40:53 1765489253

Compared to the other benchmarks which are much more gameable, I trust PelicanBikeEval way more.

getnormality · 2025-12-12T06:25:13 1765520713

Well, the variance is itself interesting.

BeetleB · 2025-12-11T20:33:53 1765485233

They probably saw your complaint that 5.1 was too spartan and a regression (I had the same experience with 5.1 in the POV-Ray version - have yet to try 5.2 out...).

tkgally · 2025-12-12T03:30:26 1765510226

I added GPT-5.2 Pro to my pelican-alternatives benchmark for the first three prompts:

Generate an SVG of an octopus operating a pipe organ

Generate an SVG of a giraffe assembling a grandfather clock

Generate an SVG of a starfish driving a bulldozer

https://gally.net/temp/20251107pelican-alternatives/index.ht...

GPT-5.2 Pro cost about 80 cents per prompt through OpenRouter, so I stopped there. I don’t feel like spending that much on all thirty prompts.

smusamashah · 2025-12-12T08:04:46 1765526686

Hi, it doesn't have Gemini 3.5 Pro which seems to be the best at this

svantana · 2025-12-12T11:24:17 1765538657

That's probably because "Gemini 3.5 Pro" doesn't exist

philipgross · 2025-12-13T02:25:19 1765592719

That gallery is an excellent advertisement for Gemini 3.0 Pro.

AstroBen · 2025-12-11T22:03:52 1765490632

Seems to be getting more aerodynamic. A clear sign of AI intelligence

fxwin · 2025-12-11T20:21:24 1765484484

the only benchmark i trust

belter · 2025-12-11T20:08:17 1765483697

What happens if you ask for a pterodactyl on a motorbike?

Would like to know how much they are optimizing for your pelican....

simonkagedal · 2025-12-11T20:30:42 1765485042

He commented on this here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

irthomasthomas · 2025-12-11T21:56:33 1765490193

I was expecting to see a pterodactyl :(

minimaxir · 2025-12-11T19:03:08 1765479788

Is that the first SVG pelican with drop shadows?

simonw · 2025-12-11T19:41:58 1765482118

No, I got drop shadows from DeepSeek 3.2 recently https://simonwillison.net/2025/Dec/1/deepseek-v32/ (probably others as well.)

tootie · 2025-12-12T03:40:46 1765510846

Do you think the big guys are on to your game and have been adding extra pelicans to the training data?

sroussey · 2025-12-11T22:12:34 1765491154

What is good at SVG design?

culi · 2025-12-12T06:24:38 1765520678

Not svg, but basically the same challenge:

https://clocks.brianmoore.com/

Probably Kimi or Deepseek are best

azinman2 · 2025-12-12T00:53:43 1765500823

Graphic designers?

KellyCriterion · 2025-12-12T10:02:05 1765533725

Ive not seen any model being good in graphic/svg creation so far - all of the stuff mostly looks ugly and somewhat "synthetic-disorted".

And lately, Claude (web) started to draw ascii charts from one day to another indstead of colorful infographicstyled-images as it did before (they were only slightly better than the ascii charts)

tmaly · 2025-12-11T19:38:27 1765481907

seems to be eating something

danans · 2025-12-11T19:54:14 1765482854

Probably a jellyfish. You're seeing the tentacles

nightshift1 · 2025-12-11T23:09:36 1765494576

benchmarks probably should not be used for so long.