Yea, I've experienced this too with 3.7. Not always though. It has been helpful ...

consumer451 · 2025-04-07T18:15:05 1744049705

> Would be interesting to see some methods come out for individuals to measure their own personal success rate/ productivity / whatever with these different models. And then have a way for people to compare them with each other so we can figure out who is working well with these models and who isn't and figure out why the difference.

This would be so useful. I have thought about this missing piece a lot.

Different tools like Cursor vs. Windsurf likely have their own system prompts for each model, so the testing really needs to be done in the context of each tool.

This seems somewhat straightforward to do using a testing tool like Playwright, correct? Whoever first does this successfully with have a popular blog/site on their hands.