Yea, I've experienced this too with 3.7. Not always though. It has been helpful for me more often than not helpful. But yea 3.5 "felt" better to me.
Part of me thinks this is because I expected less of 3.5 and therefore interacted with it differently.
It's funny because it's unlikely that everyone interacts with these models in the same way. And that's pretty much guaranteed to give different results.
Would be interesting to see some methods come out for individuals to measure their own personal success rate/ productivity / whatever with these different models. And then have a way for people to compare them with each other so we can figure out who is working well with these models and who isn't and figure out why the difference.
> Would be interesting to see some methods come out for individuals to measure their own personal success rate/ productivity / whatever with these different models. And then have a way for people to compare them with each other so we can figure out who is working well with these models and who isn't and figure out why the difference.
This would be so useful. I have thought about this missing piece a lot.
Different tools like Cursor vs. Windsurf likely have their own system prompts for each model, so the testing really needs to be done in the context of each tool.
This seems somewhat straightforward to do using a testing tool like Playwright, correct? Whoever first does this successfully with have a popular blog/site on their hands.
Part of me thinks this is because I expected less of 3.5 and therefore interacted with it differently.
It's funny because it's unlikely that everyone interacts with these models in the same way. And that's pretty much guaranteed to give different results.
Would be interesting to see some methods come out for individuals to measure their own personal success rate/ productivity / whatever with these different models. And then have a way for people to compare them with each other so we can figure out who is working well with these models and who isn't and figure out why the difference.