Not so relevant to the thread but ive been uploading screenshots from citrix guis and asking qwen3-vl for the appropriate next action eg Mouseclick, and while it knows what to click it struggles to accurately return which pixel coordinates to click. Anyone know a way to get accurate pixel coordinates returned?
How do you prompt the model? In my experience, Qwen3-VL models have very accurate grounding capabilities (I’ve tested Qwen3-VL-30B-A3B-Instruct, Qwen3-VL-30B-A3B-Thinking, and Qwen3-VL-235B-A22B-Thinking-FP8).
Note that the returned values are not direct pixel coordinates. Instead, they are normalized to a 0–1000 range. For example, if you ask for a bounding box, the model might output:
It’s been about a year since I looked into this sort of thing, but molmo will give you x,y coordinates. I hacked together a project about it. I also think Microsoft’s omniparser is good at finding coordinates too.
It's very not accurate, but sometimes instructing to return pyautogui code works.
prompt: I attach a screenshot (1920x1080). Write code to click the submit button using pyautogui.
attachment: <screenshot>
reply:
import pyautogui
pyautogui.click(100, 200)
you want get the exact coordinated by running a key point network to pinpoint which coordinates does the next click point is you can. here I show a example simple prompt which returns the keypoint location of the next botton to click and visually localize the point with a keypoint in the image
Hospitals all of the world are wholesale switching to chinese equipment - particularly mindray monitors/anaesthetic machines. China could brick all of these hospitals. We are so incredibly dependent on them.
Has it really though? Genuinely asking.. I’ve checked out a lot since 2010 or so but not sure I hear anything wildly different, vaporwave and sorta meme music was quite fresh but other than that im not sure.. maybe its just part of getting old and having less time to hunt around.
Yes, it has. In both breadth and depth. People paying attention know this.
Even within techno (my favorite genre), which is already a quite narrow genre in terms of sounds, the variety of novel sounds birthing new techno sub-genres over the last 10-15 years has been wild.
I dislike calling them genres, they're more like trends or styles. One producer makes something new and unusual that breaks the established patterns, people like it a lot, other producers copy it, and that cycle continues until fans get bored and move on. That lifecycle usually lasts about 2-5 years, sometimes not even long enough to get a proper name, but if you're into the scene you know that "genre" when you hear it.
To give a recent "mainstream" example, Odd Mob has created a certain sound that blew up in popularity despite not fitting neatly in any of the existing boxes we had (tracks like Get Busy, Losing Control, Palm Of My Hands), other producers copied it and by now you have anonymous shitposters on social media complaining that most new songs sound like they were made by him.
As a DJ, the endgame is building a set from a variety of different kinds of music which still sounds great together but doesn’t all follow the same boring formula. And it’s pretty great.
Less regulated? Circle has to keep 100% reserves backing all accounts whereas most US banks operate a low fractional reserve and lend mostly to billionaires funding moderately risky leveraged commercial real estate.
You’re conveniently ignoring what “reserves” means in the GENIUS act. Unlike regular banks, Circle can use US Treasuries instead of cash so that they earn interest and prop up US government debt at the same time. It’s a clever scheme, but not the same as being forced to hold fiat reserves.
Many other banking regulations also don’t apply. No FDIC insurance and most importantly none of the regulations that apply to true fiduciaries since they are only “custodians”.
Yep, I mean regulated with respect to who's allowed to offer and hold digital accounts in the currency. Circle itself may not do anything too funky, but big chunks of USD then get stored elsewhere online for banking-like activity in a way that wouldn't be legal with USD — right?
Why use something which appears to have very similar results to tirzepatide/mounjaro but hasn’t been used by tens of millions on people without obvious issues like tirzep?
Does it use openrouter for model selection? Which models did you achieve the webarena result with? Are there any open source models which are any good for this?
For the WebArena result, we actually used a mixture of models checking each other's work and evaluating in real time. We found the verifications to be really effective in producing accurate results. Feel free to take a look at our architectural blog post to learn more in detail: https://blog.withmeka.com/introducing-meka-an-open-source-fr...
Unfortunately, we didn't try it out with open source models, but you are welcome to pull the repo and try with any model that has good visual grounding! (I heard UI-TARS and the latest Qwen visual model are quite good)
reply