Hacker Newsnew | past | comments | ask | show | jobs | submit | chhxdjsj's commentslogin

Not so relevant to the thread but ive been uploading screenshots from citrix guis and asking qwen3-vl for the appropriate next action eg Mouseclick, and while it knows what to click it struggles to accurately return which pixel coordinates to click. Anyone know a way to get accurate pixel coordinates returned?

How do you prompt the model? In my experience, Qwen3-VL models have very accurate grounding capabilities (I’ve tested Qwen3-VL-30B-A3B-Instruct, Qwen3-VL-30B-A3B-Thinking, and Qwen3-VL-235B-A22B-Thinking-FP8).

Note that the returned values are not direct pixel coordinates. Instead, they are normalized to a 0–1000 range. For example, if you ask for a bounding box, the model might output:

```json [ {"bbox_2d": [217, 112, 920, 956], "label": "cat"} ] ```

Here, the values represent [x_min, y_min, x_max, y_max]. To convert these to pixel coordinates, use:

[x_min / 1000 * image_width, y_min / 1000 * image_height, x_max / 1000 * image_width, y_max / 1000 * image_height]

Also, if you’re running the model with vLLM > 0.11.0, you might be hitting this bug: https://github.com/vllm-project/vllm/issues/29595


Will give this a go, cheers :)

It’s been about a year since I looked into this sort of thing, but molmo will give you x,y coordinates. I hacked together a project about it. I also think Microsoft’s omniparser is good at finding coordinates too.

https://huggingface.co/allenai/Molmo-7B-D-0924

https://github.com/logankeenan/george

https://github.com/microsoft/OmniParser


Thanks ill try this!

Could you combine it with a classic OCR segmentation process, so that along with the image you also provide box coordinates of each string?

It's very not accurate, but sometimes instructing to return pyautogui code works.

  prompt: I attach a screenshot (1920x1080). Write code to click the submit button using pyautogui.
  attachment: <screenshot>
  reply:
    import pyautogui
    pyautogui.click(100, 200)

Ive been asking for pyautogui output already but it is still very hit and miss

Also curious about this. I tried https://moondream.ai/ as well for this task and it felt still far from being bulletproof.

you want get the exact coordinated by running a key point network to pinpoint which coordinates does the next click point is you can. here I show a example simple prompt which returns the keypoint location of the next botton to click and visually localize the point with a keypoint in the image

https://chat.vlm.run/c/e12f0153-7121-4599-9eb9-cd8c60bbbd69


HFSP


E) bitcoin


Hospitals all of the world are wholesale switching to chinese equipment - particularly mindray monitors/anaesthetic machines. China could brick all of these hospitals. We are so incredibly dependent on them.


Yep, what the hell


When quantum is near just get everyone to move their btc to quantum secure addresses and then orphan all the other btcs.


Addressed in TFA


Nope


Has it really though? Genuinely asking.. I’ve checked out a lot since 2010 or so but not sure I hear anything wildly different, vaporwave and sorta meme music was quite fresh but other than that im not sure.. maybe its just part of getting old and having less time to hunt around.


Yes, it has. In both breadth and depth. People paying attention know this.

Even within techno (my favorite genre), which is already a quite narrow genre in terms of sounds, the variety of novel sounds birthing new techno sub-genres over the last 10-15 years has been wild.


What's the endgame? Micro-genres of techno only five people listen to. I can't wait!


I dislike calling them genres, they're more like trends or styles. One producer makes something new and unusual that breaks the established patterns, people like it a lot, other producers copy it, and that cycle continues until fans get bored and move on. That lifecycle usually lasts about 2-5 years, sometimes not even long enough to get a proper name, but if you're into the scene you know that "genre" when you hear it.

To give a recent "mainstream" example, Odd Mob has created a certain sound that blew up in popularity despite not fitting neatly in any of the existing boxes we had (tracks like Get Busy, Losing Control, Palm Of My Hands), other producers copied it and by now you have anonymous shitposters on social media complaining that most new songs sound like they were made by him.


As a DJ, the endgame is building a set from a variety of different kinds of music which still sounds great together but doesn’t all follow the same boring formula. And it’s pretty great.


More like wine - more profitable in the attention/branding market to make your own label than take someone else's.


Less regulated? Circle has to keep 100% reserves backing all accounts whereas most US banks operate a low fractional reserve and lend mostly to billionaires funding moderately risky leveraged commercial real estate.


You’re conveniently ignoring what “reserves” means in the GENIUS act. Unlike regular banks, Circle can use US Treasuries instead of cash so that they earn interest and prop up US government debt at the same time. It’s a clever scheme, but not the same as being forced to hold fiat reserves.

Many other banking regulations also don’t apply. No FDIC insurance and most importantly none of the regulations that apply to true fiduciaries since they are only “custodians”.


Yep, I mean regulated with respect to who's allowed to offer and hold digital accounts in the currency. Circle itself may not do anything too funky, but big chunks of USD then get stored elsewhere online for banking-like activity in a way that wouldn't be legal with USD — right?


Why use something which appears to have very similar results to tirzepatide/mounjaro but hasn’t been used by tens of millions on people without obvious issues like tirzep?


Well there's no reason, except it's even more effective.

And 100%, using Retatrutide right now is illegal/not a good idea. It is super risky.

That said, anecdata from people with that risk tolerance is certainly worth looking at.


Hi, great work congrats!

Does it use openrouter for model selection? Which models did you achieve the webarena result with? Are there any open source models which are any good for this?


For the WebArena result, we actually used a mixture of models checking each other's work and evaluating in real time. We found the verifications to be really effective in producing accurate results. Feel free to take a look at our architectural blog post to learn more in detail: https://blog.withmeka.com/introducing-meka-an-open-source-fr...

Unfortunately, we didn't try it out with open source models, but you are welcome to pull the repo and try with any model that has good visual grounding! (I heard UI-TARS and the latest Qwen visual model are quite good)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: