More

chhxdjsj · 2025-12-03T03:37:28 1764733048

Not so relevant to the thread but ive been uploading screenshots from citrix guis and asking qwen3-vl for the appropriate next action eg Mouseclick, and while it knows what to click it struggles to accurately return which pixel coordinates to click. Anyone know a way to get accurate pixel coordinates returned?

spherelot · 2025-12-03T06:11:14 1764742274

How do you prompt the model? In my experience, Qwen3-VL models have very accurate grounding capabilities (I’ve tested Qwen3-VL-30B-A3B-Instruct, Qwen3-VL-30B-A3B-Thinking, and Qwen3-VL-235B-A22B-Thinking-FP8).

Note that the returned values are not direct pixel coordinates. Instead, they are normalized to a 0–1000 range. For example, if you ask for a bounding box, the model might output:

```json [ {"bbox_2d": [217, 112, 920, 956], "label": "cat"} ] ```

Here, the values represent [x_min, y_min, x_max, y_max]. To convert these to pixel coordinates, use:

[x_min / 1000 * image_width, y_min / 1000 * image_height, x_max / 1000 * image_width, y_max / 1000 * image_height]

Also, if you’re running the model with vLLM > 0.11.0, you might be hitting this bug: https://github.com/vllm-project/vllm/issues/29595

chhxdjsj · 2025-12-03T06:55:21 1764744921

Will give this a go, cheers :)

logankeenan · 2025-12-03T04:13:06 1764735186

It’s been about a year since I looked into this sort of thing, but molmo will give you x,y coordinates. I hacked together a project about it. I also think Microsoft’s omniparser is good at finding coordinates too.

https://huggingface.co/allenai/Molmo-7B-D-0924

https://github.com/logankeenan/george

https://github.com/microsoft/OmniParser

chhxdjsj · 2025-12-03T06:54:57 1764744897

Thanks ill try this!

jazzyjackson · 2025-12-03T03:47:11 1764733631

Could you combine it with a classic OCR segmentation process, so that along with the image you also provide box coordinates of each string?

hamasho · 2025-12-03T04:33:12 1764736392

It's very not accurate, but sometimes instructing to return pyautogui code works.

  prompt: I attach a screenshot (1920x1080). Write code to click the submit button using pyautogui.
  attachment: <screenshot>
  reply:
    import pyautogui
    pyautogui.click(100, 200)

chhxdjsj · 2025-12-03T06:55:57 1764744957

Ive been asking for pyautogui output already but it is still very hit and miss

8f2ab37a-ed6c · 2025-12-03T03:42:28 1764733348

Also curious about this. I tried https://moondream.ai/ as well for this task and it felt still far from being bulletproof.

visioninmyblood · 2025-12-03T04:24:44 1764735884

you want get the exact coordinated by running a key point network to pinpoint which coordinates does the next click point is you can. here I show a example simple prompt which returns the keypoint location of the next botton to click and visually localize the point with a keypoint in the image

https://chat.vlm.run/c/e12f0153-7121-4599-9eb9-cd8c60bbbd69

chhxdjsj · 2025-11-30T16:10:29 1764519029

chhxdjsj · 2025-11-28T19:06:51 1764356811

E) bitcoin

chhxdjsj · 2025-11-05T20:31:28 1762374688

Hospitals all of the world are wholesale switching to chinese equipment - particularly mindray monitors/anaesthetic machines. China could brick all of these hospitals. We are so incredibly dependent on them.

chhxdjsj · 2025-10-10T17:09:53 1760116193

Yep, what the hell

chhxdjsj · 2025-10-01T05:06:46 1759295206

When quantum is near just get everyone to move their btc to quantum secure addresses and then orphan all the other btcs.

igor47 · 2025-10-01T05:46:30 1759297590

Addressed in TFA

chhxdjsj · 2025-10-01T06:45:23 1759301123

chhxdjsj · 2025-09-28T01:14:38 1759022078

Has it really though? Genuinely asking.. I’ve checked out a lot since 2010 or so but not sure I hear anything wildly different, vaporwave and sorta meme music was quite fresh but other than that im not sure.. maybe its just part of getting old and having less time to hunt around.

caseyohara · 2025-09-28T04:23:05 1759033385

Yes, it has. In both breadth and depth. People paying attention know this.

Even within techno (my favorite genre), which is already a quite narrow genre in terms of sounds, the variety of novel sounds birthing new techno sub-genres over the last 10-15 years has been wild.

mbac32768 · 2025-09-28T04:32:35 1759033955

What's the endgame? Micro-genres of techno only five people listen to. I can't wait!

input_sh · 2025-09-28T07:29:50 1759044590

I dislike calling them genres, they're more like trends or styles. One producer makes something new and unusual that breaks the established patterns, people like it a lot, other producers copy it, and that cycle continues until fans get bored and move on. That lifecycle usually lasts about 2-5 years, sometimes not even long enough to get a proper name, but if you're into the scene you know that "genre" when you hear it.

To give a recent "mainstream" example, Odd Mob has created a certain sound that blew up in popularity despite not fitting neatly in any of the existing boxes we had (tracks like Get Busy, Losing Control, Palm Of My Hands), other producers copied it and by now you have anonymous shitposters on social media complaining that most new songs sound like they were made by him.

golergka · 2025-09-28T05:53:16 1759038796

As a DJ, the endgame is building a set from a variety of different kinds of music which still sounds great together but doesn’t all follow the same boring formula. And it’s pretty great.

calenti · 2025-09-29T12:11:37 1759147897

More like wine - more profitable in the attention/branding market to make your own label than take someone else's.

chhxdjsj · 2025-09-05T00:32:40 1757032360

Less regulated? Circle has to keep 100% reserves backing all accounts whereas most US banks operate a low fractional reserve and lend mostly to billionaires funding moderately risky leveraged commercial real estate.

mdorazio · 2025-09-05T12:05:29 1757073929

You’re conveniently ignoring what “reserves” means in the GENIUS act. Unlike regular banks, Circle can use US Treasuries instead of cash so that they earn interest and prop up US government debt at the same time. It’s a clever scheme, but not the same as being forced to hold fiat reserves.

Many other banking regulations also don’t apply. No FDIC insurance and most importantly none of the regulations that apply to true fiduciaries since they are only “custodians”.

Liron · 2025-09-05T02:24:07 1757039047

Yep, I mean regulated with respect to who's allowed to offer and hold digital accounts in the currency. Circle itself may not do anything too funky, but big chunks of USD then get stored elsewhere online for banking-like activity in a way that wouldn't be legal with USD — right?

chhxdjsj · 2025-08-06T06:03:01 1754460181

Why use something which appears to have very similar results to tirzepatide/mounjaro but hasn’t been used by tens of millions on people without obvious issues like tirzep?

glp1guide · 2025-08-06T06:13:11 1754460791

Well there's no reason, except it's even more effective.

And 100%, using Retatrutide right now is illegal/not a good idea. It is super risky.

That said, anecdata from people with that risk tolerance is certainly worth looking at.

chhxdjsj · 2025-07-30T23:00:32 1753916432

Hi, great work congrats!

Does it use openrouter for model selection? Which models did you achieve the webarena result with? Are there any open source models which are any good for this?

tcwd · 2025-07-30T23:05:12 1753916712

For the WebArena result, we actually used a mixture of models checking each other's work and evaluating in real time. We found the verifications to be really effective in producing accurate results. Feel free to take a look at our architectural blog post to learn more in detail: https://blog.withmeka.com/introducing-meka-an-open-source-fr...

Unfortunately, we didn't try it out with open source models, but you are welcome to pull the repo and try with any model that has good visual grounding! (I heard UI-TARS and the latest Qwen visual model are quite good)