tjsk's comments

tjsk · 2025-06-17T00:35:07 1750120507

what made you fork browser-use? what were the missing bits? your use case sounds similar to what they're trying with their new workflow-use repo (I am not affiliated with them, just curious)

arcb · 2025-06-17T01:21:25 1750123285

It's a great repo! We had issues with iframes and framesets (which are old DOM tags) we had to write custom code for. Some DOMs need annotation to provide meaning to an LLM (for example, a button is clearly an "add demographics" button to the human eye, but is ambiguous in the DOM (ul contains li...). Some bottlenecks in navigation required manual attention. We keep those to a minimum. I think the future is being able to progress from highly deterministic JS code, to more agentic LLM-driven decisions. One does need to be able to control this for performance, cost, and accuracy. And yes we have some overlap with workflow-use's direction, but I hope that more such OSS methods gain popularity! It'd simply mean we can go after higher value and more complex clinical tasks!

tjsk · 2025-06-17T15:04:19 1750172659

Did you consider working around those using the vision models vs DOM parsing? Was cost/latency the bottleneck? Seems like the agentic future you describe would need more vision based parsing

arcb · 2025-06-17T17:44:57 1750182297

I believe we will at some point. All question of the right need coming up. Text OCR has gotten really good, and if you think of it from a UI perspective, the only real contract is that a screen will show text that's representative of the information entered. The DOM is useful but is a changeable contract!

tjsk · 2025-05-01T22:18:25 1746137905

Slack is owned by Salesforce which is doing its own Agentforce stuff

spacebanana7 · 2025-05-01T22:55:54 1746140154

Salesforce loves acquisitions. I can already picture Benioff’s victory speech on CNBC.

tjsk · 2025-03-20T20:22:12 1742502132

I’ve been experimenting with different LLM + search combos too, but results have been mixed. One thing I’m particularly interested in is improving retrieval for both images and videos. Right now, most tools seem to rely heavily on metadata or simple embeddings, but I wonder if there’s a better way to handle complex visual queries. Have you tried anything for video search as well, or are you mainly focused on images? Also, what kinds of queries have you tested?