Hacker Newsnew | past | comments | ask | show | jobs | submit | esha_manideep's commentslogin

Latest cursor update where they started charging for tokens is pretty good. I don't use non-MAX mode on cursor anymore


Claude's limits are so vague - its not clear if buying Claude Max is cheaper than just using the API. Has anyone benchmarked this?


Sounds like they are great fans of Numberwang.


They check after they scrape


How? Real people read all millions of pages of internet texts to verify it?


Looks like it’s just scrambling each individual word. Seems straightforward to programmatically look for groups of things that aren’t legitimate words on a page.


That's a lot of time and bandwidth to waste


Great work guys! How did you benchmark traiser's 10-20% better? Would love to see exactly how each method scored


Great question! See this thread:

https://news.ycombinator.com/item?id=40369713


Pretty amazing to see training data being discussed more openly


Indeed. I think part of the reason when they are not discussed openly may be that much of the data used is copyrighted, which introduces some legal ambiguities.


IANAL but hiding something doesn't make someone legally immune. Any company could sue LLM companies and they can't hide it during the case. e.g. there is already a similar case on OpenAI.


Yes, but it at the very least delays any findings while you rake in the cash and try to create a favorable environment. OpenAI even stated that think using copyrighted texts is necessary and should be covered by fair use.


These models will are compatible with llama.cpp out of the box, we (GigaML - https://gigaml.com) are planning to train a small model (3-4B, 1-bit, opensource) with the latest stack-v2 dataset released today. Let me know if anyone is interested in collaborating with us.


I'm interested in collaborating. For example, from the comments it occurred to me that a 128-bit SIMD register can contain 64 2-bit values. It seems straightforward that SIMD bitwise logical operations could be used in training such models.


Highly interested in collaborating – got a bunch of proprietary legal data already pre-sorted and labeled for various scenarios. I've already benchmarked legal use-cases (i.e. legal speciality, a few logic-based questions, and specific document creation) with various LLMs – so would love to see what benchmarks this can produced compared to early Mistral or Llama.

Let me know what's the best way to reach out!


Having a feature to summarise the website would be a much better than the first two. As I constantly find myself wishing for such a feature :/


I’ve been working on a project [1] to do just that from within a Chrome extension. The idea was that as an extension, it could make use of the context menu and feel more like a native feature of the browser. I’m always hesitant to link to my things from comments but in this case I think it’s a perfect fit for what you’re describing.

[1] https://smudge.ai


I suspect Google is afraid of getting sued by more publishers :/


The linked post links to another blog which is about this.


The Arc browser I use has that feature on hover.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: