Looks like it’s just scrambling each individual word. Seems straightforward to programmatically look for groups of things that aren’t legitimate words on a page.
Indeed. I think part of the reason when they are not discussed openly may be that much of the data used is copyrighted, which introduces some legal ambiguities.
IANAL but hiding something doesn't make someone legally immune. Any company could sue LLM companies and they can't hide it during the case. e.g. there is already a similar case on OpenAI.
Yes, but it at the very least delays any findings while you rake in the cash and try to create a favorable environment. OpenAI even stated that think using copyrighted texts is necessary and should be covered by fair use.
These models will are compatible with llama.cpp out of the box, we (GigaML - https://gigaml.com) are planning to train a small model (3-4B, 1-bit, opensource) with the latest stack-v2 dataset released today. Let me know if anyone is interested in collaborating with us.
I'm interested in collaborating. For example, from the comments it occurred to me that a 128-bit SIMD register can contain 64 2-bit values. It seems straightforward that SIMD bitwise logical operations could be used in training such models.
Highly interested in collaborating – got a bunch of proprietary legal data already pre-sorted and labeled for various scenarios. I've already benchmarked legal use-cases (i.e. legal speciality, a few logic-based questions, and specific document creation) with various LLMs – so would love to see what benchmarks this can produced compared to early Mistral or Llama.
I’ve been working on a project [1] to do just that from within a Chrome extension. The idea was that as an extension, it could make use of the context menu and feel more like a native feature of the browser. I’m always hesitant to link to my things from comments but in this case I think it’s a perfect fit for what you’re describing.