"Not sure where we go from here. I don't want my posts slurped up by AI companies for free^[1] but what else can I do?"
Why not display a brief notice, like one sees on US government websites, that is impossible to miss. In this case the notice could be of the terms and conditions for using the website, in effect a brief copyright license that governs the use of material found on the website. The license could include a term prohibiting use of the material in machine learning and neural networks, including "training LLMs".
The idea is that even if these "AI" companies are complying with copyright law when using others' data for LLMs without permission, they would still be violating the license and this could be used to evade any fair use defense that the "AI" company intends to rely on.
Like using robots.txt, the contents of a user-agent header, if there is one, or using IP address, this costs nothing. Unlike robots.txt, User-Agent or IP addresss, it has potential legal enforceability.
That potential might be enough to deter some of these "AI" projects. You never know until you try.
Clearly, robots.txt, User-Agent header and IP address do not work.
Why would anyone aware of www history rely on the user-agent string as an accurate source of information?
As early as 1992, a year before the www went public, "user-agent spoofing" was expected.
By 1998, webmasters who relied on user-agent strings were referred to as "ill-advised":
"Rather than using other methods of content-negotiation, some ill-advised webmasters have chosen to look at the User-Agent to decide whether the browser being used was capable of using certain features (frames, for example), and would serve up different content for browsers that identified themselves as ``Mozilla''."
"Consequently, Microsoft made their browser lie, and claim to be Mozilla, because that was the only way to let their users view many web pages in their full glory: Mozilla/2.0 (compatible; MSIE 3.02; Update a; AOL 3.0; Windows 95)"
Why not display a brief notice, like one sees on US government websites, that is impossible to miss. In this case the notice could be of the terms and conditions for using the website, in effect a brief copyright license that governs the use of material found on the website. The license could include a term prohibiting use of the material in machine learning and neural networks, including "training LLMs".
The idea is that even if these "AI" companies are complying with copyright law when using others' data for LLMs without permission, they would still be violating the license and this could be used to evade any fair use defense that the "AI" company intends to rely on.
https://www.authorsalliance.org/2023/02/23/fair-use-week-202...
Like using robots.txt, the contents of a user-agent header, if there is one, or using IP address, this costs nothing. Unlike robots.txt, User-Agent or IP addresss, it has potential legal enforceability.
That potential might be enough to deter some of these "AI" projects. You never know until you try.
Clearly, robots.txt, User-Agent header and IP address do not work.
Why would anyone aware of www history rely on the user-agent string as an accurate source of information?
As early as 1992, a year before the www went public, "user-agent spoofing" was expected.
https://raw.githubusercontent.com/alandipert/ncsa-mosaic/mas...
By 1998, webmasters who relied on user-agent strings were referred to as "ill-advised":
"Rather than using other methods of content-negotiation, some ill-advised webmasters have chosen to look at the User-Agent to decide whether the browser being used was capable of using certain features (frames, for example), and would serve up different content for browsers that identified themselves as ``Mozilla''."
"Consequently, Microsoft made their browser lie, and claim to be Mozilla, because that was the only way to let their users view many web pages in their full glory: Mozilla/2.0 (compatible; MSIE 3.02; Update a; AOL 3.0; Windows 95)"
https://www-archive.mozilla.org/build/user-agent-strings.htm...
https://webaim.org/blog/user-agent-string-history/
As for robots.txt, many sites do not even have one.