Every paragraph that you've included up there just reinforces my point.
The recursive behavior isn't incidental, it's literally part of the definition of a crawler. You can't just skip past that and pretend that the people who specifically included the word recursive (or the phrase "many pages") didn't really mean it.
The first paragraph of the two about access controls is the context for what "should not be accessed" means. It refers to "very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting)", which are pages that should not be indexed by search engines but for the most part shouldn't be a problem for something like perplexity. As I said in my comment, it's about search engine crawlers and indexers.
I'm glad that you at least cherry-picked a paragraph from that second page, because I was starting to worry that you weren't even reading your sources to check if they support your argument. That said, that paragraph means very little in support of your argument (it just gives one example of what isn't a robot, which doesn't imply that everything else is) and you're deliberately ignoring that that page is also very specific about the recursive nature of the robots that are being protected against.
Again, this is the definition that you just cited, which can't possibly include a single request from Perplexity's server (emphasis added):
> WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.
The only way you can possibly apply that definition to the behavior in TFA is if you delete most of it and just end up with "programs ... that traverse ... the WWW", at which point you've also included normal web browsers in your new definition.
It honestly just feels like you really have a lot of beef with LLM tech, which is fair, but there are much better arguments to be made against LLMs than "Perplexity's ad hoc requests are made by a crawler and should respect robots.txt". Your sources do not back up what you claim—on the contrary, they support my claim in every respect—so you should either find better sources or try a different argument.
Perplexity's ad hoc requests are still made by a crawler — whether you believe it or not. A web browser presents the content directly to the user. There may be extensions or features (reader mode) which modify the retrieved content in browser, but Perplexity's summarization feature does not present the content directly to the user in any way.
It honestly just feels like you have no critical thinking when it comes to LLM tech and want to pretend that an autonomous crawler that only retrieves a single page to process it isn't a crawler.
I have used, with permission of the site owner, a crawler to retrieve data from a single URL on a scheduled basis. It is fully automated data retrieval not intended for direct user consumption. THAT is what makes it a crawler. If the page from which I was retrieving the data was included in `/robots.txt`, the site owner would expect that an automated program would not pull the data. Recursiveness is not required to make a web robot. Unattended and/or disconnected requests do.
You are inventing your own definition for a term that is widely understood and clearly and unambiguously defined in sources that you yourself cited. Since you can't engage honestly with your own sources I see no value in continuing this conversation.
The recursive behavior isn't incidental, it's literally part of the definition of a crawler. You can't just skip past that and pretend that the people who specifically included the word recursive (or the phrase "many pages") didn't really mean it.
The first paragraph of the two about access controls is the context for what "should not be accessed" means. It refers to "very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting)", which are pages that should not be indexed by search engines but for the most part shouldn't be a problem for something like perplexity. As I said in my comment, it's about search engine crawlers and indexers.
I'm glad that you at least cherry-picked a paragraph from that second page, because I was starting to worry that you weren't even reading your sources to check if they support your argument. That said, that paragraph means very little in support of your argument (it just gives one example of what isn't a robot, which doesn't imply that everything else is) and you're deliberately ignoring that that page is also very specific about the recursive nature of the robots that are being protected against.
Again, this is the definition that you just cited, which can't possibly include a single request from Perplexity's server (emphasis added):
> WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.
The only way you can possibly apply that definition to the behavior in TFA is if you delete most of it and just end up with "programs ... that traverse ... the WWW", at which point you've also included normal web browsers in your new definition.
It honestly just feels like you really have a lot of beef with LLM tech, which is fair, but there are much better arguments to be made against LLMs than "Perplexity's ad hoc requests are made by a crawler and should respect robots.txt". Your sources do not back up what you claim—on the contrary, they support my claim in every respect—so you should either find better sources or try a different argument.