Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't think this robots.txt is valid:

  User-agent: Googlebot PetalBot Bingbot YandexBot Kagibot
  Disallow: /bomb/\*
  Disallow: /bomb
  Disallow: /babble/\*

  Sitemap: https://maurycyz.com/sitemap.xml
I think this is telling the bot named "Googlebot PetalBot Bingbot YandexBot Kagibot" - which doesn't exist - to not visit those URLs. All other bots are allowed to visit those URLs. User-Agent is supposed to be one per line, and there's no User-Agent * specified here.

So a much simpler solution than setting up a Markov generator might be for the site owner to just specify a valid robots.txt. It's not evident to me that bots which do crawl this site are in fact breaking any rules. I also suspect that Googlebot, being served the markov slop, will view this as spam. Meanwhile this incentives AI companies to build heuristics to detect this kind of thing rather than building rules-respecting crawlers.



You're correct, it should read

    User-agent: Googlebot
    User-agent: PetalBot
    User-agent: Bingbot
    User-agent: YandexBot
    User-agent: Kagibot
    Disallow: /bomb/*
    Disallow: /bomb
    Disallow: /babble/*
    
    Sitemap: https://maurycyz.com/sitemap.xml




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: