Hacker Newsnew | past | comments | ask | show | jobs | submit | DeepYogurt's commentslogin

We need to bring the term Autoped back. Beats the snot out of escooter

Personally I suspect Autoped wouldn't do too well, it would provoke too many jokes about robo-pedophiles, not quite as bad as Nonce Finance, though.

"Beats the snot out of escooter"

Shit not snot. Please do the job properly! I agree: Autoped is a far better name.


Maybe we should pretend escooter is a spanish word.

I wonder if its open license. Not as impactful as seat belts, but it would be nice to see volvo continue that legacy.


> Not as impactful as seat belts, but it would be nice to see volvo continue that legacy

I'm afraid that legacy is long lost, Volvo is a very different company today than it used to be.


Volvo no longer exists. It's a brand name owned by a conglomerate, the Zhejiang Geely Holding Group.


I mean, it does exist, it goes by "Volvo Car AB", and it's a real company owned by Geely Holding (full name "Zhejiang Geely Holding Group Co., Ltd.").

But it does exist, just isn't the same as it used to be, back in the "seat-belts is for everyone" era.


Unless they're covered by a design patent, it's a free for all anyway in many places: https://en.wikipedia.org/wiki/Intellectual_property_protecti...


I think you are missing the point. The font curves/shapes/beziers are not copyrightable. But the source code and resulting font file is (everywhere). Fonts are licensed just like software (or more like software plugins).

So you can take any typeface and trace/redraw it just fine. But you can't use the original font files unless you have proper license.


That's exactly what I was wondering, since I vaguely remember that all this "copyrighted fonts" silly business boils down to the exact source code, and the same shape can be represented a hundred ways. So, what's the big deal, anyway? I never tried to do it, but I'd assume that to make a "different" copy of a font with minimal human intervention must be a trivial computing task by now. Sure, in theory there are subtleties like many possible ligatures and kerning, but I doubt it's really that critical. And it only matters if you only have a picture with so many latter combinations. If you have the actual font file, you have full information anyway.

And if so, why people still even bother with all that "font licenses" stuff and such? I'd think the only reason to buy a font by now must be when design studio actually does custom work for you. And the emphasis is on "custom", because it isn't truly "for you": anyone will be effectively free to use it after you use it once anywhere.


Fonts might seem trivial but actually there is a lot of engineering going on underneath. There is whole programming language underneath that allows fonts to do what they do, spacing between characters, how they should be rasterized on screens and also making them widely compatible.

So in same way as one would expect “its just a CRUD app” it should be trivial computing task to make a “different” copy. No unless you do some decompilation of the font which is breaking the license.

About why people bother… maybe the biggest issue typography (and lot of design in general) have is that if done right it's mostly invisible or natural. You notice typography only when it's done badly - it's very subconscious. That doesn't mean it's not ongoing topic with experts dedicating their lives to it. And for them even differences between variations of Helvetica matter. If you look around yourself you will have typography absolutely everywhere - you have probably 1000s of different fonts just in your home. You probably don't notice them but you would if our society standards were lower.


This is 100% the why.


Whenever I see this much vehement agreement about something on HN, it sets off serious groupthink alarm bells.

Idk what the answer is, but it is not 100% this. It’s too simple and satisfying of an answer to be true.


I understand what you mean, but it does match MBA/mckinsey thinking very closely.

Make a metric a goal, work tirelessly towards that new metric.

Does it make the product better? Well, the product is already made- so it doesn’t make a difference.

It’s only software developers who think a product is never “done”- normal MBA thinking is “we have invested in R&D, now there is a product, how do we get as many users of our product as possible”.


You don't think the reason we have seemingly broken optimization is because poorly thought out metrics are being gamed?

That's all its been for the last few decades. Everyone is now "data driven" and "metrics oriented". That's a footgun - if people can game it, they will, and numbers don't say what people think they say.


Normally I would agree, but I've seen this happen too often. Common sense be damned, just make the number look good.


osv.dev exists and is worlds better


To be fair the CVE system can't even encode a version string


Not sure whether this is a limitation of the scanning tooling or of the CVE format itself, it also cannot express sub packages. So if some Jackson-very-specific-module has a CVE the whole of Jackson gets marked as impacted. Same with netty.


AI is going great


He's not wrong about actions being neglected


Yeah, but you can't call developers "monkeys" and "losers" and violate your own Code of Conduct.


True eventually, but not today


Exhibit A is a github user joelreymont who seems to be making a habit of this behavior. He did a very similar spam on ocaml github.com/ocaml/ocaml/pull/14369


Reminds me of Blindsight by Peter Watts. Aliens viewed our radio signals as a type of malware aimed to consume the resources of a recipient for zero payoff and reduced fitness. This is the same.


This is absolutely insane. If you look at joelreymonts recent activity on GitHub, there is what I would consider an bomb of AI slop PRs with thousands and thousands of changes, AI generated summaries/notes, copyright issues, you name it.

People like this are ruining open source for the rest of us, and those poor maintainers really have their work cut out for them sorting through this slop.


What are you going to do? You can't expect some sort of self-censorship based on righteousness and morals. I see joelreymonts as a pioneer planting chestertons fences. LET THE MAN COOK!


Has anyone done a talk/blog/whatever on how llm crawlers are different than classical crawlers? I'm not up on the difference.


IMO there was something of a de facto contract, pre-LLMs, that the set of things one would publicly mirror/excerpt/index and the set of things one would scrape were one and the same.

Back then, legitimate search engines wouldn’t want to scrape things that would just make their search results less relevant with garbage data anyways, so by and large they would honor robots.txt and not overwhelm upstream servers. Bad actors existed, of course, but were very rarely backed by companies valued in the billions of dollars.

People training foundation models now have no such constraints or qualms - they need as many human-written sentences as possible, regardless of the context in which they are extracted. That’s coupled with a broader familiarity with ubiquitous residential proxy providers that can tunnel traffic through consumer connections worldwide. That’s an entirely different social contract, one we are still navigating.


For all its sins, google had a vested interest in the sites it was linking to stay alive. Llms don't.


That's a shortcut, llm providers are very short sighted but not to that extreme, alive websites are needed to produce new data for future trainings. Edit: damn I've seen this movie before


Text, images, video, all of it I can’t think of any form of data they don’t want to scoop up, other than noise and poisoned data


I am not well versed in this problem but can't the web servers rate limit by known IP addresses of these crawler/scrapers?


Not the exact same problem, but a few months ago, I tried to block youtube traffic from my home (I was writing a parental app for my child) by IP. After a few hours of trying to collect IPs, I gave up, realizing that YouTube was dynamically load-balanced across millions of IPs, some of which also served traffic from other Google services I didn't want to block.

I wouldn't be surprised if it was the same with LLMs. Millions of workers allocated dynamically on AWS, with varying IPs.

In my specific case, as I was dealing with browser-initiated traffic, I wrote a Firefox add-on instead. No such shortcut for web servers, though.


Yoric, dropping some knowledge vis a vis the downstream regarding DNS:

* https://www.dnsrpz.info/

* https://github.com/m3047/rear_view_rpz


Thanks!


Why not have local DNS at your router and do a block there? It can even be per-client with adguardhome


I did that, but my router doesn't offer a documented API (or even a ssh access) that I can use to reprogram DNS blocks dynamically. I wanted to stop YouTube only during homework hours, so enabling/disabling it a few times per day quickly became tiresome.


Your router almost certainly lets you assign a DNS instead of using whatever your ISP sends down so you set it to an internal device running your DNS.

Your DNS mostly passes lookup requests but during homework time, when there's a request for the ip for "www.youtube.com" it returns the ip of your choice instead of the actual one. The domain's TTL is 5 minutes.

Or don't, technical solutions to social problems are of limited value.


Any solution based on this sounds monstruously more complicated than my browser addon.

And technical bandaids to hyperactivity, however imperfect, are damn useful.


I think dnsmasq plus a cron on a server of your choice will do this pretty easily. With an LLM you could set this up in less than 15 minutes if you already have a server somewhere (even one in the home).


Thanks for the tip.

In this case, I don't have a server I can conveniently use as DNS. Plus I wanted to also control the launching of some binaries, so that would considerably complicate the architecture.

Maybe next time :)


Makes sense! Keeping your home tech simple definitely a recipe for a happier life when you have kids haha


A browser add-on wouldn't do the job. The use case was a parent controlling a child's behavior, not someone controlling their own.


Yes, my kid has ADHD. The browser add-on does the job at slowing down the impulse of going to YouTube (and a few online gaming sites) during homework hours.

I've deployed the same one for me, but setup for Reddit during work hours.

Both of us know how to get around the add-on. It's not particularly hard. But since Firefox is the primary browser for both of us, it does the trick.


For those that don't want to build their own addon, Cold turkey Blocker works quite well. It supports multiple browsers and can block apps too.

I'm not affiliated with them, but it has helped me when I really need to focus.

https://getcoldturkey.com/


They rely on residential proxies powered by botnets — often built by compromising IoT devices (see: https://krebsonsecurity.com/2025/10/aisuru-botnet-shifts-fro... ). In other words, many AI startups — along with the corporations and VC funds backing them — are indirectly financing criminal botnets.


You cannot block LLM crawlers by IP address, because some of them use residential proxies. Source: 1) a friend admins a slightly popular site and has decent bot detection heuristics, 2) just Google “residential proxy LLM”, they are not exactly hiding. Strip-mining original intellectual property for commercial usage is big business.


How does this work? Why would people let randos use their home internet connections? I googled it but the companies selling these services are not exactly forthcoming on how they obtained their "millions of residential IP addresses".

Are these botnets? Are AI companies mass-funding criminal malware companies?


>Are these botnets? Are AI companies mass-funding criminal malware companies?

Without a doubt some of them are botnets. AI companies got their initial foothold by violating copyright en masse with pirated textbook dumps for training data, and whatnot. Why should they suddenly develop scruples now?


It used to be Hola VPN which would let you use someone else’s connection and in the same way someone could use yours which was communicated transparently, that same hola client would also route business users. Im sure many other free VPN clients do the same thing nowadays.


I have seen it claimed that's a way of monetizing free phone apps. Just bundle a proxy and get paid for that.


A recent HN thread about this: https://news.ycombinator.com/item?id=45746156


so user either has a malware proxy running requests without being noticed or voluntarily signed up as a proxy to make extra $ off their home connection. Either way I dont care if their IP is blocked. Only problem is if users behind CGNAT get their IP blocked then legitimate users may later be blocked.

edit: ah yes another person above mentioned VPN's thats a good possibility, also another vector is users on mobile can sell their extra data that they dont use to 3rd parties. probably many more ways to acquire endpoints.


“Known IP addresses” to me implies an infrequently changing list of large datacenter ranges. Maintaining a dynamic list (along with any metadata required for throttling purposes) of individual IPs is a different undertaking with higher level of effort.

Of course, if you don’t care about affecting genuine users then it is much simpler. One could say it’s collateral damage and show a message suggesting to boycott companies and/or business practices that prompted these measures.


Large cloud providers could offer that solution but then, crawlers can also change cycle IPs



Thanks!


The only real difference that LLM crawlers tend to not respect /robots.txt and some of them hammer sites with some pretty heavy traffic.

The trap in the article has a link. Bots are instructed not to follow the link. The link is normally invisible to humans. A client that visits the link is probably therefore a poorly behaved bot.


Recently there have been more crawlers coming from tens to hundreds of IP netblocks from dozens (or more!) of ASN in highly time and URL correlated fashion with spoofed user-agent(s) and no regard for rate or request limiting or robots.txt. These attempt to visit every possible permutation of URLs on the domain and have a lot of bandwidth and established tcp connections available to them. It's not that this didn't happen pre-2023 but it's noticably more common now. If you have a public webserver you've probably experienced it at least once.

Actual LLM involvement as the requesting user-agent is vanishingly small. It's the same problem as ever: corporations, their profit motive during $hypecycle coupled with access to capital for IT resources, and the protection of the abusers via the company's abstraction away of legal liability for their behavior.


The crawlers themselves are not that different: it is their number, how the information is used once scraped (including referencing or lack thereof), and if they obey the rules:

1. Their number: every other company and the mangy mutt that is its mascot is scraping for LLMs at the moment, so you get hit by them far more than you get hit by search engine bots and similar. This makes them harder to block too, because even ignoring tricks like using botnets to spread requests over many source addresses (potentially the residential connections of unwitting users infected by malware) the share number coming from so many places, new places all the time, means you can not maintain a practical blocklist of source addresses. The number of scrapers out there means that small sites can be easily swamped, much like when HN, slashdot, or a popular reddit subsection, links to a site, and it gets “hugged to death” by a sudden glut of individual people who are interested.

2. Use of the information: Search engines actually provide something back: sending people to your site. Useful if that is desirable which in many cases it is. LLMs don't tend to do that though: by the very nature of LLMs very few results from them come with any indication of the source of the data they use for their guesswork. They scrape, they take, they give nothing back. Search engines had a vested interest in your site surviving as they don't want to hand out dead links, those scraping for LLMs have no such requirement because they can still summarise your work from what is effectively cached within their model. This isn't unique to LLMs, go back a few years to the pre-LLM days and you will find several significant legal cases about search engines offering summaries of the information found instead of just sending people to the site where the information is.

3. Ignoring rules: Because so many sites are attempting to block scrapers now, usually at a minimum using accepted methods to discourage it (robots.txt, nofollow attributes, etc.), these signals are just ignored. Sometimes this is malicious with people running the scrapers simply not caring despite knowing the problem they could create, sometimes it is like the spam problem in mail: each scraper thinks it'll be fine because it is only them, with each of the many also thinking the same thing… With people as big as Meta openly defending piracy as just fine for the purposes of LLM training, others see that as a declaration of open season. Those that are malicious or at least amoral (most of them) don't care. Once they have scraped your data they have, as mentioned above, no vested interest in whether your site lives or dies (either by withing away from lack of attention or falling over under their load to never be brought back up), in fact they might have incentive to want your site dead: it would no longer compete with the LLM as a source of information.

No one of these is the problem, but together they are a significant problem.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: