Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've wondered why people have tried all sorts of cumbersome ways to splice metadata onto HTML like RDFa but never tried the obvious approach of basing extraction rules on CSS selectors... Often these work without the cooperation of the target site so long as they use CSS the way it was supposed be used (e.g. not tailwind, bootstrap, etc.)


Back in the optimistic 2000s there was the idea of GRDDL – using XSLT stylesheets and XPath selectors for extracting stuff, e.g. microformats, HTML meta, FOAF, etc:

https://www.w3.org/TR/grddl/


Having learned xpath and a little xslt I've always wondered why it isn't more popular. It seems like a powerhouse for reading and transforming data from XML type documents. I've found it hard to find decent resources to learn more than the basics (and none for xquery) because of lack of popularity nowadays, but I do thing it's a skill you should have like SQL and regex. Seems a no brainer.


I’ve thought about that. My first take on XSLT was that it was “too complicated”, I got to talking with XSLT enthusiasts later and found out how many good ideas XSLT has in it.

My take is that some specifications can be written out in a linear way where you can start reading at the beginning and work to end and not feel like you need to read ahead.

Some specs have a minor discontinuity, I remember perceiving it in the K and R book on C but it seemed like there was just one kink in it and if you read the book twice you’d do OK.

Books in C++ are worse and have numerous topics that resist being put in the right order. It’s not unusual for “resource acquisition is initialization” to be repeated hundreds of times before it is defined, for instance.

That circularity is both a function of the domain and also a function of the text, I think a certain amount of circularity is inherent to many domains, but frequently you can bootstrap a domain by dividing it into numerous layers and put the circularity into a layer built just to manage the circularity.

XSLT, XMLSchema, and many XML specs have that kind of circular structure, you are left wondering what exact kind of machine is required to implement it so you can look at the spec and have a hard time understanding how to do easy things and no grasp of the hard-looking things that are actually easy. Couple that with numerous sharp edges in XML such as numeric values not being allowed in ID or IDREF fields (hate to break it to them but numeric identifiers are rampant in the jndustry) and it is no wonder people would rather use deeply lame ‘standards’ like JSON that lack comments, aren’t really clear about the semantics of numbers, and don’t have the moral authority to say “quit screwing around and just use ISO 8601 dares.

Now I finally realized the OWL spec is perfectly clear in the sense that you can understand what it really does by understanding the mapping of OWL axioms to first order logic, but the trouble is that logic is the most treacherous branch of mathematics.


CSS selectors has been common for the scrapers I've been using for years.


I quite like the microformats approach to this. https://developer.mozilla.org/en-US/docs/Web/HTML/microforma...


Sadly the trend does seem to be a move away from semantic CSS. I get the appeal of Tailwind for creating components and custom designs, but it's surprising when you see content heavy sites like the BBC no longer using class attributes in their news articles the way they used to.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: