That's cool! At some point, I myself thought about making a clone of HN that just filters out everything that is not a blog post. However, I couldn't come up with a solid filter criteria.
I agree that your method is not quite there yet, still a lot of large domains (airbnb.com, spiegel.de, spectator.co,...), but you started and that is already more than I ever did ;)
I would suggest including the HN metadata, such as the number of upvotes and comments. These are, in combination with the title, important criteria for me whether I click on something or not.
A link is to a blog post if and only if a) the linked-to page contains a feed autodiscovery tag and b) the autodiscovered feed contains an entry for the same page and c) that feed entry is at least half as long as the page (in words).
That test isn't quite right. The false negatives include links to old blog posts, and the false positives include the honourable few sites that provide full-textish feeds of something other than a blog. But it's pretty good if you want to filter away content marketing and read tech blogs.
Thank you, the filtering is the issue, and as you I haven't found a way to filter all big sites.
The filters I'm using are :
- the same user who post too often
- domain too frequent
- a list of blacklist words in the title
- a list of blacklisted domains
I already filter about 80% of links I would say (which is few enough to go through the list every day, about 200 posts)
About the HN meta data, I don't think it is a good idea to keep the upvotes, because this is exactly where the issue is, if you see a post with low upvotes people tend to not read it, doesn't mean it is not interesting, and for the comments same for not displaying the number, but you can still access the hn comment page by clicking on 'hn link'
That's an interesting problem. Aside from HN, I frequent TechMeme alot for the more news-y side of tech. They have a leaderboard section [1], that has all of the biggest tech publications ranked. That could be a good starting filter?
I agree that your method is not quite there yet, still a lot of large domains (airbnb.com, spiegel.de, spectator.co,...), but you started and that is already more than I ever did ;)
I would suggest including the HN metadata, such as the number of upvotes and comments. These are, in combination with the title, important criteria for me whether I click on something or not.