Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So, the clustering situation is a bit of a double-edged sword. Here's the pros and cons I've found:

Pros:

1) Allows for instant similarity-searching. I am using the Postgres CUBE data structure to index users' votes and it has a limit of 100 items. So, if I didn't use clusters then there could only be 100 cards maximum, but ideally you would have even less than that because the CUBE can start to slow down when you approach that limit.

2) It's also a bit of a privacy feature as people can only see how you voted along cluster lines, and not how you voted on individual cards. This provides the aforementioned plausible deniability.

Cons:

1) Not all clusters are ideal, as you've seen. I spent a lot of time exploring different clustering algorithms and none of them were perfect. Some cards were naturally a part of multiple clusters and others didn't align to any at all. I'm sure a lot of this comes down to card choice, which I definitely could improve.

2) Can be confusing to users as opposed to just listing theirs and others votes on cards.

---

If you'd like to help and create better clusters, I'd definitely be open to tweaking them. Most of the required data can be found by navigating to the Cards page. For example, if you go there and click "War on Drugs" and then click "View Correlations", you'll find that "War on Terror" correlates the most with "War on Drugs". This data can then be used to try and create your own clusters. I've found it to be a very tricky puzzle satisfying all the constraints.

In the end, for performance reasons, I felt like I had to choose between either having clusters or only 50-75 cards, and I chose clusters. There's probably a better way of doing it, but I was unable to find it at the time.



Have you manually reviewed some of these clusters? Automatic clustering for subjective topics tend get gamed by users (if they can be influenced by users) or labeled incorrectly.


Yeah, the first pass for the clustering was SciKit's Agglomerative Hierarchical Clustering Function and then the clusters were manually tweaked in an attempt to increase accuracy. Clusters are also manually updated and curated, so hopefully they shouldn't be able to be gamed at the moment.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: