If I may go on a tangent, I’ve thought a bit about the problem of link farms and how it might be addressed.
The problem is essentially this: since pagerank is basically the probability that a random walk through the link graph will end up on your site, linking back to yourself, and no other websites, gives a big boost to your pagerank because a random walk will get stuck on your site. Of course it’s easy to just ignore self-links, but you can get essentially the same effect through clique-like groups of websites and this can be more difficult to detect.
What’s interesting is that an algorithm based on how electrical current flows (so a link is a one-way resistor, i.e. a resistor in series with a diode) would not have this problem. Attaching a conductive loop to some point in a circuit does not change how current flows. Electrons don’t get stuck in loops because they don’t drift around randomly, they move from lower voltage to higher voltage.
Tangents are disallowed. But I'll grant you a hyperbolic trajectory.
Link graphs remind me of lightning descending leaders. If you can have a sense of charge potential between cloud and ground, there might be a circuit equivalent which drains largely self-referential link-farms.
Or is that rephrasing your description? My circuit physics / EE-fu is exceedingly weak.
I think so? There is an obvious circuit equivalent where links are interpreted as one-way resistors. You get a charge potential automatically given the circuit and a choice of source / sink nodes (which you need to decide on anyway to apply pagerank).
The notion of a potential & sink seems to be key, and the idea that if you identify a collectively chained circuit, it cannot be both source and sink to itself. Which is what link farms (self-referral) or "mutual admiration societies" are. The question is whether or not you can determine the interconnectedness. Since DNS obscures true ownership relations, that's a challenge.
Cluster analysis generally shows such relations though.
(I'm pretty sure these questions have generated multiple PhDs at Google.)
“The question is whether or not you can determine the interconnectedness.” I’m not sure what you mean by this. The algorithm I'm describing doesn't require any data beyond what is already used by PageRank.
I have a feeling that you think the electrical potential must be defined by some ad-hoc method before applying an electrical algorithm, and this requires fancy techniques like cluster analysis and the like? This is not the case. Let me re-emphasize that you only need to specify the network of resistors (which is the link graph) and the source and sink, and then the potential is defined automatically in terms of those things (the same way as in physical circuits).
Hacker News starts to severely rate-limit replies in deeply nested threads so if you want to continue discussing this you can email me (email in profile).
In regards to the question, the latter. The source corresponds roughly to the E vector of the pagerank whitepaper. It's a set of websites that's axiomatically good. The pagerank paper suggested the Netscape home page and John McCarthy's homepage as examples, or alternatively making every website a source, although the last option is more prone to abuse. The sink is harder to find a good analogy for. I think a reasonable choice would be to add a new node to the graph representing the sink, and then connect every website via a directed resistor. Then websites could be ranked by how much current flows through the resistor to the sink.
Edit: By the way, upon thinking about this more, I've just realized that I'm not sure about the exact mechanism by which link farms boost pagerank. As far as I can tell there are two possibilities:
A. The original "stuck in a loop" explanation, by which you can illegitimately magnify legitimately earned rank.
B. With the every-page-as-source strategy, you can boost the rank of a website by creating a new website and linking to it, even if nothing links to the new website. Then the other extra links in the link farm just exist to avoid detection of this sort of thing. So you are illegitimately creating rank from scratch, rather than magnifying existing rank.
Admittedly the electrical circuit thing only addresses mechanism A. Mechanism B can be addressed by being more discriminating about which websites are in the source. That applies whether you use pagerank or the electrical thing.
The problem is essentially this: since pagerank is basically the probability that a random walk through the link graph will end up on your site, linking back to yourself, and no other websites, gives a big boost to your pagerank because a random walk will get stuck on your site. Of course it’s easy to just ignore self-links, but you can get essentially the same effect through clique-like groups of websites and this can be more difficult to detect.
What’s interesting is that an algorithm based on how electrical current flows (so a link is a one-way resistor, i.e. a resistor in series with a diode) would not have this problem. Attaching a conductive loop to some point in a circuit does not change how current flows. Electrons don’t get stuck in loops because they don’t drift around randomly, they move from lower voltage to higher voltage.