The "person one" vs "person two" bias seems trivially solvable by running each pair evaluation twice with each possible labelling and the averaging the scores.
Although of course that behavior may be a signal that the model is sort of guessing randomly rather than actually producing a signal.
Agreed on the second part. Correcting for bias this way might average out the scores but not in a way that correctly evaluates the HN comments.
The LLM isn't performing the desired task.
It sounds possible to cancel out the comments where reversing the labels swaps the outcome because of bias. That will leave the more "extreme" HN comments that it consistently scored regardless of the label. But that may not solve for the intended task still.
Although of course that behavior may be a signal that the model is sort of guessing randomly rather than actually producing a signal.