Training a 2nd agent as a qualitative evaluator works pretty well "LLM-as-a-judge". You train it with labeled critiques from experts, iterate a few times, then point it to your ground truth human-labelled-data ("golden dataset"). The quantitative output metric is human2ai alignment on the golden dataset, mix that with some expert judgment about the critique output by the ai as well.
Works pretty well for me, where you can typically get within the range of human2human variance.
Works pretty well for me, where you can typically get within the range of human2human variance.