You keep treating the count of wrong labels as if it automatically fixes the pro...

You keep treating the count of wrong labels as if it automatically fixes the probability of being wrong. That is only true when you have already assumed a uniform distribution. Nothing in probability theory forces that assumption.

Let

Ω = {y₁, y₂, …, yₙ} (sample space = the labels)

y ∈ Ω (the single correct label)

P : Ω → [0,1] with ∑ᵢ P(yᵢ)=1 (a probability measure)

Define two events

Correct = {y}

Wrong = Ω \ {y}

Then

P(Correct) = P({y}) = P(y)

Because P is arbitrary apart from normalisation, we are free to set:

P(y) = 0.50

P(any other yᵢ) = 0.50 / (n-1)

That instantly gives P(Correct) = 0.50, P(Wrong) = 0.50.

The outcome space collapses to a Bernoulli(½) coin-flip no matter whether n = 2 or n = 10⁹.

Going back to your marble example:

99 black marbles (wrong)

1 red marble (right)

Uniform draw => 1% success. But "uniform" is a weight choice. Make the red marble 99x heavier (or magnetic, or add 98 dummy red slips you ignore when grading):

P(red) = 99 / (99+99) = 0.50

P(black) = 1 / (99+99) ≈ 0.00505

Same 100 physical marbles, now 50 % success. The count of wrong ways (99) never changed, only the weights did.

It matters for ML classifiers because every multiclass classifier ultimately gets scored on accuracy, i.e.

Pr(output = ground-truth)

That accuracy is exactly P(Correct) above. The model’s internal distribution over labels (learned or engineered) determines that number. Uniform guessing over 100 labels gives 1% accuracy. A better model might concentrate 50% mass on the right label and reach 50% accuracy, which is literally a coin-flip in outcome space even though 99 wrong labels remain.

As to your strawman, I never said every real world question lands at 50%. I said: if a system places 0.5 probability mass on the correct label, then its success odds are 50%, FULL STOP. Whether that distribution is realistic for "What day of the week is it?" or "Which stock will lead the S&P tomorrow?" is an empirical question but it has nothing to do with the mere fact that there are many wrong answers.

Probability theory says that success is whatever probability mass you assign to the single correct label, the label count is irrelevant once the distribution is non-uniform.

The short math proof above settles that and ends our discussion.