There is sample space (choices) so for example 100 different status labels and
event space (how the system grades your choice), so right and wrong.
My statement is true no matter how many choices are there, or how skewed the probabilities are. Your count of 99 incorrect labels is perfectly fine but it lives in sample space.
Arguing that there are 99 incorrect answers doesn't refute that evaluation is binary.
So counting 99 wrong labels tells us how many ways you can miss, but probability is assigned, not counted. Once a choice is made the system collapses everything to the two outcomes "correct" or "incorrect", and if the right label happens to have 50 % probability then the situation is mathematically identical to a coin flip, regardless of how many other labels sit on the die.
Example with a weighted die and 99 incorrect answers:
Die Faces: 100
Weights: Right status face = 0.50, the other 99 faces share the other 0.50
P(correct) = 0.50 -> exactly the coin-flip
The 1/N rule only applies when all faces are equally likely, once you introduce weights, the number of faces no longer tells you the probability.
> My statement is true no matter how many choices are there, or how skewed the probabilities are. Your count of 99 incorrect labels is perfectly fine but it lives in sample space.
No, it's not.
If you have a 99% chance of picking the wrong outcome, you don't have a 50% chance of picking the right outcome.
The 1% chance of being right doesn't suddenly become 50% just because you reduce the problem space to a boolean outcome.
If I put 100 marbles into a jar, and 99 of them are black, and one is red, and your single step instruction is: "Draw the red marble from the jar." - you don't have a 50% chance of picking the right marble if you're drawing randomly (i.e. the AI has no intelligence whatsoever).
Sample space, how many distinct labels sit on the die/in the jar (100)
Event space, did the guess match the ground-truth label? ("correct" vs. "incorrect").
Knowing there are 99 wrong labels tells us how many distinct ways we can be wrong, NOT how likely we are to be wrong.
Probability lives in the weights you place on each label, not in the label count itself. The moment you say "uniformly at random" you’ve chosen a particular weighting (each label gets 1⁄100). But nothing in the original claim required that assumption.
Imagine a classifier that, on any query, behaves like this:
emits the single correct status 50 % of the time.
sprays its remaining 50 % probability mass uniformly over the 99 wrong statuses (≈ 0.505% each).
There are still 99 ways to miss, but they jointly receive 0.50 of the probability mass, while the “hit” receives 0.50. When you grade the output, the experiment collapses to:
Outcome Probability
correct 0.50
wrong 0.50
Mathematically and for every metric that only cares about right vs. wrong (accuracy, recall etc.) this is a coin-flip.
Your jar contains 99 black marbles and 1 red marble and you assume each marble is equally likely to be drawn. Under that specific weight assignment
P(red)=0.01, yes, accuracy is 1 %. But that’s a special case (uniform weights), not a law of nature. Give the red marble extra weight, make it larger, magnetic, whatever, until P(red)=0.50 and suddenly the exact same jar of 100 physical objects yields a 50% success chance.
Once the system emits one label, the grader only records "match" or "mismatch". Every multiclass classification benchmark in machine learning does exactly that. So:
99 wrong labels -> many ways to fail
50% probability mass on "right" -> coin-flip odds of success
Nothing about the count of wrong options can force the probability of success down to 1 %. Only your choice of weights can do that.
"Fifty-fifty" refers to how much probability you allocate to the correct label, not to how many other labels exist. If the correct label soaks up 0.50 of the total probability mass, whether the rest is spread across 1, 9, or 99 alternatives, the task is indistinguishable from a coin flip in terms of success odds.
EDIT: If you still don't understand, just let me know and I will show you the math proof, that will confirm what I said.
The outcome of a single-shot instruction has an x% chance of being right that has nothing to do with the boolean outcome of either right or wrong, and is almost never (and only coincidentally) 50%.
It does not make sense to assume a 50% chance of success for an instruction like: Tell me what day of the week it is, tell me what town in the world my mom was born, tell me who/what Homer claimed to be the progeny of Dionysus to be, tell me which stock will perform best in the S&P tomorrow, tell my what time I will arrive in Tokyo, tell me how many stars there are in the Milky Way, etc.
You keep treating the count of wrong labels as if it automatically fixes the probability of being wrong. That is only true when you have already assumed a uniform distribution.
Nothing in probability theory forces that assumption.
Let
Ω = {y₁, y₂, …, yₙ} (sample space = the labels)
y ∈ Ω (the single correct label)
P : Ω → [0,1] with ∑ᵢ P(yᵢ)=1 (a probability measure)
Define two events
Correct = {y}
Wrong = Ω \ {y}
Then
P(Correct) = P({y}) = P(y)
Because P is arbitrary apart from normalisation, we are free to set:
P(y) = 0.50
P(any other yᵢ) = 0.50 / (n-1)
That instantly gives P(Correct) = 0.50, P(Wrong) = 0.50.
The outcome space collapses to a Bernoulli(½) coin-flip no matter whether n = 2 or n = 10⁹.
Going back to your marble example:
99 black marbles (wrong)
1 red marble (right)
Uniform draw => 1% success.
But "uniform" is a weight choice. Make the red marble 99x heavier (or magnetic, or add 98 dummy red slips you ignore when grading):
P(red) = 99 / (99+99) = 0.50
P(black) = 1 / (99+99) ≈ 0.00505
Same 100 physical marbles, now 50 % success.
The count of wrong ways (99) never changed, only the weights did.
It matters for ML classifiers because every multiclass classifier ultimately gets scored on accuracy, i.e.
Pr(output = ground-truth)
That accuracy is exactly P(Correct) above. The model’s internal distribution over labels (learned or engineered) determines that number. Uniform guessing over 100 labels gives 1% accuracy. A better model might concentrate 50% mass on the right label and reach 50% accuracy, which is literally a coin-flip in outcome space even though 99 wrong labels remain.
As to your strawman, I never said every real world question lands at 50%.
I said: if a system places 0.5 probability mass on the correct label, then its success odds are 50%, FULL STOP. Whether that distribution is realistic for "What day of the week is it?" or "Which stock will lead the S&P tomorrow?" is an empirical question but it has nothing to do with the mere fact that there are many wrong answers.
Probability theory says that success is whatever probability mass you assign to the single correct label, the label count is irrelevant once the distribution is non-uniform.
The short math proof above settles that and ends our discussion.