The article is all about why "0.05" might be a bad value to choose. But, more fundamentally, p is often the wrong thing to be looking at in the first place.
1. Effect sizes.
Suppose you are a doctor or a patient and you are interested in two drugs. Both are known to be safe (maybe they've been used for decades for some problem other than the one you're now facing). As for efficacy against the problem you have, one has been tried on 10000 people, and it gave an average benefit of 0.02 units with a standard deviation of 1 unit, on some 5-point scale. So the standard deviation of the average over 10k people is about 0.01 units, the average benefit is about 2 sigma, and p is about 0.05. Very nice.
The other drug has only been tested on 100 people. It gave an average benefit of 0.1 unit with a standard deviation of 0.5 units. Standard deviation of average is about 0.05, average benefit is about 2 sigma, p is again about 0.05.
Are these two interventions equally promising? Heck no. The first one almost certainly does very little on average, and does substantially more harm than good about half the time. The second one is probably about 5x better on average, and seems to be less likely to harm you. It's more uncertain because the sample size is smaller, and for sure we should do a study with more patients to nail it down better, but I would definitely prefer the second drug.
(With those very large standard deviations, if the second drug didn't help me I would want to try the first one, in case I'm one of the lucky people it gives > 1 unit of benefit to. But it might well be > 1 unit of harm instead.)
Looking only at p-values means only caring about effect size in so far as it affects how confident you are that there's any effect at all. (Or, e.g., any improvement on the previous best.) But usually you do, in fact, care about the effect size too.
Here's another way to think about this. When computing a p-value, you are asking "if the null hypothesis is true, how likely are results like the ones we actually got?". That's a reasonable question. But you will notice that it makes no reference at all to any not-null hypothesis. p < 0.05 means something a bit like "the null hypothesis is probably wrong" (though that is not in fact quite what it means) but usually you also care how wrong it is, and the p-value won't tell you that.
2. Prior probability.
The parenthetical remark in the last paragraph indicates another way in which the p-value is fundamentally the Wrong Thing. Suppose your null hypothesis is "people cannot psychically foretell the future by looking at tea leaves". If you test this and get a p=0.05 "positive" result, then indeed you should probably think it a little more likely than you previously did that this sort of clairvoyance is possible. But if you are a reasonable person, your previous opinion was a much-much-less-than-5% chance that tasseomancy actually works[1], and when someone gets a 1-in-20 positive result you should be thinking "oh, they got lucky", not "oh, it seems tasseomancy works after all".
[1] By psychic powers, anyway. Some people might be good at predicting the future and just pretend to be doing it by reading tea leaves, or imagine that that's how they're doing it.
3. Model errors.
And, of course, if someone purporting to read the future in their tea leaves does really well -- maybe they get p=0.0000001 -- this still doesn't oblige a reasonable person to start believing in tasseomancy. That p-value comes from assuming a particular model of what's going on and, again, the test makes no reference to any specific alternative hypothesis. If you see p=0.0000001 then you can be pretty confident that the null hypothesis's model is wrong, but it could be wrong in lots of ways. For instance, maybe the test subject cheated; maybe that probability comes from assuming a normal distribution but the actual distribution is much heavier-tailed; maybe you're measuring something and your measurement process is biased, and your model assumes all the errors are independent; maybe there's a way for the test subject to get good results that doesn't require either cheating or psychic powers.
None of these things is helped much by replacing p=0.05 with p=0.001 or p=0.25. They're fundamental problems with the whole idea that p-values are what we should care about in the first place.
(I am not claiming that p-values are worthless. It is sometimes useful to know that your test got results that are unlikely-to-such-and-such-a-degree to be the result of such-and-such a particular sort of random chance. Just so long as you are capable of distinguishing that from "X is effective" or "Y is good" or "Z is real", which it seems many people are not.)
I feel like a lot of these critiques are just straw-manning p-values consideration.
Consider effect sizes - this seems to be a completely different (yes important) question. Obviously the magnitude of the impact of the drug is important - but it isn't a replacement or "something to look at instead of p-value" because the chance that the results you saw are due to random variation is still important! You can see a massive effect size but if that is totally expected within your null model, then it is probably not all that exciting!
Effect sizes are a complement to some sort of hypothesis testing, but they are not a replacement.
> Prior probability.
yes, when you can effectively encode your prior probability I would say the posterior probability of seeing what you did is at least as good as p-value.
I agree that you shouldn't look only at effect sizes any more than you should look only at p-values. (What I would actually prefer you to do, where you can figure out a good way to do it, is to compute a posterior probability distribution and look at the whole distribution. Then you can look at its mean or median or mode or something to get a point estimate of effect size, you can look at how much of the distribution is > 0 to get something a bit like a p-value but arguably more useful, etc.
If anything I wrote appeared to be saying "just look at effect size, it's the only thing that matters" then that was an error on my part. I definitely didn't intend to say that.
But I was responding to an article saying "p<0.05 considered harmful" that never mentions effect sizes at all. I think that's enough to demonstrate that, in context, "it's bad to look at p-values and ignore effect sizes" is not in fact a straw man.
Incidentally, I am not convinced that the p-value as such is often a good way to assess how likely it is that your results are due to random chance. Suppose you see an effect size of 1 unit with p=0.05. OK, so there's a 5% chance of getting these results if the true effect size is zero. But you should also care what the chance is of getting these results if the true effect size is +0.1. (Maybe the distribution of errors is really weird and these results are very likely with a positive but much smaller effect size; then you have good evidence against the null hypothesis but very weak evidence for an effect size of the magnitude you measured.) In fact, what you really want to know is what the probability is for every possible effect size, because that gives you the likelihood ratios you can use to decide how likely you think any given effect size is after seeing the results. For sure, having the p-value is better than having nothing, but if you were going to pick one statistic to know in addition to (say) a point estimate of the effect size, it's not at all clear that the p-value is what you should choose.
What I have discovered after working in medicine for pretty long is that many biologists and MDs think p-values are a measure of effect sizes. Even a reviewer from Nature thought that, which is incredibly disturbing.
p-values were created to facilitate rigorous inference with minimal computation, which was the norm during the first half of the 20th century. For those who work on a frequentist framework, inference should be done using a likelihood-based approach plus model selection, e.g. AIC. It's makes it much harder to lie.
AIC is an estimate of prediction error. I would caution against using it for selecting a model for the purpose of inference of e.g. population parameters from some dataset (without producing some additional justification that this is a sensible thing to do). Also, uncertainty quantification after data-dependent model selection can be tricky.
Best practice (as I understand it) is to fix the model ahead of time, before seeing the data, if possible (as in a randomized controlled trial of a new medicine, etc.).
And it is not uncommon that an intentionally bad model (low AIC) will be used for inference on a parameter when one wants to test the robustness of the parameter to covariates.
You can always encode the prior. If you take the frequentist approach and ignore bayesian concepts, it’s the same as just going bayesian but with an “uninformative prior” (constant distribution).
The only question is… would you rather be up front and explicit about your assumptions, or not?
An uninformative prior is an assumption, even if it’s the one that doesn’t bias the posterior (note that here “bias” is not a bad word).
There is potentially bias (of the bad word variant) introduced by the mismatch between the prior in your own mind and the distribution and params you choose to try to approximate that, especially if you're trying to pick out a distribution with a nice posterior conjugate.
I'm also not sure why everyone perceived my comment as anti-bayesian.
I did not perceive your comment as anti-bayesian, or at least not necessarily so! :)
But are you sure you know what I meant when I said “uninformative prior”? Because choosing an uninformative prior does not involve choosing any parameters: there is only one uninformative prior, and it’s the constant (flat) distribution which assigns equal probability to every value. It encodes no information and does not bias the posterior or result. It is the one and only mathematically-neutral prior. You can think of it as being a bit like an “identity function”.
Depending on the hypothesis space what you call the "uninformative prior" does not exist in the frequentist approach. If you search for a real value, then the uninformative prior is a uniform distribution on the infinite line. This distribution does not normalize and is off-limits to bayesians.
Ultimately, I think you are strawmanning frequentism here. Just because the log likelihood is sometimes the same as the map does not imply that they have the same meaning. This is why computed uncertainties of both approaches are often not the same and have a not-so-subtle difference in their interpretation. The one computes uncertainty in belief, the other imprecision of an experiment. You can't summarize that with "do you want to be explicit about assumptions".
Nobody normalizes an uninformative prior on its own. Normalization only happens when you get your posterior.
I am not straw-manning anything: bayesian methods are a generalization of frequentist methods. The equality / isomorphism to frequentist methods, in the special case of the uninformative prior, is commonly demonstrated in introductory (undergrad-level) bayesian textbooks and is in fact trivial. One need not even talk of any infinities: if you’re doing discrete observations, infinities never show up (and in that case the prior normalizes just fine). And if you’re curve-fitting (ie.: using parametric methods), the “infinite” line goes away as soon as you multiply your uninformative prior with your likelihood.
Yeah and it's super simple to roll significance and effect sizes in one with confidence (or credible) intervals.
Plus the interpretation is super straight forward: CI contains zero: Not significant. If nothing else, we should make CIs the primary default instead of p-values.
Trying to fit your preconception into a mathematically convenient conjugate distribution is not as far afield from making stuff up as people want to believe. Maybe it is better with numerical approaches.
But yes, you usually do have some information and it is net better to encode when possible.
I think if hypothesis testing is understood properly, these objections don't have much teeth.
1. Typically we use p-values to construct confidence intervals, answering the concern about quantifying the effect size. (That is, the confidence interval is the collection of all values not rejected by the hypothesis test.)
2. P-values control type I error. Well-powered designs control type I and type II error. Good control of these errors is a kind of minimal requirement for a statistical procedure. Your example shows that we should perhaps consider more than just these aspects, but we should certainly be suspicious of any procedure that doesn't have good type I and II error control.
3. This is a problem with any kind of statistical modeling, and is not specific to p-values. All statistical techniques make assumptions that generally render them invalid when violated.
Your points are theoretically correct, and probably the reason why many statisticians still regard p-values and HNST favorably.
But looking at the practical application, in particular the replication crisis, specification curve analysis, de facto power of published studies and many more, we see that there is an immense practical problem and p-values are not making it better.
We need to criticize p-values and NHST hard, not because they cannot be used correctly, but because they are not used correctly (and are arguably hard to use right, see the Gigerenzer paper I linked).
The items you listed are certainly problems, but p-values don't have much to do with them, as far as I can see. Poor power is an experimental design problem, not a problem with the analysis technique. Not reporting all analyses is a data censoring problem (this is what I understand "specification curve analysis" to mean, based on some Googling - let me know if I misinterpreted). Again, this can't really be fixed at the analysis stage (at least without strong assumptions on the form of the censoring). The replication crisis is a combination of these these two things, and other design issues.
I can understand why you see it this way, but still disagree:
(1) p-values make significance the target, and thus create incentives for underpowered studies, misspecified analyses, early stopping (monitoring significance while collecting data), and p-hacking.
(2) p-values separate crucial pieces of information. It represents a highly specific probability (of the observed data, given the null hypothesis is true), but does not include effect size or a comprehensive estimate of uncertainty. Thus, to be useful, p-values need to be combined with effect sizes and ideally simulations, specification curves, or meta-analyses.
Thus my primary problem with p-values is that they are an incomplete solution that is too easy to use incorrectly. Ultimately, they just don't convey enough information in their single summary. CIs, for example, are just as simple to communicate, but much more informative.
I don't understand. CIs are equivalent to computing a bunch of p-values, by test-interval duality. Should I interpret your points as critiques of simple analyses that only test a single point null of no effect (and go no further)? (I would agree that is bad.)
Yes, I argue that individual p-values (as they are used almost exclusively in numerous disciplines) are bad, and adding more information on effect size and errors are needed. CIs do that by conveying (1) significance (does not include zero), (2) magnitude of effect (mean of CI), and (3) errors/noise (width of CI). That's significantly better than a single p-value (excuse the pun).
I think part of the problem with p-values and NHST is that it encourages (or doesn't discourage) underpowered studies. That's because p-hacking benefits from the noise of underpowered studies. If you can test a large number of models and only report the significant one then an underpowered study with high type I error rate gives you a greater chance of a significant result.
So I think you are correct that properly powering studies is the crucial thing, but the incentives are against fixing this as long as lone p-values are publishable.
But here the issue is the uncorrected multiple testing and under-reporting of results, not the p-values themselves. Any criterion for judging the presence of an effect is going to suffer from the same issue, if researchers don't pre-register and report all of their analyses (since otherwise you have censored data, "researcher degrees of freedom," and so on). This is really a a problem with the design and reporting of studies, not the analysis method.
That sounds like if you write proper C code correctly you don’t make memory errors when in reality it’s very common to not write correct code.
That’s why rust came along, to stop that behaviour, you simply can’t make that mistake, and hence the point is maybe there’s a better test to use than p value as a standard.
(Did HN finally increase the character limit for toplevel comments? Normally >2500 chars will get you booted to the very bottom over time. Happy to see this one at the top. It might be because of a 2008-era account though. Thanks for putting in a bunch of effort into your writing!)
It’s worth pointing out that processes like meta analyses ought to be able to catch #2. The problem is, GIGO. A lot of poorly designed studies (no real control group, failure to control for other relevant variables, other methodological errors) make meta analysis outcomes unreliable.
An area I follow closely is exercise science and I am amazed that researchers are able to get grants for some of the research they do. For instance, sometimes researchers aim to compare, say, the amount of hypertrophy on one training program versus another. They’ll use a study or maybe 15 individuals. The group on intervention A will experience 50% higher hypertrophy than intervention B, at p=0.08, and they’ll conclude that there’s no difference in hypertrophy between the two protocols rather than suggesting an increase in statistical power.
Another great example is studies whose interventions fail to produce any muscle growth in new trainees. They’ll compare two programs, one with, say, higher training volume, and one with lower training volume. Both fail to produce results for some reason. They conclude that variable is not important, rather than perhaps concluding that their interventions are poorly designed since a population that is begging to gain muscle couldn’t gain anything from it.
I find it interesting that the prior probability example always happens to be the one where the prior is correct, while the opposite direction is never given more than half a line. You can very easily turn it around so that an unreasonably large amount of evidence is needed to move a strongly believed wrong prior.
My favourite example is to do that with normal distributions and the posterior ends up as "let's meet in the middle and agree that we both were wrong" -a very low variance posterior with mean right in the middle between prior and data.
I agree, and would add lack of statistical power to the list. Underpowered studies (e.g. small effects, noisy measurement, small N) decrease the changes of finding a true effect, paradoxically increasing the risk of false positives.
It's immensely frustrating that we haven't made a lot of progress since Cohen's (1962) paper.
> The article is all about why "0.05" might be a bad value to choose.
No. It's more about why choosing a value to serve as the default choice is a bad idea in the first place. The specific value chosen as the default itself (i.e., 0.05 in this case) is irrelevant.
The idea is that the value you choose should reflect some prior knowledge about the problem. Therefore choosing 5% all the time would be somewhat analogous to a bayesian choosing a specific default gaussian as their prior: it defeats the point of choosing a prior, and it's actively harmful if it sabotages what you should actually be using as a prior, because it's not even uninformative, it's a highly opinionated prior instead.
As for points 1 to 3, technically I agree, but there's a lot of misdirection involved.
The point on effect sizes is true (and indeed something many people get wrong), but it is contrived to make effect sizes more useful than p-values. In which case, the obvious answer is, you should be choosing an example where reporting a p-value is more important than an effect size. One way to look at a p-value is as a ranking, which would be useful for comparing between effect sizes of incomparable units. Is a student with a 19/20 grade from a european school better than an american student with a 4.6 GPA? Reducing the compatibility scores to rankings, can help you compare these two effect sizes immediately.
Prior probability, similarly. "If" you interpret things as you did, then yes, p-values suck. But you're not supposed to. What the p-value tells you is "the data and model are 'this much' compatible". It's up to you then to say "and therefore" vs "must have been an atypical sample". In other words, there is still space for a prior here. And in theory, you are free to repurpose the compatibility score of a p-value to introduce this prior directly (though nobody does this in practice).
Regarding p=0.05 vs p=0.001 not mattering; of course they do. But only if they're used as compatibility rankings as opposed to decision thresholds. If you compare two models, and one has p=0.05 and the other has p=0.001, this tells you two things: a) they are both very incompatible with the data, b) the latter is a lot more incompatible than the former. The problem is not that people use p-values, the problem that people abuse them, to make decisions that are not necessarily crisply supported by the p-values used to push them. But this could be said of any metric. I have actively seen people propose "deciding in favour of a model if the Bayes Factor is > 3". This is exactly the same faulty logic, and the fact that BF is somehow "bayesian" won't protect subsequent researchers who use this heuristic from entering a new reproducibility crisis.
I agree that the article is somewhat about the idea that having a Standard Default p-value Threshold is unwise, but I think it's mostly suggesting that in its particular context it's generally better to use a larger p-value. "Defaults matter", the author says (not "Defaults are a trap"). "Do we need that much risk aversion?" (not "Sometimes we need more risk aversion, sometimes less"). "If we were starting over ... I doubt that's what we would pick" (not "If we were starting over, I think we'd have a process that begins by considering what p-value threshold would be appropriate"). At the end, he considers situations where the criterion used is strict and "the company progresses too slowly" (but not ones where it's lax and "the company moves too fast and breaks too many things" or "the company thrashes about unstably") and "Why not just admit you want something akin to p=0.25 in the first place" (not "... something akin to p=0.25 some of the time, and something akin to p=0.001 some of the time?") and "I believe that nobody wants to write down a large p-value threshold, because it feels unscientific" (and nothing about reasons why people might be reluctant to write down an unusually small p-value threshold).
I agree, in case it needs to be said, that if you are doing hypothesis testing then you should adapt your choice of thresholds to your (explicit or implicit) prior. That was approximately the second of the three points I made.
"You should be choosing an example where reporting a p-value is more important than an effect size." Why should I be doing that? I think that almost always both are important (more precisely, I think that almost always thinking of what you've learned from an experiment into a p-value and a point-estimate effect size is suboptimal, but you want the information both of those things are trying to tell you). And if I've correctly understood what you say about rankings, I completely disagree with it; almost none of the time is the p-value what you should be ranking things on. Again, consider those two medications in my hypothetical example: I do not believe any reasonable person would think that the right way to rank them relative to one another matches the p-values. (For comparing two individual students? Yeah, maybe, kinda. But approximately 0% of cases where people compute p-values are analogous to that.)
I agree that picking any single metric and applying brainless black-and-white rules using it is liable to get you in trouble, whatever that metric is. But some metrics are better than others, and a fixed p-value threshold (1) has been a widespread common practice and (2) is for most purposes a really bad idea.
Yes; as I said, wasn't disagreeing per se, just clarifying important context more than anything.
> I completely disagree with it; almost none of the time is the p-value what you should be ranking things on.
Not necessarily true; people compare models on the basis of their p-values all the time; but in any case, that's not what I was referring to here. I was saying the p-value in itself expresses a ranking. That's just what a p-value is by definition: a ranking of the particular compatibility score of the data-event under consideration, with respect to the whole domain of compatibility scores resulting from all possible data-events considered.
And yes, when the score itself is more important than the ranking, it's prudent to focus on that; when the p-value "ranking" is more important, it's prudent to focus on that instead; ideally, people should consider both.
Having said that, the medical example is a weird one. There's a specific sentence there that I have a problem with:
> It's more uncertain because the sample size is smaller [...] I would definitely prefer the second drug.
Perhaps this is simply a misnomer, since one can only be so precise when using language, but as expressed above, I feel this is wrong, as it alludes to an expression of posterior probability, which is not information given to you by a p-value in this setting. More specifically you should be saying it's not "precise". The subtle difference being, you are interpreting this as a "despite the wide range, it's most likely to be around 0.1, so I'll take my chances", whereas what the confidence interval is really telling you is "the data would be atypical for models outside this large range, but not inside", meaning there is an almost equal chance of the drug harming you as there is giving you a benefit. So, no, I wouldn't prefer the 2nd drug; I'd actually prefer the 1st one which has a guaranteed, albeit smaller benefit. This is a very different statement from a credible interval which more or less would state that 0.1 is the most likely value.
So in fact, the p-value would be a better measure of comparison here as well. You're comparing the compatibility of two different data-events (which in this case reflects samplings) against the same (null) hypothesis; so you should prefer the event that is least compatible with the null (and ideally, you should propose an alternative hypothesis, and confirm that it demonstrates higher compatibility with that one as well). But having said this, comparing p-values typically comes in the reverse scenario, were one has a single data-event, and wants to compare two different hypotheses against the same data-event, since that gives information on which model is most compatible with what you observed.
Anyway. Sorry, I know you know, I'm just enjoying the exposition :)
When I see a post "P < 0.05 Considered Harmful" I think OK now they are going to talk about maybe Bonferroni vs. other multiple hypothesis corrections if they are a frequentist or otherwise they are going to try to explain Bayesian things.
But no, this one isn't from a frequentist, or from a Bayesian. It's from a techbro whose solution isn't any kind of multiple hypothesis correction or getting Bayes-pilled, it's to say "Why not just admit you want something akin to p = 0.25 in the first place?" for 'ship criteria' in the only stats context he appears to know which is A/B testing, talking about Maslow hierarchy and namedropping Hula and Netflix. It's seriously like some Silicon Valley parody. Wait is it actually a satire blog?
p=0.2 doesn't work too well for ship criteria for medicine.
p=0.2 for "this reordering of the landing text improves the rate conversion events" is fine. People make changes based on less information all the time. Waiting for certainty has its own expenses.
Not that fine. The author claims that neutral changes are not costly for users so the only reason to avoid them is to avoid wasting time/money. But that's not really true. Pointless churn annoys users. If your P threshold isn't low enough then you can get stuck in an endless treadmill of making changes that you thought would have benefit but which don't actually do anything because your threshold for something being considered significant is too loose.
Sure, excessive rate of change is bad. Even if it's all validated and well-proven.
> If your P threshold isn't low enough then you can get stuck in an endless treadmill of making changes
To measure P, I have to make the change.
There's plenty of situations where accumulating enough data to meet a p<0.05 threshold would take months. If my prior is that it's a reasonably good change and has a decent chance to be helpful, and then I take a measurement that makes that roughly 5x more likely... it's OK to push the button. There are many decisions in business that have to be made with much less information or statistical proof than this.
It all depends on what you're experimenting on. I do think there are many teams out there who are (in effect) making decisions with less certainty than this, but they wouldn't want to actually quantify it.
It made the front page because of the title. Bashing p-values is a formula for upvotes, regardless of the content.
A lot of people associate p-values with p-hacking and replication crisis, which they read about on HN a few times while sitting on the can. Therefore, in their expert opinions, p-values are bad.
So much has been said about p-values and null hypothesis significance testing (NHST) that one blog post probably won't change anybody's opinion.
But I want to recommend the wonderful paper "the null ritual" [1] by Gerd Gigerenzer et al. It shows that a precise understanding of what a p-value means is extremely rare even among statistics lecturers, and, more interestingly, that there have always been fundamentally different understandings even when p-values were invented, e.g. between Fisher and Pearson.
Beyond that, there was a special issue recently in the american statistician discussing at length the issues of p-values in general and .05 in particular [2].
Personally, I of course feel that null hypothesis testing is often stupid and harmful, because null effects never exist in reality, effects without magnitude are useless, and because the "statelessness" of NHST creates or exacerbates problems such as publication bias and lack of power.
Thanks for that. I'll give [1] a read. I'm familiar with [2], and cited one of those papers in the blog.
About the stupid or harmful nature of null hypothesis testing in general, what do you recommend instead for decision making and for summarization of uncertainty? In the scenario of large (yet fast moving) organizations where most people will have little stats background.
Thanks, I hope you find Gigerenzer useful. The paper is a bit academic, but he also wrote a couple of nice popular science books on the (mis-)perception of numbers and statistics, those might be useful in a business environment.
For real-world applications outside engineering and academia, I would rely heavily on confidence intervals and/or confidence bands. For example, the packages from easystats [1] in R have quite a few very useful visualization functions, which make it very easy to interpret results of statistical tests. You can even get a textual precise description, but then again, that's intended for papers and not a wider audience.
Apart from that, I would mainly echo recommendations from people like Andrew Gelman, John Tukey, Edward Tufte etc.: Visuals are extremely useful and contain a lot of data. Use e.g. scatterplots with jittered points to show raw data and the goodness of fit. People will intuitively make more of it than of a single p-value.
Totally agree about visualization, and that those authors are great advocates for it. Confidence intervals are definitely much more informative and intuitive than p-values.
Would the policy be "look at our confidence intervals later and then decide what to do"? One remaining issue is how to have consistent decision criteria, and to convey it ahead of time. Imagine a context with 10-50 teams at a company that run experiments, where the teams are implicitly incentivized to find ways to report their experiments as successful. Quantified criteria can be helpful in minimizing that bad incentive.
Thanks for this, looking forward to read it! I would also recommend in turn the paper by Greenland et al (2016) called "Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations.". It's a great read.
I recently had a chat with a colleague who has "abandoned p-values for bayes factors" in their research, on how, in principle, there's nothing stopping you from having a "bayesian" p-value (i.e. the definition of the p-value at its most general can easily accommodate bayesian inference, priors, posteriors, etc). The counter-retort was more or less "no it can't, educate yourself, bayes factors are better" and didn't want to hear about it. It made me sad.
p-values are an incredibly insightful device. But because most people (ab)use it in the same way most people abuse normality assumptions or ordinal scales as continuous, it's gotten a bad rep and means something entirely different to most now by default.
2σ is fine, but the benefit of modern technology is that we can tell exactly what standard deviation would be needed for the null hypothesis to randomly generate our results. Particle physics holds itself to an "industry standard" of 5 sigma, for example.
The real conversation to be had is -- what standard deviation will we tolerate? Is this something we'll keep doing, and thus A/B test ourselves into a (perhaps quite horrifying) local minimum based on random noise? Is this a single great experiment? Are lives on the line? Will these results be taken Quite Seriously? Is this a test to say "look, this is worth further investigation"?
I'm no statistician but in my opinion this is the first conversation that needs to be had when doing an experiment: what p / σ levels are satisfactory to claim confidence? p<0.05 is a decent heuristic for some experiments, but not all.
The thing that made me realize how ineffective P < 0.05 was, was playing D&D, because the odds of something happening 5% of the time was the same odds as rolling a 1 on a 20 sided dice, which happens surprisingly frequently once you roll it more than a couple of times.
I've seen no source that shows that Xcom fudges its displayed hit chances. You may be thinking of Fire Emblem, whose games use a variety of well-documented approaches to fudging their rolls: https://fireemblemwiki.org/wiki/True_hit
I know at least XCOM 2 does on certain difficulties. The aim assist values are directly in the INI files (it fudges the numbers in your favor for lower difficulties). Here are instructions on how to remove the aim assists:
https://steamcommunity.com/sharedfiles/filedetails/?id=61799...
Think there is confusion here - if I understand correctly you are asking about the number generator, they are talking about the process of determining success. Like in league of legends you have a listed crit chance but the way they determine success isnt to generate a number and compare it to your chance, you start with a smaller base number that gets incremented each time you fail and reset when you succeed - the end result is that your overall chance remains the same but the likelihood oh a streak (of fails or successes) goes down.
Doesnt change the overall probability, drastically reduces variance.
Some games (both computer games and physical board games) intentionally use "shuffled randomness" where e.g. for percentile fail/success rolls you'd take numbers from 1-100 and use true randomness to shuffle that list; in this way the overall probability is the same, but has a substantially different feel as it's impossible for someone to have bad/good luck throughout the whole game and things like "gambler's fallacy" which are false for actual randomness become true.
Just as with any standard PRNG (e.g. Mersenne twister) the outcome may be fully predictable with some information about the state or the previous rolls, but its still usable; and the general distribution of values matches what is desired for that section of the game.
That’s not what I meant. Say that the first roll was a 1. Then you know that for at least 100 rolls, you won’t get a 1 anymore. And with more rolls, more numbers are eliminated.
Can you elaborate on how XCOM lies? I often suspected this (but you can never be sure, since human intuition is bad at probabilities). Is there hard evidence?
Nothing like spending 10 minutes reading a paper to see results which are likely nonsense. However, it pales in comparison to spending 3 weeks trying to replicate popular works... only to find it doesn't generalize... you know that ROC was likely from cooked data-sets confounded with systematic compression artifact errors... likely not harmful, but certainly irritating. lol =)
There is a joke between Mathematicians: The Confidence Interval is nothing more than, the interval between first learning about it, thinking you understood it, and the time it takes until you realize what it really means.
The fundamental error that causes misuse of p-values (and statistics in general) is misunderstanding what statistics is in the first place. Statistics is applied epistemology. There is no algorithm that fits all situations where you're trying to think about and learn new things. It's just hard and we have to deal with that.
Arguably, the main point of p-values is trying to prevent people who really know better from "cheating" by reporting "interesting" results from very low sample sizes. Having a very rigid framework as a rejection criteria helps with this. However, the scientific community and system are not capable of dealing with "real cheating" that includes fabrication of data. Also, any such rigid metric is going to be gamed. Of course, there are some people who would also cheat themselves and maybe learning about p-values makes this less frequent. But such people using them don't understand what those values are telling them, beyond "yes, I can publish this". This is counterproductive because it prevents deep thinking and thorough investigation.
Most scientists realize that in practice there is no simple list of objective criteria that tells you what experiments to perform and how exactly to interpret the results. This takes a lot of work, careful thinking, trying different things and very much benefits from collaboration. But who's got time for that? There's papers to publish and grant applications to write. So p-values it is. Or maybe some other thing eventually, that also won't solve the fundamental problem.
Maybe tech industry insiders can tell me this ... but do real people actually make product decisions based solely on p < 0.05? Seems like the author is writing about a contrived problem.
Product decisions are made based on someone's gut instinct. p values are mostly used when (or abused until) they align with that instinct, if they are being considered at all.
Bold of you to assume that product decisions are based on rigorous statistical tests.
Jokes aside, for product decisions (or all kinds of decisions, really) you should differentiate between statistical significance and relevance. A small measured difference in some metric, even if statistically significant to p < 0.000001, may not be relevant for a decision.
The "stasis" and "arbitrarily adjustments" regimes that I wrote about are certainly ones that I've seen, which don't rely solely on p < 0.05 but are still pretty suboptimal. Furthermore, it's not only about whether 0.05 is the sole criteria, but also about whether it's a useful criteria for us to highlight at all, depending on whether the anchoring effect of it is damaging relative to alternatives.
But let me turn that around and ask: what product decision regime do you see most often or think would be the most relevant to use as an example? I'd be happy to hear your perspective and make sure I keep it in mind for future blogs.
Hi I wrote some other response in this thread where I called you a techbro, so sorry about that I probably count as one too I wasn't trying to insult you too much.
Anyway when you say "Furthermore, it's not only about whether 0.05 is the sole criteria, but also about whether it's a useful criteria for us to highlight at all, depending on whether the anchoring effect of it is damaging relative to alternatives." I love this analogy of the null hypothesis vs. alternative hypotheses in frequentist statistics to the 'anchoring effect' cognitive bias that you try to work to your advantage in marketing and sales and negotiation or management. https://en.wikipedia.org/wiki/Anchoring_(cognitive_bias)
If you don't want to be tied to p-values and you only care about downstream decisions rather than quantifying beliefs, you can use some ideas in decision theory https://en.wikipedia.org/wiki/Decision_theory
For example, maybe you are deciding between two alternative ways of doing something. You don't know which one is better, and you are confronted with not only the decision of which one to use, but also with the decision of whether to experiment (with A/B testing for example) to be more sure of which one is right, versus whether to exploit the one that you currently think is better. This is the multi-arm bandit problem and it doesn't necessarily use p-values so your intuition is right! https://en.wikipedia.org/wiki/Multi-armed_bandit
Maybe that's not your situation. Maybe your situation is that you have an existing business process and you want to know whether to switch to one that might be better. Someone might say it's a p-value problem, but again I agree with your intuition that it really isn't best to think about it that way, especially when there is a cost to switching. Instead, it's a more complicated decision that depends on what is the switching cost, how much better you think the new process would be (including uncertainty of it), and what kind of business horizon you care about. There might even be a multi-armed bandit effect again even in this situation, where you also have to weigh the costs of reducing your uncertainty of the switching improvement or even of reducing the uncertainty of the switching cost itself.
Anyway, these problems do involve concepts from probability and statistics but it's for sure true that the decisions don't always reduce to P < 0.05 at the end! Good luck best wishes living your best techbro life!
I've been called worse things, and at this stage in life it is hard to be offended by people who don't know me well enough to give an insightful insult. But I hadn't responded to the earlier comment because it was more generally antagonistic and seemed to reflect a (possibly intentional?) misinterpreted and negative reading of the blog. "Don't feed the trolls", as the saying goes.
I started writing that particular blog with different experimental methods in mind, but wrote so much as a prerequisite that I wanted to stop and make that first part a standalone post. My last paragraph was supposed to make it clear that this was a launching off point, rather than a summation of everything I know about decision theory or experimentation.
Thanks for writing more substance into this comment. These opinions are sensible, I agree with much that you wrote here. On one: I do use MABs and see them as a feasible method for many organizations, and at some point I'd like to write about some challenges with those too.
Wishing you the best in your techbro or ftxbro or <whatever>bro life too!
Yep. I've seen "rigorous" A/B testing regimes set up based on a p<0.05 requirement that then went straight into "we'll keep running this until we reach significance" and "this one's clearly trending towards significance, we should just implement it now" nonsense.
> Neutral changes aren’t costly on our users, so while we should be somewhat averse to wasting time and adding tech debt for neutral changes, it isn’t the end of the world.
While I think most of the points in the post are reasonable, I strongly disagree with the idea that neutral changes aren't costly. Some neutral changes are good because they are a part of a larger strategy or vision, but in my experience most neutral features are simply change for changes sake. Any reasonablely large company that ships all/most neutral tests is going to end up feeling like a bloated, unfocused mess.
Does the sentence imply that they aren't costly? Would you write "strongly averse" instead of "somewhat averse" or "it is the end of the world" instead of "it isn't the end of the world"? If so, what language would you use to convey that strongly negative changes are that much worse than neutral ones?
A relevant quote here seems to be: "if a measure becomes a target, it ceases to become a good measure". Also true of academia in general because of the obsession with citations.
1. Effect sizes.
Suppose you are a doctor or a patient and you are interested in two drugs. Both are known to be safe (maybe they've been used for decades for some problem other than the one you're now facing). As for efficacy against the problem you have, one has been tried on 10000 people, and it gave an average benefit of 0.02 units with a standard deviation of 1 unit, on some 5-point scale. So the standard deviation of the average over 10k people is about 0.01 units, the average benefit is about 2 sigma, and p is about 0.05. Very nice.
The other drug has only been tested on 100 people. It gave an average benefit of 0.1 unit with a standard deviation of 0.5 units. Standard deviation of average is about 0.05, average benefit is about 2 sigma, p is again about 0.05.
Are these two interventions equally promising? Heck no. The first one almost certainly does very little on average, and does substantially more harm than good about half the time. The second one is probably about 5x better on average, and seems to be less likely to harm you. It's more uncertain because the sample size is smaller, and for sure we should do a study with more patients to nail it down better, but I would definitely prefer the second drug.
(With those very large standard deviations, if the second drug didn't help me I would want to try the first one, in case I'm one of the lucky people it gives > 1 unit of benefit to. But it might well be > 1 unit of harm instead.)
Looking only at p-values means only caring about effect size in so far as it affects how confident you are that there's any effect at all. (Or, e.g., any improvement on the previous best.) But usually you do, in fact, care about the effect size too.
Here's another way to think about this. When computing a p-value, you are asking "if the null hypothesis is true, how likely are results like the ones we actually got?". That's a reasonable question. But you will notice that it makes no reference at all to any not-null hypothesis. p < 0.05 means something a bit like "the null hypothesis is probably wrong" (though that is not in fact quite what it means) but usually you also care how wrong it is, and the p-value won't tell you that.
2. Prior probability.
The parenthetical remark in the last paragraph indicates another way in which the p-value is fundamentally the Wrong Thing. Suppose your null hypothesis is "people cannot psychically foretell the future by looking at tea leaves". If you test this and get a p=0.05 "positive" result, then indeed you should probably think it a little more likely than you previously did that this sort of clairvoyance is possible. But if you are a reasonable person, your previous opinion was a much-much-less-than-5% chance that tasseomancy actually works[1], and when someone gets a 1-in-20 positive result you should be thinking "oh, they got lucky", not "oh, it seems tasseomancy works after all".
[1] By psychic powers, anyway. Some people might be good at predicting the future and just pretend to be doing it by reading tea leaves, or imagine that that's how they're doing it.
3. Model errors.
And, of course, if someone purporting to read the future in their tea leaves does really well -- maybe they get p=0.0000001 -- this still doesn't oblige a reasonable person to start believing in tasseomancy. That p-value comes from assuming a particular model of what's going on and, again, the test makes no reference to any specific alternative hypothesis. If you see p=0.0000001 then you can be pretty confident that the null hypothesis's model is wrong, but it could be wrong in lots of ways. For instance, maybe the test subject cheated; maybe that probability comes from assuming a normal distribution but the actual distribution is much heavier-tailed; maybe you're measuring something and your measurement process is biased, and your model assumes all the errors are independent; maybe there's a way for the test subject to get good results that doesn't require either cheating or psychic powers.
None of these things is helped much by replacing p=0.05 with p=0.001 or p=0.25. They're fundamental problems with the whole idea that p-values are what we should care about in the first place.
(I am not claiming that p-values are worthless. It is sometimes useful to know that your test got results that are unlikely-to-such-and-such-a-degree to be the result of such-and-such a particular sort of random chance. Just so long as you are capable of distinguishing that from "X is effective" or "Y is good" or "Z is real", which it seems many people are not.)