Not sure what big data has to do with this. It is a problem that has always existed. Particularly in the "soft" sciences and biology. Its' certainly not the case in physics for instance.
Big data may bring another dimension to the problem when deep learning will be used in science. However, that's a detail and we're not there yet.
The fact that a paper is reproducible or not is not problematic per se. This is not what defines science. The real problem arises when 1/ a big claim/discovery is made in a paper that is not reproducible, 2/ nobody tries to check the results independently, and 3/ the community takes nevertheless the paper seriously and accepts its findings. All this has nothing to do with the use of statistics (unless the whole community makes the exact same errors) or big data.
I don’t think the author has the causality correct here, at least for biosciences. The statistical problems existed long before omics, big data, etc.
Most of the graduate programs don’t require students to take statistics, or if they do, it’s very cursory. Furthermore, students often learn very little about assay design - they end up thinking that non-linear responses are linear and do things like divide assay signals to get ratios (two sins here: assuming the assay response intercepts at 0 and that it’s linear).
So at least for the biosciences, it’s been a shitshow for a while.
Part of the answer here comes down to the size of the effect.
If the effect size is large, then you can be pretty sloppy with your statistics - in some ways it doesn’t matter because it’s almost a qualitative/binary difference.
Obviously statistics becomes much more important when the effect size shrinks and you are squinting at some data trying to see if a 20% difference is real or not.
Due to the poor foundations in biosciences (everything from the assays themselves, lack of assay replication, errors in interpretation) engineering/optimizing something can be fiendishly hard.
Science would first grind to a halt, reverse itself, then descend into a thousand-year dark age of sophistry, apologetics, and superstition. Alexander Fleming and Jonas Salk would be laughingstocks, Vaccinations would be banned and plagues would visit upon the Earth. Norman Borlaug would be denounced, crop varieties would be chosen by astrology, and famine would be visit upon the Earth. The guilty party for any crime would be determined by Phrenology or Theranos blood tests, the punishment decided by Jungian psychoanalysis, and injustice would rule the Earth.
I only say this because this is more or less what happened during the Mao's Chinese Cultural Revolution, Pol Pot's Khmer Rouge, or the European Dark ages. Historically speaking, it's a very bad sign when science is outright rejected and it most likely means millions of people are about to be murdered, and then millions more are going to starve to death or die of preventable diseases.
Famous professor from Harvard discovers possible route to elongating lifespan and then spends 1bn+ in funding to pursue that idea. I haven’t kept up with that story since then, so it’s possible that someone might be able to use it as a therapeutic target.
Data driven hypothesis have always been central to science, but the trick is that they are used to generate a theory which produces a prediction that's not seen in the data (so far) that can then be tested with statistical methods.
For an article on proper use of statistics in science, this is rather short on data for an empirical test. For example, did studies from the pre-Big Data era (whenever you think that was) actually have a higher rate of reproducibility? If this has been demonstrated, I am not aware of it, and certainly we are not given a reference to such data in this article.
I don't see an issue with having a data collection and data engineering function that operates separately from those scientists who are creating hypotheses and then they can search data catalogs and libraries to serve their hypotheses no? It seems the author has mistaken the ability to collect and process data with running an experiment. Further, does cross validation not apply in the sciences? Does sample size not apply? In the apocryphal example given in the story, wouldn't a study get tossed for if it used a sample size of 8 data points to begin with? And wouldn't it be really really stupid for scientists attempting to reproduce the study to go and reuse the same 8 samples?
Usually the cost of doing the experiments precludes large sample sizes.
And you would be shocked at the lack of statistical education amongst biologists. Often many them freely admit that they chose biology because they like science but are terrible at math.
Big data may bring another dimension to the problem when deep learning will be used in science. However, that's a detail and we're not there yet.
The fact that a paper is reproducible or not is not problematic per se. This is not what defines science. The real problem arises when 1/ a big claim/discovery is made in a paper that is not reproducible, 2/ nobody tries to check the results independently, and 3/ the community takes nevertheless the paper seriously and accepts its findings. All this has nothing to do with the use of statistics (unless the whole community makes the exact same errors) or big data.