Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools."

-Leo Breiman, like 24 years ago

Machine learning isn't the native language of biology, the author just realized that there's more than one approach to modeling. I'm a statistician working in an ML role and most of the issues I run into (from a modeling perspective) are the reverse of what this article describes - people trying to use ML for the precise things inferential statistics and mechanistic models are designed for. Not that the distinction is that clear to begin with.



Agreed wholeheartedly. I have argued with the VP of our department about this paper quite a few times.

I feel like Breiman sets up a strawman that I've never encountered when I work with my colleagues that are trained in the statistics community. That doesn't mean it didn't exist 25 years ago when he wrote it. I concede that we are sometimes willing to make simplifying assumptions in order to state something particular, but it's almost like we've been culturally conditioned to steep everything we say with every caveat possible.

Whereas I am constantly having to point out the poor feedback we've had about some of the XGBoost models despite the fact that they're clearly the most "predictive" when evaluated naively.


This is largely my feeling as well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: