Do you know of *any* (families of) examples of tabular datasets of any size (you...

dcl · on March 5, 2024

Regression targets where extrapolation may be needed. Decision tree methods cannot extrapolate, the predictions are have to be a mean of a subgroup of the data.

Consider: Predicting how much a customer might pay by end of month, with information we have at the start of the month.

In this example, if a customer had a record $10m of open invoices due by EoM and the largest payment amount received in prior months of $5m, the decision tree cannot possibly predict the payment amount will be ~$10m, even when the best feature indicates the payment will be $10m.

There are some hacks/techniques which can maybe reduce this issue, but they don't always work.

melondonkey · on March 7, 2024

What? Can you explain the mechanism than a NN can “extrapolate” an invoice where a tree model couldn’t? This is all just how the modeler builds the features.

Also all models are a “mean of the subgroup of the data.” The prediction is by definition the conditional mean as a function of the input values.

Scene_Cast2 · on March 5, 2024

Recommendation engines: search, feeds (tiktok / youtube shorts / etc), ads, netflix suggestions, doordash suggestions, etc etc. Also happens to be my specialty.

Jensson · on March 5, 2024

I worked with search and ads model at Google, for most things tree models were better. What evidence do you have that neural nets are better there? I worked with large parts of Google search ranking so I know what I'm talking about, some parts you want a neural net but most of the work is done by tree models and similar, they both perform better and run faster.

usgroup · on March 5, 2024

I'm not sure that is true. I think inference speed is often the bottleneck for the use cases stated, as is the need for frequent re-training. As a result algorithms like catboost are very popular in those domains. I think catboost was actually invented by Yandex.

PS: Its weird that you are being down-voted. I think your opinion is reasonable.

Scene_Cast2 · on March 5, 2024

Inference speed: more sophisticated stacks use multiple stages. Early stage might be a sublinear vector search, and the heavy hitting neural nets only rerank the remainder. Bytedance has a paper on their fairly fancy sublinear approach.

Retraining - online training solves this for the most part.

Frameworks - the only battle-tested batteries-included one I've seen is Vespa. Noone else publishes any of interesting bits. KDD is the most relevant conference if you're interested in the field. IIRC Xiaohongshu has some papers that can only really be done with NNs.

math_dandy · on March 5, 2024

Wonderful! Any public datasets you could point me to?

Scene_Cast2 · on March 5, 2024

Unfortunately, none that I know of. Maybe the Netflix movie recommendations challenge from ages ago? I haven't looked at it personally.