Hacker Newsnew | past | comments | ask | show | jobs | submit | reexpressionist's commentslogin

This type of behavior (and related) would primarily only be an issue with unconstrained generative models. If you're the one deploying the model, or a downstream consumer, once trained, the neural network can be reexpressed (via an exogenous/secondary model/process) to derive reliable and interpretable uncertainty quantification by conditioning on reference classes (in a held-out Calibration set) formed by the Similarity to Training (depth-matches to training), Distance to Training, and a CDF-based per-class threshold on the output magnitude. If the prediction/output falls below the desired probability threshold, gracefully fail by rejecting the prediction, rather than allowing silent errors to accumulate.

For higher-risk settings, you can always turn the crank to be more conservative (i.e., more stringent parameters and/or requiring a larger sample size in the highest probability and reliability data partition).

For classification tasks, this follows directly. For generative output, this comes into play with the final verification classifier used over the output.


The alternative approach is to start with a small[er] model, but derive reliable uncertainty estimates, only moving to a larger model if necessary (i.e., if the probability of the predictions is lower than needed for the task).

And I agree that the leaderboards don't currently reflect the quantities of interest typically needed in practice.


> derive reliable uncertainty estimates

That is very, very hard to do in an objective manner, as the current LLM benchmark gaming demonstrates.

Sure, you can deploy a smaller model to production to get real-world user data and feedback, but a) deploying a suboptimal model can give a bad first impression and b) the quality is still subjective and requires other metrics to be analyzed. Looking at prediction probabilities only really helps if you have a single correct output token, which isn't what LLM benchmarks test for.


I believe we have two rather different settings in mind. My statement assumes the enterprise use-case, where having a verifier is required. (In this context, I'm also assuming the approach of constraining against the observed data.) In such a selective classification setting, the end-user need not be exposed to lower quality outputs, but rather null predictions if the model cascade has been exhausted (i.e., progressively moving to larger models until the probability is acceptable).

Hopefully in 2024 we can get at least one of the benchmarks to move to assessing non-parametric/distribution-free uncertainty for selective classification, reflecting more recent CS/Stats advances that should be used in practice. Working on it.


If the end goal is document classification and/or semantic search, the Reexpress Fast I model (3.2 billion parameters) is a good choice. The key is that it produces reliable uncertainty estimates (for classification), so you know if you need a larger (or alternative) model. (In fact, an argument can be made that since the other models don't produce such uncertainty estimates, they are not ideal for serious use cases without adding an additional mechanism, such as ensembling with the Reexpress model.)


TL;DR: Reexpress makes it really easy (and inexpensive) to fine-tune a large language model (LLM) for typical document classification tasks. All of the processing happens on your Mac and you also get the indispensable additional advantages of uncertainty quantification, interpretability by example/exemplar, and semantic search capabilities.


Important essay and points. I want to mention that there exist now practical technical approaches that can be used to create trustworthy AI...and such approaches can be run on local models, as this comment suggests.

> "[...] [AI] will act trustworthy, but it will not be trustworthy. We won’t know how they are trained. We won’t know their secret instructions. We won’t know their biases, either accidental or deliberate. [...]"

I agree that this is true with standard deployments of the generative AI models, but we can instead reframe networks as a direct connection between the observed/known data and new predictions, and to tightly constrain predictions against the known labels. In this way, we can have controllable oversight of biases, out-of-distribution errors, and more broadly, a clear relation to the task-specific training data.

That is to say, I believe the concerns in the essay are valid in that they reflect one possible path in the current fork in the road, but it is not inevitable, given the potential of reliable, on-device, personal AI.


Ditto. This is the most sophisticated viz of parameters I've seen...and it's also an interactive, step-through tutorial!


"As with most software development, modern AI work is all about knowing your tools and when it's appropriate to use them." 100% agree. Ditto with just using the easiest to access models as initial proof-of-concept/dev/etc. to get started.

(I do agree with the overall sentiment of the TC article, although as noted by others below, there's some mashing of terminology in the article. E.g., I, too, associate GOFAI with symbolic AI and planning.)

There's another dimension, too, not mentioned in the article: Even with general purpose LLMs, for production applications, it's still required to have labeled data to produce uncertainty estimates. (There's a sense in which any well-defined and tested production application is a 'single-task' setting, in it its own way.) One of the reasons on-device/edge AI has gotten so interesting, in my opinion, is that we now know how to derive reliable uncertainty estimates with the neural models (more or less independent of scale). As long as prediction uncertainty is sufficiently low, there's no particular reason to go to a larger model. That can lead to non-trivial cost/resource savings, as well as the other benefits of keeping things on-device.


Can link to any methods for deriving reliable uncertainty estimates? Sounds useful.


I like the analogy to a router and local Mixture of Experts; that's basically how I see things going, as well. (Also, agreed that Huggingface has really gone far in making it possible to build such systems across many models.)

There's also another related sense for which we want routing across models for efficiency reasons in the local setting, even for tasks for the same input modalities:

First, attempt prediction on small(er) models, and if the constrained output is not sufficiently high probability (with highest calibration reliability), route to progressively larger models. If the process is exhausted, kick it to a human for further adjudication/checking.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: