Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: What are the drawbacks of caching LLM responses?
1 point by XCSme on March 15, 2024 | hide | past | favorite | 3 comments
I recently added AI integration to my application. While it works great, I dislike two things:

  1. I pay for all user prompts, even for duplicate ones.
  2. I am at the response-time mercy of the LLM API.
I could easily cache locally all prompts in a KV store and simply return the answer from cache for duplicate ones.

Why isn't everyone doing this?

I assume one reason is that the LLM response is not deterministic, where same query can return different responses, but this could also be added with a "forceRefresh" parameter to the query.



Two major ones: you now need to handle all cache issues like invalidation (what happens when you want to upgrade or the model improves?), and you also now need to think about security issues - given the drastic timing differences, anyone can probe your cache to figure out what calls have been made and extract anything in prompts like passwords or PII (eg. just going token by token and trying the top possibilities each time).


> happens when you want to upgrade or the model improves

I was thinking to prefix the key of each inquiry with model_gpt3.5-1000 or whatever model returned that.

> anyone can probe your cache to figure out what calls have been made

My use-case is local-only[0], where each user sends the requests from their own machine. I could maybe cache by default and add some info showing the request was returned from cache, alongside a button to "force regenerate answer".

[0]: https://docs.uxwizz.com/guides/ask-ai-new


Just found this: https://github.com/zilliztech/GPTCache which seems to address this idea/issue.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: