Making my local LLM voice assistant faster and more scalable with RAG

geniium · on June 15, 2024

I was having a look at the model mentioned, specifcially `casperhansen/llama-3-70b-instruct-awq`.

When checking this model, I found out [1] it's based on llama-2 ?

``` Expand Llama 3 70B Instruct AWQ Parameters and Internals LLM Name Llama 3 70B Instruct AWQ Repository Open on Base Model(s) Llama 2 70B Instruct quantumaikr/llama-2-70B-instruct Model Size 70b ```

I added a question [2] on Hugging Face to learn more about this.

Anyone could explain to me what this means? Does it mean that it has been trained on the version 2 and wrongly named version 3? Or is it something that is not well intended?

[1] https://llm.extractum.io/model/casperhansen%2Fllama-3-70b-in...

[2] https://huggingface.co/casperhansen/llama-3-70b-instruct-awq...

qeternity · on June 15, 2024

I don't know this site that you're citing there but it's clearly wrong.

Go look at the model config, you can clearly see it's Llama 3.

pw378 · on June 15, 2024

That lag between query and response ruins it for me.

ec109685 · on June 15, 2024

“Excellent query good sir! <said slowly enough to let the LLM catch up>…”

And more seriously, it seems like the LLM could be used to precreate lots of filler prefixes that correspond to the rag’d document that are being sent to the model.

While it wouldn’t work if you’re GPU’d bound, multiple prompts could be run in parallel with different pieces of context and then have the model chose the most appropriate response (which could be done in parallel too).

lettergram · on June 15, 2024

For me, it was the cuts between each call haha

throwthrowuknow · on June 15, 2024

If there are many common services for which you can precompute the embeddings then with a little record keeping and analysis you could figure out some likely questions or requests and pregenerate the responses. That way you could just use similarity search on the question or command you say and skip using the LLM. It would be interesting to try using the LLM to predict some of these based on information available ahead of time like calendar events, weather, recent prompt history, recently played media, today’s headlines, recent browser history, etc. It’d be your own recommendation algorithm.

JohnTheNerd · on June 15, 2024

that's a great idea! I've been looking into that (I'm merely logging all prompts in a JSON file for now, so that I can analyze them later).

skipping the LLM would be tough because there are so many devices in my house, not to mention it would take away from the personality of the assistant.

however, a recommendation algorithm would actually work great since i could augment the LLM prompt with it regardless of the prompt.

Jedd · on June 15, 2024

The previous story:

https://news.ycombinator.com/item?id=38985152 ( 187 comments , 2024-01-13 )

jijji · on June 15, 2024

I love how The llm responds back to you in a sarcastic, patronising, condescending, uninterested tone...

smusamashah · on June 15, 2024

These responses are mimicking voice and tone of the GLaDOS robot from the game Portal.

elevatedastalt · on June 15, 2024

Cringe conversation. Why can't AIs just do stuff that you ask them to do without pretending to be human?

zx8080 · on June 15, 2024

Then how is it different from Excel/Word and shell/python scripts?

exe34 · on June 15, 2024

it's much slower, gets things wrong, and insists on things that ain't so.

WalterSear · on June 15, 2024

If it isn't, we have bigger problems.

colechristensen · on June 15, 2024

Because they’re trained to.

I hate the introduction to the response. That’s not even trying to be human, i don’t know something more like a deranged patronizing butler.

viraptor · on June 15, 2024

Llama3 is very keen to be nice. I kind of wonder if that's due to better results on the chatbot arena (probably not, just a conspiracy theory I like). But with enough context available, you can definitely tweak the response in many ways. Give an example or two, tell it to be an emotionally detached HAL, you'll get what you want.