Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Making my local LLM voice assistant faster and more scalable with RAG (johnthenerd.com)
122 points by JohnTheNerd on June 15, 2024 | hide | past | favorite | 16 comments


I was having a look at the model mentioned, specifcially `casperhansen/llama-3-70b-instruct-awq`.

When checking this model, I found out [1] it's based on llama-2 ?

``` Expand Llama 3 70B Instruct AWQ Parameters and Internals LLM Name Llama 3 70B Instruct AWQ Repository Open on Base Model(s) Llama 2 70B Instruct quantumaikr/llama-2-70B-instruct Model Size 70b ```

I added a question [2] on Hugging Face to learn more about this.

Anyone could explain to me what this means? Does it mean that it has been trained on the version 2 and wrongly named version 3? Or is it something that is not well intended?

[1] https://llm.extractum.io/model/casperhansen%2Fllama-3-70b-in...

[2] https://huggingface.co/casperhansen/llama-3-70b-instruct-awq...


I don't know this site that you're citing there but it's clearly wrong.

Go look at the model config, you can clearly see it's Llama 3.


That lag between query and response ruins it for me.


“Excellent query good sir! <said slowly enough to let the LLM catch up>…”

And more seriously, it seems like the LLM could be used to precreate lots of filler prefixes that correspond to the rag’d document that are being sent to the model.

While it wouldn’t work if you’re GPU’d bound, multiple prompts could be run in parallel with different pieces of context and then have the model chose the most appropriate response (which could be done in parallel too).


For me, it was the cuts between each call haha


If there are many common services for which you can precompute the embeddings then with a little record keeping and analysis you could figure out some likely questions or requests and pregenerate the responses. That way you could just use similarity search on the question or command you say and skip using the LLM. It would be interesting to try using the LLM to predict some of these based on information available ahead of time like calendar events, weather, recent prompt history, recently played media, today’s headlines, recent browser history, etc. It’d be your own recommendation algorithm.


that's a great idea! I've been looking into that (I'm merely logging all prompts in a JSON file for now, so that I can analyze them later).

skipping the LLM would be tough because there are so many devices in my house, not to mention it would take away from the personality of the assistant.

however, a recommendation algorithm would actually work great since i could augment the LLM prompt with it regardless of the prompt.


The previous story:

https://news.ycombinator.com/item?id=38985152 ( 187 comments , 2024-01-13 )


I love how The llm responds back to you in a sarcastic, patronising, condescending, uninterested tone...


These responses are mimicking voice and tone of the GLaDOS robot from the game Portal.


Cringe conversation. Why can't AIs just do stuff that you ask them to do without pretending to be human?


Then how is it different from Excel/Word and shell/python scripts?


it's much slower, gets things wrong, and insists on things that ain't so.


If it isn't, we have bigger problems.


Because they’re trained to.

I hate the introduction to the response. That’s not even trying to be human, i don’t know something more like a deranged patronizing butler.


Llama3 is very keen to be nice. I kind of wonder if that's due to better results on the chatbot arena (probably not, just a conspiracy theory I like). But with enough context available, you can definitely tweak the response in many ways. Give an example or two, tell it to be an emotionally detached HAL, you'll get what you want.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: