Hacker Newsnew | past | comments | ask | show | jobs | submit | valstu's commentslogin

> However, skills are different from MCP. Skills has nothing to do with tool calling at all

Although skills require that you have certain tools available like basic file system operations so the model can read the skills files. Usually this is implemented as ephemeral "sandbox environment" where LLM have access to file system and can also execute python, run bash commands etc.


We're doing something similar. We first chunk the documents based on h1,h2,h3 headings. Then we add headers in the beginning of the chunk as a context. As an imagenary example, instead of one chunk being:

  The usual dose for adults is one or two 200mg tablets or 
  capsules 3 times a day.
It is now something like:

  # Fever
  ## Treatment
  ---
  The usual dose for adults is one or two 200mg tablets or 
  capsules 3 times a day.
This seems to work pretty well, and doesn't require any LLMs when indexing documents.

(Edited formatting)


I used to always wonder how do llms know whether a particular long article or audio transcript was written by say Alan Watts. Basically these kind of metadata annotation would be common while preparing training data for Llama models and so on. This could also be reason for the genesis for the argument that ChatGPT got slower in December. That "date" metadata would "inform" ChatGPT to be unhelpful.


I am working on question answering based on long documents / bundles of documents, 100+ pages, and I took a similar approach. I first summarize each page, give it a title and extract a list of subsections. Then I put all the summaries together and I ask the model to provide a hierarchical index. It will organize the whole bundle into a tree. At querying time I combine the path in the tree as additional context.


Did you experiment with different ways to format those included headers? Asking because I am doing something similar to that as well.


Nope, not yet. We have sticked with markdownish syntax so far.


So regex version still beats the LLM solution. There's also the risk of hallucinations. I wonder if they tried to make SML which would rewrite or update the existing regex solution instead of generating the whole content again? This would mean less output tokens, faster inference and output wouldn't contain hallucinations. Although, not sure if small language models are capabable to write regex


I think regex can beat SLM for a specific use case. But for the general case, there is no chance you come up with a pattern that works for all sites.


I just recently found out that there is Deno kernel for Jupyter notebook

https://blog.jupyter.org/bringing-modern-javascript-to-the-j...


Chroma used DuckDB at some point, might not be the case anymore though


We use the term "pre-googling" for this sort of "information retrieval". You might have some concept in your head and you want to know the exact term for it, once you get the term you're looking for from LLM you'll move to Google and search the "facts".

This might be a weird example for native english speakers but recently I just couldn't remember the term for graph where you're allowed to move in one direction and cannot do loops. LLM gave me the answer (directed acyclic graph or DAG)right away. Once I got the term I was looking for I moved on to Google search.

Same "pre-googling" works if you don't know if some concept exits.


> graph where you're allowed to move in one direction and cannot do loop

To be fair, you didn't need LLM for this. Googling that, the answer (DAG) is in the title of the first Google result.

(Not to invalidate your point, but the example has to be more obscure than that for this strategy to be useful)


I recently started watching fallout and it reminded me of a book I read about a future religious order which was piecing together pre-bomb scientific knowledge. It immediately pointed me to the Canticles of Leibovitz (which is great btw). Google results will do the same, but llm I’d much faster and more direct. I find it great for stuff like this - where you know there is an answer and will recognise it as soon as you see it. I genuinely think it can become an extension of my long-term memory, but I’m slightly nervous about the effect it will have on my actual non-memory if I just don’t need to remember stuff like this anymore!


The pre-Googling is an excellent idea. You are augmenting the query, not generating nonsense answers. My wife uses ChatGPT as a thesaurus quite a lot.


I assume you need to split the data to suitable sized database rows matching your model max length? Or does it do some chunking magic automatically?


There is no chunking built into the postgres extension yet, but we are working on it.

It does check the context length of the request against the limits of the chat model before sending the request, and optionally allows you to auto-trim the least relevant documents out of the request so that it fits the model's context window. IMO its worth spending time getting chunks prepared, sized, tuned for your use case though. There are some good conversations above discussing methods around this, such as using a summarization model to create the chunks.


I wonder how this compares to https://vlcn.io?


I commented on another thread about cr-sqlite [0]. In addition to that, I believe Mycelial is VC funded, whereas Matt has Github sponsors. I hope there's a future where he teams with, say, Fly.io to make cr-sqlite sustainable.

0: https://news.ycombinator.com/item?id=36475514


Getting there! fly.io is using cr-sqlite these days :)


Doesn't same question apply to any content you're about read? How can you know that the blog post/article writer didn't "hallucinate"?


This is 3 years old but still pretty well done explanation of the case MH370 https://www.youtube.com/watch?v=kd2KEHvK-q8


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: