We're doing something similar. We first chunk the documents based on h1,h2,h3 he...

passion__desire · on Sept 20, 2024

I used to always wonder how do llms know whether a particular long article or audio transcript was written by say Alan Watts. Basically these kind of metadata annotation would be common while preparing training data for Llama models and so on. This could also be reason for the genesis for the argument that ChatGPT got slower in December. That "date" metadata would "inform" ChatGPT to be unhelpful.

visarga · on Sept 20, 2024

I am working on question answering based on long documents / bundles of documents, 100+ pages, and I took a similar approach. I first summarize each page, give it a title and extract a list of subsections. Then I put all the summaries together and I ask the model to provide a hierarchical index. It will organize the whole bundle into a tree. At querying time I combine the path in the tree as additional context.

cabidaher · on Sept 20, 2024

Did you experiment with different ways to format those included headers? Asking because I am doing something similar to that as well.

valstu · on Sept 20, 2024

Nope, not yet. We have sticked with markdownish syntax so far.