We're doing something similar. We first chunk the documents based on h1,h2,h3 headings. Then we add headers in the beginning of the chunk as a context. As an imagenary example, instead of one chunk being:
The usual dose for adults is one or two 200mg tablets or
capsules 3 times a day.
It is now something like:
# Fever
## Treatment
---
The usual dose for adults is one or two 200mg tablets or
capsules 3 times a day.
This seems to work pretty well, and doesn't require any LLMs when indexing documents.
I used to always wonder how do llms know whether a particular long article or audio transcript was written by say Alan Watts. Basically these kind of metadata annotation would be common while preparing training data for Llama models and so on. This could also be reason for the genesis for the argument that ChatGPT got slower in December. That "date" metadata would "inform" ChatGPT to be unhelpful.
I am working on question answering based on long documents / bundles of documents, 100+ pages, and I took a similar approach. I first summarize each page, give it a title and extract a list of subsections. Then I put all the summaries together and I ask the model to provide a hierarchical index. It will organize the whole bundle into a tree. At querying time I combine the path in the tree as additional context.
(Edited formatting)