Can the index size exceed the RAM size (e.g., via memory mapping), or are index size and document number limited by RAM size?
It would be good to mention those limitations in the README.
The most widely used DHT is Kademlia from Petar Maymounkov and David Mazières.
It is used in Ethereum, IPFS, I2P, Gnutella DHT, and many other applications.
For the latency benchmarks we used vanilla BM25 (SimilarityType::Bm25f for a single field) for comparability, so there are no differences in terms of accuracy.
For SimilarityType::Bm25fProximity which takes into account the proximity between query term matches within the document, we have so far only anecdotal evidence that it returns significantly more relevant results for many queries.
Systematic relevancy benchmarks like BeIR, MS MARCO are planned.
In SeekStorm you can choose per index whether to use Mmap or let SeekStorm fully control Ram access. There is a slight performance advantage to the latter, at the cost of higher index load time of the former.
https://docs.rs/seekstorm/latest/seekstorm/index/enum.Access...
SeekStorm does currently not use io_uring, but it is on our roadmap.
Challenges are the cross-platform compatibility. Linux (io_uring) and Windows (IoRing) use different implementations, and other OS don't support it. There is no abstraction layer over those implementations in Rust, so we are on our own.
It would increase concurrent read and write speed (index loading, searching) by removing the need to lock seek and read/write.
But I would expect that the mmap implementations do already use io_uring / IoRing.
Yes, lazy loading would be possible, but pure RAM access does not offer enough benefits to justify the effort to replicate much of the memory mapping.
The benchmark should be fairly fair, as it was developed by Tantivy themselves (and Jason Wolfe). So, the choice of corpus and queries was theirs. But, of course, your mileage may vary. It is always best to benchmark it on your machine with your data and your queries.
The query "to be or not to be" that you mentioned, consisting solely of stopwords, returns complete results and perform quite well in the benchmark: https://github.com/SeekStorm/SeekStorm?tab=readme-ov-file#be...
Both Lucene and Elastic still offer stopword filters: https://lucene.apache.org/core/10_3_2/analysis/common/org/ap... https://www.elastic.co/docs/reference/text-analysis/analysis...