Squeezing millions of documents in 128 TB of virtual memory

rkwasny · on May 9, 2023

I tried Meilisearch, unfortunately despite being build on LMDB it does not work at scale.

For any collection above 1.5M documents, insert times are crazy, like 60 seconds to index additional 6000 documents. There is something very wrong with the core data model and single writer LMDB bottlenecks IMO.

Kerollmops · on May 9, 2023

Hey rkwasny,

Which version of Meilisearch were you using?

We chose to use LMDB instead of RocksDB because it is a much faster, memory and CPU-efficient key-value store. However, it would probably be much quicker to insert all those inverted indexes into RocksDB. RocksDB is an LSM tree, meaning that merges must be done to retrieve the different entries, which means that the search side of the engine would suffer from that. LMDB is much easier to operate. There are a lot fewer parameters to manage the user-end cache, etc.

The bottleneck of the Meilisearch indexing phase is not writing into LMDB. It is mainly related to computing a bunch of databases storing prefix words and the computation of other databases. We are working on making that much faster and reducing the index size on the disk.

rkwasny · on May 9, 2023

meilisearch 1.1.1

Yup, that's exactly what I'm saying there is something wrong with data model. I also noticed it's only using a single thread.

Out of curiosity what dataset size you guys are testing on? Indexing 2M documents on a server in 2023 should be instant, or at least <60 seconds.

Other learning is, not every software written in Rust is fast :-)

Kerollmops · on May 9, 2023

> Yup, that's exactly what I'm saying there is something wrong with data model.

We are an Open Source project and know we can do better on the indexing speed. We already did much work on that subject. We enabled the auto-batching feature, which significantly improved the time to index documents.

Are you sending all your documents in one go, or are you waiting for the tasks to be indexed to send the next batch?

> I also noticed it's only using a single thread.

Unfortunately, one pass of the indexing process is currently single-threaded. We can do better, and we are aware of that. Software development takes time.

> Out of curiosity what dataset size you guys are testing on?

We are testing on a broad list of datasets, going from 200k movie datasets to 142 million song datasets.

> Other learning is, not every software written in Rust is fast :-)

It depends on what you want to be fast. Software programming is a matter of trade-offs, and we have excellent search speed results.

Kerollmops · on May 9, 2023

Co-founder and Tech Lead here,

Thanks to Louis' work and the design of LMDB, we can efficiently use the virtual address space of the OS to let it manage in the best way the memory Meilisearch is using.

We continuously improve the indexing speed, index size, memory usage, language support, and search speed. Perfection is a goal, and we will eventually reach it. Unfortunately, computer science is the science of trade-offs, and that's the hard part: choosing the most critical feature. We decided that search speed and accuracy are.