How come sortable attributes are so fast? #4587

ricardopieper · 2024-04-18T21:49:52Z

ricardopieper
Apr 18, 2024

Hello,
I want to understand why Meilisearch works so well in my scenario, to the point our team is growing a bit skeptical even if it works. I haven't tested, but I guess Elasticsearch would behave well too. I am partial to Meilisearch though.

Similar questions have been asked before, and I do understand there is a disk space tradeoff: Meilisearch just stores a whole lot of data in indices to make queries fast. But I doubt it's just that.

Basically, we want to understand a bit of the magic for sorting with sortable-attributes. I will explain our problem in PostgreSQL first so you understand where my question comes from.

We currently have a problem on PostgreSQL with a discontinuous read on the index when we use an order by, such that it cannot do an efficient minimal read on the index, it has to read a whole lot of data.

Suppose we are a big e-commerce and we have millions of products on our catalog. We want to run this query:

select id, name, category, created_at from products
where category in (....big list of category IDs selected by the user, imagine ~50 values here)
order by created_date desc
limit 25 offset 0

And suppose we have this index:

create index ON products USING btree (category, created_at)

It would be amazing if PostgreSQL could do a very small index read in this scenario, reading only the top 25 records that comply with that query. But the query has a complication of passing many category IDs, if it were just one, this would work.

At a very high level, I can picture the index is composed of many blocks, each representing a category ID, and inside each block there's a list of pointers ordered by created_at. But since I want an ordering involving many of these blocks, PostgreSQL decides to load the entire blocks of these categories from disk, and THEN do the sorting.... even though it should be already sorted? I can see this with EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS, FORMAT JSON) select ... and pasting that on Dalibo (query planner visualizer). The result is a high read throughput from disk, and queries taking 2 seconds, sometimes more. With many people running this query, our database performance suffers.

I wonder what happens behind the scenes on Meilisearch. I understand sort is one of the ranking attributes. When I search without a query, only using filters (passing all the categories separated by OR) and sorting by created_at desc, without a full text search q parameter, it takes Meilisearch around 1ms. I see this kind of performance even in my laptop. Millions of records, a complex query, and single-digit milliseconds of time.

Is that due to bucket sort that is explained somewhere in the docs? You can navigate efficiently on the btree that is already sorted in disk? But then how come results are sorted even if I do a full text search, where the other ranking attributes come into play?

I don't feel I have a good understanding of this yet.

ricardopieper · 2024-04-19T15:04:43Z

ricardopieper
Apr 19, 2024
Author

Note: Upon further inspection it looks like my postgresql complain doesn't hold much water. My simplified example is too simplified, it becomes a problem when we have a text_pattern_ops index in use.

0 replies

Kerollmops · 2024-04-22T08:36:10Z

Kerollmops
Apr 22, 2024
Maintainer

Hey @ricardopieper 👋

Meilisearch is very fast with sortable attributes, and the reason is probably the tradeoff we've made to store more information on the disk. We use a tree-based data structure that looks like a B+tree; everything is stored in LMDB, our key-value store.

There is a first layer in our database that stores, lexicographically ordered, the price as a key and the set of document IDs with this given price as the associated data. When you request documents, the engine looks at the entries from the smallest to the biggest to fetch the documents with the smallest price first. However, doing that when the set of document is small can impact performances as we will be O(n) on the total number of different prices.

This is where the design of this B+tree comes into play; we do not only have one layer of key values (price and associated document IDs) but a lot more. On top of the first layer, we create groups of four prices and unionize all the documents from this set. We add layers on top of layers until we no longer can, as there are not enough entries to unionize. By starting to search from the highest layer, the denser one, we can decide whether or not we want to go down as there are documents in a group. If there are documents in there, we fetch the sub-layer; if not, we skip it, meaning we can skip many entries at once. You can see more details and great ASCII art about this in the codebase.

In conclusion, whether or not there are a lot of documents to return, the set of documents is condensed or not, the number of different keys needs to be explored, this algorithm will always (or most of the time) be fast 🚀

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How come sortable attributes are so fast? #4587

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

How come sortable attributes are so fast? #4587

ricardopieper Apr 18, 2024

Replies: 2 comments

ricardopieper Apr 19, 2024 Author

Kerollmops Apr 22, 2024 Maintainer

ricardopieper
Apr 18, 2024

ricardopieper
Apr 19, 2024
Author

Kerollmops
Apr 22, 2024
Maintainer