Skip to content
This repository has been archived by the owner on Jan 28, 2021. It is now read-only.

Multi column indexes are only used if all columns share operation #261

Open
erizocosmico opened this issue Jul 4, 2018 · 17 comments
Open
Labels
blocked Some other issue is blocking this wip work in progress

Comments

@erizocosmico
Copy link
Contributor

Because our methods for Index accept ...interface{}, where the len of that slice is the length of columns in the index, all those methods require all the values, one for each column.

This creates a problem: all columns must use the exact same operation.

For example, consider we have an index on A and B:

  • A = 1 AND B = 1 will use the index.
  • A > 1 AND B > 5 will use the index.
  • A = 1 AND B < 5 will not, because = and < are not the same operation.
@erizocosmico erizocosmico added the bug Something isn't working label Jul 4, 2018
@erizocosmico
Copy link
Contributor Author

Closing, as discussed via slack with @ajnavarro this is expected behaviour.

@ajnavarro
Copy link
Contributor

maybe we should have a look of how is working on mysql or postgres: https://dev.mysql.com/doc/refman/8.0/en/multiple-column-indexes.html

@ajnavarro ajnavarro reopened this Jul 4, 2018
@erizocosmico
Copy link
Contributor Author

In MySQL, it uses the index as long as the value of the leftmost column is specified.

e.g., for an index on (a, b), it would be used for a = 1 AND b > 5

@erizocosmico
Copy link
Contributor Author

What are we going to do with this in the end @ajnavarro?

@ajnavarro
Copy link
Contributor

Let's keep that issue open until we find a good solution for that. Right now is not a high priority.

@kuba--
Copy link
Contributor

kuba-- commented Oct 16, 2018

What if we do not have indexes on multiple columns (internally), so the index on (A, B) would be the same as index on A and index on B.
Ultimately we have to merge 2 indexes (what is pretty fast). Last but not least, it will also dedup. indexes. Right now if you create an index on A, B, C, (A, B), (A, B, C) it means 5 indexes, but if we treat them independently it will be still just 3 indexes (1 per column).
WDYT?

@erizocosmico
Copy link
Contributor Author

I think that might work. I can't think of any reason this may be bad right now. In fact, it should actually help with updates (less indexes to update once repos change). Only downside I can see is the fact that you will have to check all remaining indexes to see if you can delete a pilosa index or not, because it may be in use in another gitbase index.

WDYT @src-d/data-retrieval?

@ajnavarro
Copy link
Contributor

If it is internally, at pilosa implementation level, I'm totally in on that.

We should save on pilosa metadata the pilosa indexes that go-mysql-server indexes are using to know if we can delete the pilosa index on a DELETE INDEX statement or not.

@kuba--
Copy link
Contributor

kuba-- commented Nov 8, 2018

Ok, so I'll start prototype the idea of dedup indexes (#261 (comment))

@kuba-- kuba-- self-assigned this Nov 8, 2018
@kuba-- kuba-- added the wip work in progress label Nov 8, 2018
@kuba--
Copy link
Contributor

kuba-- commented Nov 15, 2018

I noticed a small thing to improve. Having the following index:

CREATE INDEX email_idx ON commits USING pilosa (commit_author_email, commit_author_name);

Following AND query uses the index:

SELECT * FROM commits WHERE commit_author_email='...' AND commit_author_name='...';

but OR query doesn't!

SELECT * FROM commits WHERE commit_author_email='...' OR commit_author_name='...';

but actually, if we have independent indexes (so in this case 2) then for logic operations we may always merge 2 indexes.

@kuba--
Copy link
Contributor

kuba-- commented Nov 15, 2018

It's quite easy to fix the problem with index on (A, B) which can be used only with one condition, e.g.:

WHERE A = '...'

it may require some convention between driver and analyzer (for instance we may always pass to index lookup as many keys as index expressions but we have to keep the order and lookup will skip nil keys), e.g.:

// index (A, B), WHERE A=5
index.Get({5, nil})

@ajnavarro
Copy link
Contributor

But if you create one index with two columns, we shouldn`t try to use the index with only one of the columns.

This can break the intended way to communicate between the Analyzer and the different kind of indexes. Not all indexes can do the same as we are doing with pilosa index, so we should keep the common interface with some constraints.

@kuba--
Copy link
Contributor

kuba-- commented Nov 16, 2018

@ajnavarro - right, it was just experiment, because with bitmaps it's doable.
But actually with (A, B) only A AND B works. If you need A OR B you need to create A, B independently.

@kuba--
Copy link
Contributor

kuba-- commented Nov 16, 2018

Generally with bitmaps I don't see benefits of having multi-column indexes. It works better if we have index per expression and merge them because merging bitmaps is super fast and it's more flexible from composition point of view.

@kuba--
Copy link
Contributor

kuba-- commented Nov 16, 2018

Notes from slack:

If you create one index on expressions (A, B) , actually we don't index tuples, but independently A values and B values as pilosa fields, so under the hood they are already independent structures.
But in our index driver API we require a list of expressions which were specified at creating time, (moreover we require them in the same order) to internally compute intersection for all expressions.
Because we use bitmaps per expression value, combining multiple expression into one index is kind of artificial, under the hood.
For our internals having index on (A, B) is exactly the same as having index on (A) and on (B). Although, for multi-expression indexes every time we compute intersection (A and B). What means that for non-and queries, e.g.:
WHERE A='...' OR B='...' the index won't be used. For independent indexes we can always merge 2 indexes (lookup) by some logic operation, so it will work for AND, OR, ...
Right now if we try to re-use (A, B) index to compute A OR B it will try to do: (A AND B) OR B what is incorrect.
So, I propose that index driver will work only with one expression (so api will simplify a little bit, moreover lookup will not have to compute internally intersects), but analyzer/registry will keep kind of mapping:

[A, B] --> A, B
              ^
[B, C]--------|----> C

In this case for requests WHERE B = '...' analyzer will have to check if we have indexed B expression.
And if you register an index (A, B) instead of getting one lookup with 2 expressions, we'll return 2 lookups (one per expression).

@kuba--
Copy link
Contributor

kuba-- commented Nov 27, 2018

@erizocosmico - where is the main problem of using indexes with 2 different operations? E.g.:

A = 1 AND B < 5

and if we have index per columns (A), (B) instead of one (A, B) it will work?

@erizocosmico
Copy link
Contributor Author

There is no problem, we just only did it when they share operations.
It would work if we have (A), (B), yes.

@kuba-- kuba-- added blocked Some other issue is blocking this and removed bug Something isn't working labels Dec 14, 2018
@kuba-- kuba-- removed their assignment May 20, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
blocked Some other issue is blocking this wip work in progress
Projects
None yet
Development

No branches or pull requests

3 participants