Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Composite index of "STRING" type #1754

Open
SimoneLazzaris opened this issue Jul 28, 2023 · 2 comments
Open

Composite index of "STRING" type #1754

SimoneLazzaris opened this issue Jul 28, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@SimoneLazzaris
Copy link
Collaborator

Creating composing index on document collections is not currently working with STRING types: it let you create the index, but the an error occurs when a document is inserted.

@SimoneLazzaris SimoneLazzaris added the enhancement New feature or request label Jul 28, 2023
@prog8
Copy link

prog8 commented Nov 28, 2023

The problem is related to a way how the index key is created during the upsert.
https://github.com/codenotary/immudb/blob/master/embedded/sql/stmt.go#L981-L999

First of all the ingredients of the index are encoded. For varchar type the value is encoded in a way that it first gets bytes from the string, then value is filled with 0 bytes and ad the end there is simply the lenght of the string. With the default settings encoded varchar ends up having 517 bytes. Later when the final key is created all the ingredients are concatenated. If we have 2 string columns used to build the index we end up having more than 517+517 bytes. In next steps there is a check if the key is not exceeding the maximum allowed size which defaults to 1024. And here is "kaboom".

First of all knowing the length limitations the solution could be to build a key from real characters (without filling with zero bytes). But of course this solution wouldn't be correct. This is because in such a way "foo" (from column1), "bar" (from column2) wouldn't be distinguishable from "fo" (from column1), "obar" (from column2). Another solution is to use a separator byte between columns in index and ignore the fill with zero bytes but still a user can really insert long strings (as long as 512 bytes) and we'll hit the same issue.

@SimoneLazzaris
Copy link
Collaborator Author

Good catch.
I think that another possible solution could be to use the first n character of the string, then a separator and a strong hash (e.g.: sha256). n should be chosen so that n + 1 (separator) + hash_length = 512. In that way we can guarantee uniqueness and sorting, except for the pathologically long string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants