-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index.load ()
creates instance which does not find words present in the inverted index
#503
Comments
Index.load ()
createds instance which does not find words present in the inverted indexIndex.load ()
creates instance which does not find words present in the inverted index
some further debugging gives me that I have at some point a token set with two edges: one for '0' and one for '1', both point to a single final token set which happens to have id 4. so computing the I also happen to have a token set with a single edge for '0' which points to a token set with id '414'. So computing the Both of them have the same string representation but they don't represent the same thing. The first represents that your search term ends with either a '0' or a '1', while the second represents that you can have a '0' followed by some other characters. The reason I could not reduce the number of entries in the inverted index is because it would cause that last id '414' to change. This is also the reason If I change
I don't think this can cause the same collisions since a label is only a single character, and an id can only contain numbers. But
|
I have a generated index which after loading will not match certain keywords in the inverted index. However some tricks will cause lunr to find matches for the exact same keywords.
I was not able to create a very minimal example, but the original generated index (and the reduced version) is manageable. I could not append json files as attachments so I have created a gist which includes both the full index and the one referenced to in the code below. There is also a reproduce script which will illustrate the behaviour.
Loading the index and searching for the keyword will not work:
But the keyword is in the inverted index, we can manually see inspect the json:
In the reduced example I have removed all the entries in the inverted index starting from that match. When I add just the first entry (the one we try to match) we do find one:
Hovewer when the entry after that is also added, no matches are found:
Even weirder is that altering some lunr internals (the ids of the TokenSet) will cause everything to work again.
Changing any of the entries in the inverted index before
eifuw001r00
will make the search work too. Although I have found some changes which will still cause the search to fail, nearly all changes will make it work though. This is also the reason why I have a rather long index in the reproduction example.I get the impression that the behaviour is related to this code here. The result of
TokenSet.toString()
, which includes the id, is used as the key for a lookup inthis.minimizedNodes
. I'm guessing that it matches something that it should not, and modifies a token set's edges, which cause it to lose information.I have also tried to peek inside the generation of the token set inside he
TokenSet.Builder
(using the code in the gist). Everything seems to be going fine until the call toTokenSet.finish ()
. After that it seems like it only knows abouteifuw0
andeifuw1
instead ofeifuw001r00
andeifuw001r01
.Any idea what is causing this?
Is it expected behaviour, or is this a bug?
Is there a way to fix, or detect this?
Kind regards
The text was updated successfully, but these errors were encountered: