Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warnings while creating cosine based index #106

Open
shriyog opened this issue Dec 21, 2021 · 5 comments
Open

Warnings while creating cosine based index #106

shriyog opened this issue Dec 21, 2021 · 5 comments

Comments

@shriyog
Copy link

shriyog commented Dec 21, 2021

While building NGT index using the cosine distance metric, I see lot many warnings like below.

createIndex: Warning. The specified number of edges could not be acquired, because the pruned parameter [-S] might be set.
  The node id=6651608
  The number of edges for the node=7
  The pruned parameter (edgeSizeForSearch [-S])=40

Created the index using this command where I don't specify any -S param (default is 40)

ngt create -d 40 -D c cosine-index
ngt append -d 40 cosine-index vectors.ssv

I feel this suspicious as there are differences compared to another index built with L2 (Euclidean) distance metric using the same input vectors.

  1. Index build time - 4 Mins (cosine) vs 45 Mins (L2)
  2. Epsilon vs Precision (mentioned below)
  3. Index size on disk is the same though
Euclidean
# Factor (Epsilon)      # of Queries    Precision       Time(msec)      # of computations       # of visted nodes
0       100     0.436   0.293037        0       0
0.01    100     0.55    0.0437106       0       0
0.02    100     0.664   0.0645273       0       0
0.03    100     0.802   100.782         0       0
0.04    100     0.889   728.165         0       0
0.05    100     0.932   2077.52         0       0
0.06    100     0.958   3091.21         0       0
0.07    100     0.973   4509.79         0       0
0.08    100     0.985   5053.05         0       0
0.09    100     0.988   5463.39         0       0
0.1     100     0.993   5964.26         0       0

Cosine
# Factor (Epsilon)      # of Queries    Precision       Time(msec)      # of computations       # of visted nodes
0       100     0.256   0.0588535       0       0
0.01    100     0.273   0.033929        0       0
0.02    100     0.278   0.0337207       0       0
0.03    100     0.286   0.0346833       0       0
0.04    100     0.295   0.0367112       0       0
0.05    100     0.318   0.0401136       0       0
0.06    100     0.355   0.0426844       0       0
0.07    100     0.384   0.0472755       0       0
0.08    100     0.394   0.0479118       0       0
0.09    100     0.415   0.0516687       0       0
0.1     100     0.441   0.057455        0       0

The warning seems to be originating from here due to which I think the cosine based index is not properly built hence the impact on accuracy. Any thoughts on this or it's expected?

@masajiro
Copy link
Member

Could you run the command below to get your index's information.

ngt info [your cosine index path]

@masajiro
Copy link
Member

I tried to reproduce your problem with the datasets I have, but I could not. Since the problem might depend on datasets, could you provide your dataset, if possible.

@shriyog
Copy link
Author

shriyog commented Dec 24, 2021

Hey @masajiro — Thanks for the command, it details out the index meta which is quite helpful.

This is the output for an index created with above-mentioned warnings.

> ngt info catalog-mod-0-cosine/
NGT version: 1.13.7
Processed 1000000
Processed 2000000
Processed 3000000
Processed 4000000
Processed 5000000
Processed 6000000
The size of the object repository (not the number of the objects):	6652051
The number of the removed objects:	0/6652051
The number of the nodes:	6652051
The number of the edges:	130936766
The mean of the edge lengths:	-nan
The mean of the number of the edges per node:	19.68366839
The number of the nodes without edges:	0
The maximum of the outdegrees:	139690
The minimum of the outdegrees:	10
The number of the nodes where indegree is 0:	0
The maximum of the indegrees:	139690
The minimum of the indegrees:	10
#-nodes,#-edges,#-no-indegree,avg-edges,avg-dist,max-out,min-out,v-out,max-in,min-in,v-in,med-out,med-in,mode-out,mode-in,c95,c5,o-distance(10),o-skip,i-distance(10),i-skip:6652051:130936766:0:19.68366839:-nan:139690:10:1432.146574:139690:10:1432.146574:10:10:10:10:136.0223814:10:198.2695696:10:-nan:0:-nan:0

The dataset had empty vectors which may or may not be the reason for warnings. I created another index with a clean 1 Mn vectors & it didn't give any warnings this time. Here's the command output for it.

> ngt info catalog-1m-clean-cosine/
NGT version: 1.13.7
Processed 1000000
The size of the object repository (not the number of the objects):      1000000
The number of the removed objects:      0/1000000
The number of the nodes:        1000000
The number of the edges:        19999890
The mean of the edge lengths:   0.2193799515
The mean of the number of the edges per node:   19.99989
The number of the nodes without edges:  0
The maximum of the outdegrees:  3598
The minimum of the outdegrees:  10
The number of the nodes where indegree is 0:    0
The maximum of the indegrees:   3598
The minimum of the indegrees:   10
#-nodes,#-edges,#-no-indegree,avg-edges,avg-dist,max-out,min-out,v-out,max-in,min-in,v-in,med-out,med-in,mode-out,mode-in,c95,c5,o-distance(10),o-skip,i-distance(10),i-skip:1000000:19999890:0:19.99989:0.2193799515:3598:10:29.96648422:3598:10:29.96648422:13:13:10:10:92.58104:10:177.7591:10:0.2021325693:0:0.2021325693:0

Also, want to mention that the optimization guide helped me a lot to achieve desired accuracy & performance with the ONNG index. Thanks a lot for putting it together.

@shriyog
Copy link
Author

shriyog commented Dec 24, 2021

The dataset is 6.6 Mn, I'll try to reproduce the issue with a minimal dataset & share it with you. Let me get back on this by Monday.

@masajiro
Copy link
Member

Did you solve this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants