Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create_index: No validation when split_length <= split_overlap[BUG] #605

Open
pandu-k opened this issue Sep 22, 2023 · 1 comment
Open
Labels
bug Something isn't working

Comments

@pandu-k
Copy link
Collaborator

pandu-k commented Sep 22, 2023

Describe the bug
Internal error occurs on add_docs when split_length < split_overlap. This issue was raised on our forums here.

Reproducing the issue
To reproduce:

# create index: 

curl -XPOST -H 'Content-type: application/json' http://localhost:8882/indexes/text-index -d '{ "index_defaults": { "text_preprocessing": { "split_length": 2, "split_overlap": 5, "split_method": "word" }, "treat_urls_and_pointers_as_images": false, "model": "hf/all_datasets_v4_MiniLM-L6", "normalize_embeddings": true, "image_preprocessing": { "patch_method": null }, "ann_parameters" : { "space_type": "cosinesimil", "parameters": { "ef_construction": 128, "m": 16 } } }, "number_of_shards": 3, "number_of_replicas": 0 }'

# add docs

curl -XPOST -H 'Content-type: application/json' http://localhost:8882/indexes/text-index/documents -d '{ "documents" : [{"_id":"1","title":"Fat cat","description":"The fat cat sits on the mat in the sunshine"},{"_id":"2","title":"Brown fox","description":"The quick brown fox jumps over the lazy dog"}], "tensorFields" : ["description"] }'

Yields this error:

Marqo logs:

  File "/app/src/marqo/tensor_search/tensor_search.py", line 522, in add_documents
    content_chunks = text_processor.split_text(field_content, split_by=split_by,
  File "/app/src/marqo/s2_inference/processing/text.py", line 147, in split_text
    segments = list(windowed(split_text, n=split_length, step=split_length - split_overlap))
  File "/usr/local/lib/python3.8/dist-packages/more_itertools/more.py", line 841, in windowed
    raise ValueError('step must be >= 1')
ValueError: step must be >= 1

The return message is an unhelpful message: Internal Server Error.

Expected behavior
Index-creation-time validation should prevent creating an index with these problematic settings.

Additional context

@pandu-k pandu-k added the bug Something isn't working label Sep 22, 2023
@TeimasTeimoso
Copy link

Hello @pandu-k, can I try to pick up on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants