Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyCurl for Image Downloading on Add Documents #814

Draft
wants to merge 4 commits into
base: mainline
Choose a base branch
from

Conversation

OwenPendrighElliott
Copy link
Contributor

  • What kind of change does this PR introduce?

Improvement

  • What is the current behavior? (You can also link to an open issue here)

Requests is used to download images and introduces a bottleneck in multimodal indexing.

  • What is the new behavior (if this is a feature change)?

In another piece of work I was experimenting with alternative to requests for https requests to get files. My testing showed that pycurl was about 3-4x faster than requests so I wanted to test if the same applied to image downloading in Marqo. It appears that it does hold true, even with a naive implementation. Hitting images in a US S3 bucket from Aus was 3-4x faster with pycurl, examples in our docs are about 2-3x faster. Using our local image search demo as an example, running on a machine with an Nvidia 2080 SUPER using 1 worker and a batch size of 32:

  • The mainline branch of Marqo would take about 17 hours (9-10 seconds per batch of 32)
  • The updated pycurl implementation would take 7.5 hours (3-4 seconds per batch of 32)

Below is a histogram of the image_download.full_time from telemetry for the first 100 batches using each implementation, the difference is notable - this pattern holds at 6 concurrent workers. There are other benchmarks which confirm pycurl is faster in many applications.

image (1)

Here is an example response from the multimodal example in our docs:

With requests (1446ms):

{'errors': False, 'processingTimeMs': 1445.500699999684, 'index_name': 'my-first-multimodal-index', 'items': [{'status': 200, '_id': '6424ff33-63e0-4104-ab25-8f1adea50ead'}, 
{'status': 200, '_id': '0b89345a-1dee-4050-9caa-e48b794a1662'}, {'status': 200, '_id': 'c7ccee6f-213f-4b7f-be2a-1e8e3402b5dd'}]}

With PyCurl (563ms):

{'errors': False, 'processingTimeMs': 562.6993000005314, 'index_name': 'my-first-multimodal-index', 'items': [{'status': 200, '_id': '3c70a84c-e68b-4547-b8ef-20cf2fab4cf9'}, 
{'status': 200, '_id': 'a8646b51-0640-4bb0-8661-95efc1023f0b'}, {'status': 200, '_id': 'ca44ae44-88d5-4eb8-8802-4b280e48c968'}]}

The current implementation I did is pretty basic - it simply instantiates the PyCurl object in each image download. I expect that we can could get even better performance using the I/O multiplexing and DNS caching. As far as I can tell the only reason this implementation is faster is because libcurl is simply faster than urllib3, I tried disabling the streaming with requests as is currently done but that didn't change the speed at all. This adds a dependency for pycurl and certifi (certifi may or may not be needed but I had certificate issues on WSL without it), this also might increase memory usage because there is a moment when the raw image data and the PIL image are held in memory together although the raw data should be dropped from memory almost instantly as it immediately becomes out of scope. Can someone check the changes (maybe from a machine in the US) and confirm these results are not just specific to my machine?

  • Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

No

  • Have unit tests been run against this PR? (Has there also been any additional testing?)

No

TODO

  • check memory impact
  • test from other machines in other localities
  • Please check if the PR fulfills these requirements
  • The commit message follows our guidelines
  • Tests for the changes have been added (for bug fixes/features)
  • Docs have been added / updated (for bug fixes / features)

raise UnidentifiedImageError(
f"image url `{image_path}` is unreachable, perhaps due to timeout. "
f"Timeout threshold is set to {timeout} seconds."
f"\nConnection error type: `{e.__class__.__name__}`")
f"\nConnection error type: `{e.__class__.__name__}`"
f"\n{e}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need to remove f"\n{e}"

@@ -2,7 +2,10 @@
from marqo.tensor_search.enums import ModelProperties, InferenceParams
from marqo.tensor_search.models.private_models import ModelLocation, ModelAuth
import validators
import requests
# import requests
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove commented import

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant