Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming requests read full response contents before initial cache write #878

Open
tgrandje opened this issue Sep 19, 2023 · 3 comments
Open

Comments

@tgrandje
Copy link

Problem with streaming queries

Queries do not seem to stream appropriately. It seems the whole response is awaited by the CachedSession before releasing the handle (the example in the doc is not appropriate as it is very quick).

Expected behavior

Chunks should be parsed as soon as they are gathered.

Steps to reproduce the behavior

import requests
import requests_cache

url = "https://wxs.ign.fr/859x8t863h6a09o9o6fy4v60/telechargement/inspire/BDTOPO-FRANCE-ADMIN-SSDOUBLON-PACK_17.1$BDTOPO_2-2_ADMINISTRATIF_SHP_WGS84G_FRA_2017-01-01/file/BDTOPO_2-2_ADMINISTRATIF_SHP_WGS84G_FRA_2017-01-01.7z"

s = requests.Session()
r = s.get(url, stream=True)
print(r.status_code)
for chunk in r.iter_content(chunk_size=1024):
    print(chunk[:10])
    break  # will break asap

requests_cache.install_cache()
r = s.get(url, stream=True)
print(r.status_code)
for chunk in r.iter_content(chunk_size=1024):
    print(chunk[:10])
    break # will still do

# Lets' use a new cache
s = requests_cache.CachedSession("dummy")
r = s.get(url, stream=True)
print(r.status_code) # will not even be printed until the end of the request
for chunk in r.iter_content(chunk_size=1024):
    print(chunk[:10])
    break

Workarounds

It seems the install_cache is enough to circumvent this behaviour. I checked the behaviour of requests_cache and can confirm that the stream argument is correctly passed to requests.

Environment

  • requests-cache version: 1.1.0
  • Python version: 3.9
  • Platform: Windows 10 pro
@JWCook
Copy link
Member

JWCook commented Oct 5, 2023

You're correct, the current level of support for streaming requests is making sure the stream can be played back correctly when returned from the cache. In other words, chunking behavior in the underlying file-like object used by urllib3 is the same between original and cached responses. The initial request is expected to be slower, though, since the entire response contents must be read and cached before being returned to the user. I'm not able to reproduce any difference in behavior with install_cache(), though.

I definitely agree that it would be an improvement for large requests like this if we could cache a streaming response only after it reaches the end of the stream. In general, this library isn't optimized for file downloads and other large requests, but it is something on my radar (#407). There would be a few different ways to approach this, but I can't think of any particularly clean solutions right now, and I will need to give it some more thought. Meanwhile, I will try to at least come up with a workaround you can use.

@JWCook JWCook changed the title Problem with streaming queries Streaming requests read full response contents before initial cache write Oct 23, 2023
@JWCook
Copy link
Member

JWCook commented Nov 24, 2023

Thanks for the example. I'm guessing Content-Length isn't being set correctly. I'll look into it!

@JWCook
Copy link
Member

JWCook commented Nov 24, 2023

I don't think that's related, though. Could you create a separate issue for that, please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants