Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between playback and actual headers when dealing with binary data #194

Open
kfcaio opened this issue Oct 21, 2021 · 7 comments

Comments

@kfcaio
Copy link

kfcaio commented Oct 21, 2021

@sigmavirus24 I wrote a test for one function that downloads a large zip file using requests module. I've found discrepancy in Content-Length when comparing test execution with betamax and without it. Using Betamax, the length of the binary string extracted is way larger. Besides that, I need to pass that binary string to BytesIO and then to zipfile.ZipFile, but got zipfile.BadZipFile: Bad magic number for central directory exception.

My test setup:

import betamax
from betamax.fixtures import unittest
import os


mode = os.getenv('BETAMAX_RECORD_MODE')
with betamax.Betamax.configure() as config:
    config.cassette_library_dir = 'tests/test_funcs/cassettes'
    config.default_cassette_options['record_mode'] = mode
    print(f'Using record mode <{mode}>')


def the_function(session):
    # session = requests.Session()
    from io import BytesIO
    from zipfile import ZipFile

    response = session.get("https://ww2.stj.jus.br/docs_internet/processo/dje/xml/stj_dje_20211011_xml.zip")

    zip_in_memory = BytesIO(response.content)

    try:
        my_zip = ZipFile(zip_in_memory, 'r')
        my_zip.testzip()
        result = True
    except Exception:
        result = False

    return result


class BaseTest(unittest.BetamaxTestCase):
    custom_headers = None
    custom_proxies = None
    _path_to_ignore = None
    _no_generator_return_search = False

    def setUp(self):
        super(BaseTest, self).setUp()
        if self.custom_headers:
            self.session.headers.update(self.custom_headers)
        if self.custom_proxies:
            self.session.proxies.update(self.custom_proxies)
        self.worker_under_test = self.worker_class()
        self.worker_under_test._session = self.session

    def test_search(self):
        result = the_function(self.session)
        assert result

I pass the self.session to function under test and use it to get a endpoint. Through that endpoint, I get the zip file in the form of bytes string (response.content). I found that test runs without errors if I don't use the Betamax session.

Test

Session headers

{'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

Request headers

{'Accept-Ranges': 'bytes', 'ETag': 'W/"159406-1633990217000"', 'Last-Modified': 'Mon, 11 Oct 2021 22:10:17 GMT', 'Content-Type': 'application/zip', 'Content-Length': '159406', 'Date': 'Thu, 21 Oct 2021 14:37:27 GMT', 'Set-Cookie': 'BIGipServerpool_wserv=973081866.20480.0000; path=/; Httponly, TS01dc523b=016a5b383346ca02628a7c1dd47ef26e8cadf4a1b22fa9261c6b9ac1de8ac5665e99bd4a42c5b1d0af72b97105f57020b5e0f78fa7452df6080bf5ea3ee7a85d2de98968a2; Path=/; Domain=.www.stj.jus.br', 'Strict-Transport-Security': 'max-age=604800; includeSubDomains', 'Content-Security-Policy': "upgrade-insecure-requests; frame-ancestors 'self' https://*.stj.jus.br https://*.web.stj.jus.br https://stjjus.sharepoint.com/"}

Actual content length

len(response.content) == 288055

Script execution

Session headers

{'User-Agent': 'python-requests/2.25.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

Request headers

{'Accept-Ranges': 'bytes', 'ETag': 'W/"159406-1633990217000"', 'Last-Modified': 'Mon, 11 Oct 2021 22:10:17 GMT', 'Content-Type': 'application/zip', 'Content-Length': '159406', 'Date': 'Thu, 21 Oct 2021 14:39:24 GMT', 'Set-Cookie': 'BIGipServerpool_wserv=973081866.20480.0000; path=/; Httponly, TS01dc523b=016a5b3833746a54a2d1276a2b3de87f48f672e9cd7c18c4dad842ddddeac244bcbcf1a470b59eecf83bd6a3bdeffc7c7017210981de929d01df6c054118625399d2b04ad2; Path=/; Domain=.www.stj.jus.br', 'Strict-Transport-Security': 'max-age=604800; includeSubDomains', 'Content-Security-Policy': "upgrade-insecure-requests; frame-ancestors 'self' https://*.stj.jus.br https://*.web.stj.jus.br https://stjjus.sharepoint.com/"}

Actual content length

len(response.content) == 159406

I'm using Python 3.8.2, Betamax 0.8.1, Pytest 5.4.1 to run test and Requests 2.25.1

Related question: https://stackoverflow.com/questions/69653406/how-to-mock-a-function-that-downloads-a-large-binary-content-using-betamax

Related issue: #122

@sigmavirus24
Copy link
Collaborator

Can you try setting preserve_exact_body_bytes=True on your config? https://betamax.readthedocs.io/en/latest/api.html?highlight=bytes#forcing-bytes-to-be-preserved I wonder if we need a heuristic around Content-Type: application/zip

@kfcaio
Copy link
Author

kfcaio commented Oct 22, 2021

Thank you for your quick response. It worked, but no http interactions were recorded using BETAMAX_RECORD_MODE=all

{"http_interactions": [], "recorded_with": "betamax/0.8.1"}

Is it expected?

@sigmavirus24
Copy link
Collaborator

No but all is not generally advisable. Why are you using all?

@kfcaio
Copy link
Author

kfcaio commented Oct 22, 2021

@sigmavirus24 my bad, I was creating a new session somewhere in my actual script. It worked as expected, thank you! I think you may close this one

@sigmavirus24
Copy link
Collaborator

Would you want to add a heuristic via PR for that content-type to automatically preserve the exact body bytes? I think that is a reasonable feature request and PR and should be small-ish in effort

@kfcaio
Copy link
Author

kfcaio commented Oct 22, 2021

Sure : )

@sigmavirus24
Copy link
Collaborator

If it helps to get started,

if (preserve_exact_body_bytes or
'gzip' in r.headers.get('Content-Encoding', '')):
is where I'm thinking we need a change. I suspect, however, that we want to keep that from becoming too complicated to read, so if you want to make the condition a separate function I'm 👍 on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants