Adding support for HTTP Basic Auth #566

edsu · 2024-05-07T14:38:27Z

You can use the --httpBasicAuth flag to pass colon separated username and password credentials to the browser used for crawling.

For some reason I couldn't get page.authenticate() to work as advertised in the puppeteer docs. I could get it to work outside of browsertrix-crawler's browser, so maybe there's some interference from another setting? But setting the Authenticate header directly worked nicely. Thanks for the tip @vnznznz!

Fixes #168

You can use the --httpBasicAuth flag to pass username and password credentials to the browser used for crawling. For some reason I couldn't get page.authenticate() to work as advertised. But setting the Authenticate header directly worked. Fixes webrecorder#168

ikreymer · 2024-05-07T18:06:24Z

Hm, it's probably because we are also overriding Fetch.requestPaused, but not Fetch.authRequired for recording, though they're two separate events there.
The downside of this approach is that it sends the auth header always with any request. Is that a security risk potentially?
I think that will probably send it to third party URLs also.

ikreymer · 2024-05-07T18:07:58Z

I wonder if the auth should be associated with a particular seed, and then only sent to requests to the seed domain. That'd be a bit more work, though, and slight refactoring of seeds... I suppose you would probably never have two seeds with two different HTTP auth passwords?

edsu · 2024-05-07T22:24:12Z

I was wondering about that too. I guess it begs the question of how it would show up in the browsertrix user interface?

I don't think it would be easy to have url specific auth from the command line? But certainly from YAML configuration it would work?

Maybe having the blunt command line version AND the ability to fine tune in the YAML config would be good?

seeds:
  - url: https://webrecorder.net/
    depth: 1
    scopeType: "prefix"
    httpBasicAuth: "alice:abc123"

vnznznz · 2024-05-14T08:23:49Z

tests/http_basicauth.test.js

+
+test("test that http basic auth works", async () => {
+ child_process.execSync(
+ 'docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url http://httpbin.org/basic-auth/foo/bar --httpBasicAuth foo:bar --collection basicauth-test --statsFilename stats.json',


httpbin.org doesn't have a responding maintainer at the moment. I think it's better for test stability to boot https://github.com/psf/httpbin with docker instead of relying on external hosting.

Agreed but might be made complicated by psf/httpbin#27

Yeah, I noticed the Docker image wasn't currently working.

I was going to suggest https://github.com/mccutchen/go-httpbin (I'm not sure if it does http auth). But I wasn't entirely sure how to slot in running the container as part of the current test suite.

Couldn't find support for http auth in go-httpbin. Another option would be to boot a webserver inside the test suite like here: https://github.com/webrecorder/browsertrix-crawler/blob/main/src/util/replayserver.ts#L28-L64

I pushed up a revised test that starts a web server and then ensures browsertrix-crawler can access it. Unfortunately it doesn't seem to work as written (even when I'm not yet checking username/password).

The test fails and the crawler log reports:

{ "timestamp": "2024-05-16T18:28:24.497Z", "logLevel": "error", "context": "general", "message": "Page Load Timeout, skipping page", "details": { "msg": "net::ERR_CONNECTION_REFUSED at http://host.docker.internal:9998/", "page": "http://host.docker.internal:9998/", "workerid": 0 } }

But I can see when running the server and the docker container independently (outside the test itself) that things work ok with http://host.docker.internal:9998. So I'm a bit stumped about what I'm doing wrong here. Maybe it's some kind of async/sync problem?

@edsu I think beforeAll is expecting a promise to be returned, so might be a timing/async issue as you suspect. Might be easier to just start the server in-line prior to running the crawler and close after it's done, all within the test?

tests/http_basicauth.test.js

edsu mentioned this pull request May 7, 2024

HTTP Basic Auth #172

Closed

vnznznz reviewed May 14, 2024

View reviewed changes

trying to use a local webserver

a8f8ee4

edsu force-pushed the basic-auth branch from 6a2489a to a8f8ee4 Compare May 16, 2024 18:23

tw4l reviewed May 17, 2024

View reviewed changes

tests/http_basicauth.test.js Outdated Show resolved Hide resolved

Remove trailing slash after port in seed url

a90ceb4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for HTTP Basic Auth #566

Adding support for HTTP Basic Auth #566

edsu commented May 7, 2024 •

edited

ikreymer commented May 7, 2024

ikreymer commented May 7, 2024

edsu commented May 7, 2024 •

edited

vnznznz May 14, 2024

tw4l May 15, 2024

edsu May 15, 2024

vnznznz May 16, 2024

edsu May 16, 2024 •

edited

tw4l May 16, 2024

Adding support for HTTP Basic Auth #566

Are you sure you want to change the base?

Adding support for HTTP Basic Auth #566

Conversation

edsu commented May 7, 2024 • edited

ikreymer commented May 7, 2024

ikreymer commented May 7, 2024

edsu commented May 7, 2024 • edited

vnznznz May 14, 2024

Choose a reason for hiding this comment

tw4l May 15, 2024

Choose a reason for hiding this comment

edsu May 15, 2024

Choose a reason for hiding this comment

vnznznz May 16, 2024

Choose a reason for hiding this comment

edsu May 16, 2024 • edited

Choose a reason for hiding this comment

tw4l May 16, 2024

Choose a reason for hiding this comment

edsu commented May 7, 2024 •

edited

edsu commented May 7, 2024 •

edited

edsu May 16, 2024 •

edited