-
-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for HTTP Basic Auth #566
base: main
Are you sure you want to change the base?
Conversation
You can use the --httpBasicAuth flag to pass username and password credentials to the browser used for crawling. For some reason I couldn't get page.authenticate() to work as advertised. But setting the Authenticate header directly worked. Fixes webrecorder#168
Hm, it's probably because we are also overriding Fetch.requestPaused, but not Fetch.authRequired for recording, though they're two separate events there. |
I wonder if the auth should be associated with a particular seed, and then only sent to requests to the seed domain. That'd be a bit more work, though, and slight refactoring of seeds... I suppose you would probably never have two seeds with two different HTTP auth passwords? |
I was wondering about that too. I guess it begs the question of how it would show up in the browsertrix user interface? I don't think it would be easy to have url specific auth from the command line? But certainly from YAML configuration it would work? Maybe having the blunt command line version AND the ability to fine tune in the YAML config would be good? seeds:
- url: https://webrecorder.net/
depth: 1
scopeType: "prefix"
httpBasicAuth: "alice:abc123" |
tests/http_basicauth.test.js
Outdated
|
||
test("test that http basic auth works", async () => { | ||
child_process.execSync( | ||
'docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url http://httpbin.org/basic-auth/foo/bar --httpBasicAuth foo:bar --collection basicauth-test --statsFilename stats.json', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
httpbin.org doesn't have a responding maintainer at the moment. I think it's better for test stability to boot https://github.com/psf/httpbin with docker instead of relying on external hosting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed but might be made complicated by psf/httpbin#27
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I noticed the Docker image wasn't currently working.
I was going to suggest https://github.com/mccutchen/go-httpbin (I'm not sure if it does http auth). But I wasn't entirely sure how to slot in running the container as part of the current test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't find support for http auth in go-httpbin. Another option would be to boot a webserver inside the test suite like here: https://github.com/webrecorder/browsertrix-crawler/blob/main/src/util/replayserver.ts#L28-L64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed up a revised test that starts a web server and then ensures browsertrix-crawler can access it. Unfortunately it doesn't seem to work as written (even when I'm not yet checking username/password).
The test fails and the crawler log reports:
{
"timestamp": "2024-05-16T18:28:24.497Z",
"logLevel": "error",
"context": "general",
"message": "Page Load Timeout, skipping page",
"details": {
"msg": "net::ERR_CONNECTION_REFUSED at http://host.docker.internal:9998/",
"page": "http://host.docker.internal:9998/",
"workerid": 0
}
}
But I can see when running the server and the docker container independently (outside the test itself) that things work ok with http://host.docker.internal:9998
. So I'm a bit stumped about what I'm doing wrong here. Maybe it's some kind of async/sync problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edsu I think beforeAll is expecting a promise to be returned, so might be a timing/async issue as you suspect. Might be easier to just start the server in-line prior to running the crawler and close after it's done, all within the test?
You can use the
--httpBasicAuth
flag to pass colon separated username and password credentials to the browser used for crawling.For some reason I couldn't get
page.authenticate()
to work as advertised in the puppeteer docs. I could get it to work outside of browsertrix-crawler's browser, so maybe there's some interference from another setting? But setting the Authenticate header directly worked nicely. Thanks for the tip @vnznznz!Fixes #168