Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for HTTP Basic Auth #566

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

edsu
Copy link
Contributor

@edsu edsu commented May 7, 2024

You can use the --httpBasicAuth flag to pass colon separated username and password credentials to the browser used for crawling.

For some reason I couldn't get page.authenticate() to work as advertised in the puppeteer docs. I could get it to work outside of browsertrix-crawler's browser, so maybe there's some interference from another setting? But setting the Authenticate header directly worked nicely. Thanks for the tip @vnznznz!

Fixes #168

You can use the --httpBasicAuth flag to pass username and password
credentials to the browser used for crawling.

For some reason I couldn't get page.authenticate() to work as
advertised. But setting the Authenticate header directly worked.

Fixes webrecorder#168
@edsu edsu mentioned this pull request May 7, 2024
@ikreymer
Copy link
Member

ikreymer commented May 7, 2024

Hm, it's probably because we are also overriding Fetch.requestPaused, but not Fetch.authRequired for recording, though they're two separate events there.
The downside of this approach is that it sends the auth header always with any request. Is that a security risk potentially?
I think that will probably send it to third party URLs also.

@ikreymer
Copy link
Member

ikreymer commented May 7, 2024

I wonder if the auth should be associated with a particular seed, and then only sent to requests to the seed domain. That'd be a bit more work, though, and slight refactoring of seeds... I suppose you would probably never have two seeds with two different HTTP auth passwords?

@edsu
Copy link
Contributor Author

edsu commented May 7, 2024

I was wondering about that too. I guess it begs the question of how it would show up in the browsertrix user interface?

I don't think it would be easy to have url specific auth from the command line? But certainly from YAML configuration it would work?

Maybe having the blunt command line version AND the ability to fine tune in the YAML config would be good?

seeds:
  - url: https://webrecorder.net/
    depth: 1
    scopeType: "prefix"
    httpBasicAuth: "alice:abc123"


test("test that http basic auth works", async () => {
child_process.execSync(
'docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url http://httpbin.org/basic-auth/foo/bar --httpBasicAuth foo:bar --collection basicauth-test --statsFilename stats.json',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

httpbin.org doesn't have a responding maintainer at the moment. I think it's better for test stability to boot https://github.com/psf/httpbin with docker instead of relying on external hosting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed but might be made complicated by psf/httpbin#27

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I noticed the Docker image wasn't currently working.

I was going to suggest https://github.com/mccutchen/go-httpbin (I'm not sure if it does http auth). But I wasn't entirely sure how to slot in running the container as part of the current test suite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't find support for http auth in go-httpbin. Another option would be to boot a webserver inside the test suite like here: https://github.com/webrecorder/browsertrix-crawler/blob/main/src/util/replayserver.ts#L28-L64

Copy link
Contributor Author

@edsu edsu May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed up a revised test that starts a web server and then ensures browsertrix-crawler can access it. Unfortunately it doesn't seem to work as written (even when I'm not yet checking username/password).

The test fails and the crawler log reports:

{
  "timestamp": "2024-05-16T18:28:24.497Z",
  "logLevel": "error",
  "context": "general",
  "message": "Page Load Timeout, skipping page",
  "details": {
    "msg": "net::ERR_CONNECTION_REFUSED at http://host.docker.internal:9998/",
    "page": "http://host.docker.internal:9998/",
    "workerid": 0
  }
}

But I can see when running the server and the docker container independently (outside the test itself) that things work ok with http://host.docker.internal:9998. So I'm a bit stumped about what I'm doing wrong here. Maybe it's some kind of async/sync problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edsu I think beforeAll is expecting a promise to be returned, so might be a timing/async issue as you suspect. Might be easier to just start the server in-line prior to running the crawler and close after it's done, all within the test?

tests/http_basicauth.test.js Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTTP Basic Auth profile?
4 participants