Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to set Request User-Agent string #12

Open
mafrosis opened this issue Jul 31, 2016 · 6 comments
Open

Option to set Request User-Agent string #12

mafrosis opened this issue Jul 31, 2016 · 6 comments
Milestone

Comments

@mafrosis
Copy link

I've been trialling riko and it seems great. I do have a small request however, that an option be added to change the User-Agent on outgoing requests. Some servers will block the default User-Agent: Python-urllib/3.5.

Alternatively, have you considered using urllib3 instead of the mess that's in Python core? In that case you can easily pass headers into the PoolManager constructor.

https://urllib3.readthedocs.io/en/latest/

Thanks for your works!

@reubano
Copy link
Member

reubano commented Jul 31, 2016

Glad you are enjoying riko and thanks for the suggestion! I agree this is a useful feature, but I'm not sure when I will be able to work on it since will touch multiple files and require a bit of time to properly integrate into the entire project. If this is something you are willing to take a stab at, I can happily point you in the right direction :).

My initial thought is to add a ua key to the conf kwarg of the appropriate pipes. Then you could do, e.g., pipe(conf={'url': 'example.com', 'ua': 'Special-Agent'}). There would also need to be an option added to SyncPipe (plus the async versions of both).

IIRC, I don't think urllib3 can read local files (file://), only remote (http://).

@reubano reubano modified the milestone: 1.0.0-rc Jul 31, 2016
@mafrosis
Copy link
Author

mafrosis commented Aug 1, 2016

Hi! I just took a look through the source code, and the part I'm not really clear on is what would need change in SyncPipe. It seems the new "ua" field would just be passed down into each module via kwargs?

Also, which modules will want this feature? I was looking specifically at fetchpage, but I guess fetchdata and xpathfetchpage are obvious candidates.

@reubano
Copy link
Member

reubano commented Aug 1, 2016

It seems the new "ua" field would just be passed down into each module via kwargs?

This would be true if ua were passed as

SyncPipe('fetch', conf={'url': 'example.com'}, ua='Special-Agent')

instead of

SyncPipe('fetch', conf={'url': 'example.com', 'ua': 'Special-Agent'})

The choice of whether ua should be in conf or kwargs essentially boils down to how you want extract the value:

Also, which modules will want this feature?

I would say almost all of the source pipes, with the exceptions being itembuilders and input. The non-source pipe exchangerate is also a candidate. For the sources pipes, we could intercept the parse_rss function call and pass the required kwargs to urlopen. There may be some edge cases as well but a simple search for all uses of urlopen should suffice. Plus the async variant async_url_open. Plus py2 and py3 compatibility. Plus the appropriate unit tests.... Phew!

I hope that didn't overwhelm you :)

@mafrosis
Copy link
Author

Hey I'm sorry I really haven't had time to take on this work. I wish I did! It's looking more complicated than I can realistically tackle right now, so please close this issue if you are unlikely to implement it yourself.

Thanks!

@reubano
Copy link
Member

reubano commented Aug 15, 2016

I'll keep it open since it's a valid request. Any area in particular causing difficulty?

@reubano
Copy link
Member

reubano commented May 19, 2020

CR #45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants