New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose cookiejars #1878
Comments
getting and setting cookies in Scrapy is really huge pain so big 👍 from me. In project I'm working on now we use following solution that sets "jars" from Cookie Middleware on spider, and then allows you to use it. class CustomCookiesMiddleware(cookies.CookiesMiddleware):
@classmethod
def from_crawler(cls, crawler):
o = super(CustomCookiesMiddleware, cls).from_crawler(crawler)
crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
return o
def spider_opened(self, spider):
self.enabled = getattr(spider, 'cookies_enabled', self.enabled)
spider._cookiejars = self.jars def BaseSpider(Spider):
def get_cookie(self, name, cookiejar=None):
if cookiejar not in self._cookiejars:
raise KeyError(u'cookiejar {} does not exist'.format(cookiejar))
_dict = {c.name: c.value for c in self._cookiejars[cookiejar]}
return _dict.get(name) but this is just for getting cookies, we dont have anything for setting cookies, we should definitely add something, last time I had to replace cookie value I had to write ugly code like this locale_cookie = self._cookiejars[None]._cookies[".xbox.com"]["/"].get("defCulture")
locale_cookie.value = self.locale |
One other difficulty that arises when you work with Scrapy cookies is that we use Cookie object from cookielib, and this is incompatible with Cookie object from Cookie module. So if you want to create and add cookie you CANNOT use nice and easy SimpleCookie object, you have to use Cookie object from. Using SimpleCookie ends like this # coding: utf-8
from cookielib import CookieJar # this is what we use in Scrapy
from Cookie import SimpleCookie
jar = CookieJar()
c = SimpleCookie()
c["name"] = "foo"
c["name"]["domain"] = ".github.com"
c["name"]["path"] = "/"
c.output() # 'Set-Cookie: name=foo; Domain=.github.com; Path=/'
jat.set_cookie(c)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-11-ad88adc0c10c> in <module>()
----> 1 jar.set_cookie(c)
/opt/python2.7/lib/python2.7/cookielib.pyc in set_cookie(self, cookie)
1641 self._cookies_lock.acquire()
1642 try:
-> 1643 if cookie.domain not in c: c[cookie.domain] = {}
1644 c2 = c[cookie.domain]
1645 if cookie.path not in c2: c2[cookie.path] = {}
AttributeError: 'SimpleCookie' object has no attribute 'domain' This means that if you want to set some cookie on Scrapy cookiejar you have to use cookielib.Cookie and this object is definitely not made for humans, e.g. here's how you create Cookie from cookielib, every kwarg is required and init will fail if appropriate value is not provided. There are no defaults even though some values are clearly static and dont change much (e.g. comment_url=None) from cookielib import Cookie # this is what we use in Scrapy
c = Cookie(version=0, name='name', value='value', port=None, port_specified=False,
domain='.github.com',
domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False,
expires=1511172829, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False) IMO it would be nice to be able to use SimpleCookie in Scrapy, it would simplify things. |
how should this cookiejar api be designed @kmike ? Should it be part of Scrapy or perhaps some external library? I imagine that some external library could simply subclass cookie middleware and add some useful functions and utilities - e.g. for setting/getting cookies or maybe even persisting cookies across spider runs (something that is currently not supported but could be very useful). Reading about some bot detection systems, e.g. here they seem to appreciate clients that have long living cookies, so perhaps persisting some cookies could be useful in dealing with them. One problem here is communicating between cookie middleware and spider. Cookiejars are stored as attribute of middleware, so if we want to expose cookiejars they would have to be attribute of spider probably. Are there any problems with linking middleware "jars" to spider, e.g. add spider opened to middleware and set middleware "jars" on spider instance, then add some methods for getting and setting cookies in middleware and make them available from spider. |
Having cookie management builtin makes more sense to me. Of course, nothing prevents creating a separate library for that (well, maybe #1877 can be a problem), but I'd prefer having good cookie management in Scrapy itself. This is a basic task that everyone needs to solve. In scrapy-splash I implemented another cookie middleware; it exposes current cookiejar as Having cookiejars on spider makes sense; it also makes sense to store current cookiejar in spider state so that it is preserved when on-disk request queue is restarted (how is that handled now?). Or maybe there is another clever API trick which can make working with cookies even more convenient, I dont' know :) |
Please can anybody help me: i create a subclass of request to login, because i have more than one login to parse the page.
my problem is the async, i need to wait for the zero.dologin()
i found in unit test https://github.com/scrapy/scrapy/blob/master/scrapy/utils/defer.py |
Also, it would be great to be able to set settings like CONCURRENT_REQUESTS and DOWNLOAD_DELAY to be enforce per cookiejar. |
@gdomod we use the Github Issues to discuss development of Scrapy, please use the community channels like Stackoverflow or the mailing list to ask for help on how to use it. |
+1 to exposing cookiejars. I'm needing it now for a new project, and intend to do the same as @pawelmhm mentioned (custom middleware adding a spider attribute referencing the jars object). |
#3563 (comment) has yet another syntax proposal (haven't thought about it in depth though). |
Scrapy cookiejar API is limited:
cookiejar
, but you can't put CookieJar object there, in fact it meanscookiejar_id
orsession_id
, notcookiejar
; this is confusing. It should have been calledsession_id
IMHO.I think we should provide a better API for 'sessions'. It should allow to
Currently I'm using an ugly hack to access cookies:
I don't have a concrete API proposal, but likely it should use a word 'session' :)
The text was updated successfully, but these errors were encountered: