Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

selenium.common.exceptions.TimeoutException ImmoScout24 #272

Open
flyingdodo11 opened this issue Dec 8, 2022 · 24 comments
Open

selenium.common.exceptions.TimeoutException ImmoScout24 #272

flyingdodo11 opened this issue Dec 8, 2022 · 24 comments

Comments

@flyingdodo11
Copy link

Hi guys,
I'm trying to setup the flathunter for ImmoScout24. Already tried it with ebay-kleinanzeigen und immowelt with success.

I already checked all other issues regarding this problem like Issue214, none of the solutions worked for me.

Also i tried it on MacOS and Ubuntu 20.04 with the normal version and the docker version.
I always get the same errors.

[2022/12/08 10:51:24|_common.py              |ERROR   ]: Giving up get_soup_from_url(...) after 3 tries (selenium.common.exceptions.TimeoutException: Message: 
Stacktrace:
#0 0x56378df8f2a3 <unknown>
#1 0x56378dd4df77 <unknown>
#2 0x56378dd8a80c <unknown>
#3 0x56378dd8aa71 <unknown>
#4 0x56378ddc4734 <unknown>
#5 0x56378ddaab5d <unknown>
#6 0x56378ddc247c <unknown>
#7 0x56378ddaa903 <unknown>
#8 0x56378dd7dece <unknown>
#9 0x56378dd7efde <unknown>
#10 0x56378dfdf63e <unknown>
#11 0x56378dfe2b79 <unknown>
#12 0x56378dfc589e <unknown>
#13 0x56378dfe3a83 <unknown>
#14 0x56378dfb8505 <unknown>
#15 0x56378e004ca8 <unknown>
#16 0x56378e004e36 <unknown>
#17 0x56378e020333 <unknown>
#18 0x7fdf58a1bea7 start_thread)
@codders
Copy link

codders commented Dec 8, 2022

Hi @flyingdodo11 ,

How much RAM do you have available for your docker containers? I think the docker daemon is by default not very generous on Mac. You should have at least 1GB of memory to run the Immoscout crawler.

@marcelmindemann
Copy link

I am running Flathunter in docker on Linux with no resource limits, and I am getting the same issue. More logging output:

flathunter  |   File "/usr/src/app/flathunter/hunter.py", line 54, in hunt_flats                                                                                                                           [0/1631]
flathunter  |     for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
flathunter  |   File "/usr/src/app/flathunter/hunter.py", line 34, in crawl_for_exposes
flathunter  |     for searcher in self.config.searchers()
flathunter  |   File "/usr/src/app/flathunter/hunter.py", line 35, in <listcomp>
flathunter  |     for url in self.config.target_urls()])
flathunter  |   File "/usr/src/app/flathunter/hunter.py", line 25, in try_crawl
flathunter  |     return searcher.crawl(url, max_pages)
flathunter  |   File "/usr/src/app/flathunter/abstract_crawler.py", line 142, in crawl
flathunter  |     return self.get_results(url, max_pages)
flathunter  |   File "/usr/src/app/flathunter/crawl_immobilienscout.py", line 57, in get_results
flathunter  |     soup = self.get_page(search_url, self.driver, page_no)
flathunter  |   File "/usr/src/app/flathunter/crawl_immobilienscout.py", line 145, in get_page
flathunter  |     afterlogin_string=self.afterlogin_string
flathunter  |   File "/usr/local/lib/python3.7/site-packages/backoff/_sync.py", line 105, in retry
flathunter  |     ret = target(*args, **kwargs)
flathunter  |   File "/usr/src/app/flathunter/abstract_crawler.py", line 76, in get_soup_from_url
flathunter  |     self.resolve_recaptcha(driver, checkbox, afterlogin_string)
flathunter  |   File "/usr/local/lib/python3.7/site-packages/backoff/_sync.py", line 105, in retry
flathunter  |     ret = target(*args, **kwargs)
flathunter  |   File "/usr/src/app/flathunter/abstract_crawler.py", line 190, in resolve_recaptcha
flathunter  |     iframe_present = self._wait_for_iframe(driver)
flathunter  |   File "/usr/src/app/flathunter/abstract_crawler.py", line 248, in _wait_for_iframe
flathunter  |     (By.CSS_SELECTOR, "iframe[src^='https://www.google.com/recaptcha/api2/anchor?']")))
flathunter  |   File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/support/wait.py", line 95, in until
flathunter  |     raise TimeoutException(message, screen, stacktrace)
flathunter  | selenium.common.exceptions.TimeoutException: Message:
flathunter  | Stacktrace:
flathunter  | #0 0x55d9051522a3 <unknown>
flathunter  | #1 0x55d904f10f77 <unknown>
flathunter  | #2 0x55d904f4d80c <unknown>
flathunter  | #3 0x55d904f4da71 <unknown>
flathunter  | #4 0x55d904f87734 <unknown>
flathunter  | #5 0x55d904f6db5d <unknown>
flathunter  | #6 0x55d904f8547c <unknown>
flathunter  | #7 0x55d904f6d903 <unknown>
flathunter  | #8 0x55d904f40ece <unknown>
flathunter  | #9 0x55d904f41fde <unknown>
flathunter  | #10 0x55d9051a263e <unknown>
flathunter  | #11 0x55d9051a5b79 <unknown>
flathunter  | #12 0x55d90518889e <unknown>
flathunter  | #13 0x55d9051a6a83 <unknown>
flathunter  | #14 0x55d90517b505 <unknown>
flathunter  | #15 0x55d9051c7ca8 <unknown>
flathunter  | #16 0x55d9051c7e36 <unknown>
flathunter  | #17 0x55d9051e3333 <unknown>
flathunter  | #18 0x7fd708a11ea7 start_thread
flathunter  |

@flyingdodo11
Copy link
Author

@codders Already tried that, doesnt work..

@codders
Copy link

codders commented Dec 9, 2022

Okay. I've made a PR #273 - you can try and see if that fixes your issue. Unfortunately it's not something I can reproduce locally, so it's a bit of guess work. Let me know!

@vitalik239
Copy link

@codders unfortunately didn't help. Immobilienscout crawling won't work even with increased timeout.
IS24 variable is not found with --headless driver argument, while removing it solves the problem only for the first loop.

@flyingdodo11
Copy link
Author

Doesn't work for me either.

@ivanarkhipov
Copy link

Same error for me. Though it was running fine earlier today

@codders
Copy link

codders commented Dec 13, 2022

I had a look at this again today. What I can see is that also if I run from the command line (without docker), I get the timeout / cannot find IS24 variable message. Debugging further, I can see that in these cases the bot detection has kicked in:

2022-12-13-162032_1272x1515_scrot

If I disable the '--headless' argument (or unset FLATHUNTER_HEADLESS_BROWSER), the immoscout crawl works as normal. Somehow, the version I have running in the cloud (which uses the headless argument and docker) is still succeeding.

The undetected_chromedriver package is supposed to make it impossible to detect the fact that we're driving the browser from a script, and that seemed to help us for a while, but I guess it's a cat and mouse game. If anyone has any hot tips on avoiding bot detection, those would be most welcome :)

@ozeidan
Copy link

ozeidan commented Dec 26, 2022

I got this partially fixed: undetected_chromedriver provides a Docker image in which it is possible to run chromedriver without the --headless flag. The image creates a virtual display on which the chrome window is rendered. I got it to work by basing the Dockerfile of this repository on the one of undetected-chromedriver. But the browser crashes quite often, I'm still looking to fix that.

@codders
Copy link

codders commented Dec 26, 2022

That's exciting news - thanks for taking a look! Often when I've seen crashes it's been about memory usage, but I guess you've already tried that. If you make a draft PR I can also have a go at running it here and see what happens.

@flyingdodo11
Copy link
Author

Any updates on this?

@codders
Copy link

codders commented Jan 8, 2023

@flyingdodo11 I haven't heard anything. I don't know if this helps you, but if you're just searching in Berlin and you're okay with a pretty default setup, you can also just use the hosted version: https://flathunter.codders.io . That's running okay right now (and crawling immoscout still).

@anamyk
Copy link

anamyk commented Jan 12, 2023

I also ran into this issue. Any update or workaround would be great.

@hruzgar
Copy link

hruzgar commented Jan 28, 2023

I tried this method, by basing my docker image from the undetected chromedriver like this:

FROM ultrafunk/undetected-chromedriver:latest

Also i set the flags "--no-sandbox" and "--disable-setuid-sandbox".
I didn't set the "--headless" flag (That's the hole point)
..but it didn't work. I still couln't get past the bot detection.
Then i though, that my ip address might be blacklisted and connected my container to a vpn (thanks to nordvpn-docker)
...but still no success

First i get this message for a period of time:

[2023/01/28 12:13:22|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...

then in the end it shows a long error message and stops

@codders
Copy link

codders commented Jan 28, 2023

@hruzgar Can you copy the long error message?

@infctr
Copy link

infctr commented Jan 28, 2023

I've also tried running a job on Google Cloud Run based on the ultrafunk/undetected-chromedriver image, however the container stops immediately after executing

running: /bin/sh -c python cloud_job.py
running keepUpScreen()
Container called exit(0)

What am I missing here?

@hruzgar
Copy link

hruzgar commented Jan 28, 2023

this is the full lifecycle of the execution

haso:flathunter/ (main✗) $ sudo docker run --net=container:vpn --mount type=bind,source=/opt/flath
unter/config.yaml,target=/config.yaml flathunter
running: python flathunt.py -c /config.yaml
running keepUpScreen()
[2023/01/28 14:14:18|config.py               |INFO    ]: Using config path /config.yaml
[2023/01/28 14:14:18|chrome_wrapper.py       |INFO    ]: Initializing Chrome WebDriver for crawler
...
[2023/01/28 14:14:19|patcher.py              |INFO    ]: patching driver executable /root/.local/s
hare/undetected_chromedriver/753613c1953be3c0_chromedriver
[2023/01/28 14:14:32|abstract_crawler.py     |INFO    ]: Timeout waiting for iframe element - no c
aptcha verification necessary?
[2023/01/28 14:14:32|crawl_immobilienscout.py|WARNING ]: Unable to find IS24 variable in window
[2023/01/28 14:14:32|crawl_immobilienscout.py|ERROR   ]: IS24 bot detection has identified our scr
ipt as a bot - we've been blocked
[2023/01/28 14:14:34|imagetyperz_solver.py   |INFO    ]: Trying to solve geetest.
[2023/01/28 14:14:35|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:14:41|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:14:46|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:14:51|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:14:56|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:02|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:07|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:12|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:17|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:23|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:28|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:33|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:38|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:44|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:49|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:54|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:59|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:05|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:10|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:15|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:20|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:26|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:31|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:36|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:41|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:47|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:53|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:58|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:04|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:09|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:14|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:19|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:25|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:30|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:35|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:40|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:46|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:51|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:56|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:01|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:07|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:12|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:17|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:22|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:28|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:33|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:38|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:43|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:49|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:19:00|_common.py              |INFO    ]: Backing off resolve_geetest(...) for 1.0s
 (flathunter.captcha.captcha_solver.CaptchaUnsolvableError)
[2023/01/28 14:19:01|imagetyperz_solver.py   |INFO    ]: Trying to solve geetest.
[2023/01/28 14:19:01|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:19:06|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:19:12|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:19:17|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:19:22|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
Traceback (most recent call last):
  File "/usr/src/app/flathunt.py", line 109, in <module>
    main()
  File "/usr/src/app/flathunt.py", line 105, in main
    launch_flat_hunt(config, heartbeat)
  File "/usr/src/app/flathunt.py", line 29, in launch_flat_hunt
    hunter.hunt_flats()
  File "/usr/src/app/flathunter/hunter.py", line 54, in hunt_flats
    for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/flathunter/hunter.py", line 33, in crawl_for_exposes
    return chain(*[try_crawl(searcher, url, max_pages)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/flathunter/hunter.py", line 33, in <listcomp>
    return chain(*[try_crawl(searcher, url, max_pages)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/flathunter/hunter.py", line 25, in try_crawl
    return searcher.crawl(url, max_pages)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/flathunter/abstract_crawler.py", line 142, in crawl
    return self.get_results(url, max_pages)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/flathunter/crawl_immobilienscout.py", line 57, in get_results
    soup = self.get_page(search_url, self.driver, page_no)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/flathunter/crawl_immobilienscout.py", line 145, in get_page
    return self.get_soup_from_url(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/backoff/_sync.py", line 105, in retry
    ret = target(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/flathunter/abstract_crawler.py", line 77, in get_soup_from_url
    return BeautifulSoup(driver.page_source, 'html.parser')
                         ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/undetected_chromedriver/__init__.py", line 740, in
 __getattribute__
    return super().__getattribute__(item)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 541,
 in page_source
    return self.execute(Command.GET_PAGE_SOURCE)["value"]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 440,
 in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 2
45, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of
page crash
from unknown error: cannot determine loading status
from tab crashed
  (Session info: chrome=109.0.5414.119)
Stacktrace:
#0 0x5613f8e04303 <unknown>
#1 0x5613f8bd8bbd <unknown>
#2 0x5613f8bc3233 <unknown>
#3 0x5613f8bc1c77 <unknown>
#4 0x5613f8bc2408 <unknown>
#5 0x5613f8bcf67f <unknown>
#6 0x5613f8bd02d2 <unknown>
#7 0x5613f8be0fd0 <unknown>
#8 0x5613f8be534b <unknown>
#9 0x5613f8bc29c5 <unknown>
#10 0x5613f8be0bd2 <unknown>
#11 0x5613f8c4d7a0 <unknown>
#12 0x5613f8c35753 <unknown>
#13 0x5613f8c08a14 <unknown>
#14 0x5613f8c09b7e <unknown>
#15 0x5613f8e5332e <unknown>
#16 0x5613f8e56c0e <unknown>
#17 0x5613f8e39610 <unknown>
#18 0x5613f8e57c23 <unknown>
#19 0x5613f8e2b545 <unknown>
#20 0x5613f8e786a8 <unknown>
#21 0x5613f8e78836 <unknown>
#22 0x5613f8e93d13 <unknown>
#23 0x7fc0d591cea7 start_thread

[2023/01/28 14:19:29|__init__.py             |INFO    ]: ensuring close

@hruzgar
Copy link

hruzgar commented Jan 28, 2023

@infctr you need to set "--no-sandbox" and "--disable-setuid-sandbox" flags in your config.yaml file. also don't set the "--headless" flag

@infctr
Copy link

infctr commented Jan 28, 2023

@hruzgar Did imagetyperz work for you before with IS24? I had a similar Captcha is not ready yet error so I had to switch to 2captcha

@hruzgar
Copy link

hruzgar commented Jan 28, 2023

yeah it was working (and is still working) on my main pc. But i want to run the bot on my server to not get a high energy bill (my pc is beefy). That's the reason i am trying to get it working inside docker without any gui..
I could still try if it'll work with 2captcha though. Worth a try fs

@infctr
Copy link

infctr commented Jan 28, 2023

I've started the image with these driver flags but it didn't make a difference in the container unfortunately

 "--no-sandbox",
"--disable-gpu",
"--disable-setuid-sandbox",

@hruzgar
Copy link

hruzgar commented Jan 28, 2023

I just tried running the bot locally on my pc again. And the weird thing is that it works with the "--headless" argument for a certain amount of time, before it fails again but as soon as i comment the "--headless" flag and run the bot again, it fires up a chrome tab and it sais that i am a robot and thus not get access to the site.

@codders
Copy link

codders commented Jan 30, 2023

@infctr The cloud_job script is expected to run once and then quit. It is designed to be installed as a cron job running on a timer. The flathunt script is configurable either to run in a loop, or as a one-time job.

@codders
Copy link

codders commented Jan 30, 2023

@hruzgar CaptchaUnsolvableError sometimes comes up if it just can't solve the captcha, but it should retry and that shouldn't be fatal. Usually a message like 'session deleted because of page crash' comes after the container runs out of memory - are you running with a memory limit on your docker container?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants