Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance #10

Open
aprokofyev opened this issue Jan 3, 2019 · 6 comments
Open

Performance #10

aprokofyev opened this issue Jan 3, 2019 · 6 comments

Comments

@aprokofyev
Copy link

aprokofyev commented Jan 3, 2019

Could you please explain if rendora + chrome headless can process concurrent requests in parallel? Or are all of the requests synchronous? I ran a simple benchmark and that's what I got:

wrk -H 'User-Agent: bot' http://127.0.0.1:3001
Running 10s test @ http://127.0.0.1:3001
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.01s     0.00us   1.01s   100.00%
    Req/Sec     0.11      0.33     1.00     88.89%
  9 requests in 10.09s, 334.15KB read
  Socket errors: connect 0, read 1, write 0, timeout 8
Requests/sec:      0.89
Transfer/sec:     33.10KB

Config(http://backend.d is a simple SPA, fetches dummy data from API, performance ~ 3000rps):

listen:
    address: 0.0.0.0
    port: 3001
target:
    url: "http://backend.d"
backend:
    url: "http://backend.d"
headless:
    waitAfterDOMLoad: 1000
    internal:
        url: http://localhost:9222
    timeout: 5
output:
    minify: true
debug: true
cache:
    type: none
filters:
    userAgent:
        defaultPolicy: blacklist
        exceptions:
            keywords:
                - bot
                - bing
                - crawler
                - curl
@geokb
Copy link
Member

geokb commented Jan 3, 2019

you did set waitAfterDOMLoad: 1000 so rendora waits for an entire second after the initial DOM load. Set it again to 0 and see the latencies. I have latencies as little as 10ms for hello-world-tier pages and ~200ms for some really complex pages for a website running in production.

@aprokofyev
Copy link
Author

aprokofyev commented Jan 4, 2019

Thank you for the reply, I understand what waitAfterDOMLoad does, what I'm trying to figure out is if rendora + chrome headless can process multiple concurrent requests in parallel, or do requests stack up into queue, and each next request is processed after the previous one is finished? The benchmark shows that 8 requests out of ten timed out, which makes me to believe that requests are processed successively, could you please clarify?

wrk -H 'User-Agent: bot' http://127.0.0.1:3001
Running 10s test @ http://127.0.0.1:3001
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.01s     0.00us   1.01s   100.00%
    Req/Sec     0.11      0.33     1.00     88.89%
  9 requests in 10.09s, 334.15KB read
  Socket errors: connect 0, read 1, write 0, timeout 8
Requests/sec:      0.89
Transfer/sec:     33.10KB

@geokb
Copy link
Member

geokb commented Jan 4, 2019

Aha okay, sorry I didn't read your post carefully the first time. Yeah, it currently processes the requests to the headless Chrome instance successively not in parallel, but rendora itself can accept as many parallel requests as possible by your OS and resources, and since you set a potentially low value of headless.timeout which is at 5 in your config file, this can produce some responses with status of 500 since a stalled request has to wait additional 1000ms (since you set waitAfterDOMLoad to 1000) for each older request

@aprokofyev
Copy link
Author

Thank you, understood. Do you know by any chance a way to make chrome headless to process requests concurrently?

@geokb
Copy link
Member

geokb commented Jan 4, 2019

Yes, I guess using Target by the chrome-devtools-protocol (see https://chromedevtools.github.io/devtools-protocol/tot/Target). It's possible to be done here in rendora but that means I have to sacrifice the type safe RPC layer and send/receive raw JSON to the server. I actually wanted to implement the parallelism feature from the very beginning but it's not really very common to see hundreds or even tens of crawler requests per second unless probably for top websites with hundreds of thousands of pages. But I guess I will implement this at some version before v1.0 if there is enough interest.

@agonsalves
Copy link

Seconding the interest. Even though a single website may not get too many crawler requests per sec on average, I often see surges of 600+ rpm when someone forgets to throttle their bot. Plus, if you have many websites running through this, it will definitely stack up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants