Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak issues #79

Open
jaymzh opened this issue Jul 12, 2023 · 11 comments
Open

Memory leak issues #79

jaymzh opened this issue Jul 12, 2023 · 11 comments
Assignees

Comments

@jaymzh
Copy link
Member

jaymzh commented Jul 12, 2023

The webserver crashes a few times a week due to running out of memory. @irabinovitch did some sleuthing and believes it's holding open old database connections and they are stacking up.

We should definitely try to sort this out as part of the 2024 launch.

@DrupalPhil
Copy link
Collaborator

@irabinovitch Can you share what you found?

@irabinovitch
Copy link
Collaborator

I haven't found anything or done any investigation yet.

@irabinovitch
Copy link
Collaborator

Looking briefly at monitoring data when this happens CPU and memory spike. Apache2 processes look to be where thats happening from Datadog process monitoring.

From a quick look at the configs it seems we're using mpm_prerfork here. One idea we could consider is to set MaxRequestsPerChild to something other than the default of 0. That would limit the number of requests before an individual Apache process is retired/restarted. I dont know that I'd call that a fix, but it would probably at least mitigate the run away memory usage.

@jaymzh
Copy link
Member Author

jaymzh commented Jul 18, 2023

I think you mean MaxConnectionsPerChild ?
Definitely agree on the changing that. Even if we could track down some specific bug, php can be pretty leaky in this way, and my understanding is best practices is to not have MaxConnectionsPerChild at 0. But... my experience in such areas is pretty outdated.

Oh, I see the old name was MaxRequestsPerChild. Samesame. I can whip up a PR for that.

@irabinovitch
Copy link
Collaborator

Sure, MaxRequestsPerChild is the new name for MaxConnectionsPerChild as of 2.3.9. I think they have the exact same effect, but yes we should use the new name. Definitely shouldn't be 0, thats just the default. Not sure what a reasonable number is, and yes I'd like to find whatever we're leaking or hanging on - but this seems like a reasonable defense if someone wants to try it.

@jaymzh
Copy link
Member Author

jaymzh commented Jul 18, 2023

@irabinovitch
Copy link
Collaborator

This doesn't seem to have helped stability.

@DrupalPhil
Copy link
Collaborator

Can you provide access to datadog or whatever logging system you have?

@irabinovitch
Copy link
Collaborator

irabinovitch commented Oct 7, 2023 via email

@DrupalPhil
Copy link
Collaborator

It's not a glaring issue but I've optimized the query, anyway. While I seek other potential culprits, let's get this merged into prod.

jaymzh added a commit to jaymzh/scale-chef that referenced this issue Oct 8, 2023
* Log the state of apache processes when we restart so we can continue
  to understand the problem
* Restart slightly less often
* Update MaxConnectionsPerChild to 50, which is probably a bit more
  reasonable than 5
* Update server limit (see below)
* Turn off KeepAlive (see below)

From scale-infra, explanation on the last two above.

```
This is a confirmed bug in Apache with no configuration workaround:

https://bz.apache.org/bugzilla/show_bug.cgi?id=53555

There's a pretty good description of the details, which match our behavior here:

https://serverfault.com/questions/516373/what-is-the-meaning-of-ah00485-scoreboard-is-full-not-at-maxrequestworkers

One thing that some people seem to have had success with is turning keepalive off.

ServerLimit, for the event MPM (what we use), the recommendation is:

With event, increase this directive if the process number defined by your MaxRequestWorkers and ThreadsPerChild settings, plus the number of gracefully shutting down processes, is more than 16 server processes (default).

Our MaxRequestWorkers is 50. We don't define ThreadsPerChild which defaults to 25. So if I read that correctly (and I haven't tuned Apache for a living in a looonnnggg time), we want something like 50+25=75 plus some more for shutting down processes, so like... 80? I can prep a diff for that plus keepalive and see how that goes.
```

All continuing socallinuxexpo/scale-drupal#79

Signed-off-by: Phil Dibowitz <[email protected]>
jaymzh added a commit to socallinuxexpo/scale-chef that referenced this issue Oct 12, 2023
* Log the state of apache processes when we restart so we can continue
  to understand the problem
* Restart slightly less often
* Update MaxConnectionsPerChild to 50, which is probably a bit more
  reasonable than 5
* Update server limit (see below)
* Turn off KeepAlive (see below)

From scale-infra, explanation on the last two above.

```
This is a confirmed bug in Apache with no configuration workaround:

https://bz.apache.org/bugzilla/show_bug.cgi?id=53555

There's a pretty good description of the details, which match our behavior here:

https://serverfault.com/questions/516373/what-is-the-meaning-of-ah00485-scoreboard-is-full-not-at-maxrequestworkers

One thing that some people seem to have had success with is turning keepalive off.

ServerLimit, for the event MPM (what we use), the recommendation is:

With event, increase this directive if the process number defined by your MaxRequestWorkers and ThreadsPerChild settings, plus the number of gracefully shutting down processes, is more than 16 server processes (default).

Our MaxRequestWorkers is 50. We don't define ThreadsPerChild which defaults to 25. So if I read that correctly (and I haven't tuned Apache for a living in a looonnnggg time), we want something like 50+25=75 plus some more for shutting down processes, so like... 80? I can prep a diff for that plus keepalive and see how that goes.
```

All continuing socallinuxexpo/scale-drupal#79

Signed-off-by: Phil Dibowitz <[email protected]>
@jaymzh
Copy link
Member Author

jaymzh commented Oct 12, 2023

OK I've done a few more things here:

  1. atop is now deployed on all our servers, including the webserver, which should give us some better visibility
  2. In dropped the restarts to every hour instead of every 30 minutes, and before we do so, we some data about the state of the Apache processes. If this shows the processes aren't in wait state, we can probably try 2 hours and so on.
  3. Did a variety of other small tunings to apache that may or may not help. See Try some more tuning scale-chef#293 for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants