Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to collect traces only when the CPU is active (--on-cpu) #327

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

michelhe
Copy link
Contributor

@michelhe michelhe commented Jun 7, 2021

This change adds on what #127 tried to bring to allow measuring how much time is spent on CPU, (but implemented slightly different and more up to date).

We filter traces that are off-cpu using 2 checks:

  • OS-specific, currently using the remoteprocess crate to determine if there are any active threads at this point in time.
  • Using the rb_thread_struct.status field to figure out if a thread is in THREAD_RUNNABLE or not when we come to get a trace from it. The problem is that this is only a heuristic and will be inaccurate in cases when the process gets preempted by the OS while it has runnable threads, so we rely on OS-specific checks as well when possible.

@acj
Copy link
Member

acj commented Jun 8, 2021

Hey, thanks for this. We've been wanting on/off CPU filtering for a long time. I'll try to do a deep review sometime in the next week or so.

Copy link

@Jongy Jongy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments for @michelhe :)

README.md Outdated
rbspy profiles everything the program does (including waiting) -- there's no
option to just profile when the program is using the CPU.
By defeault, rbspy profiles everything the program does (including waiting).
There is an experimantal option to profile only when the program is using the CPU. (`--on-cpu`)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
There is an experimantal option to profile only when the program is using the CPU. (`--on-cpu`)
There is an experimental option to profile only when the program is using the CPU. (`--on-cpu`)

fn is_on_cpu_os_specific(&self) -> Result<bool> {
/* We don't have OS specific checks for these targets,
* so fallback to using the interpreter based check down the line */
Ok(false)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you return false here and on_cpu is used, then nothing will be collected at all.

In this configuration - perhaps just deny using on_cpu?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, meant to return true 😅

#[cfg(any(target_os = "macos", target_os = "windows", target_os = "linux"))]
fn is_on_cpu_os_specific(&self) -> Result<bool> {
// remoteprocess crate exposes a Thread.active() method for each of these targets
for thread in self.process.threads()?.iter() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the logic you want - you need to check the "current" thread that we're going to get stacktraces from, not just any thread in the process

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know which OS thread the "current" ruby thread until we start examining the interpreter's memory, so the heuristic is: "if any thread in the thread group is on CPU"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the "current" ruby thread is not active OS-wise, it's surely not active ruby-wise (to be precise: it may be marked active ruby-wise, this just means it's preempted). Another thread may be running, doing non-ruby stuff, and the thread doing ruby-stuff may be preempted, and you will incorrectly account that as the ruby code)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I touched on this in another comment, but if this is meant as an initial filter and the more accurate check happens later (via get_thread_status), then I think it's okay. Unless the remoteprocess functions we're calling here are expensive, this should reduce rbspy's CPU usage a bit when we're doing on-CPU profiling. It might be worth it to check that assumption, though -- I don't know offhand how expensive this loop is for a typical program.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@acj this is just not accurate enough - imagine this case: 1 thread in the application is running some non-related C code doing whatever (and it's active, so thread.active() returns true). Meanwhile, another thread is running Ruby code (so it has rb_thread_status_THREAD_RUNNABLE, I guess?). But the OS decided that it's time that the Ruby thread relinquishes CPU, and it gets preempted. Its rb_thread_status remains as it was.

So, per the 2 checks:

  • the "process" is active, because one of its threads is active.
  • the "Ruby thread" is also "active" because its status is THREAD_RUNNABLE.

And the result is that we collect Ruby stacks from a thread that's not currently running!

About the cost of thread.active() - it's not very expensive. It's a single read from /proc/tid/stat, so 3 syscalls (open, read, close). Comparing to get_thread_status - which is only one process_vm_readv IIRC, so faster; but it should be benched & compared to reading /proc/tid/stat if we want a real comparison.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry, I didn't mean to suggest that the entire approach is okay. I understood your first example and agree that it's something we need to solve (I'm still thinking about this). What I meant to say is that this for loop seems reasonable, provided that we can somehow do an accurate check later when we examine process memory.

Copy link
Contributor Author

@michelhe michelhe Jun 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I touched on this in another comment, but if this is meant as an initial filter and the more accurate check happens later (via get_thread_status), then I think it's okay. Unless the remoteprocess functions we're calling here are expensive, this should reduce rbspy's CPU usage a bit when we're doing on-CPU profiling. It might be worth it to check that assumption, though -- I don't know offhand how expensive this loop is for a typical program.

Anyway, I used the flame crate to insert tracepoints in get_trace() and this is what I had measured:
image

THREAD: 140546956667136
| get_trace: 0.246437ms
  | is_on_cpu_os_specific: 0.024342ms
  | get_stack_trace_from_current_thread: 0.221337ms

It seems like the remoteprocess functions are a bit expansive to call per trace..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@acj What is your take given this measurement ? should I get rid of the remoteprocess calls ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know yet. On the one hand, the call to check active threads isn't free, but on the other hand we'd avoid the more expensive call to get_stack_trace_from_current_thread if the process is definitely not on-CPU. That might be a reasonable tradeoff for some workloads.

Let's revisit this after we solve the correctness/reliability questions.

src/core/ruby_version.rs Outdated Show resolved Hide resolved
let thread: $thread_type = get_execution_context(ruby_current_thread_address_location, ruby_vm_address_location, source)
.context(ruby_current_thread_address_location)?;

// testing the thread state in the interpreter.
if on_cpu && get_thread_status(&thread, source)? != 0 /* THREAD_RUNNABLE */ {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know ruby well enough - is this the equivalent of CPython's GIL? If so, it's probably better to add this as another option (compare to py-spy's --active for on-CPU, and --gil for GIL; 2 different options that can go together, but I might want --active only w/o --gil)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that they're very similar. Only one thread per VM can be executing ruby code at a time. Native C extensions can explicitly release the GVL, though, so that ruby and C code can execute in parallel when it's safe to do so. The situation is changing a bit in ruby 3 with the addition of ractors, which have their own lock and will probably need different logic for on/off CPU analysis. rbspy only has basic support for ractors right now (only looks at the main one), so we may need to defer that work until later.

I'm not opposed to introducing a --gvl option if we can find a reliable way to detect which thread holds the lock, but I'd recommend doing that in a future PR so that we can keep this one small-ish and focused. We might also want an --off-cpu option. Lots of future paths to explore here.

It seems that this get_thread_status check is the key mechanism, and the earlier check using remoteprocess's Thread::active is a fast/cheap filter to skip get_trace if we know that none of the process's threads is active (from the OS's perspective). Is that accurate?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the brief Ruby internals :)

I'm not opposed to introducing a --gvl option if we can find a reliable way to detect which thread holds the lock,

I might have missed that - isn't it what we expect THREAD_RUNNABLE to tell? Or does runnable mean "can be run given that it obtains the GVL"? If the latter, then I think that looking at it doesn't make any sense at all.

We might also want an --off-cpu option. Lots of future paths to explore here.

Yup. Well, basically, no --on-cpu means --off-cpu now :)

It seems that this get_thread_status check is the key mechanism, and the earlier check using remoteprocess's Thread::active is a fast/cheap filter to skip get_trace if we know that none of the process's threads is active (from the OS's perspective). Is that accurate?

To my understanding - not precisely, I'll give an elaborated answer on the other discussion 😅

Copy link
Member

@acj acj Jun 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have missed that - isn't it what we expect THREAD_RUNNABLE to tell? Or does runnable mean "can be run given that it obtains the GVL"? If the latter, then I think that looking at it doesn't make any sense at all.

I think this is a key question, and I don't know enough about MRI VM internals to give a confident answer. What you said seems reasonable. I'll poke around in some reference materials that I've collected and will post again if I find anything.

Yup. Well, basically, no --on-cpu means --off-cpu now :)

My feeling is that the absence of --on-cpu should cause rbspy to collect all traces, which is the current shipping behavior (and what this PR would do in its current state, IIUC), and then we could introduce --off-cpu to do the opposite filtering that --on-cpu does, i.e. only collect traces if the thread is waiting.

- OS level thread information (only available for linux, macos and windows)
- current thread state in the ruby interpreter.

In addition, rbspy can either profile in a blocking or nonblocking mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this. We should probably mention the --nonblocking flag to be consistent with the paragraph above.

README.md Outdated Show resolved Hide resolved
@@ -74,7 +83,25 @@ impl StackTraceGetter {
Ok(self.get_trace_from_current_thread()?)
}

fn get_trace_from_current_thread(&self) -> Result<StackTrace, MemoryCopyError> {
#[cfg(any(target_os = "macos", target_os = "windows", target_os = "linux"))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remoteprocess has a Thread implementation for FreeBSD (including the active function), so we might not need this guard. I think the _os_specific suffix could be dropped in that case -- much of rbspy's core work is OS-specific in some way.

let thread: $thread_type = get_execution_context(ruby_current_thread_address_location, ruby_vm_address_location, source)
.context(ruby_current_thread_address_location)?;

// testing the thread state in the interpreter.
if on_cpu && get_thread_status(&thread, source)? != 0 /* THREAD_RUNNABLE */ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that they're very similar. Only one thread per VM can be executing ruby code at a time. Native C extensions can explicitly release the GVL, though, so that ruby and C code can execute in parallel when it's safe to do so. The situation is changing a bit in ruby 3 with the addition of ractors, which have their own lock and will probably need different logic for on/off CPU analysis. rbspy only has basic support for ractors right now (only looks at the main one), so we may need to defer that work until later.

I'm not opposed to introducing a --gvl option if we can find a reliable way to detect which thread holds the lock, but I'd recommend doing that in a future PR so that we can keep this one small-ish and focused. We might also want an --off-cpu option. Lots of future paths to explore here.

It seems that this get_thread_status check is the key mechanism, and the earlier check using remoteprocess's Thread::active is a fast/cheap filter to skip get_trace if we know that none of the process's threads is active (from the OS's perspective). Is that accurate?

src/core/ruby_version.rs Outdated Show resolved Hide resolved
#[cfg(any(target_os = "macos", target_os = "windows", target_os = "linux"))]
fn is_on_cpu_os_specific(&self) -> Result<bool> {
// remoteprocess crate exposes a Thread.active() method for each of these targets
for thread in self.process.threads()?.iter() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I touched on this in another comment, but if this is meant as an initial filter and the more accurate check happens later (via get_thread_status), then I think it's okay. Unless the remoteprocess functions we're calling here are expensive, this should reduce rbspy's CPU usage a bit when we're doing on-CPU profiling. It might be worth it to check that assumption, though -- I don't know offhand how expensive this loop is for a typical program.

README.md Outdated Show resolved Hide resolved
README.md Outdated
rbspy profiles everything the program does (including waiting) -- there's no
option to just profile when the program is using the CPU.
By defeault, rbspy profiles everything the program does (including waiting).
There is an experimental option to profile only when the program is using the CPU. (`--on-cpu`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

experimental option

Are there any concerns or caveats that we should document? What's blocking this from being a non-experimental feature?

@acj
Copy link
Member

acj commented Jul 10, 2021

I looked through the ruby source and noticed that rb_thread_struct has a thread_id field that we might be able to use to correlate active OS threads with the active thread in the ruby VM. That would involve reading VM memory earlier than we currently do. For example, we could build a list of active OS threads, lock the process, and then immediately get the active thread ID from the VM to see if it's in the active list. If not, then unlock and return. If it's in the active list, then I think we can be reasonably confident that the active thread is on-CPU if its VM thread state is runnable. (This assumes that there's no subtle VM and/or GVL behavior that I'm not accounting for here.)

Thoughts?

@michelhe
Copy link
Contributor Author

Hey @acj ,
I wouldn't want the work done here to become stale, so please lmk what you think needs to be done in order to merge this upstream 😄

@acj
Copy link
Member

acj commented Nov 15, 2021

@michelhe Yeah, we should make a decision here. Are we still blocked on the accuracy/correctness issues? Did you have any thoughts on my last comment?

If you'd rather merge this as an experimental feature and iterate on it later, I'm okay with that too.

@Jongy
Copy link

Jongy commented Nov 16, 2021

@michelhe Yeah, we should make a decision here. Are we still blocked on the accuracy/correctness issues? Did you have any thoughts on my last comment?

I think that we want to keep the OS-level "is running" check.

About locking/unlocking - I prefer to keep it as accurate as possible also in the nonblocking mode. I recently read Ben's blog at https://www.benfrederickson.com/why-python-needs-paused-during-profiling/ which led me to understand that rbspy doesn't take as long as py-spy to read the stacks. So I suggest that we do something similar to py-spy, that is:

  • Lock if blocking
  • Check OS level "is running"
  • Check Ruby-level "is running" (in py-spy, that's done only if --gil is passed - does rbspy have any counterpart or we should always do it? I suppose it should be configurable)
  • Read the stacks array
  • Unlock if blocking
  • Symbolicate the stacks array

@acj
Copy link
Member

acj commented Nov 26, 2021

Sorry for the delay. From rereading the above discussion, I think the open questions are:

  1. Are we okay with the current heuristic for the OS-level check, i.e. if any thread in the thread group is on-CPU, then the check passes?
  2. If that's not accurate enough, then do we want to try what I suggested in my earlier comment, i.e. to compare the ruby thread's thread_id field with the active OS thread? There may be portability issues here.
  3. Depending on how confident we are about 1 and 2, should we treat the --on-cpu feature as experimental for now?
  4. Does the current code work on FreeBSD? The API seems to have what we need, so I think it should work, but I haven't tried it.

If there's any lingering doubt, I'd vote to deal with 4 (FreeBSD support) and then merge this with --on-cpu marked as experimental. Then we can let people try it out, gather more data from our own usage of it, etc.

Check Ruby-level "is running" (in py-spy, that's done only if --gil is passed - does rbspy have any counterpart or we should always do it? I suppose it should be configurable)

@Jongy We touched on that here. rbspy doesn't have an equivalent flag right now, but we could add one if we can detect which thread holds the GVL. We might need to ask the ruby VM developers about it.

@omarabid omarabid mentioned this pull request Mar 28, 2022
4 tasks
@michelhe
Copy link
Contributor Author

Sorry for the delay. From rereading the above discussion, I think the open questions are:

  1. Are we okay with the current heuristic for the OS-level check, i.e. if any thread in the thread group is on-CPU, then the check passes?
  2. If that's not accurate enough, then do we want to try what I suggested in my earlier comment, i.e. to compare the ruby thread's thread_id field with the active OS thread? There may be portability issues here.
  3. Depending on how confident we are about 1 and 2, should we treat the --on-cpu feature as experimental for now?
  4. Does the current code work on FreeBSD? The API seems to have what we need, so I think it should work, but I haven't tried it.

If there's any lingering doubt, I'd vote to deal with 4 (FreeBSD support) and then merge this with --on-cpu marked as experimental. Then we can let people try it out, gather more data from our own usage of it, etc.

Check Ruby-level "is running" (in py-spy, that's done only if --gil is passed - does rbspy have any counterpart or we should always do it? I suppose it should be configurable)

@Jongy We touched on that here. rbspy doesn't have an equivalent flag right now, but we could add one if we can detect which thread holds the GVL. We might need to ask the ruby VM developers about it.

Hey @acj , how are you?

Sorry for taking this long, I'm now ready to begin the needed work to finally bring this feature to mainline.
We actually have been using rbspy from this branch for the past year in our gProfiler and are generally happy with the results it produces for on-cpu, even though the accuracy with the current state of the code should really be taken with a grain of salt.

Reiterating past conversations in this PR I think the best course of action will be:

  1. Resolve current conflicts with master which has changed quite a bit.
  2. Making sure CI passes for FreeBSD - is this enough ? @acj
  3. Merge the current logic into master, label it as experimental.
  4. Open an issue for trying what you suggested in this comment.

I wouldn't want the --on-cpu feature to always lock the process and think we invest some future effort into a solution that is as accurate as possible and doesn't harm the performance of profiled processes.
Once we stabilize this feature to a level we feel is accurate enough without locking the process - we may discard the "experimental"

@acj
Copy link
Member

acj commented Jul 19, 2022

@michelhe I'm doing well, thanks for asking. :-) How are you?

Sorry for taking this long, I'm now ready to begin the needed work to finally bring this feature to mainline.
We actually have been using rbspy from this branch for the past year in our gProfiler and are generally happy with the results it produces for on-cpu, even though the accuracy with the current state of the code should really be taken with a grain of salt.

That's great to hear. I know that a few rbspy users are interested in this functionality (even if it's experimental) and can help us track down any bugs or quirks. Your proposed course of action sounds good to me.

Resolve current conflicts with master

Please note that the default branch is now main, so you'll need to update any local references.

Making sure CI passes for FreeBSD - is this enough ?

Yep, that's fine. I can try the feature in my FreeBSD VM after the dust settles. It doesn't need to block merging the PR though as long as CI is happy.

This change adds on what rbspy#127 tried to bring to allow measuring how much time is spent on CPU, (but implemented slightly different and more up to date).

We filter traces that are off-cpu using 2 checks:
- OS-specific, currently using the remoteprocess crate to determine if there are any active threads at this point in time.
- Using the rb_thread_struct.status field to figure out if a thread is in `THREAD_RUNNABLE` or not when we come to get a trace from it.
 The problem is that this is only a heuristic and will be inaccurate in cases when the process gets preempted by the OS while it has runnable threads, so we rely on OS-specific checks as well when possible.
@michelhe
Copy link
Contributor Author

I rebased all my work on top of the latest version v0.12.1 only to discover later that main is still too far ahead with new refactors when you moved to spytools 😭

@acj
Copy link
Member

acj commented Oct 2, 2022

Ah, sorry for the churn. The parts that we moved into spytools are (currently) for process discovery and memory maps and such, so they shouldn't affect this work much. Let me know if you run into any snags.

@acj
Copy link
Member

acj commented Jun 11, 2024

@michelhe 👋 I've freshened this up in a branch here. There are a few minor changes to align with refactors from the past year, clarify that it's experimental, etc. I also reworked the off-cpu test to make sure the timing is reliable. If you're still interested in the feature, please let me know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants