Add ability to collect traces only when the CPU is active (--on-cpu) #327

michelhe · 2021-06-07T14:36:38Z

This change adds on what #127 tried to bring to allow measuring how much time is spent on CPU, (but implemented slightly different and more up to date).

We filter traces that are off-cpu using 2 checks:

OS-specific, currently using the remoteprocess crate to determine if there are any active threads at this point in time.
Using the rb_thread_struct.status field to figure out if a thread is in THREAD_RUNNABLE or not when we come to get a trace from it. The problem is that this is only a heuristic and will be inaccurate in cases when the process gets preempted by the OS while it has runnable threads, so we rely on OS-specific checks as well when possible.

acj · 2021-06-08T16:38:02Z

Hey, thanks for this. We've been wanting on/off CPU filtering for a long time. I'll try to do a deep review sometime in the next week or so.

Jongy

Some comments for @michelhe :)

Jongy · 2021-06-12T10:41:20Z

README.md

-rbspy profiles everything the program does (including waiting) -- there's no
-option to just profile when the program is using the CPU.
+By defeault, rbspy profiles everything the program does (including waiting).
+There is an experimantal option to profile only when the program is using the CPU. (`--on-cpu`)


Suggested change

There is an experimantal option to profile only when the program is using the CPU. (`--on-cpu`)

There is an experimental option to profile only when the program is using the CPU. (`--on-cpu`)

Jongy · 2021-06-12T10:43:52Z

src/core/initialize.rs

+ fn is_on_cpu_os_specific(&self) -> Result<bool> {
+ /* We don't have OS specific checks for these targets,
+ * so fallback to using the interpreter based check down the line */
+ Ok(false)


If you return false here and on_cpu is used, then nothing will be collected at all.

In this configuration - perhaps just deny using on_cpu?

Nice catch, meant to return true 😅

Jongy · 2021-06-12T10:45:20Z

src/core/initialize.rs

+ #[cfg(any(target_os = "macos", target_os = "windows", target_os = "linux"))]
+ fn is_on_cpu_os_specific(&self) -> Result<bool> {
+ // remoteprocess crate exposes a Thread.active() method for each of these targets
+ for thread in self.process.threads()?.iter() {


This is not the logic you want - you need to check the "current" thread that we're going to get stacktraces from, not just any thread in the process

We don't know which OS thread the "current" ruby thread until we start examining the interpreter's memory, so the heuristic is: "if any thread in the thread group is on CPU"

If the "current" ruby thread is not active OS-wise, it's surely not active ruby-wise (to be precise: it may be marked active ruby-wise, this just means it's preempted). Another thread may be running, doing non-ruby stuff, and the thread doing ruby-stuff may be preempted, and you will incorrectly account that as the ruby code)

I touched on this in another comment, but if this is meant as an initial filter and the more accurate check happens later (via get_thread_status), then I think it's okay. Unless the remoteprocess functions we're calling here are expensive, this should reduce rbspy's CPU usage a bit when we're doing on-CPU profiling. It might be worth it to check that assumption, though -- I don't know offhand how expensive this loop is for a typical program.

@acj this is just not accurate enough - imagine this case: 1 thread in the application is running some non-related C code doing whatever (and it's active, so thread.active() returns true). Meanwhile, another thread is running Ruby code (so it has rb_thread_status_THREAD_RUNNABLE, I guess?). But the OS decided that it's time that the Ruby thread relinquishes CPU, and it gets preempted. Its rb_thread_status remains as it was.

So, per the 2 checks:

the "process" is active, because one of its threads is active.

the "Ruby thread" is also "active" because its status is THREAD_RUNNABLE.

And the result is that we collect Ruby stacks from a thread that's not currently running!

About the cost of thread.active() - it's not very expensive. It's a single read from /proc/tid/stat, so 3 syscalls (open, read, close). Comparing to get_thread_status - which is only one process_vm_readv IIRC, so faster; but it should be benched & compared to reading /proc/tid/stat if we want a real comparison.

Yeah, sorry, I didn't mean to suggest that the entire approach is okay. I understood your first example and agree that it's something we need to solve (I'm still thinking about this). What I meant to say is that this for loop seems reasonable, provided that we can somehow do an accurate check later when we examine process memory.

I touched on this in another comment, but if this is meant as an initial filter and the more accurate check happens later (via get_thread_status), then I think it's okay. Unless the remoteprocess functions we're calling here are expensive, this should reduce rbspy's CPU usage a bit when we're doing on-CPU profiling. It might be worth it to check that assumption, though -- I don't know offhand how expensive this loop is for a typical program.

Anyway, I used the flame crate to insert tracepoints in get_trace() and this is what I had measured:

THREAD: 140546956667136 | get_trace: 0.246437ms | is_on_cpu_os_specific: 0.024342ms | get_stack_trace_from_current_thread: 0.221337ms

It seems like the remoteprocess functions are a bit expansive to call per trace..

@acj What is your take given this measurement ? should I get rid of the remoteprocess calls ?

I don't know yet. On the one hand, the call to check active threads isn't free, but on the other hand we'd avoid the more expensive call to get_stack_trace_from_current_thread if the process is definitely not on-CPU. That might be a reasonable tradeoff for some workloads.

Let's revisit this after we solve the correctness/reliability questions.

src/core/ruby_version.rs

Jongy · 2021-06-13T08:19:57Z

src/core/ruby_version.rs

 let thread: $thread_type = get_execution_context(ruby_current_thread_address_location, ruby_vm_address_location, source)
 .context(ruby_current_thread_address_location)?;

+ // testing the thread state in the interpreter.
+ if on_cpu && get_thread_status(&thread, source)? != 0 /* THREAD_RUNNABLE */ {


I don't know ruby well enough - is this the equivalent of CPython's GIL? If so, it's probably better to add this as another option (compare to py-spy's --active for on-CPU, and --gil for GIL; 2 different options that can go together, but I might want --active only w/o --gil)

My understanding is that they're very similar. Only one thread per VM can be executing ruby code at a time. Native C extensions can explicitly release the GVL, though, so that ruby and C code can execute in parallel when it's safe to do so. The situation is changing a bit in ruby 3 with the addition of ractors, which have their own lock and will probably need different logic for on/off CPU analysis. rbspy only has basic support for ractors right now (only looks at the main one), so we may need to defer that work until later.

I'm not opposed to introducing a --gvl option if we can find a reliable way to detect which thread holds the lock, but I'd recommend doing that in a future PR so that we can keep this one small-ish and focused. We might also want an --off-cpu option. Lots of future paths to explore here.

It seems that this get_thread_status check is the key mechanism, and the earlier check using remoteprocess's Thread::active is a fast/cheap filter to skip get_trace if we know that none of the process's threads is active (from the OS's perspective). Is that accurate?

Thanks for the brief Ruby internals :)

I'm not opposed to introducing a --gvl option if we can find a reliable way to detect which thread holds the lock,

I might have missed that - isn't it what we expect THREAD_RUNNABLE to tell? Or does runnable mean "can be run given that it obtains the GVL"? If the latter, then I think that looking at it doesn't make any sense at all.

We might also want an --off-cpu option. Lots of future paths to explore here.

Yup. Well, basically, no --on-cpu means --off-cpu now :)

It seems that this get_thread_status check is the key mechanism, and the earlier check using remoteprocess's Thread::active is a fast/cheap filter to skip get_trace if we know that none of the process's threads is active (from the OS's perspective). Is that accurate?

To my understanding - not precisely, I'll give an elaborated answer on the other discussion 😅

I might have missed that - isn't it what we expect THREAD_RUNNABLE to tell? Or does runnable mean "can be run given that it obtains the GVL"? If the latter, then I think that looking at it doesn't make any sense at all.

I think this is a key question, and I don't know enough about MRI VM internals to give a confident answer. What you said seems reasonable. I'll poke around in some reference materials that I've collected and will post again if I find anything.

Yup. Well, basically, no --on-cpu means --off-cpu now :)

My feeling is that the absence of --on-cpu should cause rbspy to collect all traces, which is the current shipping behavior (and what this PR would do in its current state, IIUC), and then we could introduce --off-cpu to do the opposite filtering that --on-cpu does, i.e. only collect traces if the thread is waiting.

acj · 2021-06-13T01:26:35Z

README.md

+- OS level thread information (only available for linux, macos and windows)
+- current thread state in the ruby interpreter.
+
+In addition, rbspy can either profile in a blocking or nonblocking mode.


Thanks for adding this. We should probably mention the --nonblocking flag to be consistent with the paragraph above.

README.md

acj · 2021-06-13T01:34:35Z

src/core/initialize.rs

@@ -74,7 +83,25 @@ impl StackTraceGetter {
 Ok(self.get_trace_from_current_thread()?)
 }

- fn get_trace_from_current_thread(&self) -> Result<StackTrace, MemoryCopyError> {
+ #[cfg(any(target_os = "macos", target_os = "windows", target_os = "linux"))]


remoteprocess has a Thread implementation for FreeBSD (including the active function), so we might not need this guard. I think the _os_specific suffix could be dropped in that case -- much of rbspy's core work is OS-specific in some way.

acj · 2021-06-14T21:09:43Z

src/core/ruby_version.rs

 let thread: $thread_type = get_execution_context(ruby_current_thread_address_location, ruby_vm_address_location, source)
 .context(ruby_current_thread_address_location)?;

+ // testing the thread state in the interpreter.
+ if on_cpu && get_thread_status(&thread, source)? != 0 /* THREAD_RUNNABLE */ {


My understanding is that they're very similar. Only one thread per VM can be executing ruby code at a time. Native C extensions can explicitly release the GVL, though, so that ruby and C code can execute in parallel when it's safe to do so. The situation is changing a bit in ruby 3 with the addition of ractors, which have their own lock and will probably need different logic for on/off CPU analysis. rbspy only has basic support for ractors right now (only looks at the main one), so we may need to defer that work until later.

I'm not opposed to introducing a --gvl option if we can find a reliable way to detect which thread holds the lock, but I'd recommend doing that in a future PR so that we can keep this one small-ish and focused. We might also want an --off-cpu option. Lots of future paths to explore here.

It seems that this get_thread_status check is the key mechanism, and the earlier check using remoteprocess's Thread::active is a fast/cheap filter to skip get_trace if we know that none of the process's threads is active (from the OS's perspective). Is that accurate?

src/core/ruby_version.rs

acj · 2021-06-14T21:20:03Z

src/core/initialize.rs

+ #[cfg(any(target_os = "macos", target_os = "windows", target_os = "linux"))]
+ fn is_on_cpu_os_specific(&self) -> Result<bool> {
+ // remoteprocess crate exposes a Thread.active() method for each of these targets
+ for thread in self.process.threads()?.iter() {


I touched on this in another comment, but if this is meant as an initial filter and the more accurate check happens later (via get_thread_status), then I think it's okay. Unless the remoteprocess functions we're calling here are expensive, this should reduce rbspy's CPU usage a bit when we're doing on-CPU profiling. It might be worth it to check that assumption, though -- I don't know offhand how expensive this loop is for a typical program.

README.md

acj · 2021-06-14T21:22:58Z

README.md

-rbspy profiles everything the program does (including waiting) -- there's no
-option to just profile when the program is using the CPU.
+By defeault, rbspy profiles everything the program does (including waiting).
+There is an experimental option to profile only when the program is using the CPU. (`--on-cpu`)


experimental option

Are there any concerns or caveats that we should document? What's blocking this from being a non-experimental feature?

acj · 2021-07-10T20:06:27Z

I looked through the ruby source and noticed that rb_thread_struct has a thread_id field that we might be able to use to correlate active OS threads with the active thread in the ruby VM. That would involve reading VM memory earlier than we currently do. For example, we could build a list of active OS threads, lock the process, and then immediately get the active thread ID from the VM to see if it's in the active list. If not, then unlock and return. If it's in the active list, then I think we can be reasonably confident that the active thread is on-CPU if its VM thread state is runnable. (This assumes that there's no subtle VM and/or GVL behavior that I'm not accounting for here.)

Thoughts?

michelhe · 2021-11-15T14:53:53Z

Hey @acj ,
I wouldn't want the work done here to become stale, so please lmk what you think needs to be done in order to merge this upstream 😄

acj · 2021-11-15T17:48:44Z

@michelhe Yeah, we should make a decision here. Are we still blocked on the accuracy/correctness issues? Did you have any thoughts on my last comment?

If you'd rather merge this as an experimental feature and iterate on it later, I'm okay with that too.

Jongy · 2021-11-16T22:40:29Z

@michelhe Yeah, we should make a decision here. Are we still blocked on the accuracy/correctness issues? Did you have any thoughts on my last comment?

I think that we want to keep the OS-level "is running" check.

About locking/unlocking - I prefer to keep it as accurate as possible also in the nonblocking mode. I recently read Ben's blog at https://www.benfrederickson.com/why-python-needs-paused-during-profiling/ which led me to understand that rbspy doesn't take as long as py-spy to read the stacks. So I suggest that we do something similar to py-spy, that is:

Lock if blocking
Check OS level "is running"
Check Ruby-level "is running" (in py-spy, that's done only if --gil is passed - does rbspy have any counterpart or we should always do it? I suppose it should be configurable)
Read the stacks array
Unlock if blocking
Symbolicate the stacks array

acj · 2021-11-26T18:46:49Z

Sorry for the delay. From rereading the above discussion, I think the open questions are:

Are we okay with the current heuristic for the OS-level check, i.e. if any thread in the thread group is on-CPU, then the check passes?
If that's not accurate enough, then do we want to try what I suggested in my earlier comment, i.e. to compare the ruby thread's thread_id field with the active OS thread? There may be portability issues here.
Depending on how confident we are about 1 and 2, should we treat the --on-cpu feature as experimental for now?
Does the current code work on FreeBSD? The API seems to have what we need, so I think it should work, but I haven't tried it.

If there's any lingering doubt, I'd vote to deal with 4 (FreeBSD support) and then merge this with --on-cpu marked as experimental. Then we can let people try it out, gather more data from our own usage of it, etc.

Check Ruby-level "is running" (in py-spy, that's done only if --gil is passed - does rbspy have any counterpart or we should always do it? I suppose it should be configurable)

@Jongy We touched on that here. rbspy doesn't have an equivalent flag right now, but we could add one if we can detect which thread holds the GVL. We might need to ask the ruby VM developers about it.

michelhe · 2022-07-14T11:18:11Z

Sorry for the delay. From rereading the above discussion, I think the open questions are:

Are we okay with the current heuristic for the OS-level check, i.e. if any thread in the thread group is on-CPU, then the check passes?

If that's not accurate enough, then do we want to try what I suggested in my earlier comment, i.e. to compare the ruby thread's thread_id field with the active OS thread? There may be portability issues here.

Depending on how confident we are about 1 and 2, should we treat the --on-cpu feature as experimental for now?

Does the current code work on FreeBSD? The API seems to have what we need, so I think it should work, but I haven't tried it.

If there's any lingering doubt, I'd vote to deal with 4 (FreeBSD support) and then merge this with --on-cpu marked as experimental. Then we can let people try it out, gather more data from our own usage of it, etc.

Check Ruby-level "is running" (in py-spy, that's done only if --gil is passed - does rbspy have any counterpart or we should always do it? I suppose it should be configurable)

@Jongy We touched on that here. rbspy doesn't have an equivalent flag right now, but we could add one if we can detect which thread holds the GVL. We might need to ask the ruby VM developers about it.

Hey @acj , how are you?

Sorry for taking this long, I'm now ready to begin the needed work to finally bring this feature to mainline.
We actually have been using rbspy from this branch for the past year in our gProfiler and are generally happy with the results it produces for on-cpu, even though the accuracy with the current state of the code should really be taken with a grain of salt.

Reiterating past conversations in this PR I think the best course of action will be:

Resolve current conflicts with master which has changed quite a bit.
Making sure CI passes for FreeBSD - is this enough ? @acj
Merge the current logic into master, label it as experimental.
Open an issue for trying what you suggested in this comment.

I wouldn't want the --on-cpu feature to always lock the process and think we invest some future effort into a solution that is as accurate as possible and doesn't harm the performance of profiled processes.
Once we stabilize this feature to a level we feel is accurate enough without locking the process - we may discard the "experimental"

acj · 2022-07-19T17:08:29Z

@michelhe I'm doing well, thanks for asking. :-) How are you?

Sorry for taking this long, I'm now ready to begin the needed work to finally bring this feature to mainline.
We actually have been using rbspy from this branch for the past year in our gProfiler and are generally happy with the results it produces for on-cpu, even though the accuracy with the current state of the code should really be taken with a grain of salt.

That's great to hear. I know that a few rbspy users are interested in this functionality (even if it's experimental) and can help us track down any bugs or quirks. Your proposed course of action sounds good to me.

Resolve current conflicts with master

Please note that the default branch is now main, so you'll need to update any local references.

Making sure CI passes for FreeBSD - is this enough ?

Yep, that's fine. I can try the feature in my FreeBSD VM after the dust settles. It doesn't need to block merging the PR though as long as CI is happy.

This change adds on what rbspy#127 tried to bring to allow measuring how much time is spent on CPU, (but implemented slightly different and more up to date). We filter traces that are off-cpu using 2 checks: - OS-specific, currently using the remoteprocess crate to determine if there are any active threads at this point in time. - Using the rb_thread_struct.status field to figure out if a thread is in `THREAD_RUNNABLE` or not when we come to get a trace from it. The problem is that this is only a heuristic and will be inaccurate in cases when the process gets preempted by the OS while it has runnable threads, so we rely on OS-specific checks as well when possible.

michelhe · 2022-09-21T11:32:43Z

I rebased all my work on top of the latest version v0.12.1 only to discover later that main is still too far ahead with new refactors when you moved to spytools 😭

acj · 2022-10-02T23:09:59Z

Ah, sorry for the churn. The parts that we moved into spytools are (currently) for process discovery and memory maps and such, so they shouldn't affect this work much. Let me know if you run into any snags.

acj · 2024-06-11T01:54:11Z

@michelhe 👋 I've freshened this up in a branch here. There are a few minor changes to align with refactors from the past year, clarify that it's experimental, etc. I also reworked the off-cpu test to make sure the timing is reliable. If you're still interested in the feature, please let me know what you think.

Jongy reviewed Jun 12, 2021

View reviewed changes

Jongy reviewed Jun 13, 2021

View reviewed changes

acj reviewed Jun 14, 2021

View reviewed changes

omarabid mentioned this pull request Mar 28, 2022

0.4.0 release grafana/pyroscope-rs#23

Merged

4 tasks

Jongy mentioned this pull request Aug 27, 2022

Unable to profile ruby code, potentially ruby version related Granulate/gprofiler#444

Closed

michelhe force-pushed the granulate/on-cpu branch from e466cf3 to 42c5826 Compare August 29, 2022 14:19

michelhe force-pushed the granulate/on-cpu branch from 42c5826 to ebeafcd Compare September 21, 2022 11:22

acj mentioned this pull request Dec 15, 2022

Add native traces to RubySpy::get_stack_trace #372

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to collect traces only when the CPU is active (--on-cpu) #327

Add ability to collect traces only when the CPU is active (--on-cpu) #327

michelhe commented Jun 7, 2021

acj commented Jun 8, 2021

Jongy left a comment

Jongy Jun 12, 2021

Jongy Jun 12, 2021

michelhe Jun 12, 2021

Jongy Jun 12, 2021

michelhe Jun 13, 2021

Jongy Jun 13, 2021

acj Jun 14, 2021

Jongy Jun 14, 2021

acj Jun 15, 2021

michelhe Jun 20, 2021 •

edited

michelhe Jul 6, 2021

acj Jul 10, 2021

Jongy Jun 13, 2021

acj Jun 14, 2021

Jongy Jun 14, 2021

acj Jun 15, 2021 •

edited

acj Jun 13, 2021

acj Jun 13, 2021

acj Jun 14, 2021

acj Jun 14, 2021

acj Jun 14, 2021

acj commented Jul 10, 2021

michelhe commented Nov 15, 2021

acj commented Nov 15, 2021

Jongy commented Nov 16, 2021

acj commented Nov 26, 2021

michelhe commented Jul 14, 2022

acj commented Jul 19, 2022

michelhe commented Sep 21, 2022

acj commented Oct 2, 2022

acj commented Jun 11, 2024

	There is an experimantal option to profile only when the program is using the CPU. (`--on-cpu`)
	There is an experimental option to profile only when the program is using the CPU. (`--on-cpu`)

Add ability to collect traces only when the CPU is active (--on-cpu) #327

Are you sure you want to change the base?

Add ability to collect traces only when the CPU is active (--on-cpu) #327

Conversation

michelhe commented Jun 7, 2021

acj commented Jun 8, 2021

Jongy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michelhe Jun 20, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acj Jun 15, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

acj commented Jul 10, 2021

michelhe commented Nov 15, 2021

acj commented Nov 15, 2021

Jongy commented Nov 16, 2021

acj commented Nov 26, 2021

michelhe commented Jul 14, 2022

acj commented Jul 19, 2022

michelhe commented Sep 21, 2022

acj commented Oct 2, 2022

acj commented Jun 11, 2024

michelhe Jun 20, 2021 •

edited

acj Jun 15, 2021 •

edited