-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to collect traces only when the CPU is active (--on-cpu) #327
base: main
Are you sure you want to change the base?
Conversation
Hey, thanks for this. We've been wanting on/off CPU filtering for a long time. I'll try to do a deep review sometime in the next week or so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments for @michelhe :)
README.md
Outdated
rbspy profiles everything the program does (including waiting) -- there's no | ||
option to just profile when the program is using the CPU. | ||
By defeault, rbspy profiles everything the program does (including waiting). | ||
There is an experimantal option to profile only when the program is using the CPU. (`--on-cpu`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an experimantal option to profile only when the program is using the CPU. (`--on-cpu`) | |
There is an experimental option to profile only when the program is using the CPU. (`--on-cpu`) |
src/core/initialize.rs
Outdated
fn is_on_cpu_os_specific(&self) -> Result<bool> { | ||
/* We don't have OS specific checks for these targets, | ||
* so fallback to using the interpreter based check down the line */ | ||
Ok(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you return false
here and on_cpu
is used, then nothing will be collected at all.
In this configuration - perhaps just deny using on_cpu
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, meant to return true 😅
#[cfg(any(target_os = "macos", target_os = "windows", target_os = "linux"))] | ||
fn is_on_cpu_os_specific(&self) -> Result<bool> { | ||
// remoteprocess crate exposes a Thread.active() method for each of these targets | ||
for thread in self.process.threads()?.iter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the logic you want - you need to check the "current" thread that we're going to get stacktraces from, not just any thread in the process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know which OS thread the "current" ruby thread until we start examining the interpreter's memory, so the heuristic is: "if any thread in the thread group is on CPU"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the "current" ruby thread is not active OS-wise, it's surely not active ruby-wise (to be precise: it may be marked active ruby-wise, this just means it's preempted). Another thread may be running, doing non-ruby stuff, and the thread doing ruby-stuff may be preempted, and you will incorrectly account that as the ruby code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I touched on this in another comment, but if this is meant as an initial filter and the more accurate check happens later (via get_thread_status
), then I think it's okay. Unless the remoteprocess functions we're calling here are expensive, this should reduce rbspy's CPU usage a bit when we're doing on-CPU profiling. It might be worth it to check that assumption, though -- I don't know offhand how expensive this loop is for a typical program.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@acj this is just not accurate enough - imagine this case: 1 thread in the application is running some non-related C code doing whatever (and it's active, so thread.active()
returns true
). Meanwhile, another thread is running Ruby code (so it has rb_thread_status_THREAD_RUNNABLE
, I guess?). But the OS decided that it's time that the Ruby thread relinquishes CPU, and it gets preempted. Its rb_thread_status
remains as it was.
So, per the 2 checks:
- the "process" is active, because one of its threads is active.
- the "Ruby thread" is also "active" because its status is
THREAD_RUNNABLE
.
And the result is that we collect Ruby stacks from a thread that's not currently running!
About the cost of thread.active()
- it's not very expensive. It's a single read from /proc/tid/stat
, so 3 syscalls (open
, read
, close
). Comparing to get_thread_status
- which is only one process_vm_readv
IIRC, so faster; but it should be benched & compared to reading /proc/tid/stat
if we want a real comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry, I didn't mean to suggest that the entire approach is okay. I understood your first example and agree that it's something we need to solve (I'm still thinking about this). What I meant to say is that this for
loop seems reasonable, provided that we can somehow do an accurate check later when we examine process memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I touched on this in another comment, but if this is meant as an initial filter and the more accurate check happens later (via
get_thread_status
), then I think it's okay. Unless the remoteprocess functions we're calling here are expensive, this should reduce rbspy's CPU usage a bit when we're doing on-CPU profiling. It might be worth it to check that assumption, though -- I don't know offhand how expensive this loop is for a typical program.
Anyway, I used the flame crate to insert tracepoints in get_trace()
and this is what I had measured:
THREAD: 140546956667136
| get_trace: 0.246437ms
| is_on_cpu_os_specific: 0.024342ms
| get_stack_trace_from_current_thread: 0.221337ms
It seems like the remoteprocess functions are a bit expansive to call per trace..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@acj What is your take given this measurement ? should I get rid of the remoteprocess
calls ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know yet. On the one hand, the call to check active threads isn't free, but on the other hand we'd avoid the more expensive call to get_stack_trace_from_current_thread
if the process is definitely not on-CPU. That might be a reasonable tradeoff for some workloads.
Let's revisit this after we solve the correctness/reliability questions.
src/core/ruby_version.rs
Outdated
let thread: $thread_type = get_execution_context(ruby_current_thread_address_location, ruby_vm_address_location, source) | ||
.context(ruby_current_thread_address_location)?; | ||
|
||
// testing the thread state in the interpreter. | ||
if on_cpu && get_thread_status(&thread, source)? != 0 /* THREAD_RUNNABLE */ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know ruby well enough - is this the equivalent of CPython's GIL? If so, it's probably better to add this as another option (compare to py-spy's --active
for on-CPU, and --gil
for GIL; 2 different options that can go together, but I might want --active
only w/o --gil
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that they're very similar. Only one thread per VM can be executing ruby code at a time. Native C extensions can explicitly release the GVL, though, so that ruby and C code can execute in parallel when it's safe to do so. The situation is changing a bit in ruby 3 with the addition of ractors, which have their own lock and will probably need different logic for on/off CPU analysis. rbspy only has basic support for ractors right now (only looks at the main one), so we may need to defer that work until later.
I'm not opposed to introducing a --gvl
option if we can find a reliable way to detect which thread holds the lock, but I'd recommend doing that in a future PR so that we can keep this one small-ish and focused. We might also want an --off-cpu
option. Lots of future paths to explore here.
It seems that this get_thread_status
check is the key mechanism, and the earlier check using remoteprocess's Thread::active
is a fast/cheap filter to skip get_trace
if we know that none of the process's threads is active (from the OS's perspective). Is that accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the brief Ruby internals :)
I'm not opposed to introducing a --gvl option if we can find a reliable way to detect which thread holds the lock,
I might have missed that - isn't it what we expect THREAD_RUNNABLE
to tell? Or does runnable mean "can be run given that it obtains the GVL"? If the latter, then I think that looking at it doesn't make any sense at all.
We might also want an --off-cpu option. Lots of future paths to explore here.
Yup. Well, basically, no --on-cpu
means --off-cpu
now :)
It seems that this get_thread_status check is the key mechanism, and the earlier check using remoteprocess's Thread::active is a fast/cheap filter to skip get_trace if we know that none of the process's threads is active (from the OS's perspective). Is that accurate?
To my understanding - not precisely, I'll give an elaborated answer on the other discussion 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have missed that - isn't it what we expect THREAD_RUNNABLE to tell? Or does runnable mean "can be run given that it obtains the GVL"? If the latter, then I think that looking at it doesn't make any sense at all.
I think this is a key question, and I don't know enough about MRI VM internals to give a confident answer. What you said seems reasonable. I'll poke around in some reference materials that I've collected and will post again if I find anything.
Yup. Well, basically, no --on-cpu means --off-cpu now :)
My feeling is that the absence of --on-cpu
should cause rbspy to collect all traces, which is the current shipping behavior (and what this PR would do in its current state, IIUC), and then we could introduce --off-cpu
to do the opposite filtering that --on-cpu
does, i.e. only collect traces if the thread is waiting.
- OS level thread information (only available for linux, macos and windows) | ||
- current thread state in the ruby interpreter. | ||
|
||
In addition, rbspy can either profile in a blocking or nonblocking mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this. We should probably mention the --nonblocking
flag to be consistent with the paragraph above.
@@ -74,7 +83,25 @@ impl StackTraceGetter { | |||
Ok(self.get_trace_from_current_thread()?) | |||
} | |||
|
|||
fn get_trace_from_current_thread(&self) -> Result<StackTrace, MemoryCopyError> { | |||
#[cfg(any(target_os = "macos", target_os = "windows", target_os = "linux"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remoteprocess has a Thread
implementation for FreeBSD (including the active
function), so we might not need this guard. I think the _os_specific
suffix could be dropped in that case -- much of rbspy's core work is OS-specific in some way.
src/core/ruby_version.rs
Outdated
let thread: $thread_type = get_execution_context(ruby_current_thread_address_location, ruby_vm_address_location, source) | ||
.context(ruby_current_thread_address_location)?; | ||
|
||
// testing the thread state in the interpreter. | ||
if on_cpu && get_thread_status(&thread, source)? != 0 /* THREAD_RUNNABLE */ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that they're very similar. Only one thread per VM can be executing ruby code at a time. Native C extensions can explicitly release the GVL, though, so that ruby and C code can execute in parallel when it's safe to do so. The situation is changing a bit in ruby 3 with the addition of ractors, which have their own lock and will probably need different logic for on/off CPU analysis. rbspy only has basic support for ractors right now (only looks at the main one), so we may need to defer that work until later.
I'm not opposed to introducing a --gvl
option if we can find a reliable way to detect which thread holds the lock, but I'd recommend doing that in a future PR so that we can keep this one small-ish and focused. We might also want an --off-cpu
option. Lots of future paths to explore here.
It seems that this get_thread_status
check is the key mechanism, and the earlier check using remoteprocess's Thread::active
is a fast/cheap filter to skip get_trace
if we know that none of the process's threads is active (from the OS's perspective). Is that accurate?
#[cfg(any(target_os = "macos", target_os = "windows", target_os = "linux"))] | ||
fn is_on_cpu_os_specific(&self) -> Result<bool> { | ||
// remoteprocess crate exposes a Thread.active() method for each of these targets | ||
for thread in self.process.threads()?.iter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I touched on this in another comment, but if this is meant as an initial filter and the more accurate check happens later (via get_thread_status
), then I think it's okay. Unless the remoteprocess functions we're calling here are expensive, this should reduce rbspy's CPU usage a bit when we're doing on-CPU profiling. It might be worth it to check that assumption, though -- I don't know offhand how expensive this loop is for a typical program.
README.md
Outdated
rbspy profiles everything the program does (including waiting) -- there's no | ||
option to just profile when the program is using the CPU. | ||
By defeault, rbspy profiles everything the program does (including waiting). | ||
There is an experimental option to profile only when the program is using the CPU. (`--on-cpu`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
experimental option
Are there any concerns or caveats that we should document? What's blocking this from being a non-experimental feature?
I looked through the ruby source and noticed that Thoughts? |
Hey @acj , |
@michelhe Yeah, we should make a decision here. Are we still blocked on the accuracy/correctness issues? Did you have any thoughts on my last comment? If you'd rather merge this as an experimental feature and iterate on it later, I'm okay with that too. |
I think that we want to keep the OS-level "is running" check. About locking/unlocking - I prefer to keep it as accurate as possible also in the nonblocking mode. I recently read Ben's blog at https://www.benfrederickson.com/why-python-needs-paused-during-profiling/ which led me to understand that rbspy doesn't take as long as py-spy to read the stacks. So I suggest that we do something similar to py-spy, that is:
|
Sorry for the delay. From rereading the above discussion, I think the open questions are:
If there's any lingering doubt, I'd vote to deal with 4 (FreeBSD support) and then merge this with
@Jongy We touched on that here. rbspy doesn't have an equivalent flag right now, but we could add one if we can detect which thread holds the GVL. We might need to ask the ruby VM developers about it. |
Hey @acj , how are you? Sorry for taking this long, I'm now ready to begin the needed work to finally bring this feature to mainline. Reiterating past conversations in this PR I think the best course of action will be:
I wouldn't want the --on-cpu feature to always lock the process and think we invest some future effort into a solution that is as accurate as possible and doesn't harm the performance of profiled processes. |
@michelhe I'm doing well, thanks for asking. :-) How are you?
That's great to hear. I know that a few rbspy users are interested in this functionality (even if it's experimental) and can help us track down any bugs or quirks. Your proposed course of action sounds good to me.
Please note that the default branch is now
Yep, that's fine. I can try the feature in my FreeBSD VM after the dust settles. It doesn't need to block merging the PR though as long as CI is happy. |
e466cf3
to
42c5826
Compare
This change adds on what rbspy#127 tried to bring to allow measuring how much time is spent on CPU, (but implemented slightly different and more up to date). We filter traces that are off-cpu using 2 checks: - OS-specific, currently using the remoteprocess crate to determine if there are any active threads at this point in time. - Using the rb_thread_struct.status field to figure out if a thread is in `THREAD_RUNNABLE` or not when we come to get a trace from it. The problem is that this is only a heuristic and will be inaccurate in cases when the process gets preempted by the OS while it has runnable threads, so we rely on OS-specific checks as well when possible.
42c5826
to
ebeafcd
Compare
I rebased all my work on top of the latest version v0.12.1 only to discover later that |
Ah, sorry for the churn. The parts that we moved into spytools are (currently) for process discovery and memory maps and such, so they shouldn't affect this work much. Let me know if you run into any snags. |
@michelhe 👋 I've freshened this up in a branch here. There are a few minor changes to align with refactors from the past year, clarify that it's experimental, etc. I also reworked the off-cpu test to make sure the timing is reliable. If you're still interested in the feature, please let me know what you think. |
This change adds on what #127 tried to bring to allow measuring how much time is spent on CPU, (but implemented slightly different and more up to date).
We filter traces that are off-cpu using 2 checks:
remoteprocess
crate to determine if there are any active threads at this point in time.rb_thread_struct.status
field to figure out if a thread is inTHREAD_RUNNABLE
or not when we come to get a trace from it. The problem is that this is only a heuristic and will be inaccurate in cases when the process gets preempted by the OS while it has runnable threads, so we rely on OS-specific checks as well when possible.