Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time.monotonic makes it impossible to line up distributed traces #361

Open
bobhansen opened this issue Aug 8, 2023 · 1 comment
Open

Comments

@bobhansen
Copy link

We're using viztracer for lightweight tracing when training pytorch models. Running in a datacenter, all of the clocks are synchronized to within some small number of ms. Since viztracer uses only the monotonic clock during tracing (absolutely the correct answer), traces from different machines will have wildly different timestamps. Since we can't force the traces to start at the same moment, the --align_combine feature gets them to within seconds of each other (some improvement!) but I think we can do better.

It would be keen to have an option (or update the default) to calculate the offset between the system time and monotonic time during trace save, and offset the timestamp by that difference. That way, we will project the monotonic time into global time (+/- the error of the system clock), and be able to compare traces that have been combined.

If it's something you're interested in, I can look into making a PR.

@gaogaotiantian
Copy link
Owner

What are you looking for to solve this issue? There are a couple of ways to do this.

  1. Post-run edit. This would be the most straightforward way to solve the problem and you probably do not even need any changes from viztracer. Loop through events and do the offset as you want.
  2. Have an option to pass in an offset, which is 0 by default, then do the offset when saving the trace. This is not too bad, but there will be C code involved and the trace saving part is .. hmm, not the cleanest code to follow.
  3. Do it on run-time, add the offset when getting the timestamp. This would probably be the easiest as getts is already a function, but I don't want to do this as it hits performance.
  4. An even more interesting way, to add system clock(or it's offset to monitonic clock) to metadata, and let --combine command to solve it. Similar to --align_combine, but with a known offset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants