Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flight Recorder Sequence IDs are insufficient #125173

Open
wconstab opened this issue Apr 29, 2024 · 1 comment
Open

Flight Recorder Sequence IDs are insufficient #125173

wconstab opened this issue Apr 29, 2024 · 1 comment
Assignees
Labels
module: distributed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@wconstab
Copy link
Contributor

I started working on a script to do basic processing of the flight-recorder buffers.

Pasting the doc from the script here for context:

"""Flight Recorder Trace Analyzer
This script primarily merges data from individual flight recorder buffers from individual ranks in a
PyTorch Distributed program into a flattened database format that can be used for further analysis.
However as part of the merging process, it is necessary to perform some analysis in order to match operators
on one rank with corresponding operators on other ranks and register them as one 'collective' entry.  During this
process, a significant amount of useful information can already be extracted such as where the first mismatch occurs
in cases of desync (when not all ranks issue a compatible collective in a particular process group).
Not Yet Implemented

- TODO- tracebacks aren't implemented

Known Issues
- Flight Recorder buffer sequence_id information is not sufficient to match collectives and coalseced collectives
  unless we have the trace data from the beginning of the program.  To enable confident analysis of trace buffers that
  do not start from zero (and to simplify the script's matching logic) we need to add more information to the recorder.
- Currently, the script omits checking the 'status' of collectives.  We can look for the first 'non completed'
  collective easily enough and report that.

Usage
python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
- Omitting the optional output file will still yield analysis information to stdout
- the output file is a pickle of the flat DB, which may change in format in the future.
"""

Proposal: Add a collective-only sequence ID

  • or separate seq_id into 'collective_seq_id' and 'p2p_seq_id'.. or something

For collective operations (operations that occur on every rank within a particular process group), we should keep a separate counter that only increments for such operations, and does not increment for P2P operations that are not collective.

With this change, even when loading flight recorder buffers that do not start at time 0, it is possible to unambiguously match up collective operations across ranks. Already, this solves a few use cases

  • process groups that only use collective operations (which is typically the majority of PGs), are completely covered
  • PGs that mix P2P operations and collective operations can use the most recent collective operations to re-establish a frame of reference (similar to having the flight-recorder start from time 0).

Open Question: Solving mismatch between P2P Ops
The above doesn't resolve the challenge of matching up flight-recorder data from P2P operations on a PG that does not perform any collectives when the data starts later than time 0.

A few ideas came up. Any other ideas folks have?

  • we could "artificially require" all P2P ops to be called collectively. this means that if rank 0 wants to send to rank 1, rank 2 and 3 need to also issue this 'dummy api call'. This is basically a non-starter as it's not going to help existing model code, and it's too bad a UX change to even propose with a straight face
  • The pipeline parallelism library could inject a collective 'heartbeat' at some frequency. The performance cost of this heartbeat would be a factor to consider, but doing so would provide a reference frame. This would be a little hard to tune since the necessary frequency depends on the user-configurable flight-buffer size.
@cpuhrsch cpuhrsch added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: distributed labels Apr 30, 2024
@c-p-i-o c-p-i-o self-assigned this Apr 30, 2024
@c-p-i-o
Copy link
Contributor

c-p-i-o commented May 10, 2024

A few ideas came up. Any other ideas folks have?
@wconstab

Is it possible to have 2 different buffers?
One for tracking collectives and a separate one for tracking P2P?
Right now we have a single vector that we dump to file when we are debugging?
How about we have 2 buffers and we put a start of first buffer marker, end of first buffer marker, start of 2nd buffer marker and end of 2nd buffer markers when dumping?

OR:

Alternatively, we split the existing buffer into two? Say 80% NUM_ENTRIES dedicated to writing collectives and 20% of it for writing P2P?
The disadvantage is that maybe there is a bit of wasted space?

c-p-i-o added a commit that referenced this issue May 13, 2024
Summary:
Split out seq_id into collective_seq_id and p2p_seq_id. The main idea
here is that collectives that go to all machines should have identical
collective_seq_id and therefore it makes it easier to spot if one of
machines isn't handling a collective operation.
Next, we can attempt to match up p2p operations to ensure that the
sender(s)/receivers(s) are in sync.

Resolves issue: #125173

ghstack-source-id: c31b3164d2e51efeab210e6a949cd4c8d1ecd3d7
Pull Request resolved: #125727
c-p-i-o added a commit that referenced this issue May 14, 2024
Summary:
Split out seq_id into collective_seq_id and p2p_seq_id. The main idea
here is that collectives that go to all machines should have identical
collective_seq_id and therefore it makes it easier to spot if one of
machines isn't handling a collective operation.
Next, we can attempt to match up p2p operations to ensure that the
sender(s)/receivers(s) are in sync.

Resolves issue: #125173

ghstack-source-id: b49bf3738f4355ec123c1e2520fed874c2cec714
Pull Request resolved: #125727
c-p-i-o added a commit that referenced this issue May 14, 2024
Summary:
Split out seq_id into collective_seq_id and p2p_seq_id. The main idea
here is that collectives that go to all machines should have identical
collective_seq_id and therefore it makes it easier to spot if one of
machines isn't handling a collective operation.
Next, we can attempt to match up p2p operations to ensure that the
sender(s)/receivers(s) are in sync.

Resolves issue: #125173

ghstack-source-id: 6bf99233b2e8f37f48aa323c1703de3f2e10a12d
Pull Request resolved: #125727
c-p-i-o added a commit that referenced this issue May 15, 2024
Summary:
Split out seq_id into collective_seq_id and p2p_seq_id. The main idea
here is that collectives that go to all machines should have identical
collective_seq_id and therefore it makes it easier to spot if one of
machines isn't handling a collective operation.
Next, we can attempt to match up p2p operations to ensure that the
sender(s)/receivers(s) are in sync.

Resolves issue: #125173

ghstack-source-id: 45332354d7d902b1860b5a9273403a9d89733066
Pull Request resolved: #125727
c-p-i-o added a commit that referenced this issue May 15, 2024
Summary:
Split out seq_id into collective_seq_id and p2p_seq_id. The main idea
here is that collectives that go to all machines should have identical
collective_seq_id and therefore it makes it easier to spot if one of
machines isn't handling a collective operation.
Next, we can attempt to match up p2p operations to ensure that the
sender(s)/receivers(s) are in sync.

Resolves issue: #125173

ghstack-source-id: c67b8ed6bda1415b5f6a2e2006e5bec0ae8b1621
Pull Request resolved: #125727
c-p-i-o added a commit that referenced this issue May 16, 2024
Summary:
Split out seq_id into collective_seq_id and p2p_seq_id. The main idea
here is that collectives that go to all machines should have identical
collective_seq_id and therefore it makes it easier to spot if one of
machines isn't handling a collective operation.
Next, we can attempt to match up p2p operations to ensure that the
sender(s)/receivers(s) are in sync.

Resolves issue: #125173

ghstack-source-id: f392686c6e68260fd453c28f2575fcf8bc71ea7f
Pull Request resolved: #125727
OnlyFor pushed a commit to OnlyFor/pytorch that referenced this issue May 18, 2024
Summary:
Split out seq_id into collective_seq_id and p2p_seq_id. The main idea
here is that collectives that go to all machines should have identical
collective_seq_id and therefore it makes it easier to spot if one of
machines isn't handling a collective operation.
Next, we can attempt to match up p2p operations to ensure that the
sender(s)/receivers(s) are in sync.

Resolves issue: pytorch#125173

ghstack-source-id: cf9bb109c028d7ffe9612d2b9c4fda1df47586d7
Pull Request resolved: pytorch#125727
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: distributed triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants