Flight Recorder Sequence IDs are insufficient #125173

wconstab · 2024-04-29T17:44:08Z

I started working on a script to do basic processing of the flight-recorder buffers.

Pasting the doc from the script here for context:

"""Flight Recorder Trace Analyzer
This script primarily merges data from individual flight recorder buffers from individual ranks in a
PyTorch Distributed program into a flattened database format that can be used for further analysis.
However as part of the merging process, it is necessary to perform some analysis in order to match operators
on one rank with corresponding operators on other ranks and register them as one 'collective' entry.  During this
process, a significant amount of useful information can already be extracted such as where the first mismatch occurs
in cases of desync (when not all ranks issue a compatible collective in a particular process group).
Not Yet Implemented

- TODO- tracebacks aren't implemented

Known Issues
- Flight Recorder buffer sequence_id information is not sufficient to match collectives and coalseced collectives
  unless we have the trace data from the beginning of the program.  To enable confident analysis of trace buffers that
  do not start from zero (and to simplify the script's matching logic) we need to add more information to the recorder.
- Currently, the script omits checking the 'status' of collectives.  We can look for the first 'non completed'
  collective easily enough and report that.

Usage
python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
- Omitting the optional output file will still yield analysis information to stdout
- the output file is a pickle of the flat DB, which may change in format in the future.
"""

Proposal: Add a collective-only sequence ID

or separate seq_id into 'collective_seq_id' and 'p2p_seq_id'.. or something

For collective operations (operations that occur on every rank within a particular process group), we should keep a separate counter that only increments for such operations, and does not increment for P2P operations that are not collective.

With this change, even when loading flight recorder buffers that do not start at time 0, it is possible to unambiguously match up collective operations across ranks. Already, this solves a few use cases

process groups that only use collective operations (which is typically the majority of PGs), are completely covered
PGs that mix P2P operations and collective operations can use the most recent collective operations to re-establish a frame of reference (similar to having the flight-recorder start from time 0).

Open Question: Solving mismatch between P2P Ops
The above doesn't resolve the challenge of matching up flight-recorder data from P2P operations on a PG that does not perform any collectives when the data starts later than time 0.

A few ideas came up. Any other ideas folks have?

we could "artificially require" all P2P ops to be called collectively. this means that if rank 0 wants to send to rank 1, rank 2 and 3 need to also issue this 'dummy api call'. This is basically a non-starter as it's not going to help existing model code, and it's too bad a UX change to even propose with a straight face
The pipeline parallelism library could inject a collective 'heartbeat' at some frequency. The performance cost of this heartbeat would be a factor to consider, but doing so would provide a reference frame. This would be a little hard to tune since the necessary frequency depends on the user-configurable flight-buffer size.

The text was updated successfully, but these errors were encountered:

c-p-i-o · 2024-05-10T19:16:02Z

A few ideas came up. Any other ideas folks have?
@wconstab

Is it possible to have 2 different buffers?
One for tracking collectives and a separate one for tracking P2P?
Right now we have a single vector that we dump to file when we are debugging?
How about we have 2 buffers and we put a start of first buffer marker, end of first buffer marker, start of 2nd buffer marker and end of 2nd buffer markers when dumping?

OR:

Alternatively, we split the existing buffer into two? Say 80% NUM_ENTRIES dedicated to writing collectives and 20% of it for writing P2P?
The disadvantage is that maybe there is a bit of wasted space?

Summary: Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: #125173 ghstack-source-id: c31b3164d2e51efeab210e6a949cd4c8d1ecd3d7 Pull Request resolved: #125727

Summary: Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: #125173 ghstack-source-id: b49bf3738f4355ec123c1e2520fed874c2cec714 Pull Request resolved: #125727

Summary: Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: #125173 ghstack-source-id: 6bf99233b2e8f37f48aa323c1703de3f2e10a12d Pull Request resolved: #125727

Summary: Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: #125173 ghstack-source-id: 45332354d7d902b1860b5a9273403a9d89733066 Pull Request resolved: #125727

Summary: Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: #125173 ghstack-source-id: c67b8ed6bda1415b5f6a2e2006e5bec0ae8b1621 Pull Request resolved: #125727

Summary: Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: #125173 ghstack-source-id: f392686c6e68260fd453c28f2575fcf8bc71ea7f Pull Request resolved: #125727

Summary: Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: pytorch#125173 ghstack-source-id: cf9bb109c028d7ffe9612d2b9c4fda1df47586d7 Pull Request resolved: pytorch#125727 Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

cpuhrsch added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: distributed labels Apr 30, 2024

c-p-i-o self-assigned this Apr 30, 2024

c-p-i-o mentioned this issue May 13, 2024

[BE] split seq_id to collective_seq_id and p2p_seq_id #125727

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flight Recorder Sequence IDs are insufficient #125173

Flight Recorder Sequence IDs are insufficient #125173

wconstab commented Apr 29, 2024

c-p-i-o commented May 10, 2024

Flight Recorder Sequence IDs are insufficient #125173

Flight Recorder Sequence IDs are insufficient #125173

Comments

wconstab commented Apr 29, 2024

c-p-i-o commented May 10, 2024