New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flight Recorder Sequence IDs are insufficient #125173
Comments
Is it possible to have 2 different buffers? OR: Alternatively, we split the existing buffer into two? Say 80% NUM_ENTRIES dedicated to writing collectives and 20% of it for writing P2P? |
Summary: Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: #125173 ghstack-source-id: c31b3164d2e51efeab210e6a949cd4c8d1ecd3d7 Pull Request resolved: #125727
Summary: Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: #125173 ghstack-source-id: b49bf3738f4355ec123c1e2520fed874c2cec714 Pull Request resolved: #125727
Summary: Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: #125173 ghstack-source-id: 6bf99233b2e8f37f48aa323c1703de3f2e10a12d Pull Request resolved: #125727
Summary: Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: #125173 ghstack-source-id: 45332354d7d902b1860b5a9273403a9d89733066 Pull Request resolved: #125727
Summary: Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: #125173 ghstack-source-id: c67b8ed6bda1415b5f6a2e2006e5bec0ae8b1621 Pull Request resolved: #125727
Summary: Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: #125173 ghstack-source-id: f392686c6e68260fd453c28f2575fcf8bc71ea7f Pull Request resolved: #125727
Summary: Split out seq_id into collective_seq_id and p2p_seq_id. The main idea here is that collectives that go to all machines should have identical collective_seq_id and therefore it makes it easier to spot if one of machines isn't handling a collective operation. Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync. Resolves issue: pytorch#125173 ghstack-source-id: cf9bb109c028d7ffe9612d2b9c4fda1df47586d7 Pull Request resolved: pytorch#125727 Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
I started working on a script to do basic processing of the flight-recorder buffers.
Pasting the doc from the script here for context:
Proposal: Add a collective-only sequence ID
For collective operations (operations that occur on every rank within a particular process group), we should keep a separate counter that only increments for such operations, and does not increment for P2P operations that are not collective.
With this change, even when loading flight recorder buffers that do not start at time 0, it is possible to unambiguously match up collective operations across ranks. Already, this solves a few use cases
Open Question: Solving mismatch between P2P Ops
The above doesn't resolve the challenge of matching up flight-recorder data from P2P operations on a PG that does not perform any collectives when the data starts later than time 0.
A few ideas came up. Any other ideas folks have?
The text was updated successfully, but these errors were encountered: