-
Notifications
You must be signed in to change notification settings - Fork 35.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: Assumeutxo: import snapshot in a node with a divergent chain #29996
Open
alfonsoromanz
wants to merge
3
commits into
bitcoin:master
Choose a base branch
from
alfonsoromanz:assumeutxo_tests
base: master
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also check that the background validation succeeds. Otherwise there could be a bug where the diverging chain is not rewound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. The background validation does not seem to finish in this case.
I am now testing with a new node
n3
(in my local copy) to avoid the manual rollback toSTART_HEIGHT
(L207). Additionally,sync_blocks()
was throwing a timeout when reusing n2 for this test (not sure why).Here's the new approach I'm trying:
START_HEIGHT
(199).START_HEIGHT
up to height 298 (<SNAPSHOT_BASE_HEIGHT
).After loading the snapshot, I can see these two chain states:
Next, I connect the nodes and ensure they all see the same tip:
After syncing, these are the chain states:
It seems that the snapshot chain has now synced to the tip (height 399). However, this line times out after syncing the blocks:
self.wait_until(lambda: len(n3.getchainstates()['chainstates']) == 1)
I'm not sure if I'm doing something wrong or if there is indeed a bug where the divergent chain is not rewound. I will continue investigating.
Something to note here: if I follow the same process but I don't generate any divergent chain, then the validation completes successfully.
Any directions on how to proceed with this would be appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a bug. Just to clarify, the active chain is stuck at height
298
and the background chain continues to sync past399
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I am confused with the terms "active" and "background" chain. I assume the active chain is the snapshot chain which starts at height
299
, and the divergent chain is the one stuck at298
. I base this on the fact that when I rungetbestblockhash
afterloadtxoutset
, I get the hash of the snapshot tip. However, I might be mistaken. Shouldn't the divergent chain be rewound toSTART_HEIGHT
and become the background validation chain?To address your questions:
298
.399
. Although399
is theFINAL_HEIGHT
for this test, I was able to mine an additional 100 blocks on top ofnode0
, resulting in bothnode0
andnode3
syncing again, and the snapshot chain syncing up to height499
.However, even after syncing past 399, the background validation does not seem to finish. I always get a timeout when running this line after syncing the nodes:
Additionally, I am experiencing an intermittent issue where the sync does not always finish, and I have not been able to determine the cause yet. This issue happens only after mining past 399.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm this behavior and also think that this has uncovered a bug in
net_processing
.I think that the root cause is in TryDownloadingHistoricalBlocks (which is responsible for downloading the background chain):
This function calls
FindNextBlocks()
withpindexWalk
set to the current tip of the background chainstate (from_tip
), andtarget_block
set to the snapshot block.FindNextBlocks
then walks from the snapshot block backwards to the the height offrom_tip
, save these blocks invToFetch
and then begins to download these blocks in forward order.This is incorrect, because the blocks in starting at the last common ancestor of
from_tip
and the snapshot block, up to the height offrom_tip
are never requested that way (their height is smaller than the height offrom_tip
).So, my proposed fix would be something like mzumsande@edb2b69 (feel free to cherry-pick/ adjust as you like).
@alfonsoromanz: Could you check if that fix would solve the issue for you?
@ryanofsky Could you take a look - would you agree with that explanation and the proposed fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: https://github.com/bitcoin/bitcoin/pull/29996/files#r1633915935
Nice find! Would suggest opening a separate PR so it is easier to understand the problem and fix. And maybe it is possible to come up with a simpler test for this problem specifically, like by adding an assert in FindNextBlocks() that pindexWalk is an ancestor of
state->pindexBestKnownBlock
and then adding a test that triggers the assert.Would also consider tweaking the fix to call LastCommonAncestor() before calling TryDownloadingHistoricalBlocks(), so it is easier to understand
from_tip
variable being the immediate predecessor of blocks to download next instead of having a more complicated meaning.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: #29996 (comment)
I think the todo list came from my comment #27596 (comment), and I probably was thinking of interpretation (B) not (A), but maybe (B) is not a very interesting scenario.
I'm not sure why it needs to "lead to a much earlier error," though, or lead to any error. If the node is syncing to some chain that seems to have the most work, but then headers for a second chain are announced that has more work, the chainstate tip is not going to switch to the second chain, even though it has more work, until enough blocks from it are downloaded and validated, and a block from the second chain is reached that is that is valid and has more work than the chainstate tip. Before that happens, a snapshot from the second chain could be loaded such that the current chain tip has less work than the snapshot block and is not an ancestor of the snapshot block, but the snapshot block is valid and its ancestors can be downloaded.
Or maybe that is wrong, but at least it's my understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I misunderstand, but that seems overly complicated. I assume we're talking about the scenario "Not an ancestor of the snapshot block but has less work":
In my local test, I just gave the node the headers of the snapshot chain, and then used
-generate
to mine a divergent chain from the old tip. Number of blocks doesn't really matter. The other chain (with the snapshot in it) will have more work, but it is headers-only, so the tip will be on the divergent chain no matter how much work it has. Then, after the snapshot is loaded and we connected to a peer that has all the blocks, the node will successfully download the snapshot chain, but currently the background sync won't complete unless you apply my fix above. I assume that @alfonsoromanz's test (not pushed yet) works in a similar way.Just to avoid any confusion: There are two independent issues. My issue pops up if you don't use invalidateblock anymore, as the current version of the PR still does.
I neither want to hijack this PR nor open a PR with just the fix without a test, so my suggestion would be that you could incorporate the one-line fix into this PR (meaning that this PR wouldn't be test-only anymore) - if you're interested and have the time, that is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I interpreted it as B) less work than the snapshot block itself.
Regarding A): If the new node (divergent chain) has less work than the chain tip (
399
) but more than the snapshot (299
), the expected behavior would be to get the error:"Unable to Load UTXO Snapshot - [snapshot] activation failed - work does not exceed active chainstate."
This same behavior is expected for the other scenario of a node with a divergent chain but with more work, i.e "TODO: Not an ancestor or a descendant of the snapshot block and has more work"So both scenarios look very similar to me. That's why I was leaning towards option B. However, I am not an expert on real scenarios in mainnet and I may be missing something.
Also, I started working on your approach and completed this part:
But this is where I get confused:
After following you steps and connecting and syncing the nodes, the divergent chain is replaced with the original chain because has more work. If I don't run sync, it's not replaced, but I guess it's just a matter of time? or maybe I don't understand how connect and sync works.
Given that the snapshot will not be loaded because it doesn't exceed the active chainstate, I don't see much difference in making the divergent chain have less or more work than the original chain. What am I missing?
Either way, I am happy to add tests for both A) and B) scenarios.
Yes that's what I'm doing in my local code (not pushed). I'm submitting the headers to
n3
just like the original code is doing ton1
andn2
. After that, I call this test functiontest_snapshot_in_a_divergent_chain
where I generate the divergent chain and load the snapshot. The background validation only finishes if I apply your fix.Yes, I can incorporate it and add you as a co-author. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed my recent changes for this PR. This is the approach I decided to move forward with:
n3
) for the scenario where we load the snapshot in a node with a divergent chain but less work. I am not reusing previous nodes because I don't know a way to do a clean rollback without invalidating blocks. As mentioned by @fjahr, we shouldn't expect to load a snapshot in a scenario where part of the snapshot was invalidated. There is actually a new PR to prevent this from happening: assumeutxo: Check snapshot base block is not in invalid chain #30267.n4
) for the scenario where we load the snapshot in a node with a divergent chain and more work. I was not able to reusen3
for the same reason described previously:n3
has already synced to the tip, and I don't know any other way to rollback the chain other than invalidating blocks.Any feedback is appreciated.
Thanks!