-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-10295. Provide an "ozone repair" subcommand to update the snapshot info in transactionInfoTable #6533
base: master
Are you sure you want to change the base?
Conversation
…info in transactionInfoTable
Since this is an offline CLI I think it should also support finding the largest updateID (even if it's slow) and doing the update. Maybe as two steps (one to find the largest ID, and another to update to that). Doing the repair incorrectly can result in some bad states and we should try to make the repair commands as safe as possible. @ChenSammi or @fapifta can probably confirm what the correct steps to do the repair are since I haven't actually manually repaired a DB from this bug myself. I think scanning the DB for largest update ID will give the correct number to set the transaction index to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @DaveTeng0 for the patch.
Overall looks good to me, left some cosmetic comments.
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/repair/om/TransactionInfoRepair.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/repair/om/TransactionInfoRepair.java
Outdated
Show resolved
Hide resolved
...ozone/integration-test/src/test/java/org/apache/hadoop/ozone/shell/TestOzoneRepairShell.java
Outdated
Show resolved
Hide resolved
...ozone/integration-test/src/test/java/org/apache/hadoop/ozone/shell/TestOzoneRepairShell.java
Outdated
Show resolved
Hide resolved
...ozone/integration-test/src/test/java/org/apache/hadoop/ozone/shell/TestOzoneRepairShell.java
Outdated
Show resolved
Hide resolved
hey @ChenSammi @szetszwo , I look at the previous jira https://issues.apache.org/jira/browse/HDDS-9342, but I'm still not sure what's the best way to retrieve the highest TermIndex, except checking om's log. I see that two maps of 'applyTransactionMap' and 'ratisTransactionMap' have been removed from om, which might contain that information. so do you know where we could retrieve that TermIndex information, other than looking at om's log? |
@DaveTeng0 , since this is an offline CLI, there is no OM running and these two maps are not available even if there were not removed. |
I guess you mean OM raft log? It also cannot be used since the log entries may or may not be applied. The correct way is to fine the highest index from RocksDB. This should be what @errose28 has suggested. |
oh! that's right! |
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/RocksDBUtils.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/repair/om/TransactionInfoRepair.java
Outdated
Show resolved
Hide resolved
created a jira to investigate how to parse all RocksDB files to get latest highest TermIndex of OM. HDDS-10730 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @DaveTeng0 added some comments for improved testing and usability.
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/repair/om/TransactionInfoRepair.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/repair/om/TransactionInfoRepair.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/repair/om/TransactionInfoRepair.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/repair/om/TransactionInfoRepair.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/repair/om/TransactionInfoRepair.java
Outdated
Show resolved
Hide resolved
...ozone/integration-test/src/test/java/org/apache/hadoop/ozone/shell/TestOzoneRepairShell.java
Outdated
Show resolved
Hide resolved
...ozone/integration-test/src/test/java/org/apache/hadoop/ozone/shell/TestOzoneRepairShell.java
Outdated
Show resolved
Hide resolved
...ozone/integration-test/src/test/java/org/apache/hadoop/ozone/shell/TestOzoneRepairShell.java
Outdated
Show resolved
Hide resolved
Hello! if no further new comments, please feel free to merge! Thanks! |
@DaveTeng0 TestTransactionInfoRepair tests are failing due to NPE. Can you please fix that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM+1
@errose28 can you please take a look at the final PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more comments based on the latest iteration.
@Test | ||
public void testUpdateTransactionInfoTable() throws Exception { | ||
CommandLine cmd = new CommandLine(new RDBRepair()).addSubcommand(new TransactionInfoRepair()); | ||
String dbPath = OMStorage.getOmDbDir(conf) + OM_KEY_PREFIX + OM_DB_NAME; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. This is a path on the local filesystem, so it should be constructed from a Path
or File
object. OM_KEY_PREFIX
is for files in the Ozone filesystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense! Changed to create a File object and retrieve its path. And changed to use pure "/" in test case instead.
|
||
String cmdOut2 = scanTransactionInfoTable(dbPath); | ||
assertThat(cmdOut2).contains(testTerm + "#" + testIndex); | ||
cluster.getOzoneManager().restart(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the goal to make sure that the OM starts correctly after the repair? If so, we should use the same transaction update command to restore the old values, then do a metadata write operation on the cluster when it comes back up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense! updated!
...ozone/integration-test/src/test/java/org/apache/hadoop/ozone/shell/TestOzoneRepairShell.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/repair/TransactionInfoRepair.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/repair/TransactionInfoRepair.java
Outdated
Show resolved
Hide resolved
System.err.println(TRANSACTION_INFO_TABLE + " is not in a column family in DB for the given path."); | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the command will still exit 0 in this case. If you throw something likeIllegalArgumentException
the stack trace will be filtered out, the message printed, and the return code will be non-zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be tested in TestTransactionInfoRepair
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definitely makes sense, my mistake and I should have chose to throw exception here instead! thanks for catching it, and will add verification of the error message in test cases.
What changes were proposed in this pull request?
The issue found in HDDS-9342 caused the snapshot info in OM transactionInfoTable not get updated timely, so that OM restart failed at update ID check during raft log reapply.
The recover solution is to find the largest update ID, and update the snapshot info in transactionInfoTable with this it.
The task aims to provide such an CLI to update the table. Be noted, the largest update ID and its term currently should still need manual find.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10295
How was this patch tested?
Integration test