Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple NameNode role-groups may lead to cluster startup failure #294

Open
maltesander opened this issue Jan 9, 2023 · 4 comments
Open
Labels

Comments

@maltesander
Copy link
Member

Affected version

0.7.0-nightly

Current and expected behavior

Currently, we have a format-namenode namenode init container and a script to either create an active or standby namenode.
With one role-group and the podManagementPolicy: "OrderedReady" we make sure that namenodes (actually data and journalnodes as well) will spin up after another.

With two role-groups like:

  nameNodes:
    roleGroups:
      default:
        replicas: 1
      other_default:
        replicas: 1

we get two StatefulSets, which by itself respect the "OrderedReady" policy but spin up their individual Pods in parallel.

This may lead to a cluster startup failure. The namenodes of the different role-groups sometimes (flaky) both format itself as active, with different blob IDs etc. which leads to the "slower" namenode to fail starting up and joining the cluster:

Failed to start namenode.
java.io.FileNotFoundException: No valid image files found
	at org.apache.hadoop.hdfs.server.namenode.FSImageTransactionalStorageInspector.getLatestImages(FSImageTransactionalStorageInspector.java:158)
	at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:688)
	at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:339)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1201)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:779)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:681)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:768)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:1020)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:995)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1769)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1834)

Possible solution

We have to improve the format-namenode init container script to take into account namenodes (and their formatting) starting in parallel. Currently it just checks if there is already an active namenode and depending on that will format as active or standby (which leads to the "race" condition of having two nodes formatted as active with different blob IDs).

  • Take ZooKeeper into account?
  • Let the operator determine which role-group should format as active?
  • Introduce "wait" times for different role-groups to make sure they will not spin up in parallel (not very deterministic)

@lfrancke @soenkeliebau @Jimvin any ideas?

Additional context

This came up when implementing logging for the HDFS operator (and the integrationtests using multiple role-groups per role for custom and automatic log testing).

We should try to get rid of the "OrderedReady" policy part anyways (see #261) to speed up cluster creation.

Environment

Failed on GKE 1.23, AWS 1.22, Azure 1.23 (and probably any other provider)

Would you like to work on fixing this bug?

None

@soenkeliebau
Copy link
Member

Personally I think we should try and have the operator determine this and only add the init container to one namenode, this would simply bypass the race condition, artificial delays etc..

We have talked about tracking the "format state" in the status of the hdfs object in the past, that might be helpful here as well.

@maltesander
Copy link
Member Author

Dont we have to format the other namenodes as standby?

@lfrancke
Copy link
Member

@soenkeliebau How critical is this do you think?

@soenkeliebau
Copy link
Member

I'd say, if the only effect is "a namenode doesn't come up and needs to be reset" then it should be ok-ish..
To resolve this if it happens we probably need to remove the backing PVs for this namenode, right? Or would it overwrite state there with state from the active namenode when it is restarted?

If however this can cause formatting of an actual filesystem that contains data this would be a bit more critical..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants