Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance metadata will be lost after Nacos restart #11890

Open
nkorange opened this issue Mar 28, 2024 · 7 comments · May be fixed by #11932
Open

Instance metadata will be lost after Nacos restart #11890

nkorange opened this issue Mar 28, 2024 · 7 comments · May be fixed by #11932
Labels
area/Naming kind/bug Category issues or prs related to bug. kind/discussion Category issues related to discussion

Comments

@nkorange
Copy link
Collaborator

nkorange commented Mar 28, 2024

Describe the bug

Instance metadata will be lost after Nacos restart

Expected behavior

Instance metadata is not affected after Nacos restart

Actually behavior

Instance metadata is lost after Nacos restart

How to Reproduce
Steps to reproduce the behavior:

  1. Deploy 3 Nacos servers.
  2. Run a Nacos client 2.x registering an instance with service name 'test.1'.
  3. Change the instance to offline on the Nacos conole.
  4. Restart the Nacos server which the Nacos client connected to.
  5. Wait for 5 minutes.
  6. Call the following command to force refresh the data:
curl 'http://127.0.0.1:8848/nacos/v1/ns/instance/list?serviceName=test.1&udpPort=1111' -H 'User-Agent:Nacos-Java-Client:v2.0.0'
  1. You can find the instance status is back to online.

Desktop (please complete the following information):

  • OS: MacOS
  • Version nacos-server 2.3.1, nacos-client 2.1.2
  • Module naming
  • SDK original

Additional context

There was an issue #10975 reported a similar problem. The fix to that issue did solve the metadata loss after client reconnection.

But for Nacos restart, the metadata loss issue still persists.

The reason of this bug is that Nacos has a ExpiredClientCleaner that would remove all expired clients.

Consider the three Nacos servers are Nacos1, Nacos2 and Nacos3:

  1. Client connects to Nacos1 with client ID client_1
  2. client_1 connection data is synced to Nacos2 and Nacos3
  3. Set the instance to offline on Nacos console.
  4. Restart Nacos1
  5. Client would connect to Nacos2 with client ID client_2
  6. As there is no more heartbeat from client_1, and Nacos1 didn't send the Client-Delete event to Nacos2 and Nacos3. So client_1 connection data is still there in Nacos2 and Nacos3. Then Nacos2 and Nacos3 will consider client_1 expired. After 3 minutes, they will trigger the clean task in ExpiredClientCleaner.
  7. ExpiredClientCleaner will publish a ClientDisconnectEvent event
  8. NamingMetadataManager received the ClientDisconnectEvent event and mark the instance metadata expired.
  9. After 1 minute, the instance metadata is deleted.
@KomachiSion
Copy link
Collaborator

restart nacos server is restart one node or restart whole cluster?

@KomachiSion KomachiSion added kind/bug Category issues or prs related to bug. area/Naming kind/discussion Category issues related to discussion labels Mar 28, 2024
@guozongkang
Copy link
Contributor

guozongkang commented Mar 28, 2024

restart nacos server is restart one node or restart whole cluster?

The Nacos node that establishes a long-lived gRPC connection with the client does not require restarting the entire Nacos cluster.

@alibaba alibaba deleted a comment from Banihanimohammad22 Apr 2, 2024
@nkorange
Copy link
Collaborator Author

nkorange commented Apr 2, 2024

A quick fix might be checking if the instance still exists before setting its metadata expired in NamingMetadataManager:

private void updateExpiredInfo(boolean expired, ExpiredMetadataInfo expiredMetadataInfo) {
        
        Instance instance = queryInstance(...); // new added code
        
        if (expired && instance == null) {
            expiredMetadataInfos.add(expiredMetadataInfo);
        } else {
            expiredMetadataInfos.remove(expiredMetadataInfo);
        }
    }

But for long term, I think this part should be re-designed. The instance metadata should be bound to an abstracted session (similar to lease in ETCD), instead of to a connection.

@KomachiSion Any thoughts on this?

@KomachiSion
Copy link
Collaborator

  1. I think do check should be in Cleaner.
  2. How to queryInstance? if use ServiceStorage, it is a cache which might update with a little delay. If query by traversing, whether cause performance problem?

@nkorange
Copy link
Collaborator Author

nkorange commented Apr 4, 2024

How to queryInstance? if use ServiceStorage, it is a cache which might update with a little delay. If query by traversing, whether cause performance problem?

I don't know a better solution, I would do ServiceStorage.getPushData(...) to query the instance. As I mentioned, this is a temp fix, and I think this whole expiring mechanism should be refactored.

@KomachiSion
Copy link
Collaborator

How to queryInstance? if use ServiceStorage, it is a cache which might update with a little delay. If query by traversing, whether cause performance problem?

I don't know a better solution, I would do ServiceStorage.getPushData(...) to query the instance. As I mentioned, this is a temp fix, and I think this whole expiring mechanism should be refactored.

It should be fix first by this way, performance problem can be enhanced in future, but if we can get a better way in now. Prefer to use the better performance way.

  1. Do check in Cleaner which is actual do delete metadata.
  2. use ServiceStorage.getPushData(...) first to get the real data.

@nkorange
Copy link
Collaborator Author

nkorange commented Apr 8, 2024

Okay, let me create a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/Naming kind/bug Category issues or prs related to bug. kind/discussion Category issues related to discussion
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants