Instance metadata will be lost after Nacos restart #11890

nkorange · 2024-03-28T08:35:51Z

Describe the bug

Instance metadata will be lost after Nacos restart

Expected behavior

Instance metadata is not affected after Nacos restart

Actually behavior

Instance metadata is lost after Nacos restart

How to Reproduce
Steps to reproduce the behavior:

Deploy 3 Nacos servers.
Run a Nacos client 2.x registering an instance with service name 'test.1'.
Change the instance to offline on the Nacos conole.
Restart the Nacos server which the Nacos client connected to.
Wait for 5 minutes.
Call the following command to force refresh the data:

curl 'http://127.0.0.1:8848/nacos/v1/ns/instance/list?serviceName=test.1&udpPort=1111' -H 'User-Agent:Nacos-Java-Client:v2.0.0'

You can find the instance status is back to online.

Desktop (please complete the following information):

OS: MacOS
Version nacos-server 2.3.1, nacos-client 2.1.2
Module naming
SDK original

Additional context

There was an issue #10975 reported a similar problem. The fix to that issue did solve the metadata loss after client reconnection.

But for Nacos restart, the metadata loss issue still persists.

The reason of this bug is that Nacos has a ExpiredClientCleaner that would remove all expired clients.

Consider the three Nacos servers are Nacos1, Nacos2 and Nacos3:

Client connects to Nacos1 with client ID client_1
client_1 connection data is synced to Nacos2 and Nacos3
Set the instance to offline on Nacos console.
Restart Nacos1
Client would connect to Nacos2 with client ID client_2
As there is no more heartbeat from client_1, and Nacos1 didn't send the Client-Delete event to Nacos2 and Nacos3. So client_1 connection data is still there in Nacos2 and Nacos3. Then Nacos2 and Nacos3 will consider client_1 expired. After 3 minutes, they will trigger the clean task in ExpiredClientCleaner.
ExpiredClientCleaner will publish a ClientDisconnectEvent event
NamingMetadataManager received the ClientDisconnectEvent event and mark the instance metadata expired.
After 1 minute, the instance metadata is deleted.

The text was updated successfully, but these errors were encountered:

KomachiSion · 2024-03-28T08:54:22Z

restart nacos server is restart one node or restart whole cluster?

guozongkang · 2024-03-28T10:25:17Z

restart nacos server is restart one node or restart whole cluster?

The Nacos node that establishes a long-lived gRPC connection with the client does not require restarting the entire Nacos cluster.

nkorange · 2024-04-02T08:02:00Z

A quick fix might be checking if the instance still exists before setting its metadata expired in NamingMetadataManager:

private void updateExpiredInfo(boolean expired, ExpiredMetadataInfo expiredMetadataInfo) {
        
        Instance instance = queryInstance(...); // new added code
        
        if (expired && instance == null) {
            expiredMetadataInfos.add(expiredMetadataInfo);
        } else {
            expiredMetadataInfos.remove(expiredMetadataInfo);
        }
    }

But for long term, I think this part should be re-designed. The instance metadata should be bound to an abstracted session (similar to lease in ETCD), instead of to a connection.

@KomachiSion Any thoughts on this?

KomachiSion · 2024-04-03T08:58:03Z

I think do check should be in Cleaner.
How to queryInstance? if use ServiceStorage, it is a cache which might update with a little delay. If query by traversing, whether cause performance problem?

nkorange · 2024-04-04T06:52:49Z

How to queryInstance? if use ServiceStorage, it is a cache which might update with a little delay. If query by traversing, whether cause performance problem?

I don't know a better solution, I would do ServiceStorage.getPushData(...) to query the instance. As I mentioned, this is a temp fix, and I think this whole expiring mechanism should be refactored.

KomachiSion · 2024-04-08T02:24:45Z

How to queryInstance? if use ServiceStorage, it is a cache which might update with a little delay. If query by traversing, whether cause performance problem?

I don't know a better solution, I would do ServiceStorage.getPushData(...) to query the instance. As I mentioned, this is a temp fix, and I think this whole expiring mechanism should be refactored.

It should be fix first by this way, performance problem can be enhanced in future, but if we can get a better way in now. Prefer to use the better performance way.

Do check in Cleaner which is actual do delete metadata.
use ServiceStorage.getPushData(...) first to get the real data.

nkorange · 2024-04-08T02:42:10Z

Okay, let me create a PR

KomachiSion added kind/bug Category issues or prs related to bug. area/Naming kind/discussion Category issues related to discussion labels Mar 28, 2024

alibaba deleted a comment from Banihanimohammad22 Apr 2, 2024

nkorange linked a pull request Apr 8, 2024 that will close this issue

[ISSUE #11890] Check instance existence before expiring its metadata #11932

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instance metadata will be lost after Nacos restart #11890

Instance metadata will be lost after Nacos restart #11890

nkorange commented Mar 28, 2024 •

edited

KomachiSion commented Mar 28, 2024

guozongkang commented Mar 28, 2024 •

edited

nkorange commented Apr 2, 2024

KomachiSion commented Apr 3, 2024

nkorange commented Apr 4, 2024 •

edited

KomachiSion commented Apr 8, 2024

nkorange commented Apr 8, 2024

Instance metadata will be lost after Nacos restart #11890

Instance metadata will be lost after Nacos restart #11890

Comments

nkorange commented Mar 28, 2024 • edited

KomachiSion commented Mar 28, 2024

guozongkang commented Mar 28, 2024 • edited

nkorange commented Apr 2, 2024

KomachiSion commented Apr 3, 2024

nkorange commented Apr 4, 2024 • edited

KomachiSion commented Apr 8, 2024

nkorange commented Apr 8, 2024

nkorange commented Mar 28, 2024 •

edited

guozongkang commented Mar 28, 2024 •

edited

nkorange commented Apr 4, 2024 •

edited