Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove acl-check and cancel instead when REPLCONF ACK fails to validate #2920

Merged
merged 8 commits into from
May 1, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
8 changes: 7 additions & 1 deletion src/server/main_service.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1147,7 +1147,13 @@ void Service::DispatchCommand(CmdArgList args, facade::ConnectionContext* cntx)
// Bonus points because this allows to continue replication with ACL users who got
// their access revoked and reinstated
if (cid->name() == "REPLCONF" && absl::EqualsIgnoreCase(ArgS(args_no_cmd, 0), "ACK")) {
LOG(ERROR) << "Tried to reply to REPLCONF";
// TODO remove this comment when we remove the acl-checker
// This code will allow us to transition on later versions without
// breaking replica/master setup with different versions of dragonfly
auto info_ptr = server_family_.GetReplicaInfo(dfly_cntx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adiholden please help me reviewing these changes

if (info_ptr) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont you need to call DflyCmd::CancelReplication?

info_ptr->cntx.Cancel();
}
return;
}
dfly_cntx->SendError(std::move(*err));
Expand Down
4 changes: 4 additions & 0 deletions src/server/protocol_client.h
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,10 @@ class ProtocolClient {
return sock_.get();
}

void SetSocketTimeout(uint32_t msec) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this called ?

sock_->set_timeout(msec);
}

private:
ServerContext server_context_;

Expand Down
2 changes: 2 additions & 0 deletions src/server/replica.cc
Original file line number Diff line number Diff line change
Expand Up @@ -893,6 +893,7 @@ void Replica::RedisStreamAcksFb() {
}
}

// TODO(kostasrim): Remove this on 20/6/2024
Copy link
Collaborator

@romange romange Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused. Should not you remove the replica side checks so that later you will be able to remove master side responses?
Where do you remove acl-check as written in the PR title?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my bad here. While I was making the changes I realized that the code I introduced above is enough to allow us a smooth transition (that means as long as two subsequent versions of df contain the code below we can remove both the acl-checks and be fine without the clients noticing.

      auto info_ptr = server_family_.GetReplicaInfo(dfly_cntx);
      if (info_ptr) {
        info_ptr->cntx.Cancel();
      }
      return;

However, I do agree we should remove this now. I will push an update soon.

class AclCheckerClient : public ProtocolClient {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces a small bug if master is an older version that we cover in our tests. The problem is that if ACL of the masteruser changes then REPLCONF ACK will start failing silently without noticing and replication will break :(

I guess this setup is more probably since upgrading from one version to the other would involve a replica copying the data and then being promoted to master. An intermediate solution would be to keep this check for older version masters and deprecate/completely remove at some later point.

I am open for suggestions

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine, the use-case is not interesting.

public:
AclCheckerClient(ServerContext server, Context* cntx)
Expand All @@ -918,6 +919,7 @@ class AclCheckerClient : public ProtocolClient {
LOG(INFO) << "Failed to connect with acl client " << ec.message();
cntx_->Cancel();
}
SetSocketTimeout(4000);
}

Context* cntx_;
Expand Down
6 changes: 5 additions & 1 deletion src/server/server_family.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
#include "facade/reply_builder.h"
#include "server/channel_store.h"
#include "server/detail/save_stages_controller.h"
#include "server/dflycmd.h"
#include "server/engine_shard_set.h"
#include "server/replica.h"
#include "server/server_state.h"
Expand Down Expand Up @@ -47,7 +48,6 @@ class Journal;
class ClusterFamily;
class ConnectionContext;
class CommandRegistry;
class DflyCmd;
class Service;
class ScriptMgr;

Expand Down Expand Up @@ -215,6 +215,10 @@ class ServerFamily {
bool HasReplica() const;
std::optional<Replica::Info> GetReplicaInfo() const;

std::shared_ptr<DflyCmd::ReplicaInfo> GetReplicaInfo(ConnectionContext* cntx) const {
return dfly_cmd_->GetReplicaInfo(cntx);
}

void OnClose(ConnectionContext* cntx);

void BreakOnShutdown();
Expand Down