Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-instance raft #1759

Open
mbanck opened this issue Nov 14, 2020 · 6 comments
Open

Multi-instance raft #1759

mbanck opened this issue Nov 14, 2020 · 6 comments

Comments

@mbanck
Copy link
Contributor

mbanck commented Nov 14, 2020

It seems pure raft based on pysyncobj does not namespace the configuration values like e.g. etcd does (which uses the /$SCOPE/ as prefix/namespace). So you can't run two patroni instances on the same raft port.

Is this not possible due to restrictions in the way the key-value store works (no directories/nesting), or even due to the fact that it would be intrinsically tied to the patroni instance creating it (maybe like the systemid in postgresql)?

It would of course be possible to just select a different port for each instance, but this makes it more cumbersome to setup multiple instances the way Debian does it, because it currently assumes the DCS configuration is the same for each patroni instance.

As an aside, is there some default port assigned for raft? I see both 1234 and 222[234] in the code.

Maybe it could be made to just use API port + 10000 (so 18008 by default) if no port is specified, (assuming that this class could even get at the API port)?

@CyberDem0n
Copy link
Collaborator

CyberDem0n commented Nov 14, 2020

It seems pure raft based on pysyncobj does not namespace the configuration values like e.g. etcd does (which uses the /$SCOPE/ as prefix/namespace).

In this regard, the raft implementation isn't different from any other DCS. It uses /$namespace/$scope/ as prefix for keys.

So you can't run two patroni instances on the same raft port.

IMO, the main use-case for using raft, is when you just have a single cluster, with two or three PostgreSQL nodes.
In theory, (and I believe it wouldn't be hard to show it in practice) it is possible to run the pysyncobj cluster of three patroni_raft_controler. This cluster could be used by Patroni. For that, you just don't need to specify the raft.self_addr in the patroni.yaml and it will not listen for incoming connections. Instead, Patroni will connect to the aforementioned cluster. But again, IMO, instead of using such a setup, it would be better to rely on well tested DCS, like Etcd, Consul, and Zookeeper.

It would of course be possible to just select a different port for each instance, but this makes it more cumbersome to setup multiple instances the way Debian does it, because it currently assumes the DCS configuration is the same for each patroni instance.

Yes, that's the major difference of raft with other DCS. You have to specify in the config the list of other nodes of this specific PostgreSQL cluster.

As an aside, is there some default port assigned for raft? I see both 1234 and 222[234] in the code.

I don't think that pysyncobj recommends or advertises any default port. It just happened that in my experiments I used some "random" ports, so they ended up in the sample configs and code.

Maybe it could be made to just use API port + 10000 (so 18008 by default) if no port is specified, (assuming that this class could even get at the API port)?

The thing is that DCS implementations don't know anything about REST API. Even the code which handles the config file doesn't care much about DCS, it only tries to match available DCS with config. I.e., it knows that there is for example etcd3.py in the patroni/dcs and checks if there is etcd3 section in the config. After that it will try to load patroni.dcs.ecd3 and use it. When the DCS class is created it gets only the part of the config which it needs. In addition to that, it is possible to change the REST API port on the fly, while the raft port must be static, because all raft nodes identifying themself by host:port.

It seems that you put some effort into the raft implementation, so I should warn you that it is not yet production-ready.
Since the Patroni release 2.0.0 there were identified a few problems in the pysyncobj. Some of them were fixed by the maintainer, fixes for others I contributed myself.
At least one issue is still open. There is sort of a workaround in the Patroni (enforce log compaction on topology change), but it would be better to wait for a permanent solution in the pysyncobj. Another problem is that the traffic encryption with the password doesn't work. It is partially due to some bugs in Patroni and partially due to bugs in pysyncobj that were already fixed, but we have t wait for the next release.

@mbanck
Copy link
Contributor Author

mbanck commented Nov 14, 2020

It seems pure raft based on pysyncobj does not namespace the configuration values like e.g. etcd does (which uses the /$SCOPE/ as prefix/namespace).

In this regard, the raft implementation isn't different from any other DCS. It uses /$namespace/$scope/ as prefix for keys.

Hrm, ok - I only did a quick syncobj_admin -conn 192.168.122.114:4321 -status and saw a flat list, like


root@pg1:~# syncobj_admin -conn 192.168.122.114:4321 -status | head
commit_idx: 7602
enabled_code_version: 0
last_applied: 7602
leader: 192.168.122.114:4321
leader_commit_idx: 7602
log_len: 78
match_idx_count: 2
match_idx_server_192.168.122.211:4321: 7602
match_idx_server_192.168.122.38:4321: 7602
next_node_idx_count: 2

I now realize that this is probably just the pysyncobj internal stuff, I'll have to look into how to get at the patroni DCS data itself, is there some CLI way to get at it?

When I try to start a second patroni cluster on the same raft port, I just get INFO: waiting on raft on all nodes, so I assumed the port clash was the problem.

So you can't run two patroni instances on the same raft port.

IMO, the main use-case for using raft, is when you just have a single cluster, with two or three PostgreSQL nodes.
In theory, (and I believe it wouldn't be hard to show it in practice) it is possible to run the pysyncobj cluster of three patroni_raft_controler. This cluster could be used by Patroni. For that, you just don't need to specify the raft.self_addr in the patroni.yaml and it will not listen for incoming connections. Instead, Patroni will connect to the aforementioned cluster. But again, IMO, instead of using such a setup, it would be better to rely on well tested DCS, like Etcd, Consul, and Zookeeper.

I see; I think one of the main advantages of raft would be that you don't need another service besides patroni; having to run patroni_raft_controller would defy that.

Maybe it could be made to just use API port + 10000 (so 18008 by default) if no port is specified, (assuming that this class could even get at the API port)?

The thing is that DCS implementations don't know anything about REST API. Even the code which handles the config file doesn't care much about DCS, it only tries to match available DCS with config. I.e., it knows that there is for example etcd3.py in the patroni/dcs and checks if there is etcd3 section in the config. After that it will try to load patroni.dcs.ecd3 and use it. When the DCS class is created it gets only the part of the config which it needs. In addition to that, it is possible to change the REST API port on the fly, while the raft port must be static, because all raft nodes identifying themself by host:port.

Right, I took a look at hacking that in but didn't get far.

It seems that you put some effort into the raft implementation, so I should warn you that it is not yet production-ready.
Since the Patroni release 2.0.0 there were identified a few problems in the pysyncobj. Some of them were fixed by the maintainer, fixes for others I contributed myself.
At least one issue is still open. There is sort of a workaround in the Patroni (enforce log compaction on topology change), but it would be better to wait for a permanent solution in the pysyncobj. Another problem is that the traffic encryption with the password doesn't work. It is partially due to some bugs in Patroni and partially due to bugs in pysyncobj that were already fixed, but we have t wait for the next release.

Ok, thanks for the heads-up. I did not spend a lot of time, just tried to see whether it's possible to easily integrate raft into the Debian package now that pysyncobj is in Debian

@mbanck
Copy link
Contributor Author

mbanck commented Nov 15, 2020

When I try to start a second patroni cluster on the same raft port, I just get INFO: waiting on raft on all nodes, so I assumed the port clash was the problem.

So, one thing is having the same data_dir should also clash, as the .journal files would have the same file names, or there might even be data corruption if both pysyncobj instances use them?

Otherwise, I would expect binding to the same port should just fail, but it seems pysyncobj just sits idle/is in a loop INFO: waiting on raft, if it will never work it should fail early. Probably a pysyncobj issue though?

I tried starting patroni_raft_controller twice with the same config and while the KVStoreTTL() call returns for the first instance, it never returns for the second.

@mbanck
Copy link
Contributor Author

mbanck commented Nov 15, 2020

I tried starting patroni_raft_controller twice with the same config and while the KVStoreTTL() call returns for the first instance, it never returns for the second.

I just noticed now that the second, waiting instance spins in a tight loop and takes 100% CPU

@CyberDem0n
Copy link
Collaborator

I now realize that this is probably just the pysyncobj internal stuff, I'll have to look into how to get at the patroni DCS data itself, is there some CLI way to get at it?

Exactly, syncobj_admin only reports the internal state of pysyncobj and doesn't know anything about the distributed KeyValue store implemented in Patroni on top of pysyncobj. Currently there is no good interface to take a look at what is inside. patronictl works only with the state of a single cluster, but there is no way to list all clusters.

When I try to start a second patroni cluster on the same raft port, I just get INFO: waiting on raft on all nodes, so I assumed the port clash was the problem.

Unfortunately pysyncobj is not very good at providing feedback of what is going on. It tries to bind to the port and if bind failed it immediately retries without even writing anything to the log. Therefore the CPU usage jumps to 100%. It is not possible to run multiple Patroni instances on the same host with raft on the same port.

I see; I think one of the main advantages of raft would be that you don't need another service besides patroni; having to run patroni_raft_controller would defy that.

Yes, this is the main advantage. But it is useful only when you need to manage just one cluster. If you need to run more then one - it is better to invest in setting up for example Etcd.

The only reason to have patroni_raft_controller - is to be able to run Patroni on raft with only two Postgres nodes. In such case you have to run patroni_raft_controller on the third node for a quorum. As I already told, one could also use patroni_raft_controller running on different nodes to set up an equivalent of etcd cluster, but IMO it is not the best idea ever.

So, one thing is having the same data_dir should also clash, as the .journal files would have the same file names, or there might even be data corruption if both pysyncobj instances use them?

Yes, that's definitely an issue. Two instances with the same config will try to use the same files and might damage raft logs and dump files. Although, it shouldn't be so dangerous as it sounds. The log is only mirrored to the file, while the content is stored in the memory, so only when you restart it will read something unexpected. But even in that case it will get the latest snapshot from the leader.

Otherwise, I would expect binding to the same port should just fail, but it seems pysyncobj just sits idle/is in a loop INFO: waiting on raft, if it will never work it should fail early. Probably a pysyncobj issue though?

It fails for sure, but pysyncobj is very silent about that and simply retries without a back-off :(
The INFO: waiting on raft is produced by Patroni while it is waiting for pysyncobj to report that it became ready, but since the port is already occupied it never happens.

@mbanck
Copy link
Contributor Author

mbanck commented Dec 1, 2020

Maybe it could be made to just use API port + 10000 (so 18008 by default) if no port is specified, (assuming that this class could even get at the API port)?

The thing is that DCS implementations don't know anything about REST API. Even the code which handles the config file doesn't care much about DCS, it only tries to match available DCS with config. I.e., it knows that there is for example etcd3.py in the patroni/dcs and checks if there is etcd3 section in the config. After that it will try to load patroni.dcs.ecd3 and use it. When the DCS class is created it gets only the part of the config which it needs.

I thought about this some more. I think the DCS setup being encapsulated and insulated from Patroni makes sense for the other DCS providers, but for RAFT, I believe it would be beneficial to have one RAFT per instance, and maybe let the RAFT code know a bit more about the config (in particular, the REST API port) in order to deduce a RAFT port ab initio, without having to touch the DCS config for each instance. This could be done (and it I've got a PoC working, will open a PR in a bit) by exposing restapi in dcs/__init__.py like this:

                         # propagate some parameters
                         config[name].update({p: config[p] for p in ('namespace', 'name', 'scope', 'loop_wait',
-                                             'patronictl', 'ttl', 'retry_timeout') if p in config})
+                                             'patronictl', 'ttl', 'retry_timeout', 'restapi') if p in config})
                         return item(config[name])

The API port can then be extracted from config['restapi']['listen'] and the RAFT port be set to that, prepended with a 1. Of course, the user can still configure RAFT with an explicit port in the DCS config if they choose.

Another possibility would be the Postgres port, but that one is hidden even further down in config['postgresql'].

In addition to that, it is possible to change the REST API port on the fly, while the raft port must be static, because all raft nodes identifying themself by host:port.

Is that done a lot in practise? I think it would be ok to just use the REST API port that the Patroni configuration mentioned at startup time and keep using that.

It seems that you put some effort into the raft implementation, so I should warn you that it is not yet production-ready.

I think it would be wise to mark features like that as "Preview" or "Beta", as users have come to expect Patroni to be a pretty mature and solid product. They might assume that once you've cut a release with it, the feature must be ready for prime-time.

mbanck pushed a commit to credativ/patroni that referenced this issue Dec 1, 2020
If no ports are specified for self_addr, bind_addr and/or listen_addrs, then Patroni
will self-assign the startup API port + 10000 as RAFT port.
mbanck pushed a commit to credativ/patroni that referenced this issue Dec 1, 2020
If no ports are specified for self_addr, bind_addr and/or partnern_addrs, then
Patroni will self-assign the startup API port + 10000 as RAFT port.
mbanck pushed a commit to credativ/patroni that referenced this issue Dec 1, 2020
If no ports are specified for self_addr, bind_addr and/or partner_addrs, then
Patroni will self-assign the startup API port + 10000 as RAFT port.
mbanck pushed a commit to credativ/patroni that referenced this issue Oct 17, 2021
If no ports are specified for self_addr, bind_addr and/or partner_addrs, then
Patroni will self-assign the startup API port + 10000 as RAFT port.
mbanck pushed a commit to credativ/patroni that referenced this issue Dec 31, 2021
If no ports are specified for self_addr, bind_addr and/or partner_addrs, then
Patroni will self-assign the startup API port + 10000 as RAFT port.
mbanck pushed a commit to credativ/patroni that referenced this issue Feb 10, 2022
If no ports are specified for self_addr, bind_addr and/or partner_addrs, then
Patroni will self-assign the startup API port + 10000 as RAFT port.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants