Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: can multiple cluster share the same server #319

Open
StephenPCG opened this issue Oct 20, 2017 · 5 comments
Open

feature request: can multiple cluster share the same server #319

StephenPCG opened this issue Oct 20, 2017 · 5 comments

Comments

@StephenPCG
Copy link

Let me first describe my scenario.

At first, I had a single instance of carbon-c-relay, it does some aggregation jobs. Then, when metric volume grows, relay was running out of CPU, so I'd like to scale up. Since aggregation requires all related metrics sent to the same relay instance, I added another layer of relay, let's call this layer relay-gate, and the original layer relay-worker. So, I now have 1 instance of relay-gate, and 4 instances of relay-worker . relay-gate has such configuration:

cluster worker1 forward $worker1:2003;
cluster worker2 forward $worker2:2003;
cluster worker3 forward $worker3:2003;
cluster worker4 forward $worker4:2003;

match ^prefix1\..* send to worker1 stop;
match ^prefix2\..* send to worker2 stop;
match ^prefix3\..* send to worker3 stop;
match ^prefix4\..* send to worker4 stop;

This CPU load now splits to multiple machines well. However, in this setup, each relay-worker is a single point of failure. I would like something like this:

cluster worker1
    failover
        worker1:2003
        worker2:2003
        worker3:2003
        worker4:2003
    ;
cluster worker2
    failover
        worker2:2003
        worker3:2003
        worker4:2003
        worker5:2003
    ;
...

So that, each worker instance is primarily targeted by some metrics set, when one fails, all its load are migrated to another instance. Thus, none of the worker instance are SPOF.

But current carbon-c-relay does not support such configuration, it complains:

relay.conf:18:4: cannot share server a.b.c.d:2003 with any_of/failover cluster 'worker2'

Would you like to implement such feature? Or, do you have any better suggestions for me?

Thanks in advance!

@grobian
Copy link
Owner

grobian commented Oct 20, 2017

This is unrelated to the request, but do you think your workers can handle the load when one worker becomes unavailable? From a fail-over point of view, this feels very unlikely and thus causing a lazer point of destruction, where one after the other get overloaded, with everytime an even larger load.

The reason for not sharing the servers between these kind of clusters is a technical one, solving that isn't going to be trivial I think, for failover logic is implemented in the servers themselves.

@StephenPCG
Copy link
Author

I shall monitor cpu usage to prevent a node failure due to high load. For example, I will make sure each worker instance won't occupy more than 30% CPU, so that, if all load of a single instance migrated to another won't overload that server. If any relay instance requires more than 30% CPU due to growing metric volume, I will add more workers and re-balance load as soon as I get alerted.

What I wan't is HA during hardware failure. In my current setup, any single hardware failure won't disrupt the service as well as data (except for relay-worker). I have multiple instances of relay-gate and have a loadbalancer in front of it, same to carbonapi. I have many go-carbons with all data replicated by at least 2. But relay-worker does not have any failover option now, if one of the worker fails, we lost the data sent to it.

@grobian
Copy link
Owner

grobian commented Oct 20, 2017

I think real HA means you'd have to do it on two nodes at the same time (double), because aggregations depend on state, which gets lost if the engine stops.

@StephenPCG
Copy link
Author

"Real HA" is really expensive to achieve. If the feature I expected can be implemented, when a worker fails, only aggregated metrics will be affected, and ideally only those at about two time points are corrupted, this is acceptable, compared to losing all metrics during worker failure.

Anyway, if this is not trivial to implement, that's OK. I will look for other solutions. You can close this issue at anytime.

@grobian
Copy link
Owner

grobian commented Oct 21, 2017

Indeed, my criticism aside, the problem is an implementation detail. In the past I used some technique to share queues of servers, perhaps I can use that to implement this FR, as well as another asking about multiple servers for the same destination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants