Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing nodes creates multiple processes out of 1 process #207

Open
KamilZielinski opened this issue Jun 28, 2020 · 19 comments
Open

Removing nodes creates multiple processes out of 1 process #207

KamilZielinski opened this issue Jun 28, 2020 · 19 comments
Labels
bug Something isn't working

Comments

@KamilZielinski
Copy link

KamilZielinski commented Jun 28, 2020

Hey guys. I was learning how the library works and encountered weird issue which I can't really understand nor find explanation in the docs: why having 1 process created within 3-nodes cluster can scale up to 8 processes after removing 2 nodes from the cluster?

Setup

Erlang/OTP 22 [erts-10.7] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe] [dtrace]
IEx 1.10.2 (compiled with Erlang/OTP 22)

Deps (I'm using 0.8.0-rc.1 as it solved the issue with init/1):

{:phoenix, "~> 1.5.3"},
{:phoenix_ecto, "~> 4.1"},
{:ecto_sql, "~> 3.4"},
{:postgrex, ">= 0.0.0"},
{:phoenix_live_dashboard, "~> 0.2.0"},
{:telemetry_metrics, "~> 0.4"},
{:telemetry_poller, "~> 0.4"},
{:gettext, "~> 0.11"},
{:jason, "~> 1.0"},
{:plug_cowboy, "~> 2.0"},
{:horde, "~> 0.8.0-rc.1"}

Issue

I have a DynamicSupervisor(members: :auto) which creates 1 process (transient, called Conversation in my example) on a API request with random name. I don't worry about moving the state between nodes for now. Process always lasts 60s just waiting and then dies.
My setup: 3 nodes (Phoenix APIs) started locally on different ports (Node1,2,3 from left to right on the screenshot).

  1. Node1 takes the request (as I hit API's url with its port) and is supposed to create 1 process. It's being created on Node3. All available nodes show number of processes: 1
  2. Few seconds later I'm sending SIGTERM to Node3 from different console. Node2 schedules 2 processes instead of 1. All available nodes show number of processes: 2
  3. Few seconds later i kill the app with ctrl+c in Node2 console. Node1 schedules 8 processes instead of 1 or 2 (or even 4 as I'd assume based on previous example). Node1 is showing 8 processes running.

Zrzut ekranu 2020-06-28 o 03 37 20-kopia

@derekkraan derekkraan added the bug Something isn't working label Oct 29, 2020
@philipgiuliani
Copy link

philipgiuliani commented Nov 17, 2020

I am also seeing duplicated processes in my cluster. We have implemented the :name_conflict message, but my two processes are running concurrently without anyone of these getting the messages.
It also seems to happen when starting/stopping servers of the cluster.

@derekkraan
Copy link
Owner

Could someone make a minimal application that exhibits this behaviour?

@KamilZielinski
Copy link
Author

KamilZielinski commented Nov 18, 2020

Sure, I'll share the same code I used back then in few hours
edit: Unfortunately, I didn't have time to do it yesterday. I'll do it during the weekend

@philipgiuliani
Copy link

Hi @KamilZielinski ,
thanks for your effort! Did you already have time to look into it?

@KamilZielinski
Copy link
Author

@philipgiuliani Hey, I got it on my GH already. I need to clean it up from some things and describe scenario to reproduce. Sorry for the delay, I will try to drop it here soon

@KamilZielinski
Copy link
Author

KamilZielinski commented Nov 25, 2020

Hey guys, sorry for the delay. Seems like I didn't push the correct version many months ago. Fortunately, I had changes locally. Long live GIT! (or something 😂 )

Repo is here: https://github.com/KamilZielinski/horderoro
Code might be messy, I was still learning it. It might be even my mistake, who knows.

Start node 1/2/3 with port 4000/4001/4002 (simply change it in dev.exs before you run next node, I had some issues with passing port in the run command, dunno why)

$ iex --name [email protected] --cookie asdf -S mix phx.server
$ iex --name [email protected] --cookie asdf -S mix phx.server
$ iex --name [email protected] --cookie asdf -S mix phx.server

Join them together eg. from node1

iex([email protected])2> Node.connect(:"[email protected]")
iex([email protected])3> Node.connect(:"[email protected]")

and verify that nodes are connected eg. from Node3

Now hit [GET] http://localhost:4000/api/conversations (node1) endpoint to create a new worker. It might be spawned on one of nodes. Kill the node it has been created on and you will see process/es moved to another node, kill that node this time (the one that process moved to) and you will see multiple processes spawned on the last node.

@dbuos
Copy link
Contributor

dbuos commented Jan 21, 2021

Hi everyone, some updates with this issue? I've got very strange behaviour when nodes are removed from the cluster (testing with a huge number of processes) and I suspect it's for the same cause

@bigardone
Copy link

The same issue is happening to me :(

@fidr
Copy link

fidr commented Apr 16, 2021

Script to reproduce:

defmodule TestServer do
  use GenServer

  def start_link(val) do
    GenServer.start_link(__MODULE__, val)
  end

  def init(val) do
    IO.puts "starting TestServer #{inspect val}"
    {:ok, val}
  end
end

{:ok, _} =
  Horde.DynamicSupervisor.start_link(
    name: :sup_1,
    strategy: :one_for_one,
    delta_crdt_options: [sync_interval: 20]
  )

{:ok, _} =
  Horde.DynamicSupervisor.start_link(
    name: :sup_2,
    strategy: :one_for_one,
    delta_crdt_options: [sync_interval: 20]
  )

Horde.Cluster.set_members(:sup_1, [:sup_1, :sup_2])

Process.sleep(200)

{:ok, _} = Horde.DynamicSupervisor.start_child(:sup_1, {TestServer, "foo1"})
{:ok, _} = Horde.DynamicSupervisor.start_child(:sup_1, {TestServer, "foo2"})

{:ok, _pid} = Horde.DynamicSupervisor.start_child(:sup_2, {TestServer, "foo3"})
{:ok, _pid} = Horde.DynamicSupervisor.start_child(:sup_2, {TestServer, "foo4"})

Process.sleep(200)

stop_sup_2 = fn ->
  # remove sup_2 and stop it
  Horde.Cluster.set_members(:sup_1, [:sup_1])
  Horde.DynamicSupervisor.stop(:sup_2)
  Process.sleep(200)
end

restart_sup_2 = fn ->
  # start sup_2 and add it
  {:ok, _} =
    Horde.DynamicSupervisor.start_link(
      name: :sup_2,
      strategy: :one_for_one,
      delta_crdt_options: [sync_interval: 20]
    )
  Horde.Cluster.set_members(:sup_1, [:sup_1, :sup_2])
  Process.sleep(200)
end

count_processes_in_crdt = fn ->
  Enum.filter(DeltaCrdt.read(:"sup_1.Crdt"), fn
    {{:process, _}, _} -> true
    _ -> false
  end)
  |> length()
end

count_processes_in_supervisor = fn ->
  length(Horde.DynamicSupervisor.which_children(:sup_1))
end

4 = count_processes_in_crdt.()
4 = count_processes_in_supervisor.()

stop_sup_2.()
restart_sup_2.()

6 = count_processes_in_supervisor.()
8 = count_processes_in_crdt.()

stop_sup_2.()
restart_sup_2.()

10 = count_processes_in_supervisor.()
12 = count_processes_in_crdt.()

@bartj3
Copy link

bartj3 commented May 26, 2021

Hi,
At T-Mobile we are running into the same bug, we've spent some time diving into
Horde to see if we could tackle this.

Upon inspection of the CRDT state during the test script, we noticed that after
stopping a supervisor. It's children are handed over with new random IDs, but
the old children are not removed from the CRDT. Our assumption at this point is
that in the old (pre randomized id) version a cleanup was not needed because by
using the same ID you would overwrite instead of add.

So there's two options here:

  1. Either we no longer randomize the child on handover, meaning we would
    overwrite the old process
  2. Or we delete the old process from the CRDT during hand over.

The first solution is already suggested in PR #226, to us that felt somewhat
scary because we don't understand why the randomize was added to begin with. So
we explored option 2.

During the exploration of option 2 we realized that the :active/:passive setup
in the handover meant that during a :passive handover no process is terminated,
meaning that the disown_child_process is also never called.

We decided to call this on the recipient supervisor during handover because at
that point we are sure that the old supervisor is dead, as the code already
validates that. This resulted in PR #231 This is still a WIP as we'd like to
add tests to prove that this does indeed work, and does not break the :active
handover.

@l3nz
Copy link

l3nz commented Jun 24, 2022

I am seeing a similar behavior; I have some supervisors that start and register. I understood that registering multiple times with the same name would not be allowed, and on merge of the same process starting on multiple nodes, one of them would be killed.

In fact I see that many processes are present multiple times.

The process is started by calling:

Horde.DynamicSupervisor.start_child(@supervisor, {Logfwd.Trk.StatusSrv, []})

when each application boot. What I get is that either it starts, or it is found duplicate, and not started. All is good.

HordeRegistry

While if I ask HordeRegistry about a specific process (that is supposed to be a singleton) it is present just once:

 {Logfwd.Trk.StatusSrv, "-", #PID<46889.30038.0>},

If I go look at the list of processes it is linked to, there are TWO processes for it:

#PID<46889.30038.0> (Logfwd.Trk.StatusSrv.init/1)   <---- active, the registered one
#PID<46889.2452.0> (Logfwd.Trk.StatusSrv.init/1)    <---- gets no input but it's still there

HordeSupervisor

If I ask HordeSupervisor for the list of all children, I see that the process is present multiple times:

> Horde.DynamicSupervisor.which_children Logfwd.HordeSupervisor
[
  {:undefined, #PID<46889.2452.0>, :worker, [Logfwd.Trk.StatusSrv]},
  {:undefined, #PID<46889.30038.0>, :worker, [Logfwd.Trk.StatusSrv]},
  {:undefined, #PID<46890.21379.0>, :worker, [Logfwd.Trk.StatusSrv]},
  ....
]

I wonder what the point of having Horde allow multiple instances of StatusSrv (with the same call parameters) would be. I thought that the whole point of Horde was to have a supervisor that would keep alive cluster-unique processes, and it did so by preventing and - when necessary - killing processes booting with the same name. Or am I missing something trivial?

@bernardo-martinez
Copy link

same issue is happening to me. Have you figured out a way to prevent it?

@derekkraan
Copy link
Owner

@bernardo-martinez I don't have time to dig into this right now unfortunately.

@arjan
Copy link
Contributor

arjan commented Nov 16, 2022

I think that actually @bartj3 's comment explains it pretty well

@arjan
Copy link
Contributor

arjan commented Nov 16, 2022

@bernardo-martinez could you try applying the patch from #226, I think it helps. I am not sure why a new random child ID is being generated on process handoff, the identity of the child is still the same, it is just being started somewhere else...

@bernardo-martinez
Copy link

@arjan i get the problem, what i don't get is where in the code should i dive (And how it works internally) in order to fix it. But ok when i have time i will go first through the path you suggest and if no sucess will go through the long way (studying all the code basically). Thanks for the replies :)

@l3nz
Copy link

l3nz commented Nov 16, 2022

@derekkraan I believe it could be a good idea to put a note on the main README that there are currently issues with process duplication - because this is a major issue. I personally decided to use a different approach because in our environment this was a complete no-go.

I'm totally fine with you not having time/bandwidth to address these issues, as every open source maintainer knows (we are all on the same boat), but for such a critical piece of infrastructure as Horde is, it would definitely be nice to be clear on this topic, e.g. by saying that that there are some unsolved major issues, and that they won't be solved any time soon.

People (even open source maintainers) have lives and jobs and other interests as well, so I don't think there is any loss of face in being open about it. Horde is still a terrific library.

@rossvz
Copy link

rossvz commented Nov 29, 2022

I personally decided to use a different approach because in our environment this was a complete no-go.

@l3nz can you elaborate on the workaround you used to avoid this? We're running into the same issue and it's creating issues with our rolling deploys.

@l3nz
Copy link

l3nz commented Dec 11, 2022

I basically have a singleton-per-cluster process using an approach from https://github.com/derekkraan/highlander that takes care of having a list of processes that must be up and distribuiting them to different cluster nodes. When there are topology changes, and in any case every few minutes, it checks to make sure that all processes that are supposed to be up are actually up and that all nodes are reachable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.