Removing nodes creates multiple processes out of 1 process #207

KamilZielinski · 2020-06-28T03:03:31Z

Hey guys. I was learning how the library works and encountered weird issue which I can't really understand nor find explanation in the docs: why having 1 process created within 3-nodes cluster can scale up to 8 processes after removing 2 nodes from the cluster?

Setup

Erlang/OTP 22 [erts-10.7] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe] [dtrace]
IEx 1.10.2 (compiled with Erlang/OTP 22)

Deps (I'm using 0.8.0-rc.1 as it solved the issue with init/1):

{:phoenix, "~> 1.5.3"},
{:phoenix_ecto, "~> 4.1"},
{:ecto_sql, "~> 3.4"},
{:postgrex, ">= 0.0.0"},
{:phoenix_live_dashboard, "~> 0.2.0"},
{:telemetry_metrics, "~> 0.4"},
{:telemetry_poller, "~> 0.4"},
{:gettext, "~> 0.11"},
{:jason, "~> 1.0"},
{:plug_cowboy, "~> 2.0"},
{:horde, "~> 0.8.0-rc.1"}

Issue

I have a DynamicSupervisor(members: :auto) which creates 1 process (transient, called Conversation in my example) on a API request with random name. I don't worry about moving the state between nodes for now. Process always lasts 60s just waiting and then dies.
My setup: 3 nodes (Phoenix APIs) started locally on different ports (Node1,2,3 from left to right on the screenshot).

Node1 takes the request (as I hit API's url with its port) and is supposed to create 1 process. It's being created on Node3. All available nodes show number of processes: 1
Few seconds later I'm sending SIGTERM to Node3 from different console. Node2 schedules 2 processes instead of 1. All available nodes show number of processes: 2
Few seconds later i kill the app with ctrl+c in Node2 console. Node1 schedules 8 processes instead of 1 or 2 (or even 4 as I'd assume based on previous example). Node1 is showing 8 processes running.

The text was updated successfully, but these errors were encountered:

philipgiuliani · 2020-11-17T16:18:38Z

I am also seeing duplicated processes in my cluster. We have implemented the :name_conflict message, but my two processes are running concurrently without anyone of these getting the messages.
It also seems to happen when starting/stopping servers of the cluster.

derekkraan · 2020-11-18T08:39:30Z

Could someone make a minimal application that exhibits this behaviour?

KamilZielinski · 2020-11-18T08:58:12Z

Sure, I'll share the same code I used back then in few hours
edit: Unfortunately, I didn't have time to do it yesterday. I'll do it during the weekend

philipgiuliani · 2020-11-24T09:04:29Z

Hi @KamilZielinski ,
thanks for your effort! Did you already have time to look into it?

KamilZielinski · 2020-11-24T09:26:21Z

@philipgiuliani Hey, I got it on my GH already. I need to clean it up from some things and describe scenario to reproduce. Sorry for the delay, I will try to drop it here soon

KamilZielinski · 2020-11-25T01:46:22Z

Hey guys, sorry for the delay. Seems like I didn't push the correct version many months ago. Fortunately, I had changes locally. Long live GIT! (or something 😂 )

Repo is here: https://github.com/KamilZielinski/horderoro
Code might be messy, I was still learning it. It might be even my mistake, who knows.

Start node 1/2/3 with port 4000/4001/4002 (simply change it in dev.exs before you run next node, I had some issues with passing port in the run command, dunno why)

$ iex --name [email protected] --cookie asdf -S mix phx.server
$ iex --name [email protected] --cookie asdf -S mix phx.server
$ iex --name [email protected] --cookie asdf -S mix phx.server

Join them together eg. from node1

iex([email protected])2> Node.connect(:"[email protected]")
iex([email protected])3> Node.connect(:"[email protected]")

and verify that nodes are connected eg. from Node3

iex([email protected])2> Node.list
[:"[email protected]", :"[email protected]"]

Now hit [GET] http://localhost:4000/api/conversations (node1) endpoint to create a new worker. It might be spawned on one of nodes. Kill the node it has been created on and you will see process/es moved to another node, kill that node this time (the one that process moved to) and you will see multiple processes spawned on the last node.

dbuos · 2021-01-21T13:26:22Z

Hi everyone, some updates with this issue? I've got very strange behaviour when nodes are removed from the cluster (testing with a huge number of processes) and I suspect it's for the same cause

bigardone · 2021-02-03T09:19:21Z

The same issue is happening to me :(

fidr · 2021-04-16T10:10:35Z

Script to reproduce:

defmodule TestServer do
  use GenServer

  def start_link(val) do
    GenServer.start_link(__MODULE__, val)
  end

  def init(val) do
    IO.puts "starting TestServer #{inspect val}"
    {:ok, val}
  end
end

{:ok, _} =
  Horde.DynamicSupervisor.start_link(
    name: :sup_1,
    strategy: :one_for_one,
    delta_crdt_options: [sync_interval: 20]
  )

{:ok, _} =
  Horde.DynamicSupervisor.start_link(
    name: :sup_2,
    strategy: :one_for_one,
    delta_crdt_options: [sync_interval: 20]
  )

Horde.Cluster.set_members(:sup_1, [:sup_1, :sup_2])

Process.sleep(200)

{:ok, _} = Horde.DynamicSupervisor.start_child(:sup_1, {TestServer, "foo1"})
{:ok, _} = Horde.DynamicSupervisor.start_child(:sup_1, {TestServer, "foo2"})

{:ok, _pid} = Horde.DynamicSupervisor.start_child(:sup_2, {TestServer, "foo3"})
{:ok, _pid} = Horde.DynamicSupervisor.start_child(:sup_2, {TestServer, "foo4"})

Process.sleep(200)

stop_sup_2 = fn ->
  # remove sup_2 and stop it
  Horde.Cluster.set_members(:sup_1, [:sup_1])
  Horde.DynamicSupervisor.stop(:sup_2)
  Process.sleep(200)
end

restart_sup_2 = fn ->
  # start sup_2 and add it
  {:ok, _} =
    Horde.DynamicSupervisor.start_link(
      name: :sup_2,
      strategy: :one_for_one,
      delta_crdt_options: [sync_interval: 20]
    )
  Horde.Cluster.set_members(:sup_1, [:sup_1, :sup_2])
  Process.sleep(200)
end

count_processes_in_crdt = fn ->
  Enum.filter(DeltaCrdt.read(:"sup_1.Crdt"), fn
    {{:process, _}, _} -> true
    _ -> false
  end)
  |> length()
end

count_processes_in_supervisor = fn ->
  length(Horde.DynamicSupervisor.which_children(:sup_1))
end

4 = count_processes_in_crdt.()
4 = count_processes_in_supervisor.()

stop_sup_2.()
restart_sup_2.()

6 = count_processes_in_supervisor.()
8 = count_processes_in_crdt.()

stop_sup_2.()
restart_sup_2.()

10 = count_processes_in_supervisor.()
12 = count_processes_in_crdt.()

bartj3 · 2021-05-26T12:55:33Z

Hi,
At T-Mobile we are running into the same bug, we've spent some time diving into
Horde to see if we could tackle this.

Upon inspection of the CRDT state during the test script, we noticed that after
stopping a supervisor. It's children are handed over with new random IDs, but
the old children are not removed from the CRDT. Our assumption at this point is
that in the old (pre randomized id) version a cleanup was not needed because by
using the same ID you would overwrite instead of add.

So there's two options here:

Either we no longer randomize the child on handover, meaning we would
overwrite the old process
Or we delete the old process from the CRDT during hand over.

The first solution is already suggested in PR #226, to us that felt somewhat
scary because we don't understand why the randomize was added to begin with. So
we explored option 2.

During the exploration of option 2 we realized that the :active/:passive setup
in the handover meant that during a :passive handover no process is terminated,
meaning that the disown_child_process is also never called.

We decided to call this on the recipient supervisor during handover because at
that point we are sure that the old supervisor is dead, as the code already
validates that. This resulted in PR #231 This is still a WIP as we'd like to
add tests to prove that this does indeed work, and does not break the :active
handover.

l3nz · 2022-06-24T11:07:09Z

I am seeing a similar behavior; I have some supervisors that start and register. I understood that registering multiple times with the same name would not be allowed, and on merge of the same process starting on multiple nodes, one of them would be killed.

In fact I see that many processes are present multiple times.

The process is started by calling:

Horde.DynamicSupervisor.start_child(@supervisor, {Logfwd.Trk.StatusSrv, []})

when each application boot. What I get is that either it starts, or it is found duplicate, and not started. All is good.

HordeRegistry

While if I ask HordeRegistry about a specific process (that is supposed to be a singleton) it is present just once:

 {Logfwd.Trk.StatusSrv, "-", #PID<46889.30038.0>},

If I go look at the list of processes it is linked to, there are TWO processes for it:

#PID<46889.30038.0> (Logfwd.Trk.StatusSrv.init/1)   <---- active, the registered one
#PID<46889.2452.0> (Logfwd.Trk.StatusSrv.init/1)    <---- gets no input but it's still there

HordeSupervisor

If I ask HordeSupervisor for the list of all children, I see that the process is present multiple times:

> Horde.DynamicSupervisor.which_children Logfwd.HordeSupervisor
[
  {:undefined, #PID<46889.2452.0>, :worker, [Logfwd.Trk.StatusSrv]},
  {:undefined, #PID<46889.30038.0>, :worker, [Logfwd.Trk.StatusSrv]},
  {:undefined, #PID<46890.21379.0>, :worker, [Logfwd.Trk.StatusSrv]},
  ....
]

I wonder what the point of having Horde allow multiple instances of StatusSrv (with the same call parameters) would be. I thought that the whole point of Horde was to have a supervisor that would keep alive cluster-unique processes, and it did so by preventing and - when necessary - killing processes booting with the same name. Or am I missing something trivial?

bernardo-martinez · 2022-09-26T11:49:28Z

same issue is happening to me. Have you figured out a way to prevent it?

derekkraan · 2022-11-14T09:24:31Z

@bernardo-martinez I don't have time to dig into this right now unfortunately.

arjan · 2022-11-16T07:52:00Z

I think that actually @bartj3 's comment explains it pretty well

arjan · 2022-11-16T08:18:42Z

@bernardo-martinez could you try applying the patch from #226, I think it helps. I am not sure why a new random child ID is being generated on process handoff, the identity of the child is still the same, it is just being started somewhere else...

bernardo-martinez · 2022-11-16T09:09:58Z

@arjan i get the problem, what i don't get is where in the code should i dive (And how it works internally) in order to fix it. But ok when i have time i will go first through the path you suggest and if no sucess will go through the long way (studying all the code basically). Thanks for the replies :)

l3nz · 2022-11-16T11:52:48Z

@derekkraan I believe it could be a good idea to put a note on the main README that there are currently issues with process duplication - because this is a major issue. I personally decided to use a different approach because in our environment this was a complete no-go.

I'm totally fine with you not having time/bandwidth to address these issues, as every open source maintainer knows (we are all on the same boat), but for such a critical piece of infrastructure as Horde is, it would definitely be nice to be clear on this topic, e.g. by saying that that there are some unsolved major issues, and that they won't be solved any time soon.

People (even open source maintainers) have lives and jobs and other interests as well, so I don't think there is any loss of face in being open about it. Horde is still a terrific library.

rossvz · 2022-11-29T14:55:23Z

I personally decided to use a different approach because in our environment this was a complete no-go.

@l3nz can you elaborate on the workaround you used to avoid this? We're running into the same issue and it's creating issues with our rolling deploys.

l3nz · 2022-12-11T13:37:50Z

I basically have a singleton-per-cluster process using an approach from https://github.com/derekkraan/highlander that takes care of having a list of processes that must be up and distribuiting them to different cluster nodes. When there are topology changes, and in any case every few minutes, it checks to make sure that all processes that are supposed to be up are actually up and that all nodes are reachable.

derekkraan added the bug Something isn't working label Oct 29, 2020

kunna mentioned this issue Apr 7, 2021

Stop randomize_child_id when handoff #226

Closed

bartj3 mentioned this issue May 26, 2021

WIP: Fix for bug that resulted in dead processes to remain in Horde #231

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing nodes creates multiple processes out of 1 process #207

Removing nodes creates multiple processes out of 1 process #207

KamilZielinski commented Jun 28, 2020 •

edited

philipgiuliani commented Nov 17, 2020 •

edited

derekkraan commented Nov 18, 2020

KamilZielinski commented Nov 18, 2020 •

edited

philipgiuliani commented Nov 24, 2020

KamilZielinski commented Nov 24, 2020

KamilZielinski commented Nov 25, 2020 •

edited

dbuos commented Jan 21, 2021

bigardone commented Feb 3, 2021

fidr commented Apr 16, 2021

bartj3 commented May 26, 2021

l3nz commented Jun 24, 2022 •

edited

bernardo-martinez commented Sep 26, 2022

derekkraan commented Nov 14, 2022

arjan commented Nov 16, 2022

arjan commented Nov 16, 2022

bernardo-martinez commented Nov 16, 2022

l3nz commented Nov 16, 2022 •

edited

rossvz commented Nov 29, 2022

l3nz commented Dec 11, 2022

Removing nodes creates multiple processes out of 1 process #207

Removing nodes creates multiple processes out of 1 process #207

Comments

KamilZielinski commented Jun 28, 2020 • edited

Setup

Issue

philipgiuliani commented Nov 17, 2020 • edited

derekkraan commented Nov 18, 2020

KamilZielinski commented Nov 18, 2020 • edited

philipgiuliani commented Nov 24, 2020

KamilZielinski commented Nov 24, 2020

KamilZielinski commented Nov 25, 2020 • edited

dbuos commented Jan 21, 2021

bigardone commented Feb 3, 2021

fidr commented Apr 16, 2021

bartj3 commented May 26, 2021

l3nz commented Jun 24, 2022 • edited

HordeRegistry

HordeSupervisor

bernardo-martinez commented Sep 26, 2022

derekkraan commented Nov 14, 2022

arjan commented Nov 16, 2022

arjan commented Nov 16, 2022

bernardo-martinez commented Nov 16, 2022

l3nz commented Nov 16, 2022 • edited

rossvz commented Nov 29, 2022

l3nz commented Dec 11, 2022

KamilZielinski commented Jun 28, 2020 •

edited

philipgiuliani commented Nov 17, 2020 •

edited

KamilZielinski commented Nov 18, 2020 •

edited

KamilZielinski commented Nov 25, 2020 •

edited

l3nz commented Jun 24, 2022 •

edited

l3nz commented Nov 16, 2022 •

edited