When a joiner requests an rsync SST, wsrep_sst_rsync on the donor node executes FLUSH TABLES WITH READ LOCK before donating the SST. If FLUSH TABLES WITH READ LOCK is not successful, then this wsrep_sst_rsync process dies not die. Instead, it seems to stick around.
Often, this script seems to have some locks in the database, so this can cause strange problems, such as the node being stuck in the DONOR/DESYNCED state.
To reproduce, let's say that we have a 2-node cluster: one will act as the donor, and one as the joiner.
Let's first create and populate a table:
Then let's stop one of the nodes and delete the datadir:
And then on the donor node, let's start some DDL that will take a long time:
Once the DDL is started, let's start the SST on the joiner:
The donor will see an error like this:
And the wsrep_sst_rsync process will not die. For each additional SST attempt, there will be another leftover process: