Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-8291

Parallel replication causes slave threads to not pick up new global config after restart

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 10.1, 10.0
    • Fix Version/s: 10.0.20
    • Component/s: Replication
    • Labels:
    • Environment:
      CentOS 6.6

      Description

      This issue only occurs when parallel replication is enabled. It seems configuration changes are not picked up after restarting the slave.

      Reproduce:
      On slave:
      Configuration:

      sql_mode='STRICT_TRANS_TABLES'
      group_concat_max_len=1024
      

      On master:

      SET SESSION binlog_format=statement;
      drop table if exists testbreak; drop table if exists testdata;
      create table testbreak (big text not null) engine=MyISAM;
      create table testdata (part varchar(1024) not null) engine=MyISAM;
      insert into testdata VALUES (REPEAT('a', 1024));
      insert into testdata VALUES (REPEAT('a', 1024));
      insert into testdata VALUES (REPEAT('a', 1024));
      set session group_concat_max_len=4096;
      insert into testbreak SELECT group_concat(part) FROM testdata;
      

      On slave, witness:

      Last_SQL_Error: Error 'Row 2 was cut by GROUP_CONCAT()' on query. Default database: 'mariadb_test'. Query: 'insert into testbreak SELECT group_concat(part) FROM testdata'
      

      On slave, execute:

      STOP SLAVE;
      SET GLOBAL group_concat_max_len=4096;
      START SLAVE;
      

      On slave, witness:

      Last_SQL_Error: Error 'Row 2 was cut by GROUP_CONCAT()' on query. Default database: 'mariadb_test'. Query: 'insert into testbreak SELECT group_concat(part) FROM testdata'
      

      On slave, execute:

      STOP SLAVE;
      SET GLOBAL slave_parallel_threads=0;
      START SLAVE;
      

      Error goes away.

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Hide
              elenst Elena Stepanova added a comment - - edited

              Technically, the reason is understandable: even if the slave stops, parallel threads aren't, so they are not restarted, so they cannot pick up the changed value.
              But from the users' perspective, the complaint is valid. I also have a vague (and maybe wrong) feeling it's been discussed before, but I don't remember the outcome. Assigning to Kristian Nielsen for further feedback.

              See also MDEV-8294 regarding not terminated parallel threads.

              Show
              elenst Elena Stepanova added a comment - - edited Technically, the reason is understandable: even if the slave stops, parallel threads aren't, so they are not restarted, so they cannot pick up the changed value. But from the users' perspective, the complaint is valid. I also have a vague (and maybe wrong) feeling it's been discussed before, but I don't remember the outcome. Assigning to Kristian Nielsen for further feedback. See also MDEV-8294 regarding not terminated parallel threads.
              Hide
              knielsen Kristian Nielsen added a comment -

              Duplicate of MDEV-5289, I think. No version was given in the report, but
              here is what it looks like on recent version of 10.0:

              MariaDB [test]> change master to master_host='127.0.0.1', master_port=3310, master_user='root';
              Query OK, 0 rows affected (0.05 sec)
              
              MariaDB [test]> show processlist;
              +----+------+-----------+------+---------+------+-------+------------------+----------+
              | Id | User | Host      | db   | Command | Time | State | Info             | Progress |
              +----+------+-----------+------+---------+------+-------+------------------+----------+
              |  2 | root | localhost | test | Query   |    0 | init  | show processlist |    0.000 |
              +----+------+-----------+------+---------+------+-------+------------------+----------+
              1 row in set (0.00 sec)
              
              MariaDB [test]> start slave;
              Query OK, 0 rows affected (0.01 sec)
              
              MariaDB [test]> show processlist;
              +----+-------------+-----------+------+---------+------+-----------------------------------------------------------------------------+------------------+----------+
              | Id | User        | Host      | db   | Command | Time | State                                                                       | Info             | Progress |
              +----+-------------+-----------+------+---------+------+-----------------------------------------------------------------------------+------------------+----------+
              |  2 | root        | localhost | test | Query   |    0 | init                                                                        | show processlist |    0.000 |
              |  3 | system user |           | NULL | Connect |    1 | Waiting for master to send event                                            | NULL             |    0.000 |
              |  4 | system user |           | NULL | Connect |    1 | Waiting for work from SQL thread                                            | NULL             |    0.000 |
              |  5 | system user |           | NULL | Connect |    1 | Waiting for work from SQL thread                                            | NULL             |    0.000 |
              |  6 | system user |           | NULL | Connect |    1 | Waiting for work from SQL thread                                            | NULL             |    0.000 |
              |  7 | system user |           | NULL | Connect |   53 | Waiting for work from SQL thread                                            | NULL             |    0.000 |
              |  8 | system user |           | NULL | Connect |    1 | Waiting for work from SQL thread                                            | NULL             |    0.000 |
              |  9 | system user |           | NULL | Connect |    1 | Waiting for work from SQL thread                                            | NULL             |    0.000 |
              | 10 | system user |           | NULL | Connect |    1 | Waiting for work from SQL thread                                            | NULL             |    0.000 |
              | 11 | system user |           | NULL | Connect |    1 | Waiting for work from SQL thread                                            | NULL             |    0.000 |
              | 12 | system user |           | NULL | Connect |    1 | Slave has read all relay log; waiting for the slave I/O thread to update it | NULL             |    0.000 |
              +----+-------------+-----------+------+---------+------+-----------------------------------------------------------------------------+------------------+----------+
              11 rows in set (0.00 sec)
              
              MariaDB [test]> stop slave;
              Query OK, 0 rows affected (0.02 sec)
              
              MariaDB [test]> show processlist;
              +----+------+-----------+------+---------+------+-------+------------------+----------+
              | Id | User | Host      | db   | Command | Time | State | Info             | Progress |
              +----+------+-----------+------+---------+------+-------+------------------+----------+
              |  2 | root | localhost | test | Query   |    0 | init  | show processlist |    0.000 |
              +----+------+-----------+------+---------+------+-------+------------------+----------+
              1 row in set (0.00 sec)
              
              MariaDB [test]> 
              

              Worker threads are stopped when all slave threads are stopped.

              In earlier versions (or in case of hitting MDEV-8294, perhaps?), a
              work-around is to just change the number of threads; this will cause the
              worker threads to be re-spawned:

                SET GLOBAL <config>, <config>
                SET GLOBAL slave_parallel_threads=0;
                SET GLOBAL slave_parallel_threads=10;
              
              Show
              knielsen Kristian Nielsen added a comment - Duplicate of MDEV-5289 , I think. No version was given in the report, but here is what it looks like on recent version of 10.0: MariaDB [test]> change master to master_host='127.0.0.1', master_port=3310, master_user='root'; Query OK, 0 rows affected (0.05 sec) MariaDB [test]> show processlist; +----+------+-----------+------+---------+------+-------+------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +----+------+-----------+------+---------+------+-------+------------------+----------+ | 2 | root | localhost | test | Query | 0 | init | show processlist | 0.000 | +----+------+-----------+------+---------+------+-------+------------------+----------+ 1 row in set (0.00 sec) MariaDB [test]> start slave; Query OK, 0 rows affected (0.01 sec) MariaDB [test]> show processlist; +----+-------------+-----------+------+---------+------+-----------------------------------------------------------------------------+------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +----+-------------+-----------+------+---------+------+-----------------------------------------------------------------------------+------------------+----------+ | 2 | root | localhost | test | Query | 0 | init | show processlist | 0.000 | | 3 | system user | | NULL | Connect | 1 | Waiting for master to send event | NULL | 0.000 | | 4 | system user | | NULL | Connect | 1 | Waiting for work from SQL thread | NULL | 0.000 | | 5 | system user | | NULL | Connect | 1 | Waiting for work from SQL thread | NULL | 0.000 | | 6 | system user | | NULL | Connect | 1 | Waiting for work from SQL thread | NULL | 0.000 | | 7 | system user | | NULL | Connect | 53 | Waiting for work from SQL thread | NULL | 0.000 | | 8 | system user | | NULL | Connect | 1 | Waiting for work from SQL thread | NULL | 0.000 | | 9 | system user | | NULL | Connect | 1 | Waiting for work from SQL thread | NULL | 0.000 | | 10 | system user | | NULL | Connect | 1 | Waiting for work from SQL thread | NULL | 0.000 | | 11 | system user | | NULL | Connect | 1 | Waiting for work from SQL thread | NULL | 0.000 | | 12 | system user | | NULL | Connect | 1 | Slave has read all relay log; waiting for the slave I/O thread to update it | NULL | 0.000 | +----+-------------+-----------+------+---------+------+-----------------------------------------------------------------------------+------------------+----------+ 11 rows in set (0.00 sec) MariaDB [test]> stop slave; Query OK, 0 rows affected (0.02 sec) MariaDB [test]> show processlist; +----+------+-----------+------+---------+------+-------+------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +----+------+-----------+------+---------+------+-------+------------------+----------+ | 2 | root | localhost | test | Query | 0 | init | show processlist | 0.000 | +----+------+-----------+------+---------+------+-------+------------------+----------+ 1 row in set (0.00 sec) MariaDB [test]> Worker threads are stopped when all slave threads are stopped. In earlier versions (or in case of hitting MDEV-8294 , perhaps?), a work-around is to just change the number of threads; this will cause the worker threads to be re-spawned: SET GLOBAL <config>, <config> SET GLOBAL slave_parallel_threads=0; SET GLOBAL slave_parallel_threads=10;
              Hide
              elenst Elena Stepanova added a comment -

              Something is not right, it can't be a duplicate of MDEV-5289, I checked it against the current 10.0 (which is why no particular minor version was given). I can double-check, but please note that it only happens if the replication aborted with an error – that is, if MDEV-8294 is hit. Basically, it's a consequence of MDEV-8294 which as I can see you've just fixed.

              Show
              elenst Elena Stepanova added a comment - Something is not right, it can't be a duplicate of MDEV-5289 , I checked it against the current 10.0 (which is why no particular minor version was given). I can double-check, but please note that it only happens if the replication aborted with an error – that is, if MDEV-8294 is hit. Basically, it's a consequence of MDEV-8294 which as I can see you've just fixed.
              Hide
              knielsen Kristian Nielsen added a comment -

              Ok, let's call it a duplicate of MDEV-5289, then.

              Basically, before MDEV-5289, it was necessary to change the value of
              @@slave_parallel_threads to make the parallel replication worker threads
              respawn and pick up new configuration settings.

              After MDEV-5289, STOP SLAVE (for all slaves in case of multi-source) is
              enough to respawn the worker threads.

              However, there was a bug with the MDEV-5289 implementation (MDEV-8294), so
              that a slave stopping with an error would leave the worker threads still
              running, with old session variable values. And then a STOP SLAVE was also
              not effective (because the slave is already stopped). Then, a successful
              START SLAVE followed by normal STOP SLAVE (not error stop) was needed to
              re-spawn the worker threads.

              After fix of MDEV-8294, it should (hopefully) be enough to stop and start
              slaves to respawn the worker threads, even if the stop happens due to an
              error.

              Note that in the case of multi-source, all slaves must be stopped at once
              for worker threads to be respawned, as worker threads are shared among
              multi-source connections. As long as at least one SQL thread is running,
              worker threads will remain using old configuration values in their session
              variables.

              Show
              knielsen Kristian Nielsen added a comment - Ok, let's call it a duplicate of MDEV-5289 , then. Basically, before MDEV-5289 , it was necessary to change the value of @@slave_parallel_threads to make the parallel replication worker threads respawn and pick up new configuration settings. After MDEV-5289 , STOP SLAVE (for all slaves in case of multi-source) is enough to respawn the worker threads. However, there was a bug with the MDEV-5289 implementation ( MDEV-8294 ), so that a slave stopping with an error would leave the worker threads still running, with old session variable values. And then a STOP SLAVE was also not effective (because the slave is already stopped). Then, a successful START SLAVE followed by normal STOP SLAVE (not error stop) was needed to re-spawn the worker threads. After fix of MDEV-8294 , it should (hopefully) be enough to stop and start slaves to respawn the worker threads, even if the stop happens due to an error. Note that in the case of multi-source, all slaves must be stopped at once for worker threads to be respawned, as worker threads are shared among multi-source connections. As long as at least one SQL thread is running, worker threads will remain using old configuration values in their session variables.
              Hide
              elenst Elena Stepanova added a comment -

              Closing as fixed in 10.0.20 because it should go away after a fix for the root cause – MDEV-8294.

              Show
              elenst Elena Stepanova added a comment - Closing as fixed in 10.0.20 because it should go away after a fix for the root cause – MDEV-8294 .

                People

                • Assignee:
                  knielsen Kristian Nielsen
                  Reporter:
                  michaeldg Michaƫl de groot
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  2 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved: