Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6549

Assertion `(thd->variables.option_bits & (1ULL << 25)) == 0' failed with parallel replication

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 10.0.13
    • Fix Version/s: 10.0.14
    • Component/s: None
    • Labels:
      None

      Description

      10.0/sql/log_event.cc:6555: virtual int Gtid_log_event::do_apply_event(rpl_group_info*): Assertion `(thd->variables.option_bits & (1ULL << 25)) == 0' failed.
      140807  2:27:35 [ERROR] mysqld got signal 6 ;
      
      #5  0x00007fdc71db07c0 in *__GI_abort () at abort.c:92
      #6  0x00007fdc71da66f1 in *__GI___assert_fail (assertion=0x1019e60 "(thd->variables.option_bits & (1ULL << 25)) == 0", file=<optimized out>, line=6555, function=0x101cae0 "virtual int Gtid_log_event::do_apply_event(rpl_group_info*)") at assert.c:81
      #7  0x00000000009561c4 in Gtid_log_event::do_apply_event (this=0x7fdc47336470, rgi=0x7fdc483dc800) at 10.0/sql/log_event.cc:6555
      #8  0x00000000005edf08 in Log_event::apply_event (this=0x7fdc47336470, rgi=0x7fdc483dc800) at 10.0/sql/log_event.h:1343
      #9  0x00000000005e4f8e in apply_event_and_update_pos (ev=0x7fdc47336470, thd=0x7fdc478bd070, rgi=0x7fdc483dc800, rpt=0x7fdc7163eeb0) at 10.0/sql/slave.cc:3255
      #10 0x00000000007f06b9 in rpt_handle_event (qev=0x7fdc46d83770, rpt=0x7fdc7163eeb0) at 10.0/sql/rpl_parallel.cc:36
      #11 0x00000000007f1f7b in handle_rpl_parallel_thread (arg=0x7fdc7163eeb0) at 10.0/sql/rpl_parallel.cc:724
      #12 0x00007fdc7395fb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
      #13 0x00007fdc71e5720d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
      

      Stack trace from:

      revision-id: sergii@pisem.net-20140805163520-tz0w14h8uju0ti9n
      revno: 4330
      branch-nick: 10.0
      

      To reproduce:

      • put the attached set of binary logs into the master datadir;
      • start master with log-bin=mysql-bin;
      • start slave with --slave-skip-errors=1049,1305,1539,1505,1317 --slave-parallel-threads=8;
      • start replication with MASTER_USE_GTID = current_pos;
      • wait.

      Note: most likely not all slave-skip-errors values are needed, but some are.

      The failure happens sporadically in different places over the course of replicating the binary logs. For me it happens 2-3 times before the slave catches up with the master.

      The binary logs were created during a usual stress test on Power 8 machine and were not tampered with anyhow.

      Could not reproduce without parallel replication.

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            knielsen Kristian Nielsen added a comment -

            Ok, I can reproduce as described.

            The thread that asserts turns out to have needed to do a deadlock retry of the transaction executed just before the one that asserts.

            It seems likely that the problem is missing cleanup of some transactional state flag in case of transaction retry. I'll dig into it.

            Show
            knielsen Kristian Nielsen added a comment - Ok, I can reproduce as described. The thread that asserts turns out to have needed to do a deadlock retry of the transaction executed just before the one that asserts. It seems likely that the problem is missing cleanup of some transactional state flag in case of transaction retry. I'll dig into it.
            Hide
            knielsen Kristian Nielsen added a comment -

            Ok, so the real problem here turned out to be as follows:

            When a replicated transaction updates (but does not commit yet) the update of
            gtid_slave_pos, the flag is cleared that marks that a GTID exists to be so
            recorded in gtid_slave_pos. Then if the transaction is later killed due to a
            deadlock and retried, the flag remained cleared, so that during the retry, the
            update of gtid_slave_pos would not happen, and this the GTID position would be
            wrong.

            As a side effect the code would also skip clearing OPTION_GTID_BEGIN when
            skipping the gtid_slave_pos update, and this is what causes the assertion, but
            the real (and serious) problem is the missing update of GTID position.

            Good catch!

            Show
            knielsen Kristian Nielsen added a comment - Ok, so the real problem here turned out to be as follows: When a replicated transaction updates (but does not commit yet) the update of gtid_slave_pos, the flag is cleared that marks that a GTID exists to be so recorded in gtid_slave_pos. Then if the transaction is later killed due to a deadlock and retried, the flag remained cleared, so that during the retry, the update of gtid_slave_pos would not happen, and this the GTID position would be wrong. As a side effect the code would also skip clearing OPTION_GTID_BEGIN when skipping the gtid_slave_pos update, and this is what causes the assertion, but the real (and serious) problem is the missing update of GTID position. Good catch!
            Hide
            knielsen Kristian Nielsen added a comment -

            Pushed to 10.0.14

            Show
            knielsen Kristian Nielsen added a comment - Pushed to 10.0.14

              People

              • Assignee:
                knielsen Kristian Nielsen
                Reporter:
                elenst Elena Stepanova
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: