Details
-
Type:
Bug
-
Status: Closed
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: 10.0.12
-
Fix Version/s: 10.0.13
-
Component/s: None
-
Labels:
Description
While replicating from the attached binary logs, a slave started with slave-parallel-threads=8 fails and then goes down with the assertion failure.
The binary log contains changes made by a bootstrap process on master, so the replication failure with the duplicate key is expected, since the system table on the slave already exists and is populated.
Assertion failure is the problem.
140625 15:01:47 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='127.0.0.1', master_port='3306', master_log_file='', master_log_pos='4'. 140625 15:01:47 [Note] Previous Using_Gtid=No. New Using_Gtid=Current_Pos 140625 15:01:55 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './wheezy-64-relay-bin.000001' position: 4; GTID position '' 140625 15:01:55 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:3306',replication starts at GTID position '' 140625 15:02:15 [ERROR] Slave SQL: Error 'Duplicate entry '%-test-' for key 'PRIMARY'' on query. Default database: 'mysql'. Query: 'INSERT INTO db SELECT * FROM tmp_db WHERE @had_db_table=0;', Internal MariaDB error code: 1062 140625 15:02:15 [Warning] Slave: Duplicate entry '%-test-' for key 'PRIMARY' Error_code: 1062 140625 15:02:15 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000001' position 63928; GTID position '0-1-85' mysqld: 10.0/sql/sql_base.cc:902: void close_thread_tables(THD*): Assertion `thd->transaction.stmt.is_empty() || thd->in_sub_stmt || (thd->state_flags & Open_tables_state::BACKUPS_AVAIL)' failed. 140625 15:02:15 [ERROR] mysqld got signal 6 ;
#6 0x00007fcfc01ff621 in *__GI___assert_fail (assertion=0xf16a20 "thd->transaction.stmt.is_empty() || thd->in_sub_stmt || (thd->state_flags & Open_tables_state::BACKUPS_AVAIL)", file=<optimized out>, line=902, function=0xf19fb0 "void close_thread_tables(THD*)") at assert.c:81 #7 0x0000000000614865 in close_thread_tables (thd=0x7fcfa3049070) at 10.0/sql/sql_base.cc:901 #8 0x00000000007eae3f in rpl_slave_state::record_gtid (this=0x1886760, thd=0x7fcfa3049070, gtid=0x7fcf969ab090, sub_id=89, in_transaction=false, in_statement=false) at 10.0/sql/rpl_gtid.cc:688 #9 0x00000000007e9950 in rpl_slave_state::record_and_update_gtid (this=0x1886760, thd=0x7fcfa3049070, rgi=0x7fcf969ab070) at 10.0/sql/rpl_gtid.cc:80 #10 0x00000000007825a2 in Relay_log_info::stmt_done (this=0x7fcfad7fccb8, event_master_log_pos=64952, event_creation_time=1403693523, thd=0x7fcfa3049070, rgi=0x7fcf969ab070) at 10.0/sql/rpl_rli.cc:1244 #11 0x00000000009484e4 in Log_event::do_update_pos (this=0x7fcf96998f30, rgi=0x7fcf969ab070) at 10.0/sql/log_event.cc:975 #12 0x000000000094ea7b in Query_log_event::do_update_pos (this=0x7fcf96998f30, rgi=0x7fcf969ab070) at 10.0/sql/log_event.cc:4487 #13 0x00000000005ed13a in Log_event::update_pos (this=0x7fcf96998f30, rgi=0x7fcf969ab070) at 10.0/sql/log_event.h:1357 #14 0x00000000005e4538 in apply_event_and_update_pos (ev=0x7fcf96998f30, thd=0x7fcfa3049070, rgi=0x7fcf969ab070, rpt=0x7fcfb661a930) at 10.0/sql/slave.cc:3271 #15 0x00000000007eebd6 in rpt_handle_event (qev=0x7fcf969d3b70, rpt=0x7fcfb661a930) at 10.0/sql/rpl_parallel.cc:46 #16 0x00000000007efad4 in handle_rpl_parallel_thread (arg=0x7fcfb661a930) at 10.0/sql/rpl_parallel.cc:502 #17 0x00007fcfc1db6b50 in start_thread (arg=<optimized out>) at pthread_create.c:304 #18 0x00007fcfc02aea7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
I couldn't reproduce it without parallel replication or with smaller number of parallel threads, but with slave-parallel-threads=8 the failure seems to be reliably reproducible for me.
RQG test to reproduce, in case replaying replication from binary logs does not do it:
perl ./runall-new.pl --threads=1 --duration=10 --queries=100M --grammar=conf/optimizer/optimizer_subquery.yy --rpl_mode=row --use-gtid=current_pos --mysqld=--slave_parallel_threads=8 --basedir=<basedir> --vardir=<vardir>
It is to be run with MariaDB's flavor of RQG (lp:~elenst/randgen/mariadb-patches)
Gliffy Diagrams
Attachments
Activity
- All
- Comments
- Work Log
- History
- Activity
- Transitions
Great test, thanks.
I actually found two issues that together result in the assertion:
1. Given two transactions T1 and T2 in the master's binlog, not in the same
group commit. On the slave, T2 is only allowed to run once T1 has started to
commit. However, there was a bug, so that if T1 failed, then T2 was still
allowed to start (T2 would then catch the error at its own commit and roll
back, so the problem was often not visible).
2. When T2 is DDL (drop temporary table in this case), record_gtid() needs to
update mysql.gtid_slave_pos in an auto-committed transaction. This autocommit
ends up noticing the error in T1 and failing. There was missing error handling
in this case (missing ha_rollback_trans()), which triggered the assertion.
Note that it is necessary to run without binlogging on the slave
(--skip-log-bin or --log-slave-updates=0) to trigger this.
Here is an MTR test case that reproduces the assertion:
-slave.opt file: