Details
-
Type:
Bug
-
Status: Closed
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: 10.0.17
-
Fix Version/s: 10.0.18
-
Component/s: Replication
-
Labels:
Description
I encountered 11 instances of the following in a binlog group commit along with one other transaction on a different table.
UPDATE variable SET value='a:0:{}'
WHERE ( (name = 'rules_event_whitelist') )
the slave settings where:
(replication not in gtid mode)
slave_transaction_retries=10 (default)
slave-parallel-threads=20
master had binlog_commit_wait_count=20
as such the replication was stopped and the error was in the log:
150701 2:01:33 [ERROR] Slave worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. 150701 2:01:33 [ERROR] Slave SQL: Deadlock found when trying to get lock; try restarting transaction, Gtid 0-8-1270304033, Internal MariaDB error code: 1213 150701 2:01:33 [Warning] Slave: Connection was killed Error_code: 1927 150701 2:01:33 [Warning] Slave: Connection was killed Error_code: 1927 150701 2:01:33 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213 150701 2:01:33 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.010502' position 38545619
a) Stopping replication after slave_transaction_retries has been achieved puts me in a much worse state than continuing to try, so is there any case where slave_transaction_retries is useful (apart from catching a coding error)?
b) is slave_transaction_retries = slave-parallel-threads sufficient to resolve this in the mean time? or do i need to set this to factorial(slave-parallel-threads) ?
I read though MDEV-7882 and it looked different. Does anything else post 10.0.17 release change the handling of this situation?
Gliffy Diagrams
Attachments
Activity
- All
- Comments
- Work Log
- History
- Activity
- Transitions
It does look like the same problem as
MDEV-7882to me.Suppose that T1 and T2 happen to conflict on the slave. Before 10.0.18, we
would deadlock kill T2 when T1 goes to wait. But it was possible for thread
scheduling to let T2 run ahead and once again get the lock before T1. This
causes a new deadlock kill and new race, eventually hitting the 10 retries
limit.
In 10.0.18, it is fixed so that T2 will not retry until T1 (and any other
prior transactions) have finished their commit, avoiding repeated conflicts
and retries.
Until an upgrade to 10.0.18 can be done, maybe a workaround could be to
increase max retries to a really large value, say 10000 or
something. Hopefully thread scheduling will eventually allow T1 to get the
lock before T2. Conflicts between transactions should hopefully be rare in
10.0 parallel replication anyway.