A master was running a LOAD DATA statement, and the statement was interrupted by pressing CTRL-C on the client. This partially completed query caused errors on the slaves. For example, SHOW SLAVE STATUS showed:
The error log also showed the error:
An error on the slave about a partially completed query is expected, but note that the master's error in this case was #1158 (ER_NET_READ_ERROR), and not error #1317 (ER_QUERY_INTERRUPTED), which would be the expected error when a query is interrupted.
When the partially completed query was skipped with "SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE;" as the error message recommended, this caused something strange to happen on slaves that had slave_parallel_threads > 0:
The relay log position jumps ahead from (relaylog.031235, 4) to (relaylog.031235, 1866), but the master binlog position actually goes backwards from (binlog.140008, 20798609) to (binlog.140007, 1582) and the GTID position goes backwards from '0-187226073-1732217638,999-186069024-2' to 0-187226073-1732162247,999-186069024-2'. When replication is restarted, duplicate key errors are seen as old insert statements are re-run.
This did not happen on slaves that were not using parallel replication.
After closer inspection, it appeared that the values of (Relay_Log_File, Relay_Log_Pos) and (Relay_Master_Log_File, Exec_Master_Log_Pos) reported by SHOW SLAVE STATUS did not match up to the same logical position. It seems that the slave's relay log position is corrupted when parallel replication is enabled and LOAD DATA is cancelled on the master with error #1158.
When the error occurred on slaves using parallel replication, relay log position corruption could be avoided by using this process:
- stop slave;
- start slave io_thread; # This wipes the relay logs and download back the binlog from the failed LOAD DATA INFILE.
- SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1; START SLAVE SQL_THREAD;