Details
-
Type:
Bug
-
Status: Closed
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: 10.0.0
-
Fix Version/s: 10.0.2
-
Component/s: None
-
Labels:None
-
Environment:Linux (RHEL6)
Description
The original unmodified description of the problem can be found at the bottom of the field
On SQL thread start, a slave checks permissions for slave_load_tmpdir by creating (and immediately deleting) a dummy file there:
sql/slave.cc:
3387 3388 /* 3389 Check permissions to create a file. 3390 */ 3391 if ((fd= mysql_file_create(key_file_misc, 3392 tmp_file, CREATE_MODE, 3393 O_WRONLY | O_BINARY | O_EXCL | O_NOFOLLOW, 3394 MYF(MY_WME))) < 0) 3395 DBUG_RETURN(1); 3396 3397 /* 3398 Clean up. 3399 */ 3400 mysql_file_close(fd, MYF(0)); 3401 mysql_file_delete(key_file_misc, tmp_file, MYF(0)); 3402 3403 DBUG_RETURN(0); 3404}
The resulting file is <slave_load_tmpdir>/SQL_LOAD-. If the file exists, slave fails to start. The draft test case below shows that, it fails for any version.
Usually it's not a problem, since the slave deletes the file immediately. However, with multiple slaves, if they on some reason restart simultaneously, I suppose it's possible to hit a race condition when one slave has created the file but not deleted it yet, and another slave attempts to create it too and fails.
Test case below shows how slave fails to start when the file exists
(it uses $MYSQLTEST_VARDIR/tmp because that's what MTR sets for slave_load_tmpdir):
--source include/master-slave.inc
--save_master_pos
--connection slave
stop slave;
--echo # Writing $MYSQLTEST_VARDIR/tmp/SQL_LOAD-
--write_file $MYSQLTEST_VARDIR/tmp/SQL_LOAD-
test
EOF
start slave;
--sync_with_master 0
show slave status;
=================================
Original description
Multi-source replication seems to randomly stop with
Last_SQL_Error: Unable to use slave's temporary directory /tmp - Can't create/write to file '/tmp/SQL_LOAD-' (Errcode: 17 "File exists")
Only see it occassionally on a huge replication volume (100s of GB/day).
Is the master name perhaps missing off the end of that file name?
Is there any data that would be useful to capture when this happens to help debugging?
Gliffy Diagrams
Attachments
Activity
- All
- Comments
- Work Log
- History
- Activity
- Transitions
Looking at some of the other, older error reports on the internet, it seems the file name is missing a suffix of some sort, and it looks like the multiple replication slave threads are trying to write to the same suffixless file.
We are using row-based (rather than statement based or mixed) replication.