Details
-
Type:
Task
-
Status: Open
-
Priority:
Minor
-
Resolution: Unresolved
-
Fix Version/s: None
-
Component/s: None
-
Labels:
Description
Currently, when parallel replication needs to retry a transaction due to
deadlock or other error, it needs to re-open the relay log file, seek to the
correct position, and re-read the events from the file.
This is necessary in the general case, where a transaction may be huge and
not fit in memory. But in the usual case, this is wasteful, as the events are
likely to already be available in-memory in the list of events queued up for
the worker thread.
This list is only freed in batches for efficiency, so in most cases the events
will still be in the list when a transaction needs to be retried.
Transaction retry efficiency becomes somewhat more important with MDEV-6676,
speculative parallel replication. Thus, it might be worthwhile to implement a
simple facility for this.
Say, the worker thread, when freeing queued events, will keep around the last
event group unless it would require more than (--slave-max-queued-events/3)
bytes of memory. Then in transaction retry, if the entire transaction to be
retried is still in the queue, execute the events from out of there, rather
than re-opening and reading the relay log file.
The main problem with this approach is testing. The code that reads events
from relay log during retry will be executed very rarely, so fatal bugs
could hide and be very hard to deal with when they finally turn up. I think
some DBUG injection should be used to ensure that existing retry mtr test
cases will exercise the rare relay-log-read code path in this case and keep
some testing of this code.
Gliffy Diagrams
Attachments
Activity
- All
- Comments
- Work Log
- History
- Activity
- Transitions
One thing to look out for with this is what to do with the Log_event objects stored in the work queue.
Now, they are deleted immediately after being first applied, in delete_or_keep_event_post_apply(). So that will have to be postponed, in case of re-using an event for retry.
But then the question is if all the code in the different do_apply_event() implementations in log_event.cc leave the event object in the same state as originally? It seems quite possible that there will be some cases where an object is left in a different state, so that re-try that reuses the event object can give subtly different results. Again, testing will be a challenge.