Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6795

More efficient transaction retry in parallel replication

    Details

    • Type: Task
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: None

      Description

      Currently, when parallel replication needs to retry a transaction due to
      deadlock or other error, it needs to re-open the relay log file, seek to the
      correct position, and re-read the events from the file.

      This is necessary in the general case, where a transaction may be huge and
      not fit in memory. But in the usual case, this is wasteful, as the events are
      likely to already be available in-memory in the list of events queued up for
      the worker thread.

      This list is only freed in batches for efficiency, so in most cases the events
      will still be in the list when a transaction needs to be retried.

      Transaction retry efficiency becomes somewhat more important with MDEV-6676,
      speculative parallel replication. Thus, it might be worthwhile to implement a
      simple facility for this.

      Say, the worker thread, when freeing queued events, will keep around the last
      event group unless it would require more than (--slave-max-queued-events/3)
      bytes of memory. Then in transaction retry, if the entire transaction to be
      retried is still in the queue, execute the events from out of there, rather
      than re-opening and reading the relay log file.

      The main problem with this approach is testing. The code that reads events
      from relay log during retry will be executed very rarely, so fatal bugs
      could hide and be very hard to deal with when they finally turn up. I think
      some DBUG injection should be used to ensure that existing retry mtr test
      cases will exercise the rare relay-log-read code path in this case and keep
      some testing of this code.

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            knielsen Kristian Nielsen added a comment -

            One thing to look out for with this is what to do with the Log_event objects stored in the work queue.

            Now, they are deleted immediately after being first applied, in delete_or_keep_event_post_apply(). So that will have to be postponed, in case of re-using an event for retry.

            But then the question is if all the code in the different do_apply_event() implementations in log_event.cc leave the event object in the same state as originally? It seems quite possible that there will be some cases where an object is left in a different state, so that re-try that reuses the event object can give subtly different results. Again, testing will be a challenge.

            Show
            knielsen Kristian Nielsen added a comment - One thing to look out for with this is what to do with the Log_event objects stored in the work queue. Now, they are deleted immediately after being first applied, in delete_or_keep_event_post_apply(). So that will have to be postponed, in case of re-using an event for retry. But then the question is if all the code in the different do_apply_event() implementations in log_event.cc leave the event object in the same state as originally? It seems quite possible that there will be some cases where an object is left in a different state, so that re-try that reuses the event object can give subtly different results. Again, testing will be a challenge.

              People

              • Assignee:
                knielsen Kristian Nielsen
                Reporter:
                knielsen Kristian Nielsen
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated: