Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6429

Create public storage engine API for reducing impact of deadlocks during parallel replication

    Details

    • Type: Task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: None
    • Component/s: None

      Description

      There are some corner cases where parallel replication can hit a
      deadlock. This needs to be handled so that replication does not stall or even
      fail.

      In 10.0, a minimal patch was made, to solve the problem for InnoDB/XtraDB:
      MDEV-5262, MDEV-5914, MDEV-5941, MDEV-6020.

      In 10.1, we want to introduce a general storage engine API, properly
      implemented as a service, that can be used by any storage engine to implement
      the same optimisations/fixes.

      Ideally, the API should be optional, and only used as an optimisation to
      improve the performance or behaviour in case a deadlock does occur. So we
      should try to find some way to handle deadlocks as well as possible for
      storage engines that do not implement the API.

      Mailing list threads that discuss the API and related issues:

      https://lists.launchpad.net/maria-developers/msg07480.html
      https://lists.launchpad.net/maria-developers/msg07489.html

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            knielsen Kristian Nielsen added a comment -

            I have a new idea for how to design this API, mostly based on Serg's
            suggestion in the discussions on the mailing list.

            The idea is to introduce a new (optional) handlerton method, say

            void (*rpl_info)(void *master_connection, uint32 domain_id, uint64_t sub_id, uint64 commit_id)

            Parallel replication will then invoke this method at the start of every
            transaction, to inform the storage engine of the necessary metadata about the
            replicated transaction to handle possible deadlocks.

            So with this new idea, things work in the opposite direction from the original
            patch, where it was the storage engine that called back into the upper layer
            at various points during transaction execution.

            The actual logic and code will be much the same in either case, however with
            the rpl_info() handlerton method, more of it will be in the storage engine
            rather than in the server layer.

            The originally proposed API tried to hide the details of the parallel
            replication from the storage engines. This one exposes the details, which
            seems to simplify it somewhat. It does mean that the new API will not be
            generally applicable to other cases in the future; however any such
            applicability of the old API is so far only theoretical.

            A benefit of the new API is that it allows the server layer to detect if a
            storage engine supports it - by looking if the hton->rpl_info pointer is
            non-NULL. This might allow the server layer to take default, less efficient
            and/or robust deadlock preventive measures for storage engines that do not
            implement such measures themselves.

            The code that needs to be added to InnoDB/XtraDB to handle this new API is as
            follows:

            1. In the innobase_rpl_info() method, store the supplied replication metadata
            inside the trx object for later usage. This replaces the need for the previous
            thd_need_wait_for() function.

            2. When a trx1 is about to wait for trx2, check the metadata. If
            trx1->master_connection==trx2->master_connection and
            trx1->domain_id==trx2->domain_id and trx1->sub_id < trx2->sub_id, then
            this wait will cause a deadlock later during parallel replication commit. So
            in this case, call some thd_deadlock_kill(trx2) server method, which will mark
            trx2 as being deadlocked and send it a kill signal. This code replaces the
            previous thd_report_wait_for() function. It is necessary to be able to handle
            some corner case deadlocks with statement-based parallel replication and
            InnoDB.

            3. When trx1 and trx2 have conflicting gap locks, check their replication
            metadata. If trx1->master_connection==trx2->master_connection and
            trx1->domain_id==trx2->domain_id and trx1->commit_id==trx2->commit_id, then
            the locking semantics can be relaxed and both transactions can be allowed to
            proceed in parallel without waiting on the lock of the other. Because the gap
            lock would be needed only to ensure correct commit order for serialisation
            purposes, and this commit order was already determined on the master and will
            be enforced on the slave. This code replaces the previous
            thd_need_ordering_with() function. This code is optional, it eliminates some
            corner case deadlocks with statement-based parallel replication, saving the
            deadlock kill (and later transaction retry) that would otherwise happen from
            the code in (2).

            4. It would be possible to similarly replace the
            thd_deadlock_victim_preference() function. However, I think that part of the
            original API was considered ok in the earlier mailing list discussions, so no
            need to change it.

            Not all storage engines may need to implement this API. For example, there are
            currently no known cases where parallel replication can get deadlocks with
            MyISAM, due to the simpler table-based locking. (However, maybe it could
            happen with parallel insert at the end of the table?).

            Alternatively, we could use this API to implement more agressive parallel
            replication for supporting engines. If transactions T1 and T2 did not commit
            in parallel on the master, we can still try to run them in parallel on the
            slave. If it works, then great, fixed commit order will ensure that the result
            is correct. If it does not work, the engine will detect the conflict in the
            case (2) above, and we can rollback T2 and retry it, this time not in
            parallel.

            One issue with the new API proposal is that it exposes somewhat internal
            details of parallel replication to the storage engines. But this can also be
            viewed as an advantage depending on point of view, by eliminating extra
            abstraction and making it easier for the storage engine to do exactly what
            needs to be done to handle the issues efficiently.

            The proposed API requires that the parallel replication code will be able to
            invoke the handlerton method rpl_info() for the storage engines participating
            in a transaction; this needs to be done early when the storage engine joins
            the transaction, before it can do any row lock waits. I suppose this could
            happen when the storage engine is added to the list of those participating in
            the transaction?

            Show
            knielsen Kristian Nielsen added a comment - I have a new idea for how to design this API, mostly based on Serg's suggestion in the discussions on the mailing list. The idea is to introduce a new (optional) handlerton method, say void (*rpl_info)(void *master_connection, uint32 domain_id, uint64_t sub_id, uint64 commit_id) Parallel replication will then invoke this method at the start of every transaction, to inform the storage engine of the necessary metadata about the replicated transaction to handle possible deadlocks. So with this new idea, things work in the opposite direction from the original patch, where it was the storage engine that called back into the upper layer at various points during transaction execution. The actual logic and code will be much the same in either case, however with the rpl_info() handlerton method, more of it will be in the storage engine rather than in the server layer. The originally proposed API tried to hide the details of the parallel replication from the storage engines. This one exposes the details, which seems to simplify it somewhat. It does mean that the new API will not be generally applicable to other cases in the future; however any such applicability of the old API is so far only theoretical. A benefit of the new API is that it allows the server layer to detect if a storage engine supports it - by looking if the hton->rpl_info pointer is non-NULL. This might allow the server layer to take default, less efficient and/or robust deadlock preventive measures for storage engines that do not implement such measures themselves. The code that needs to be added to InnoDB/XtraDB to handle this new API is as follows: 1. In the innobase_rpl_info() method, store the supplied replication metadata inside the trx object for later usage. This replaces the need for the previous thd_need_wait_for() function. 2. When a trx1 is about to wait for trx2, check the metadata. If trx1->master_connection==trx2->master_connection and trx1->domain_id==trx2->domain_id and trx1->sub_id < trx2->sub_id, then this wait will cause a deadlock later during parallel replication commit. So in this case, call some thd_deadlock_kill(trx2) server method, which will mark trx2 as being deadlocked and send it a kill signal. This code replaces the previous thd_report_wait_for() function. It is necessary to be able to handle some corner case deadlocks with statement-based parallel replication and InnoDB. 3. When trx1 and trx2 have conflicting gap locks, check their replication metadata. If trx1->master_connection==trx2->master_connection and trx1->domain_id==trx2->domain_id and trx1->commit_id==trx2->commit_id, then the locking semantics can be relaxed and both transactions can be allowed to proceed in parallel without waiting on the lock of the other. Because the gap lock would be needed only to ensure correct commit order for serialisation purposes, and this commit order was already determined on the master and will be enforced on the slave. This code replaces the previous thd_need_ordering_with() function. This code is optional, it eliminates some corner case deadlocks with statement-based parallel replication, saving the deadlock kill (and later transaction retry) that would otherwise happen from the code in (2). 4. It would be possible to similarly replace the thd_deadlock_victim_preference() function. However, I think that part of the original API was considered ok in the earlier mailing list discussions, so no need to change it. Not all storage engines may need to implement this API. For example, there are currently no known cases where parallel replication can get deadlocks with MyISAM, due to the simpler table-based locking. (However, maybe it could happen with parallel insert at the end of the table?). Alternatively, we could use this API to implement more agressive parallel replication for supporting engines. If transactions T1 and T2 did not commit in parallel on the master, we can still try to run them in parallel on the slave. If it works, then great, fixed commit order will ensure that the result is correct. If it does not work, the engine will detect the conflict in the case (2) above, and we can rollback T2 and retry it, this time not in parallel. One issue with the new API proposal is that it exposes somewhat internal details of parallel replication to the storage engines. But this can also be viewed as an advantage depending on point of view, by eliminating extra abstraction and making it easier for the storage engine to do exactly what needs to be done to handle the issues efficiently. The proposed API requires that the parallel replication code will be able to invoke the handlerton method rpl_info() for the storage engines participating in a transaction; this needs to be done early when the storage engine joins the transaction, before it can do any row lock waits. I suppose this could happen when the storage engine is added to the list of those participating in the transaction?

              People

              • Assignee:
                knielsen Kristian Nielsen
                Reporter:
                knielsen Kristian Nielsen
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: