I have a new idea for how to design this API, mostly based on Serg's
suggestion in the discussions on the mailing list.
The idea is to introduce a new (optional) handlerton method, say
void (*rpl_info)(void *master_connection, uint32 domain_id, uint64_t sub_id, uint64 commit_id)
Parallel replication will then invoke this method at the start of every
transaction, to inform the storage engine of the necessary metadata about the
replicated transaction to handle possible deadlocks.
So with this new idea, things work in the opposite direction from the original
patch, where it was the storage engine that called back into the upper layer
at various points during transaction execution.
The actual logic and code will be much the same in either case, however with
the rpl_info() handlerton method, more of it will be in the storage engine
rather than in the server layer.
The originally proposed API tried to hide the details of the parallel
replication from the storage engines. This one exposes the details, which
seems to simplify it somewhat. It does mean that the new API will not be
generally applicable to other cases in the future; however any such
applicability of the old API is so far only theoretical.
A benefit of the new API is that it allows the server layer to detect if a
storage engine supports it - by looking if the hton->rpl_info pointer is
non-NULL. This might allow the server layer to take default, less efficient
and/or robust deadlock preventive measures for storage engines that do not
implement such measures themselves.
The code that needs to be added to InnoDB/XtraDB to handle this new API is as
follows:
1. In the innobase_rpl_info() method, store the supplied replication metadata
inside the trx object for later usage. This replaces the need for the previous
thd_need_wait_for() function.
2. When a trx1 is about to wait for trx2, check the metadata. If
trx1->master_connection==trx2->master_connection and
trx1->domain_id==trx2->domain_id and trx1->sub_id < trx2->sub_id, then
this wait will cause a deadlock later during parallel replication commit. So
in this case, call some thd_deadlock_kill(trx2) server method, which will mark
trx2 as being deadlocked and send it a kill signal. This code replaces the
previous thd_report_wait_for() function. It is necessary to be able to handle
some corner case deadlocks with statement-based parallel replication and
InnoDB.
3. When trx1 and trx2 have conflicting gap locks, check their replication
metadata. If trx1->master_connection==trx2->master_connection and
trx1->domain_id==trx2->domain_id and trx1->commit_id==trx2->commit_id, then
the locking semantics can be relaxed and both transactions can be allowed to
proceed in parallel without waiting on the lock of the other. Because the gap
lock would be needed only to ensure correct commit order for serialisation
purposes, and this commit order was already determined on the master and will
be enforced on the slave. This code replaces the previous
thd_need_ordering_with() function. This code is optional, it eliminates some
corner case deadlocks with statement-based parallel replication, saving the
deadlock kill (and later transaction retry) that would otherwise happen from
the code in (2).
4. It would be possible to similarly replace the
thd_deadlock_victim_preference() function. However, I think that part of the
original API was considered ok in the earlier mailing list discussions, so no
need to change it.
Not all storage engines may need to implement this API. For example, there are
currently no known cases where parallel replication can get deadlocks with
MyISAM, due to the simpler table-based locking. (However, maybe it could
happen with parallel insert at the end of the table?).
Alternatively, we could use this API to implement more agressive parallel
replication for supporting engines. If transactions T1 and T2 did not commit
in parallel on the master, we can still try to run them in parallel on the
slave. If it works, then great, fixed commit order will ensure that the result
is correct. If it does not work, the engine will detect the conflict in the
case (2) above, and we can rollback T2 and retry it, this time not in
parallel.
One issue with the new API proposal is that it exposes somewhat internal
details of parallel replication to the storage engines. But this can also be
viewed as an advantage depending on point of view, by eliminating extra
abstraction and making it easier for the storage engine to do exactly what
needs to be done to handle the issues efficiently.
The proposed API requires that the parallel replication code will be able to
invoke the handlerton method rpl_info() for the storage engines participating
in a transaction; this needs to be done early when the storage engine joins
the transaction, before it can do any row lock waits. I suppose this could
happen when the storage engine is added to the list of those participating in
the transaction?
I have a new idea for how to design this API, mostly based on Serg's
suggestion in the discussions on the mailing list.
The idea is to introduce a new (optional) handlerton method, say
void (*rpl_info)(void *master_connection, uint32 domain_id, uint64_t sub_id, uint64 commit_id)
Parallel replication will then invoke this method at the start of every
transaction, to inform the storage engine of the necessary metadata about the
replicated transaction to handle possible deadlocks.
So with this new idea, things work in the opposite direction from the original
patch, where it was the storage engine that called back into the upper layer
at various points during transaction execution.
The actual logic and code will be much the same in either case, however with
the rpl_info() handlerton method, more of it will be in the storage engine
rather than in the server layer.
The originally proposed API tried to hide the details of the parallel
replication from the storage engines. This one exposes the details, which
seems to simplify it somewhat. It does mean that the new API will not be
generally applicable to other cases in the future; however any such
applicability of the old API is so far only theoretical.
A benefit of the new API is that it allows the server layer to detect if a
storage engine supports it - by looking if the hton->rpl_info pointer is
non-NULL. This might allow the server layer to take default, less efficient
and/or robust deadlock preventive measures for storage engines that do not
implement such measures themselves.
The code that needs to be added to InnoDB/XtraDB to handle this new API is as
follows:
1. In the innobase_rpl_info() method, store the supplied replication metadata
inside the trx object for later usage. This replaces the need for the previous
thd_need_wait_for() function.
2. When a trx1 is about to wait for trx2, check the metadata. If
trx1->master_connection==trx2->master_connection and
trx1->domain_id==trx2->domain_id and trx1->sub_id < trx2->sub_id, then
this wait will cause a deadlock later during parallel replication commit. So
in this case, call some thd_deadlock_kill(trx2) server method, which will mark
trx2 as being deadlocked and send it a kill signal. This code replaces the
previous thd_report_wait_for() function. It is necessary to be able to handle
some corner case deadlocks with statement-based parallel replication and
InnoDB.
3. When trx1 and trx2 have conflicting gap locks, check their replication
metadata. If trx1->master_connection==trx2->master_connection and
trx1->domain_id==trx2->domain_id and trx1->commit_id==trx2->commit_id, then
the locking semantics can be relaxed and both transactions can be allowed to
proceed in parallel without waiting on the lock of the other. Because the gap
lock would be needed only to ensure correct commit order for serialisation
purposes, and this commit order was already determined on the master and will
be enforced on the slave. This code replaces the previous
thd_need_ordering_with() function. This code is optional, it eliminates some
corner case deadlocks with statement-based parallel replication, saving the
deadlock kill (and later transaction retry) that would otherwise happen from
the code in (2).
4. It would be possible to similarly replace the
thd_deadlock_victim_preference() function. However, I think that part of the
original API was considered ok in the earlier mailing list discussions, so no
need to change it.
Not all storage engines may need to implement this API. For example, there are
currently no known cases where parallel replication can get deadlocks with
MyISAM, due to the simpler table-based locking. (However, maybe it could
happen with parallel insert at the end of the table?).
Alternatively, we could use this API to implement more agressive parallel
replication for supporting engines. If transactions T1 and T2 did not commit
in parallel on the master, we can still try to run them in parallel on the
slave. If it works, then great, fixed commit order will ensure that the result
is correct. If it does not work, the engine will detect the conflict in the
case (2) above, and we can rollback T2 and retry it, this time not in
parallel.
One issue with the new API proposal is that it exposes somewhat internal
details of parallel replication to the storage engines. But this can also be
viewed as an advantage depending on point of view, by eliminating extra
abstraction and making it easier for the storage engine to do exactly what
needs to be done to handle the issues efficiently.
The proposed API requires that the parallel replication code will be able to
invoke the handlerton method rpl_info() for the storage engines participating
in a transaction; this needs to be done early when the storage engine joins
the transaction, before it can do any row lock waits. I suppose this could
happen when the storage engine is added to the list of those participating in
the transaction?