Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7202

[PATCH] additional statistics for parallel replication - Slave_parallel_eventqueue_size/Slave_parallel_eventqueue_freepending

    Details

      Description

      in MDEV-6680 I thought some additional status would be helpful.

      Attached patch adds a total status for all threads for the Slave_parallel_eventqueue_size/Slave_parallel_eventqueue_freepending.

      Rather/in addition to totals, would push a per thread status as slave_parallel_eventqueue_0_size be acceptable?

      Anything else useful to capture/graph here?

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Hide
              knielsen Kristian Nielsen added a comment -

              Ok, I (finally) got to look at this patch.

              > Attached patch adds a total status for all threads for the
              > Slave_parallel_eventqueue_size/Slave_parallel_eventqueue_freepending.

              The patch exposes the loc_qev_size and qev_free_pending fields as status
              variables. I don't really see how this is useful?

              This is completely internal detail to the memory management of the parallel
              replication. It is the size of an internal free list of buffers that each
              thread keeps for handling event queueing efficiently. Its size does not seem
              to mean much for how parallel replication is working?

              > Rather/in addition to totals, would push a per thread status as
              > slave_parallel_eventqueue_0_size be acceptable?

              Do you mean here that there would be N status variables, one for each worker
              thread? Maybe there are better places to expose per-thread statistics, like
              performance schema or information_schema? But as I said, it seems to me that
              these particular values will be more confusing than useful.

              > I thought some additional status would be helpful.
              > Anything else useful to capture/graph here?

              I 100% agree that more monitoring of parallel replication is needed.

              With respect to size of event queues, the issue here is that the code does not
              update the queue size after every event execution, in order to reduce lock
              contention. So the information is not easily available in the current code.

              I'm trying to think of a way to get size of pending events without introducing
              additional locking overhead.

              The SQL driver thread takes LOCK_rpl_thread whenever an event is queued. And
              the worker thread takes LOCK_parallel_entry whenever a new event group
              (transaction) is started. So maybe we could do something while these locks are
              held?

              Under LOCK_parallel_entry, a worker thread could update a counter of size of
              events processsed but not yet freed (in class rpl_parallel_entry). And under
              LOCK_rpl_thread, the SQL driver thread could increment per-thread size of
              events queued. And somehow the status variable would combine these to obtain
              the right value. But it sounds a bit too complicated... would be nice to come
              up with a simpler idea to add good monitoring of parallel replication status
              (not just queue size).

              In general, I'm unsure how to balance the need for more monitoring against the
              overhead of locking/atomics needed to maintain such monitoring. There is
              already significant locking overhead in parallel replication, and not much
              benchmarking has been done to understand the significance of this overhead.

              Show
              knielsen Kristian Nielsen added a comment - Ok, I (finally) got to look at this patch. > Attached patch adds a total status for all threads for the > Slave_parallel_eventqueue_size/Slave_parallel_eventqueue_freepending. The patch exposes the loc_qev_size and qev_free_pending fields as status variables. I don't really see how this is useful? This is completely internal detail to the memory management of the parallel replication. It is the size of an internal free list of buffers that each thread keeps for handling event queueing efficiently. Its size does not seem to mean much for how parallel replication is working? > Rather/in addition to totals, would push a per thread status as > slave_parallel_eventqueue_0_size be acceptable? Do you mean here that there would be N status variables, one for each worker thread? Maybe there are better places to expose per-thread statistics, like performance schema or information_schema? But as I said, it seems to me that these particular values will be more confusing than useful. > I thought some additional status would be helpful. > Anything else useful to capture/graph here? I 100% agree that more monitoring of parallel replication is needed. With respect to size of event queues, the issue here is that the code does not update the queue size after every event execution, in order to reduce lock contention. So the information is not easily available in the current code. I'm trying to think of a way to get size of pending events without introducing additional locking overhead. The SQL driver thread takes LOCK_rpl_thread whenever an event is queued. And the worker thread takes LOCK_parallel_entry whenever a new event group (transaction) is started. So maybe we could do something while these locks are held? Under LOCK_parallel_entry, a worker thread could update a counter of size of events processsed but not yet freed (in class rpl_parallel_entry). And under LOCK_rpl_thread, the SQL driver thread could increment per-thread size of events queued. And somehow the status variable would combine these to obtain the right value. But it sounds a bit too complicated... would be nice to come up with a simpler idea to add good monitoring of parallel replication status (not just queue size). In general, I'm unsure how to balance the need for more monitoring against the overhead of locking/atomics needed to maintain such monitoring. There is already significant locking overhead in parallel replication, and not much benchmarking has been done to understand the significance of this overhead.
              Hide
              knielsen Kristian Nielsen added a comment -

              I tried to assign the issue back to user Daniel Black, but that did not seem possible

              Show
              knielsen Kristian Nielsen added a comment - I tried to assign the issue back to user Daniel Black, but that did not seem possible
              Hide
              danblack Daniel Black added a comment -

              Pavel Ivanov suggested much better options in MDEV-7340 so lets close this and continue there.

              Show
              danblack Daniel Black added a comment - Pavel Ivanov suggested much better options in MDEV-7340 so lets close this and continue there.

                People

                • Assignee:
                  Unassigned
                  Reporter:
                  danblack Daniel Black
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  3 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved: