Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7588

[PATCH] Slow connections with thread pool and replication

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 10.0, 5.5
    • Fix Version/s: 10.0, 5.5
    • Component/s: OTHER, Replication
    • Labels:
    • Environment:
      Debian Linux 7, Linux 3.10, 4 cores / 8 threads, official MariaDB packages

      Description

      If MariaDB is set up with pool-of-threads and a replication slave connects and issues the COM_BINLOG_DUMP command, every nth connection (in my setup every 8th) can take as much as 1 second to establish. That is, the TCP connection is completed fast, but it takes a long time (relatively speaking) before the server greeting is sent.

      The reason is apparently that the binlog dump thread doesn't notify the thread pool that it is waiting, so the active_thread_count remains > 0 and queue_put() will not do anything. Instead, the thread group is activated by the thread timer (thread_pool_stall_limit), which with the default value of 500ms give us a worst case close to 1 sec.

      I only have a superficial understanding of the MariaDB core, but it seems that mysql_binlog_send() ought to call thd_wait_begin. Either something like that or the active_thread_count check should be removed in queue_put().

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            pchri03 Peter Nørlund added a comment - - edited

            I've tried adding the_wait_begin/thd_wait_end into MYSQL_BIN_LOG::wait_for_update_bin_log where the thread waits for a conditional variable. This had an immediate effect on the reaction time. (Attached patch yield_wait_for_update_bin_log.diff)

            I couldn't compile the Debian package without commenting out the tests (and I actually had to backport two patches to even compile in the first place), but all the failed tests seems to be related the SSL, and at least one of the failures was caused by an expired certificate.

            Anyway, the patch is currently running in the internal databases at my work place, to see if it have any unwanted side-effects.

            Show
            pchri03 Peter Nørlund added a comment - - edited I've tried adding the_wait_begin/thd_wait_end into MYSQL_BIN_LOG::wait_for_update_bin_log where the thread waits for a conditional variable. This had an immediate effect on the reaction time. (Attached patch yield_wait_for_update_bin_log.diff) I couldn't compile the Debian package without commenting out the tests (and I actually had to backport two patches to even compile in the first place), but all the failed tests seems to be related the SSL, and at least one of the failures was caused by an expired certificate. Anyway, the patch is currently running in the internal databases at my work place, to see if it have any unwanted side-effects.
            Hide
            jb-boin Jean Weisbuch added a comment -

            The SSL certificates issues are due to the fact the the certs used for the test suite have expired, it has been solved since on MDEV-7536 so you can disregard those tests.

            About your issue, i had issues on a slave running pool-of-threads that could be related : MDEV-5951.
            That slave was having dumps for each databases running sequentially while at the same time there was another thread dumping the databases structures (also sequentially) and one the table contents and i was hitting the "thread_pool_max_threads" limit (which was not low) on one of the dumps at some point while the only connections to the server were the dumps and a simple moitoring Zabbix script that was just checking some server variables every minutes.
            Disabling the threadpool made the issue to go away.

            Do you think this bug could produce that kind if issue?

            Show
            jb-boin Jean Weisbuch added a comment - The SSL certificates issues are due to the fact the the certs used for the test suite have expired, it has been solved since on MDEV-7536 so you can disregard those tests. – About your issue, i had issues on a slave running pool-of-threads that could be related : MDEV-5951 . That slave was having dumps for each databases running sequentially while at the same time there was another thread dumping the databases structures (also sequentially) and one the table contents and i was hitting the "thread_pool_max_threads" limit (which was not low) on one of the dumps at some point while the only connections to the server were the dumps and a simple moitoring Zabbix script that was just checking some server variables every minutes. Disabling the threadpool made the issue to go away. Do you think this bug could produce that kind if issue?
            Hide
            pchri03 Peter Nørlund added a comment -

            I doubt that it is related to MDEV-5951, which looks like somebody forgetting to unlock LOCK_thread_count somewhere.

            Show
            pchri03 Peter Nørlund added a comment - I doubt that it is related to MDEV-5951 , which looks like somebody forgetting to unlock LOCK_thread_count somewhere.
            Hide
            elenst Elena Stepanova added a comment -

            Thanks for the report and the patch. Below is a simple MTR test case to confirm the problem (not to be included to the regression suite).

            Results with and without the patch on the current 5.5 tree, debug build:

            without patch, pool-of-threads
            # Min connection time:   1366
            # Max connection time:   997304 
            # Total connection time: 16919926
            # Avg connection time:   338398.5200
            
            without patch, one-thread-per-connection
            # Min connection time:   1370
            # Max connection time:   1965 
            # Total connection time: 78925
            # Avg connection time:   1578.5000
            
            with patch, pool-of-threads
            # Min connection time:   1274
            # Max connection time:   2034 
            # Total connection time: 74009
            # Avg connection time:   1480.1800
            
            with patch, one-thread-per-connection
            # Min connection time:   1419
            # Max connection time:   2322 
            # Total connection time: 85742
            # Avg connection time:   1714.8400
            

            So the patch does help, although I don't know if it's otherwise correct.

            Test case
            
            --source include/master-slave.inc
            --source include/have_binlog_format_statement.inc
            
            --let $num = 50
            --let $sum = 0
            --let $max = 0
            --let $min = 10000000
            --let $n = $num
            while ($n) 
            {
            	--let $before = `select now(6)`
            	--connect (con$n,localhost,root,,)
            	--let $tm = `select timestampdiff(microsecond,'$before', now(6))`
            	--echo # Connection time: $tm
            	--let $sum = `select $sum + $tm`
            	--let $max = `select if($tm>$max,$tm,$max)`
            	--let $min = `select if($tm<$min,$tm,$min)`
            	--connection default
            	--dec $n
            }
            
            --let $avg = `select $sum / $num`
            
            --echo # Min connection time:   $min
            --echo # Max connection time:   $max 
            --echo # Total connection time: $sum
            --echo # Avg connection time:   $avg
            
            --source include/rpl_end.inc
            
            Show
            elenst Elena Stepanova added a comment - Thanks for the report and the patch. Below is a simple MTR test case to confirm the problem ( not to be included to the regression suite). Results with and without the patch on the current 5.5 tree, debug build: without patch, pool-of-threads # Min connection time: 1366 # Max connection time: 997304 # Total connection time: 16919926 # Avg connection time: 338398.5200 without patch, one-thread-per-connection # Min connection time: 1370 # Max connection time: 1965 # Total connection time: 78925 # Avg connection time: 1578.5000 with patch, pool-of-threads # Min connection time: 1274 # Max connection time: 2034 # Total connection time: 74009 # Avg connection time: 1480.1800 with patch, one-thread-per-connection # Min connection time: 1419 # Max connection time: 2322 # Total connection time: 85742 # Avg connection time: 1714.8400 So the patch does help, although I don't know if it's otherwise correct. Test case --source include/master-slave.inc --source include/have_binlog_format_statement.inc --let $num = 50 --let $sum = 0 --let $max = 0 --let $min = 10000000 --let $n = $num while ($n) { --let $before = ` select now(6)` --connect (con$n,localhost,root,,) --let $tm = ` select timestampdiff(microsecond,'$before', now(6))` --echo # Connection time: $tm --let $sum = ` select $sum + $tm` --let $max = ` select if($tm>$max,$tm,$max)` --let $min = ` select if($tm<$min,$tm,$min)` --connection default --dec $n } --let $avg = ` select $sum / $num` --echo # Min connection time: $min --echo # Max connection time: $max --echo # Total connection time: $sum --echo # Avg connection time: $avg --source include/rpl_end.inc
            Hide
            pomyk Patryk Pomykalski added a comment -

            We have the same problem on servers with thread pool.

            Show
            pomyk Patryk Pomykalski added a comment - We have the same problem on servers with thread pool.

              People

              • Assignee:
                serg Sergei Golubchik
                Reporter:
                pchri03 Peter Nørlund
              • Votes:
                3 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated: