Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 10.0.21
    • Fix Version/s: 10.1
    • Labels:
      None
    • Environment:
      ppc64el
    • Sprint:
      10.1.8-1, 10.1.8-3

      Description

      See http://bugs.mysql.com/bug.php?id=74832

      Patch has languished since Nov 2014.

      So i'm just pasting in my original description here:

      The code here has seemingly changed in 5.7.5 compared to previous versions. I'm talking about 5.7.5-m15 here.

      The purpose of ut_delay() is to spin for a while before InnoDB attempts to again acquire a mutex. Optimizations include (on x86) calling the pause instruction inside the spin loop and (on POWER) setting the thread priority to low for the duration of ut_delay.

      Here is the current (MySQL 5.7.5) implementation of ut_delay:
      ulint
      ut_delay(
      /=====/
      ulint delay) /*!< in: delay in microseconds on 100 MHz Pentium */
      {
      ulint i, j;

      UT_LOW_PRIORITY_CPU();

      j = 0;

      for (i = 0; i < delay * 50; i++)

      { j += i; UT_RELAX_CPU(); }

      if (ut_always_false)

      { ut_always_false = (ibool) j; }

      UT_RESUME_PRIORITY_CPU();

      return(j);
      }

      There are a couple of problems with this code:
      1) ut_always_false could quite legitimately be compiled away by the compiler
      2) j is actually unneeded and if UT_RELAX_CPU() was not implemented, then the compiler could legitimately completely optimize away the loop

      But there's another problem that's a bit hidden....

      In ut0ut.h we have the following:
      #ifndef UNIV_HOTBACKUP

      1. if defined(HAVE_PAUSE_INSTRUCTION)
        /* According to the gcc info page, asm volatile means that the
        instruction has important side-effects and must not be removed.
        Also asm volatile may trigger a memory barrier (spilling all registers
        to memory). */
      2. ifdef __SUNPRO_CC
      3. define UT_RELAX_CPU() asm ("pause" )
      4. else
      5. define UT_RELAX_CPU() _asm_ _volatile_ ("pause")
      6. endif /* __SUNPRO_CC */
      1. elif defined(HAVE_FAKE_PAUSE_INSTRUCTION)
      2. define UT_RELAX_CPU() _asm_ _volatile_ ("rep; nop")
      3. elif defined(HAVE_ATOMIC_BUILTINS)
      4. define UT_RELAX_CPU() do { \ volatile lint volatile_var; \ os_compare_and_swap_lint(&volatile_var, 0, 1); \ } while (0)
        # elif defined(HAVE_WINDOWS_ATOMICS)
        /* In the Win32 API, the x86 PAUSE instruction is executed by calling
        the YieldProcessor macro defined in WinNT.h. It is a CPU architecture-
        independent way by using YieldProcessor. */
        # define UT_RELAX_CPU() YieldProcessor()
        # else
        # define UT_RELAX_CPU() ((void)0) /* avoid warning for an empty statement */
        # endif

        Which if HAVE_PAUSE_INSTRUCTION or HAVE_FAKE_PAUSE_INSTRUCTION are defined (i.e. recent x86), you'll get the desired effect, there will be a pause instruction.

        However, if you HAVE_ATOMIC_BUILTINS, then you get this:
        do { volatile lint volatile_var; os_compare_and_swap_lint(&volatile_var, 0, 1); }

        while (0)

      Which is anything but relaxing. So, on POWER, where we have atomics but not pause instruction, we get that instead of an empty statement.

      This likely affects other platforms too (e.g. SPARC, MIPS, ARM, ia64, mips, m68k... basically everything that isn't x86).

      What we really want here is instead of that, just a compiler barrier, so that it knows that it cannot optimize away the loop.

      Back to ut_delay, if we look at the original PowerPC assembler for this, it's rather larger than it needs to be:

      0000000000000380 <._Z8ut_delaym>:
      380: fb e1 ff f8 std r31,-8(r1)
      384: f8 21 ff b1 stdu r1,-80(r1)
      388: 7c 3f 0b 78 mr r31,r1
      38c: 7c 21 0b 78 mr r1,r1
      390: 1d 03 00 32 mulli r8,r3,50
      394: 38 60 00 00 li r3,0
      398: 2f a8 00 00 cmpdi cr7,r8,0
      39c: 41 9e 00 44 beq cr7,3e0 <._Z8ut_delaym+0x60>
      3a0: 39 20 00 00 li r9,0
      3a4: 38 e0 00 01 li r7,1
      3a8: 60 00 00 00 nop
      3ac: 60 00 00 00 nop
      3b0: 7c 00 04 ac sync
      3b4: 7c 63 4a 14 add r3,r3,r9
      3b8: 38 df 00 30 addi r6,r31,48
      3bc: 7d 40 30 a8 ldarx r10,0,r6
      3c0: 2c 2a 00 00 cmpdi r10,0
      3c4: 40 82 00 0c bne 3d0 <._Z8ut_delaym+0x50>
      3c8: 7c e0 31 ad stdcx. r7,0,r6
      3cc: 40 a2 ff ec bne 3b8 <._Z8ut_delaym+0x38>
      3d0: 4c 00 01 2c isync
      3d4: 39 29 00 01 addi r9,r9,1
      3d8: 7f a9 40 40 cmpld cr7,r9,r8
      3dc: 40 9e ff d4 bne cr7,3b0 <._Z8ut_delaym+0x30>
      3e0: 3c c2 00 00 addis r6,r2,0
      3e4: e9 26 00 00 ld r9,0(r6)
      3e8: 2f a9 00 00 cmpdi cr7,r9,0
      3ec: 41 9e 00 08 beq cr7,3f4 <._Z8ut_delaym+0x74>
      3f0: f8 66 00 00 std r3,0(r6)
      3f4: 7c 42 13 78 mr r2,r2
      3f8: 38 3f 00 50 addi r1,r31,80
      3fc: eb e1 ff f8 ld r31,-8(r1)
      400: 4e 80 00 20 blr

      The bits that stare at me are the sync and isync instructions. We're executing memory barriers in there! In a loop! When we're meant to be relaxing!

      So, once I remove the buggy UT_RELAX_CPU() implementation and simplify ut_delay (patch attached), I end up with:

      0000000000000380 <._Z8ut_delaym>:
      380: fb e1 ff f8 std r31,-8(r1)
      384: f8 21 ff c1 stdu r1,-64(r1)
      388: 7c 3f 0b 78 mr r31,r1
      38c: 7c 21 0b 78 mr r1,r1
      390: 1c 63 00 32 mulli r3,r3,50
      394: 7c 69 03 a6 mtctr r3
      398: 2f a3 00 00 cmpdi cr7,r3,0
      39c: 41 9e 00 08 beq cr7,3a4 <._Z8ut_delaym+0x24>
      3a0: 42 00 00 00 bdnz 3a0 <._Z8ut_delaym+0x20>
      3a4: 7c 42 13 78 mr r2,r2
      3a8: 38 3f 00 40 addi r1,r31,64
      3ac: eb e1 ff f8 ld r31,-8(r1)
      3b0: 4e 80 00 20 blr
      3b4: 00 00 00 00 .long 0x0
      3b8: 00 09 00 00 .long 0x90000
      3bc: 80 01 00 00 lwz r0,0(r1)

      Which is exactly what we should be doing - we go into low priority (mr r1,r1), spin for a while, then resume normal priority (mr r2, r2) and return. We also avoid doing unnecessary work (which is good).

      This also may have a positive performance impact on x86 as the extra math and work around there would have to be done, and IIRC modern KVM on x86 will trap the pause instruction and attempt to schedule a vcpu that may hold the lock that we're spinning for.

      How to repeat:
      look at profiles, or disassemble code and examine it (like I've done above)

      Suggested fix:
      merge my patch (attached) that fixes this.

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Hide
              stewart-ibm Stewart Smith added a comment - - edited

              Attached is SQL and a script to recreate the issue.

              This will be related to another issue I'm about to file too, but from the perf output you can clearly see that ut_delay() is sucking huge amounts of CPU time, and this isn't really ideal.

              We've observed this in 10.0.17 and 10.0.20 (or was it 21... I can't remember).

              (Pasting in from Anton below)

               # mysql -u root < sql
              
               # time ./psdoit 1
               0.022s
              
               # time ./psdoit 10
               0.112s
              
               # time ./psdoit 100
               7.150s
              
               # time ./psdoit 200
               30.422s
              
               # time ./psdoit 400
               190.378s
              
               Things get very bad, very fast. We spend almost all our time in a locking
               food fight:
              
                   37.42%  mysqld         mysqld              [.] ut_delay(unsigned long)                                                                                     
              		    |
              		    ---ut_delay(unsigned long)
              		       |          
              		       |--99.74%-- mutex_spin_wait(void*, bool, char const*,
               unsigned long)
              		       |          |          
              		       |          |--74.70%--
               pfs_mutex_enter_func(ib_mutex_t*, char const*, unsigned long) [clone
               .constprop.71]
              		       |          |         
               lock_rec_convert_impl_to_expl(buf_block_t const*, unsigned char const*,
               dict_index_t*, unsigned long const*)
              		       |          |         
               lock_clust_rec_read_check_and_lock(unsigned long, buf_block_t const*,
               unsigned char const*, dict_index_t*, unsigned long const
              		       |          |          sel_set_rec_lock(buf_block_t
               const*, unsigned char const*, dict_index_t*, unsigned long const*, unsigned
               long, unsigned long, 
              		       |          |          row_search_for_mysql(unsigned
               char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
              		       |          |          |          
              		       |          |          |--76.07%--
               ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int,
               ha_rkey_function)
              		       |          |          |         
               handler::index_read_map(unsigned char*, unsigned char const*, unsigned long,
               ha_rkey_function)
              		       |          |          |         
               handler::ha_index_read_map(unsigned char*, unsigned char const*, unsigned
               long, ha_rkey_function)
              		       |          |          |          join_read_key2(THD*,
               st_join_table*, TABLE*, st_table_ref*)
              		       |          |          |          sub_select(JOIN*,
               st_join_table*, bool)
              		       |          |          |
              
               and:
              
                   27.96%  mysqld         [kernel.kallsyms]   [k] _raw_spin_lock                                                                                   
              		    |
              		    ---_raw_spin_lock
              		       |          
              		       |--58.11%-- futex_wait_setup
              		       |          |          
              		       |          |--99.96%-- 0
              		       |          |          futex_wait
              		       |          |          do_futex
              		       |          |          sys_futex
              		       |          |          system_call
              		       |          |          |          
              		       |          |          |--64.91%-- __lll_lock_wait
              		       |          |          |          |          
              		       |          |          |          |--83.30%--
               pthread_mutex_lock
              		       |          |          |          |          |          
              		       |          |          |          |          |--52.72%--
               os_event_reset(os_event*)
              		       |          |          |          |          |         
               sync_array_reserve_cell(sync_array_t*, void*, unsigned long, char const*,
               unsigned long, unsi
              		       |          |          |          |          |         
               mutex_spin_wait(void*, bool, char const*, unsigned long)
              		       |          |          |          |          |         
               |          
              		       |          |          |          |          |         
               |--75.08%-- pfs_mutex_enter_func(ib_mutex_t*, char const*, unsigned long)
               [clone .constprop.7
              		       |          |          |          |          |         
               |          lock_rec_convert_impl_to_expl(buf_block_t const*, unsigned char
               const*, dict_index
              		       |          |          |          |          |         
               |          lock_clust_rec_read_check_and_lock(unsigned long, buf_block_t
               const*, unsigned cha
              		       |          |          |          |          |         
               |          sel_set_rec_lock(buf_block_t const*, unsigned char const*,
               dict_index_t*, unsigned
              		       |          |          |          |          |         
               |          row_search_for_mysql(unsigned char*, unsigned long,
               row_prebuilt_t*, unsigned long
              
              Show
              stewart-ibm Stewart Smith added a comment - - edited Attached is SQL and a script to recreate the issue. This will be related to another issue I'm about to file too, but from the perf output you can clearly see that ut_delay() is sucking huge amounts of CPU time, and this isn't really ideal. We've observed this in 10.0.17 and 10.0.20 (or was it 21... I can't remember). (Pasting in from Anton below) # mysql -u root < sql # time ./psdoit 1 0.022s # time ./psdoit 10 0.112s # time ./psdoit 100 7.150s # time ./psdoit 200 30.422s # time ./psdoit 400 190.378s Things get very bad, very fast. We spend almost all our time in a locking food fight: 37.42% mysqld mysqld [.] ut_delay(unsigned long) | ---ut_delay(unsigned long) | |--99.74%-- mutex_spin_wait(void*, bool, char const*, unsigned long) | | | |--74.70%-- pfs_mutex_enter_func(ib_mutex_t*, char const*, unsigned long) [clone .constprop.71] | | lock_rec_convert_impl_to_expl(buf_block_t const*, unsigned char const*, dict_index_t*, unsigned long const*) | | lock_clust_rec_read_check_and_lock(unsigned long, buf_block_t const*, unsigned char const*, dict_index_t*, unsigned long const | | sel_set_rec_lock(buf_block_t const*, unsigned char const*, dict_index_t*, unsigned long const*, unsigned long, unsigned long, | | row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long) | | | | | |--76.07%-- ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function) | | | handler::index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function) | | | handler::ha_index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function) | | | join_read_key2(THD*, st_join_table*, TABLE*, st_table_ref*) | | | sub_select(JOIN*, st_join_table*, bool) | | | and: 27.96% mysqld [kernel.kallsyms] [k] _raw_spin_lock | ---_raw_spin_lock | |--58.11%-- futex_wait_setup | | | |--99.96%-- 0 | | futex_wait | | do_futex | | sys_futex | | system_call | | | | | |--64.91%-- __lll_lock_wait | | | | | | | |--83.30%-- pthread_mutex_lock | | | | | | | | | |--52.72%-- os_event_reset(os_event*) | | | | | sync_array_reserve_cell(sync_array_t*, void*, unsigned long, char const*, unsigned long, unsi | | | | | mutex_spin_wait(void*, bool, char const*, unsigned long) | | | | | | | | | | | |--75.08%-- pfs_mutex_enter_func(ib_mutex_t*, char const*, unsigned long) [clone .constprop.7 | | | | | | lock_rec_convert_impl_to_expl(buf_block_t const*, unsigned char const*, dict_index | | | | | | lock_clust_rec_read_check_and_lock(unsigned long, buf_block_t const*, unsigned cha | | | | | | sel_set_rec_lock(buf_block_t const*, unsigned char const*, dict_index_t*, unsigned | | | | | | row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long
              Hide
              svoj Sergey Vojtovich added a comment -

              About a year ago we also analyzed ut_delay performance (can't remember exact workload though). We applied similar patch and ut_delay has gone from perf output, but overall system performance suffered as well. Thus it was put on a shelve and never remembered until now.

              We need to resume this analysis with suggested workload.

              Show
              svoj Sergey Vojtovich added a comment - About a year ago we also analyzed ut_delay performance (can't remember exact workload though). We applied similar patch and ut_delay has gone from perf output, but overall system performance suffered as well. Thus it was put on a shelve and never remembered until now. We need to resume this analysis with suggested workload.
              Hide
              stewart-ibm Stewart Smith added a comment -

              It's possible that the delay needs to be increased when we add this patch as it'll do less work, thus take a lot less time (probably a LOT less actually as we're not forcing things out of L1 cache).

              Show
              stewart-ibm Stewart Smith added a comment - It's possible that the delay needs to be increased when we add this patch as it'll do less work, thus take a lot less time (probably a LOT less actually as we're not forcing things out of L1 cache).
              Hide
              stewart-ibm Stewart Smith added a comment -

              Using the microbenchmarks from https://github.com/stewart-ibm/microbenchmarks/commit/9210e849374f14e1ffd652ed869637f3d70ba354

              Here I use the timebase register on ppc to work out how long in wall time things take (timebase is 512Mhz and is "good enough" for this purpose).

              [stewart@p82 ~]$ gcc -O3 ut_delay_optim.c -o ut_delay_optim
              [stewart@p82 ~]$ gcc -O3 ut_delay_mysql.c -o ut_delay_mysql
              [stewart@p82 ~]$ for i in `seq 1 10`; do ./ut_delay_mysql ; done
              tb change (avg over 1000000): 3209
              [stewart@p82 ~]$ ./ut_delay_optim
              tb change (avg over 1000000): 225

              so with the optimized ut_delay, we're actually delaying for less than a tenth of the time.

              So, for tuning that to be approximately the same amount of wall time, we try changing the multiplication for the loop in ut_delay from *50 to *1024 (which should just optim down to a bit shift).

              With this (ut_delay_optim2.c) I compare the run of the MySQL one to the optim2 one (see https://github.com/stewart-ibm/microbenchmarks/commit/fa26a1cb71b5799ae846e49b951bb54d55a6a168 )

              [stewart@p82 ~]$ for i in `seq 1 10`; do ./ut_delay_optim2 ; done
              tb change (avg over 1000000): 2771
              tb change (avg over 1000000): 2811
              tb change (avg over 1000000): 2640
              tb change (avg over 1000000): 2571
              tb change (avg over 1000000): 2789
              tb change (avg over 1000000): 2776
              tb change (avg over 1000000): 2859
              tb change (avg over 1000000): 2739
              tb change (avg over 1000000): 2665
              tb change (avg over 1000000): 2799

              [stewart@p82 ~]$ for i in `seq 1 10`; do ./ut_delay_mysql ; done
              tb change (avg over 1000000): 2882
              tb change (avg over 1000000): 2880
              tb change (avg over 1000000): 2882
              tb change (avg over 1000000): 2878
              tb change (avg over 1000000): 2880
              tb change (avg over 1000000): 2878
              tb change (avg over 1000000): 3117
              tb change (avg over 1000000): 3071
              tb change (avg over 1000000): 2880
              tb change (avg over 1000000): 3209

              So it's probably worth to try that implementation to see if there's still negative performance impact. It should burn the same amount of (wall) CPU time without placing additional stress on CPU parts.

              Show
              stewart-ibm Stewart Smith added a comment - Using the microbenchmarks from https://github.com/stewart-ibm/microbenchmarks/commit/9210e849374f14e1ffd652ed869637f3d70ba354 Here I use the timebase register on ppc to work out how long in wall time things take (timebase is 512Mhz and is "good enough" for this purpose). [stewart@p82 ~] $ gcc -O3 ut_delay_optim.c -o ut_delay_optim [stewart@p82 ~] $ gcc -O3 ut_delay_mysql.c -o ut_delay_mysql [stewart@p82 ~] $ for i in `seq 1 10`; do ./ut_delay_mysql ; done tb change (avg over 1000000): 3209 [stewart@p82 ~] $ ./ut_delay_optim tb change (avg over 1000000): 225 so with the optimized ut_delay, we're actually delaying for less than a tenth of the time. So, for tuning that to be approximately the same amount of wall time, we try changing the multiplication for the loop in ut_delay from *50 to *1024 (which should just optim down to a bit shift). With this (ut_delay_optim2.c) I compare the run of the MySQL one to the optim2 one (see https://github.com/stewart-ibm/microbenchmarks/commit/fa26a1cb71b5799ae846e49b951bb54d55a6a168 ) [stewart@p82 ~] $ for i in `seq 1 10`; do ./ut_delay_optim2 ; done tb change (avg over 1000000): 2771 tb change (avg over 1000000): 2811 tb change (avg over 1000000): 2640 tb change (avg over 1000000): 2571 tb change (avg over 1000000): 2789 tb change (avg over 1000000): 2776 tb change (avg over 1000000): 2859 tb change (avg over 1000000): 2739 tb change (avg over 1000000): 2665 tb change (avg over 1000000): 2799 [stewart@p82 ~] $ for i in `seq 1 10`; do ./ut_delay_mysql ; done tb change (avg over 1000000): 2882 tb change (avg over 1000000): 2880 tb change (avg over 1000000): 2882 tb change (avg over 1000000): 2878 tb change (avg over 1000000): 2880 tb change (avg over 1000000): 2878 tb change (avg over 1000000): 3117 tb change (avg over 1000000): 3071 tb change (avg over 1000000): 2880 tb change (avg over 1000000): 3209 So it's probably worth to try that implementation to see if there's still negative performance impact. It should burn the same amount of (wall) CPU time without placing additional stress on CPU parts.
              Hide
              svoj Sergey Vojtovich added a comment -

              So here're OLTP RW results for 160 threads, 16 tables, 2CPU/20cores, SMT=8, DSCR=1, sysbench-0.5 + RNG fix.

              Fresh 10.1 snapshot: 16688TPS (ut_delay takes 4.74%)
              Fresh 10.1 snapshot + following patch: 12023TPS (ut_delay takes 1.63%)
              Fresh 10.1 snapshot + following patch + increased to 1024 multiplier: 16706TPS (ut_delay takes 6.19%).

              Patch:

              diff --git a/storage/xtradb/include/ut0ut.h b/storage/xtradb/include/ut0ut.h
              index 0caf379..090eb15 100644
              --- a/storage/xtradb/include/ut0ut.h
              +++ b/storage/xtradb/include/ut0ut.h
              @@ -78,11 +78,6 @@ struct ut_when_dtor {
              
               # elif defined(HAVE_FAKE_PAUSE_INSTRUCTION)
               #  define UT_RELAX_CPU() __asm__ __volatile__ ("rep; nop")
              -# elif defined(HAVE_ATOMIC_BUILTINS)
              -#  define UT_RELAX_CPU() do { \
              -     volatile lint	volatile_var; \
              -     os_compare_and_swap_lint(&volatile_var, 0, 1); \
              -   } while (0)
               # elif defined(HAVE_WINDOWS_ATOMICS)
                  /* In the Win32 API, the x86 PAUSE instruction is executed by calling
                  the YieldProcessor macro defined in WinNT.h. It is a CPU architecture-
              @@ -92,6 +87,8 @@ struct ut_when_dtor {
               #  define UT_RELAX_CPU() ((void)0) /* avoid warning for an empty statement */
               # endif
              
              +#define UT_COMPILER_BARRIER() __asm__ __volatile__ ("":::"memory")
              +
               /*********************************************************************//**
               Delays execution for at most max_wait_us microseconds or returns earlier
               if cond becomes true.
              @@ -332,7 +329,7 @@ struct ut_when_dtor {
               in microseconds on 100 MHz Pentium + Visual C++.
               @return	dummy value */
               UNIV_INTERN
              -ulint
              +void
               ut_delay(
               /*=====*/
               	ulint	delay);	/*!< in: delay in microseconds on 100 MHz Pentium */
              diff --git a/storage/xtradb/ut/ut0ut.cc b/storage/xtradb/ut/ut0ut.cc
              index 96f2c53..76a7ea3 100644
              --- a/storage/xtradb/ut/ut0ut.cc
              +++ b/storage/xtradb/ut/ut0ut.cc
              @@ -45,9 +45,6 @@
               # include "mysql_com.h" /* NAME_LEN */
               #endif /* UNIV_HOTBACKUP */
              
              -/** A constant to prevent the compiler from optimizing ut_delay() away. */
              -UNIV_INTERN ibool	ut_always_false	= FALSE;
              -
               #ifdef __WIN__
               /*****************************************************************//**
               NOTE: The Windows epoch starts from 1601/01/01 whereas the Unix
              @@ -397,25 +394,17 @@ This is the Windows version of gettimeofday(2).
               in microseconds on 100 MHz Pentium + Visual C++.
               @return	dummy value */
               UNIV_INTERN
              -ulint
              +void
               ut_delay(
               /*=====*/
               	ulint	delay)	/*!< in: delay in microseconds on 100 MHz Pentium */
               {
              -	ulint	i, j;
              -
              -	j = 0;
              +	ulint	i;
              
               	for (i = 0; i < delay * 50; i++) {
              -		j += i;
              +		UT_COMPILER_BARRIER();
               		UT_RELAX_CPU();
               	}
              -
              -	if (ut_always_false) {
              -		ut_always_false = (ibool) j;
              -	}
              -
              -	return(j);
               }
               #endif /* !UNIV_HOTBACKUP */
              
              Show
              svoj Sergey Vojtovich added a comment - So here're OLTP RW results for 160 threads, 16 tables, 2CPU/20cores, SMT=8, DSCR=1, sysbench-0.5 + RNG fix. Fresh 10.1 snapshot: 16688TPS (ut_delay takes 4.74%) Fresh 10.1 snapshot + following patch: 12023TPS (ut_delay takes 1.63%) Fresh 10.1 snapshot + following patch + increased to 1024 multiplier: 16706TPS (ut_delay takes 6.19%). Patch: diff --git a/storage/xtradb/include/ut0ut.h b/storage/xtradb/include/ut0ut.h index 0caf379..090eb15 100644 --- a/storage/xtradb/include/ut0ut.h +++ b/storage/xtradb/include/ut0ut.h @@ -78,11 +78,6 @@ struct ut_when_dtor { # elif defined(HAVE_FAKE_PAUSE_INSTRUCTION) # define UT_RELAX_CPU() __asm__ __volatile__ ("rep; nop") -# elif defined(HAVE_ATOMIC_BUILTINS) -# define UT_RELAX_CPU() do { \ - volatile lint volatile_var; \ - os_compare_and_swap_lint(&volatile_var, 0, 1); \ - } while (0) # elif defined(HAVE_WINDOWS_ATOMICS) /* In the Win32 API, the x86 PAUSE instruction is executed by calling the YieldProcessor macro defined in WinNT.h. It is a CPU architecture- @@ -92,6 +87,8 @@ struct ut_when_dtor { # define UT_RELAX_CPU() ((void)0) /* avoid warning for an empty statement */ # endif +#define UT_COMPILER_BARRIER() __asm__ __volatile__ ("":::"memory") + /*********************************************************************//** Delays execution for at most max_wait_us microseconds or returns earlier if cond becomes true. @@ -332,7 +329,7 @@ struct ut_when_dtor { in microseconds on 100 MHz Pentium + Visual C++. @return dummy value */ UNIV_INTERN -ulint +void ut_delay( /*=====*/ ulint delay); /*!< in: delay in microseconds on 100 MHz Pentium */ diff --git a/storage/xtradb/ut/ut0ut.cc b/storage/xtradb/ut/ut0ut.cc index 96f2c53..76a7ea3 100644 --- a/storage/xtradb/ut/ut0ut.cc +++ b/storage/xtradb/ut/ut0ut.cc @@ -45,9 +45,6 @@ # include "mysql_com.h" /* NAME_LEN */ #endif /* UNIV_HOTBACKUP */ -/** A constant to prevent the compiler from optimizing ut_delay() away. */ -UNIV_INTERN ibool ut_always_false = FALSE; - #ifdef __WIN__ /*****************************************************************//** NOTE: The Windows epoch starts from 1601/01/01 whereas the Unix @@ -397,25 +394,17 @@ This is the Windows version of gettimeofday(2). in microseconds on 100 MHz Pentium + Visual C++. @return dummy value */ UNIV_INTERN -ulint +void ut_delay( /*=====*/ ulint delay) /*!< in: delay in microseconds on 100 MHz Pentium */ { - ulint i, j; - - j = 0; + ulint i; for (i = 0; i < delay * 50; i++) { - j += i; + UT_COMPILER_BARRIER(); UT_RELAX_CPU(); } - - if (ut_always_false) { - ut_always_false = (ibool) j; - } - - return(j); } #endif /* !UNIV_HOTBACKUP */
              Hide
              svoj Sergey Vojtovich added a comment -

              I couldn't reproduce any reasonable performance impact for the "psdoit" benchmark as well.

              Stewart Smith, did you see any effect in your benchmarks?

              Generally I agree with this patch. But it is hard to justify it with no visible improvement.

              Otherwise I'd prefer to have this problem fixed by:
              1. abandon InnoDB mutex implementation in favor of pthread mutex
              2. avoid mutexes on hot path

              Show
              svoj Sergey Vojtovich added a comment - I couldn't reproduce any reasonable performance impact for the "psdoit" benchmark as well. Stewart Smith , did you see any effect in your benchmarks? Generally I agree with this patch. But it is hard to justify it with no visible improvement. Otherwise I'd prefer to have this problem fixed by: 1. abandon InnoDB mutex implementation in favor of pthread mutex 2. avoid mutexes on hot path
              Hide
              svoj Sergey Vojtovich added a comment -

              Jan Lindström, could you also share your thoughts on this?

              Show
              svoj Sergey Vojtovich added a comment - Jan Lindström , could you also share your thoughts on this?
              Hide
              jplindst Jan Lindström added a comment -

              Very interesting to see that different implementations have very different amount of time used on that single function. However, results are not clear, why the middle one that has the least amount of time used has significantly lower performance? Does it hit the real mutex wait sooner? When you compare the first one and the last one, there is no significant difference and I do not see any real reason to change from original to last one. Do you see similar results when number of threads are varied?

              Show
              jplindst Jan Lindström added a comment - Very interesting to see that different implementations have very different amount of time used on that single function. However, results are not clear, why the middle one that has the least amount of time used has significantly lower performance? Does it hit the real mutex wait sooner? When you compare the first one and the last one, there is no significant difference and I do not see any real reason to change from original to last one. Do you see similar results when number of threads are varied?
              Hide
              stewart-ibm Stewart Smith added a comment -

              (Sorry I've been a bit behind on this - a million other things going on).

              I'm totally in favour of instead just using pthread mutexes and fixing the problem there! I'm doubly in support of doing this upstream too.

              I think what's going on with the first patch and decreased performance is that we poke the cacheline with the mutex in it a lot more and we end up sleeping a lot sooner.

              I don't think anyone has looked at what's the best way to spin like this for a while, I'll poke some people and see if we can come up with something better, as we may want to poke into various in-depth simulators/modelling tools and even just ask the chip designers.

              Show
              stewart-ibm Stewart Smith added a comment - (Sorry I've been a bit behind on this - a million other things going on). I'm totally in favour of instead just using pthread mutexes and fixing the problem there! I'm doubly in support of doing this upstream too. I think what's going on with the first patch and decreased performance is that we poke the cacheline with the mutex in it a lot more and we end up sleeping a lot sooner. I don't think anyone has looked at what's the best way to spin like this for a while, I'll poke some people and see if we can come up with something better, as we may want to poke into various in-depth simulators/modelling tools and even just ask the chip designers.

                People

                • Assignee:
                  jplindst Jan Lindström
                  Reporter:
                  stewart-ibm Stewart Smith
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  5 Start watching this issue

                  Dates

                  • Created:
                    Updated:

                    Agile